HOME

TheInfoList



OR:

In
network management Network management is the process of administering and managing computer networks. Services provided by this discipline include fault analysis, performance management, provisioning of networks and maintaining quality of service. Network managem ...
, fault management is the set of functions that detect, isolate, and correct malfunctions in a telecommunications network, compensate for environmental changes, and include maintaining and examining
error An error (from the Latin ''error'', meaning "wandering") is an action which is inaccurate or incorrect. In some usages, an error is synonymous with a mistake. The etymology derives from the Latin term 'errare', meaning 'to stray'. In statistics ...
logs, accepting and acting on error detection notifications, tracing and identifying faults, carrying out sequences of diagnostics tests, correcting faults, reporting error conditions, and localizing and tracing faults by examining and manipulating
database In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases ...
information Information is an abstract concept that refers to that which has the power to inform. At the most fundamental level information pertains to the interpretation of that which may be sensed. Any natural process that is not completely random, ...
. When a fault or event occurs, a network component will often send a notification to the network operator using a protocol such as
SNMP Simple Network Management Protocol (SNMP) is an Internet Standard protocol for collecting and organizing information about managed devices on IP networks and for modifying that information to change device behaviour. Devices that typically ...
. An alarm is a persistent indication of a fault that clears only when the triggering condition has been resolved. A current list of problems occurring on the network component is often kept in the form of an active alarm list such as is defined in RFC 3877, the Alarm
MIB The byte is a unit of digital information that most commonly consists of eight bits. Historically, the byte was the number of bits used to encode a single character of text in a computer and for this reason it is the smallest addressable unit ...
. A list of cleared faults is also maintained by most
network management Network management is the process of administering and managing computer networks. Services provided by this discipline include fault analysis, performance management, provisioning of networks and maintaining quality of service. Network managem ...
systems. Fault management systems may use complex filtering systems to assign alarms to severity levels. These can range in severity from debug to emergency, as in the
syslog In computing, syslog is a standard for message logging. It allows separation of the software that generates messages, the system that stores them, and the software that reports and analyzes them. Each message is labeled with a facility code, i ...
protocol.RFC 3164 Alternatively, they could use the ITU X.733 Alarm Reporting Function's perceived severity field. This takes on values of cleared, indeterminate, critical, major, minor or warning. Note that the latest version of the syslog protocol draft under development within the
IETF The Internet Engineering Task Force (IETF) is a standards organization for the Internet and is responsible for the technical standards that make up the Internet protocol suite (TCP/IP). It has no formal membership roster or requirements an ...
includes a mapping between these two different sets of severities. It is considered good practice to send a notification not only when a problem has occurred, but also when it has been resolved. The latter notification would have a severity of clear. A fault management console allows a
network administrator A network administrator is a person designated in an organization whose responsibility includes maintaining computer infrastructures with emphasis on local area networks (LANs) up to wide area networks (WANs). Responsibilities may vary between org ...
or
system operator A sysop (; an abbreviation of system operator) is an administrator of a multi-user computer system, such as a bulletin board system (BBS) or an online service virtual community.Jansen, E. & James,V. (2002). NetLingo: the Internet dictionary. Net ...
to monitor events from multiple systems and perform actions based on this information. Ideally, a fault management system should be able to correctly identify events and automatically take action, either launching a program or script to take corrective action, or activating notification software that allows a human to take proper intervention (i.e. send
e-mail Electronic mail (email or e-mail) is a method of exchanging messages ("mail") between people using electronic devices. Email was thus conceived as the electronic (digital) version of, or counterpart to, mail, at a time when "mail" meant ...
or SMS text to a
mobile phone A mobile phone, cellular phone, cell phone, cellphone, handphone, hand phone or pocket phone, sometimes shortened to simply mobile, cell, or just phone, is a portable telephone that can make and receive calls over a radio frequency link whi ...
). Some notification systems also have escalation rules that will notify a chain of individuals based on availability and severity of alarm.


Types

There are two primary ways to perform fault management - these are active and passive. Passive fault management is done by collecting alarms from devices (normally via
SNMP Simple Network Management Protocol (SNMP) is an Internet Standard protocol for collecting and organizing information about managed devices on IP networks and for modifying that information to change device behaviour. Devices that typically ...
traps) when something happens in the devices. In this mode, the fault management system only knows if a device it is monitoring is intelligent enough to generate an error and report it to the management tool. However, if the device being monitored fails completely or locks up, it won't throw an alarm and the problem will not be detected. Active fault management addresses this issue by actively monitoring devices via tools such as
ping Ping may refer to: Arts and entertainment Fictional characters * Ping, a domesticated Chinese duck in the illustrated book '' The Story about Ping'', first published in 1933 * Ping, a minor character in ''Seinfeld'', an NBC sitcom * Ping, a c ...
to determine if the device is active and responding. If the device stops responding, active monitoring will throw an alarm showing the device as unavailable and allows for the proactive correction of the problem. Fault management includes any tools or procedure for testing, diagnosing or repairing the network when a failure occurs.


See also

*
Alarm management Alarm management is the application of human factors and ergonomics along with instrumentation engineering and systems thinking to manage the design of an alarm system to increase its usability. Most often the major usability problem is that t ...
* Alarm fatigue


Notes


References

*{{FS1037C MS188 Network management