23.4.1 Overview

In a fault-tolerant system, fault management encompasses the following activities:

• Fault detection - detecting the presence of a fault in the system and generating a fault report.

• Fault notification - propagating fault reports to entities that have registered for such notifications.

• Fault analysis/diagnosis - analyzing a (potentially large) number of related fault reports and generating condensed or summary reports.

In the Fault Tolerance Infrastructure, Fault Detectors detect faults in the objects, and report faults to the Fault Notifier. The Fault Notifier receives fault reports from the Fault Detectors, filters the reports, and propagates the filtered reports as fault event notifications to consumers that have subscribed for them. The Fault Analyzer reasons about the fault reports that it has received, and produces aggregate or summary fault reports that it propagates back to the Fault Notifier for dissemination to other consumers.

A fault-tolerant system typically has several Fault Detectors, including those provided by the infrastructure to monitor objects, and other fault detectors provided by the infrastructure or the application. Each Fault Detector belongs to a particular fault tolerance domain, and is not shared across fault tolerance domains. Most implementations of Fault Detectors are based on timeouts, and use either pull- or push-based monitoring. This section defines an interface for pull-based monitoring, the PullMonitorable interface, that application objects inherit, and that is invoked by a Fault Detector within the Fault Tolerance Infrastructure.

The section also defines a FaultNotifier interface. The Fault Notifier receives fault reports from the Fault Detectors. The Fault Notifier filters the reports to eliminate unnecessary or duplicate reports. It then sends fault event notifications to the consumers. The Replication Manager is such a consumer, as is the Fault Analyzer. The application can also subscribe to receive fault event notifications. Logically, there is one Fault Notifier per fault tolerance domain, although typically it is replicated for fault tolerance. The Fault Notifier belongs to a particular fault tolerance domain and is not shared across domains.

A fault-tolerant system may also have one or more Fault Analyzers. Each Fault Analyzer collects fault reports and performs event correlation, analysis, and diagnosis. It may condense a large number of related fault reports into a single fault report (e.g.,

the crash of a host can cause fault reports for all objects on that host, as well as a fault report for the host itself). The analysis of fault reports is application-dependent; thus, this chapter does not define a Fault Analyzer interface, but allows an application developer to hook in Fault Analyzers as consumers of fault reports generated by the Fault Notifier.

A problem with fault notification is the potential for a large number of notifications to be generated by a single fault. This problem is addressed by filtering within the Fault Notifier, by Fault Analyzers, and by the FaultMonitoringGranularity.

23.4.2 Architecture

Figure 23-7 shows the interaction between the Fault Detectors, Fault Notifier, Fault Analyzer, and Replication Manager in a relatively simple system. The fault management specification defines interfaces that allow interaction of:

• A Fault Detector with a pull-monitored object within a fault tolerance domain

• A Fault Detector with the Fault Notifier within a fault tolerance domain

• The Fault Notifier with the Replication Manager, a Fault Analyzer, or other application objects within a fault tolerance domain.

ReplicationManager

Fault Notifications

Fault FaultNotifier Analyzer

Fault Reports

Fault FaultDetector Detector

Pings

FaultDetector

Application

ApplicationObject

Figure 23-7 Interactions between the Fault Detectors, Fault Notifier, Fault Analyzer, and Replication Manager.

23.4.2.1 Fault Detection

In the Fault Tolerance Infrastructure, fault detection is initiated by the Replication Manager for members of object groups having either application-controlled or infrastructure-controlled MembershipStyles (see Section 23.3.2, “Fault Tolerance Properties,? on page 23-32). Because the fault management specification focuses on monitoring and timeout-based fault detection, the terms monitor and detector are used interchangeably.

There are two common styles of fault monitoring: PULL and PUSH. These two fault monitoring styles differ in the direction in which fault information flows in the system. Because push-based monitoring depends on characteristics of the application, it is not defined in this specification.

The fault management specification defines the interaction between a pull-based Fault Detector and application objects. It defines a PullMonitorable interface that the application objects inherit. Other kinds of system-specific (for example, host, network) and application-specific Fault Detectors may be present in the system, but they are not defined.

23.4.2.2 Fault Notification

This section defines a FaultNotifier interface that contains operations that allow a Fault Detector or Fault Analyzer to push fault reports to the Fault Notifier. It also defines operations that allow the Replication Manager, a Fault Analyzer or other application object to register as consumers of fault event notifications. The Fault Notifier filters fault reports that it has received from the Fault Detectors, and propagates fault reports to the entities that have registered for such notifications.

23.4.2.3 Fault Analysis

The Fault Analyzer registers with the Fault Notifier as a consumer of fault reports. The Fault Analyzer correlates fault reports and generates condensed fault reports. Because these activities are specific to the application or the environment, the application developer is responsible for the analysis/diagnosis algorithm employed by the Fault Analyzer. The Fault Analyzer may use the Fault Notifier to disseminate its condensed fault reports.

23.4.2.4 Scalability

The fault management specification does not limit the number or arrangement of Fault Detectors in a fault tolerance domain. In a large system spanning many hosts with each host supporting many objects, arranging the Fault Detectors in a hierarchical structure would be more scalable and efficient.

For example, consider a system where all objects at a given location (say, a process) are monitored by a local object-level Fault Detector, as shown in Figure 23-8 on page 23-69. The set of object-level Fault Detectors might be monitored by a process- level Fault Detector. The set of process-level Fault Detectors might be monitored by a host-level Fault Detector. The Replication Manager, or a consumer object created by the Replication Manager, might be implemented to consume either object-level, process-level, or host-level fault reports. If it is implemented to consume only object-level fault reports, a Fault Analyzer that translates object-level fault reports into process- or host-level fault reports can be attached to the Fault Notifier.

Monitoring at the process level can be achieved by monitoring a single proxy object in the process. The proxy object would be responsible for ensuring that all of the other objects in the process are alive, and would monitor those objects through the use of application-specific facilities or private Fault Notifier channels provided by the

	infrastructure.

	Fault Notifier
	Host Fault Detector Fault Reports
	Pings
	Process Fault Detector Host	Host	Process Fault Detector
Object Fault Detector	Object Fault Detector Process Process	Object Fault Detector	Process
Application Object Pi	ngs	Application Object Application Object Application Object Pings	Application Object Pings	Application Object
	Figure 23-8	Hierarchical Fault Detection.

This example shows the generality of the Fault Tolerance Infrastructure in handling different types of arrangements of Fault Detectors. Other organizations are possible and useful.

23.4.2.5 Deployment of Fault Detectors

Fault Detectors can be as varied as the applications they monitor and, for these diverse applications, Fault Detectors can be deployed in several different ways:

• Infrastructure Created Fault Detectors. The Fault Tolerance Infrastructure may create instances of Fault Detectors to meet the needs of the applications. For example, to implement the MEMB FaultMonitoringGranularity, the Fault Tolerance Infrastructure must create Fault Detectors sufficient to ping every member of the object group. Because these Fault Detectors are created (or, at least, configured) by the Fault Tolerance Infrastructure, their identities need only be known to the infrastructure.

• Application Created Fault Detectors. It might be necessary or advantageous for applications to create their own Fault Detectors. For example, applications might have unique knowledge of their operating environment, such as access to hardware indicators of faults within the operating environment. However, unlike the other types of Fault Detectors, application-created Fault Detectors are not inherently known to the Fault Tolerance Infrastructure. They can propagate their fault information to an application-specific Fault Analyzer through the Fault Notifier provided by the infrastructure. The Fault Analyzer can interpret these application-specific fault reports, generate reports that can be understood by the Replication Manager, and propagate them to the Replication Manager through the Fault Notifier, as shown in Figure 23-8.

• Statically Deployed Fault Detectors. In an operating environment with a relatively static configuration, location-specific Fault Detectors will typically be created when the Fault Tolerance Infrastructure is installed. For example, these stand-alone Fault

Detectors could be implemented as daemon processes that are installed with the Fault Tolerance Infrastructure. These Fault Detectors could be registered in a manner internal to the Fault Tolerance Infrastructure, allowing the infrastructure to include them in every fault-tolerant application within the fault tolerance domain in a transparent manner.