23.1.5 Requirements

The requirements of the Fault Tolerant CORBA specification are stated below.

CORBA Object Model

For object groups with the infrastructure-controlled (CONS_INF_CTRL) Consistency Style ( Section 23.3.2.3, “ConsistencyStyle,? on page 23-34), the specification requires that the CORBA object model is preserved. Even though an object is replicated to provide protection against faults, at all times its behavior shall appear to be the behavior of a single object. In particular, a replicated object can act as a client or a server or both, and can invoke another replicated object, regardless of the fault tolerance properties of the two object groups.

CORBA Object Reference Model

The specification introduces three new special tagged components into the CORBA object reference model. The object group references that are used for fault tolerance contain multiple profiles that contain these components. Even though an object group reference contains such components in its profiles, an unreplicated object, hosted by an ORB that does not support fault tolerance, can still use the reference to invoke the methods of the replicated object. Similarly, a replicated object can use the object reference of an unreplicated object to invoke the methods of the unreplicated object.

Transparency to Replication and to Faults

Creating or deleting an object using a Generic Factory, and invoking a method of an object, appear the same for replicated objects as for unreplicated objects. Similarly, the behavior of a replicated server object when invoked by a client object appears the same whether or not faults occur, except perhaps for a transient delay if the primary member of a passively replicated object becomes faulty.

No Single Point of Failure

The specification supports applications that need robust fault tolerance, including applications that require higher reliability than can be provided by a single backup. The specification requires that there shall be no single points of failure.

Client Redirection

For a client and a replicated server, the specification defines an interoperable object group reference that allows the client to connect to the server replicas, by connecting to an alternative server or through an alternative network, when a fault in a server replica occurs. It defines an additional service context, in request messages, that allows a server to determine if the object group reference for the server used by a client is obsolete. Transparency to the client application program is provided, with minimal modifications to the client ORB and simple mechanisms in the server ORB. Typical applications include desktop client access to enterprise servers.

Transparent Reinvocation

The specification introduces an additional service context in Request messages that ensures that, in the presence of faults, a client can reinvoke a request on a replicated server and receive a reply to that request, without risk that the operation will be performed more than once. Typical applications include desktop client access to e-commerce applications.

Infrastructure-Controlled Membership

The infrastructure-controlled (MEMB_INF_CTRL) Membership Style allows the application to direct the Replication Manager to create an object group. The Replication Manager then invokes the factories at the different locations to create the object replicas, and then add them to the group. The Replication Manager is responsible for creating the initial number of replicas and for maintaining the minimum number of replicas, as specified by the fault tolerance properties for the group. Typical applications include enterprise server applications, such as supply chain applications, and large-scale critical systems, such as defense applications.

Application-Controlled Membership

The application-controlled (MEMB_APP_CTRL) Membership Style allows the application to create the members of an object group and to direct the Replication Manager to add them to the group, or to direct the Replication Manager to create the members of an object group and add them to the group. The application is responsible for maintaining the initial and minimum number of replicas and the locations of the replicas, both initially and after faults. Application-controlled membership is particularly important for applications whose different hosts have different capabilities, such as communication network applications.

Infrastructure-Controlled Consistency

The infrastructure-controlled (CONS_INF_CTRL) Consistency Style provides Strong Replica Consistency between the states of the members of an object group. Strong Replica Consistency requires that, even in the presence of faults, as members of an object group execute a sequence of methods invoked on the object group, the behavior is logically equivalent to that of a single fault-free object processing the same sequence of method invocations. The Fault Tolerance Infrastructure provides logging, checkpointing, activation, and recovery mechanisms to achieve Strong Replica Consistency. Strong Replica Consistency is particularly important for financial applications and safety-critical applications, such as industrial process control and aircraft instrumentation.

Application-Controlled Consistency

The application-controlled (CONS_APP_CTRL) Consistency Style depends on application-specific mechanisms to ensure whatever consistency is required for the members of an object group. Application-controlled consistency does not depend on the Fault Tolerance Infrastructure to provide logging, checkpointing or recovery, and does not guarantee Strong Replica Consistency. Typical applications might include telecommunications applications, and some embedded and real-time applications.

Passive Replication

The COLD_PASSIVE or WARM_PASSIVE Replication Styles require that, during fault-free operation, only one member of the object group, the primary member, executes the methods invoked on the group. Periodically, the state of the primary member is recorded in a log, together with the sequence of method invocations. In the presence of a fault, a backup member is promoted to be the new primary member of the group. The state of the new primary is restored to the state of the old primary by reloading its state from the log, followed by reapplying request messages recorded in the log. Passive replication is useful when the cost of executing a method invocation is larger than the cost of transferring a state, and the time for recovery after a fault is not constrained. Typical examples include enterprise inventory, logistics applications, and hospital record keeping.

Active Replication

The ACTIVE Replication Style requires that all of the members of an object group execute each invocation independently but in the same order, so that they maintain exactly the same state and, in the event of a fault in one member, that the application can continue with results from another member without waiting for fault detection and recovery. Even though each of the members of the object group generates each request and each reply, the Message Handling Mechanism detects and suppresses duplicate requests and replies, and delivers a single request or reply to the destination object(s). Active replication is useful when the cost of transferring a state is larger than the cost of executing a method invocation, or when the time available for recovery after a fault is tightly constrained. Typical examples include enterprise electronic trading applications and safety-critical applications, such as hospital patient monitoring.

Fault Detection and Notification

The Fault Management interfaces allow detection of object crash faults, and provide fault notifications to the entities that have registered for such notifications. Accuracy of fault detection is impossible in an asynchronous fault-tolerant distributed system. Occasional false suspicions cause no harm in a robust fault-tolerant system. If a host crashes or an object hangs, the Fault Detectors are required to detect the fault in a timely manner. However, a Fault Detector must not continuously suspect all members of an object group, unless all of them are indeed faulty. Most fault-tolerant applications will use the Fault Management interfaces, but they are particularly important for telecommunications, electric power distribution and other safety-critical applications.

Logging and Recovery

The Logging and Recovery Mechanisms and Checkpointable and Updateable interfaces allow an application object to record its state, for use in recovery after a fault or to initialize another replica. Following a fault that damages one or more, but not all, of the members of an object group, recovery is required to ensure that the continued behavior of the replicated object after recovery is the same as it would have been in the absence of the fault. A recovering member executes the same requests in the same order, generates the same replies, invokes the same methods of other objects, and reaches the same internal state, as if no fault had occurred. If a request is partially executed when a fault occurs, that request is fully executed, at the same position in the sequence of messages, during recovery. If an object invokes a method of another object and then becomes faulty, that method invocation must not be duplicated during recovery. Because some objects may be unreplicated, or may be supported by ORBs that do not provide fault tolerance, or may use different Replication Styles, the recovery of each object must be self-contained and must not depend on the cooperation of any other object. Applications that employ the infrastructure-controlled Consistency Style will use these mechanisms and interfaces.