This topic describes iLink session fault tolerance.

Customers must wait for in-flight resend requests to be fulfilled before terminating or failing over to the backup gateway.

Fault tolerance in a network environment is characterized by rapid recovery from such failures such as process termination, hardware failure, or network disconnects.

One method of providing fault-tolerance is through a mechanism called Failover; the goal is to minimize service interruption caused by error conditions.

Process failure may be caused by hardware failure of the machine on which the process runs, as well as software error; faulty logic, and improper memory handling, among others. Network failure includes loss of TCP/IP connectivity between the client application and the iLink Gateway. This may be caused by such items as faulty network interface cards, frayed wires, and power failure on network components (i.e. routers).

The iLink Gateway initiates a controlled failover when it detects either process or network failure that impacts its ability to service the client.

See the following topics for details regarding each Fault Tolerance error condition and example scenarios:

Contents

1 Summary of Fault Tolerance Requirements on Client Applications
2 Fault Tolerance Implementation
3 Customers Choosing Not to Implement Fault Tolerance
4 Failover Processing
5 Duplicate Session Connection Attempts

Summary of Fault Tolerance Requirements on Client Applications

The guidelines for implementing fault tolerant client applications are:

Coordinate applications such that the primary and backup processes each establish a separate and independent content stream to the iLink Gateways via TCP/IP socket connection.
Send application messages (e.g. New Order - Single, Order Cancel/Replace Request) only through the primary content stream where sequencing is enforced per FIX 4.2 protocol.
Send only session messages over a backup content stream. Communication over a backup content stream is for link maintenance via session messages only. Application messages sent over a backup connection are rejected.
Messages exchanged over a backup content stream serve to maintain only the connection. As a result, sequence numbers are not relevant and are "0" (zero).
Maintain tracking of inbound and outbound sequence numbers between the primary and backup client processes to ensure proper failover processing.

Fault Tolerance Implementation

A customer choosing to use fault tolerance must coordinate their application processes to establish separate and independent content streams to the iLink FIXP Gateways via TCP/IP socket connections. This group of redundant client processes (also commonly referred to as the fault tolerant group) operates together to provide fault tolerant functionality. In a typical deployment scenario, multiple redundant processes are spawned from the same executable file and each of those processes runs on a separate machine. Running redundant processes in the same machine is not recommended. If a machine fails, all the processes running on it fail simultaneously.

The client application must Negotiate and Establish with the designated primary Market Segment Gateway and then Establish with the backup Market Segment Gateway. A backup member must be ready to activate in the same data state as the former primary member being replaced. For example, inbound and outbound sequence numbers for a UUID must be maintained in a consistent state during fail-over between both processes.

For active-active fault tolerance, Negotiation is not allowed on the backup connection.

The customer must use the same UUID for both primary and backup FIXP sessions (this is the only exception to the general rule of thumb that the UUID be globally unique) and then the sequence streams would be contiguous after failure.

It is important to note that the Fault Tolerance Indicator is present only in messages related to Negotiation, Establishment, and Sequence (not present in business messages).

The customer will be told beforehand of their primary and backup IP addresses and once Established, the customer will notice that the Establishment Acknowledgment and Sequence messages sent from the primary Market Segment Gateway will contain a valid value in NextSeqNo, whereas, the Establishment Acknowledgment and Sequence messages sent from the backup Market Segment Gateway will show NextSeqNo as zero.

The customer must avoid populating NextSeqNo with a non-zero number in Establish and Sequence messages to the backup Market Segment Gateway, as this will be considered a protocol violation. The NextSeqNo in these messages must be set to zero. Likewise the reverse situation is also true, since setting NextSeqNo as zero in Establish and Sequence messages to the primary Market Segment Gateway will be considered a protocol violation.

Business messages, including New Order - Single, Order Cancel/Replace Request, must be sent only through the primary content stream where sequencing is enforced per FIXP protocol. Communication over the backup content stream is solely for link maintenance; only administrative messages (Establish, Sequence, Terminate) are sent through the backup content stream.

In the event of a failure in an active-active fault tolerance scenario, the customer will receive a Sequence message from the backup or newly promoted primary Market Segment Gateway with the NextSeqNo containing a valid value (from where the old primary left off) along with the FaultToleranceIndicator=1 (primary) and this will serve as the trigger to let the customer know that a failure occurred and the transition to backup taking over as new primary was successful.

Customers Choosing Not to Implement Fault Tolerance

The exchange does not mandate that the customer use the fault tolerance feature, however CME strongly recommends using this functionality. A customer that does not implement fault tolerance will not be able to dynamically recover from process or network failures.

For customers that do not wish to implement fault tolerance, their application should just Negotiate and Establish with the designated primary iLink Gateway and not Establish with the backup iLink Gateway. However they are not precluded from doing so should they choose to Establish with the backup iLink Gateway at a later time after first Negotiating and Establishing with the primary gateway.

Failover Processing

Mission-critical client applications must continue to function properly despite sudden difficulties such as process termination, hardware failure and network disconnects. Fault tolerance in a network environment is characterized by rapid recovery from such failures. One method of providing fault-tolerance is through a mechanism called fail-over; the end goal is to minimize service interruption caused by error conditions above.

iLink detects seven different categories of error conditions:

Customer primary process failure
Customer backup process failure
Primary Convenience Gateway failure
Backup Convenience Gateway failure
Primary Market Segment Gateway failure
Backup Market Segment Gateway failure
Network Failure

Process failure may be caused by hardware failure of the machine on which the process runs. It may also be caused by software error; faulty logic, improper memory handling, etc. Network failure includes loss of TCP/IP connectivity between the customer application and the Market Segment Gateway. This may be caused by faulty network interface card, frayed wire, power failure on network components (i.e. routers), etc.

The iLink Gateway initiates a controlled fail-over when it detects either process or network failure that impacts its ability to service the customer.

Customer Primary Process Failure

If the exchange does not receive a message from the primary client process within a defined interval (two times the KeepAliveInterval sent by client in the Establish message), then the exchange designates the primary client process as failed and initiates fail-over. The primary client application and backup client applications will be terminated and disconnected from the exchange and the backup client application will be expected to communicate with the designated primary Market Segment Gateway. The backup client application is notified of fault tolerance status change by the Code in the Terminate message (i.e. ErrorCodes = 26 - DisconnectFromPrimary: Backup session will be terminated as well).

If the primary client process fails without closing the TCP connection, then it takes two times the KeepAliveInterval for the exchange to detect the primary process failure. If the customer would like to avoid the time delay in this process, then they should ensure the TCP connection is closed whenever their application fails.
Before the fail-over, the backup client application was receiving NextSeqNo set to zero. During and after the fail-over process, the backup client application is responsible for ensuring that its UUID, inbound sequence number to the exchange and outbound sequence number from the exchange are synchronized with the primary application that just failed. This is critical since the newly elected primary member must know exactly where the failed member left off.
If the NextSeqNo number of the Sequence message sent by the new primary client application is lower than that of the original client application for the same UUID, the exchange will terminate the client application as per FIXP protocol.
If the NextSeqNo number of the Sequence message sent by the new primary client application is lower than that of the original client application for a different UUID, the exchange will accept that and will also inform the customer of the perceived gap by sending a NotApplied message; if the NextSeqNo=1 in this case for a new UUID then that is acceptable.

Customer Backup Process Failure

If the exchange does not receive any message from a backup client application within a defined interval (which is two times the KeepAliveInterval in the Establish message sent by the customer), then the backup client application will be disconnected and the status of the primary client application connectivity remain intact.

It is permissible for the backup client application to Negotiate and Establish with the backup Market Segment Gateway with a different UUID than that the primary client application is using, with the primary Market Segment Gateway, but the customer should be aware that the sequence stream will not be contiguous in event of a failure—therefore, it is recommended that both primary and backup client applications should use the same UUID.

Primary Convenience Gateway Failure

If the primary iLink FIX Gateway fails:

CME Globex initiates failover by electing the ranking inactive iLink FIX Gateway to assume the primary role.
The client application that is connected to this newly chosen iLink FIX Gateway must act as the primary for the client application FT Group.
The client application is notified to become primary by examining the FTI in the Tag 56-TargetCompID of next incoming message.

Backup Convenience Gateway Failure

iLink maintains a predefined number of processes running for each iLink component. If a backup iLink Gateway fails:

The failed iLink Gateway is restored.
The backup client application must reestablish the connection to this restored backup iLink Gateway to maintain a pool of backup content streams.
The status of the primary client application connectivity remains intact.

Primary Market Segment Gateway Failure

If the primary Market Segment Gateway Gateway fails, then the Exhange fault tolerance mechanism initiates fail-over by electing the backup Market Segment Gateway to assume the primary role. The client application connected to this newly chosen Market Segment Gateway must act as the primary for the client application FT Group. The client application is notified to become primary by examining the value of NextSeqNo in the Sequence message, which will change from zero to a non-zero value and the FaultToleranceIndicator=1 (primary).

Customer can re-establish a previous FIXP session (same UUID) after reconnecting to this new primary Market Segment Gateway without further negotiation, which ensures that the sequence numbers will continue.
Customer may also negotiate and establish a new session with the a new UUID after reconnecting and this resets the sequence numbers back to 1 both ways (customer to CME and CME to customer).

Backup Market Segment Gateway Failure

If a backup Market Segment Gateway fails, it will be restored to working status by market operations.

The failure of a backup component will not affect the primary connection state.
The failed backup MSGW is restored.

For active-active fault tolerance, the backup client application must again re-establish the same session (same UUID) connection to this restored backup Market Segment Gateway to maintain a backup. The status of the primary client application connectivity remains intact.

Network Failure

In the event of network failure, iLink handles socket exceptions that are thrown for network error conditions (i.e., loss of TCP/IP connectivity between the client application and the iLink Gateway). When this happens, iLink designates the primary content stream as failed and initiates the failover.

Duplicate Session Connection Attempts

Client system cannot make a duplicate TCP connection to iLink.