Overview

failsafe configuration diagram
IRIS FailSafe™ -- The SGI® High-Availability Solution

Clusters Configured with IRIS FailSafe in Conjunction with the Application Services Provide:

IRIS FailSafe is a robust and flexible cluster HA solution, which is highly scalable due to its modular design. An IRIS FailSafe-based HA Cluster is built by combining multiple nodes that work together to achieve a common goal of providing highly available services to its users by minimizing the service downtime. Each node is an independent computer, with its own CPU, memory, I/O etc. The nodes are interconnected using standard, off-the-shelf hardware, and they cooperate and communicate with one another to ensure resilience from any single point of failure. With HA agents for some popular applications like Samba™ and Netscape® Fastrack™ server provided by SGI and an open API that allows easy integration of almost any application into the HA framework, IRIS FailSafe combined with ccNuma™-based SGI servers provides the best solution to bring high levels of service availability to businesses across industries like telecommunications, manufacturing and many others.

HA clusters thus configured provide a very cost effective yet flexible solution to meet the increased availability demands of businesses providing services that must be available round-the-clock. These businesses can be as varied as universities having their student records, assignments and other material on servers which must be accessible to their students, faculty and other staff all the time; to printing and publishing that houses all content databases on powerful, high bandwidth Unix servers.

Highly-Available Services with IRIS FailSafe
IRIS FailSafe is the HA infrastructure solution that enables up to 8 nodes (which can be any combination of SGI Origin® servers) to be part of a cluster that provides highly-available services to clients connected to the cluster via standard networking like Ethernet, FDDI, ATM. The cluster is setup to detect failures quickly and take the necessary steps to minimize the impact of failure. This is achieved in part by having failure impact limited to the resources in a resource group. The recovery steps can range from attempting to access the storage from an alternate path (in case of a storage access path failure) to failing over the application to another node in the cluster. IRIS FailSafe leverages the IRIX kernel capabilities to detect and provide recovery from failures at various levels:

  • Disk failure
  • Storage path failure
  • System failures
  • Network failures
  • Application failures
Dynamic Cluster Configuration to Suit Your Needs
IRIS FailSafe uses distributed software technology, to allow highly flexible and scalable cluster configuration.

Should you need to add a system to the cluster because the processing power requirements of the HA services have outgrown what is currently available, another server can be added to the cluster dynamically, and you can redistribute the services among the systems in the newly formed cluster without disrupting other HA applications in the cluster.

The systems can be added or removed, the services/applications can be added/removed from the cluster, and even the resources upon which an application/service depends, can be modified on the fly. All of this allows the users to ensure minimal disruption to their services when there is a need to change the cluster configuration.

Java-Based Portable Cluster Manager
IRIS FailSafe cluster manager graphical user interface (GUI) allows users to set up, administer, and monitor their high-availability cluster with ease. The Java-based GUI is comprised of:

  • FailSafe Manager to configure your cluster and set it up to run in production mode.
  • FailSafe Cluster View to display a dynamic graphical overview that lets you monitor the state of your cluster, obtain detailed information on specific highly available resources, and modify the cluster configuration.
The web-like GUI design enables the users to click upon any blue text to get more information (glossary, help, or configuration details) and to change the cluster configuration (ex move a resource group, define a cluster).

Services Available During Planned Maintenance
When there is need for hardware or software upgrade, normally it results in system downtime. In contrast, an IRIS FailSafe cluster allows you to migrate the services of the target node to other nodes in the cluster, remove the node from the cluster, upgrade it as needed and then bring it back to join the cluster. The services can then be redistributed among all the cluster members. This enables the IT managers, to minimize service downtime, during planned system maintenance.

Popular Highly-Available Solutions
Packaged with the agents for popular applications like NFS, Samba, Netscape Enterprise server etc, IRIS FailSafe enables easy setup for highly available file servers, Web servers, and more. The following agents are available from SGI for use with IRIS FailSafe:


Application Agent Highly-Available Solution
TMF agent To setup highly-available TMF servers for tape management
NFS agent To setup highly-available
file servers for Unix clients
Samba agent To setup highly-available file servers for PC clients
Netscape server agent To setup highly-available Web servers
Informix agent To setup highly-available Informix application servers
Oracle agent To setup highly-available Oracle application servers
DMF agent To setup highly-available DMF servers for your HSM environment


Open API for Easy Integration of Other Applications
IRIS FailSafe has implemented an open API to facilitate easy integration of third party applications. Applications with certain characteristics that make them HA-capable can be easily integrated into the HA framework provided by IRIS FailSafe. These characteristics include:

  • Service is not stateful across start/stop operation i.e. can start and stop multiple times, without losing any state
  • Service manages its data efficiently, and concisely i.e. keeps its entire files etc in one directory structure that can be duplicated across systems easily, as opposed to having them scattered.
  • Service can handle multiple instances on the same node or across systems, without problems

SGI Managed Services team members can provide the custom engineering to integrate any customer desired applications into the IRIS FailSafe umbrella. A programmer's guide is also provided for those wanting to explore the task themselves.

Note that application itself does *not* need to be modified in any way to integrate with the IRIS FailSafe framework.

Power of Distributed Computing and Enhanced Throughput for Workloads
As the HA cluster is built by joining multiple systems together which watch over one another, it also brings the power of distributed computing to the user's environment. Each of the systems in the cluster is a complete system with its own I/O, networking and CPU capabilities. Therefore, by carefully defining the data boundaries and distributing the workload of one system onto two or more systems, the users can avail the additional I/O thruput and networking bandwidth of these multiple systems in the cluster.

Flexible cluster topology for better capacity planning
IRIS FailSafe allows for N+1 or N+M and ring or star cluster configurations, so that users can plan for the backup capacity that best suits their requirements. In an N+1 cluster configuration, 1 system is designated as the dedicated backup system for N primary systems and must be able to assume the workload of the highest capacity system. In the N+M configuration, multiple systems can be designated as backups, along with doing their primary activities. Flexible and user configurable failover policies allow the users to manage the environment and workload distribution as per their needs.

Protecting Data Integrity
Within an HA cluster, IRIS FailSafe not only makes applications highly available, it also ensures data integrity by making sure that in an event of a failover of services, the failed node does not inadvertently (and most undesirably) attempt to write to the data store again. Using the cluster membership protocols, when a node is determined to be down, it is reset by one of the other cluster members, in order to prevent the "split brain syndrome". Without this guarantee, having more than one node simultaneously accessing a disk could compromise data integrity.

Enhancing Availability in the Entire Processing Environment
HA clusters created with IRIS FailSafe provide a strong foundation for hosting highly available services for your mission critical environments. However, several other factors must be considered and the environment must be designed top-to-bottom for HA. The cluster environment should be setup to eliminate as many single points of failures as possible, and factors such as power failure should be addressed.