IRIS FailSafe --
The SGI® High-Availability Solution
Clusters Configured with IRIS FailSafe in
Conjunction with the Application Services Provide:
IRIS FailSafe is a robust and flexible cluster HA
solution, which is highly scalable due to its modular design. An IRIS
FailSafe-based HA Cluster is built by combining multiple nodes
that work together to achieve a common goal of providing highly
available services to its users by minimizing the service
downtime. Each node is an independent computer, with its own CPU,
memory, I/O etc. The nodes are interconnected using standard,
off-the-shelf hardware, and they cooperate and communicate with one
another to ensure resilience from any single point of failure. With HA
agents for some popular applications like Samba
and Netscape® Fastrack server provided by SGI and an open API that allows easy
integration of almost any application into the HA framework, IRIS
FailSafe combined with ccNuma-based SGI servers provides the best
solution to bring high levels of service availability to businesses
across industries like telecommunications, manufacturing and many
others.
HA clusters thus configured provide a very cost effective yet flexible
solution to meet the increased availability demands of businesses
providing services that must be available round-the-clock. These
businesses can be as varied as universities having their student
records, assignments and other material on servers which must be
accessible to their students, faculty and other staff all the time; to
printing and publishing that houses all content
databases on powerful, high bandwidth Unix servers.
Highly-Available Services with IRIS FailSafe
IRIS FailSafe is the HA infrastructure solution that enables
up to 8 nodes (which can be any combination of SGI Origin®
servers) to be part of a cluster that provides highly-available
services to clients connected to the cluster via standard networking
like Ethernet, FDDI, ATM. The cluster is setup to detect failures
quickly and take the necessary steps to minimize the impact of
failure. This is achieved in part by having failure impact limited
to the resources in a resource group. The recovery steps can range from
attempting to access the storage from an alternate path (in case of a
storage access path failure) to failing over the application to another
node in the cluster. IRIS FailSafe leverages the IRIX kernel
capabilities to detect and provide recovery from failures at various
levels:
- Disk failure
- Storage path failure
- System failures
- Network failures
- Application failures
Dynamic Cluster Configuration to Suit Your Needs
IRIS FailSafe uses distributed software technology, to allow highly
flexible and scalable cluster configuration.
Should you need to add a system to the cluster because the processing
power requirements of the HA services have outgrown what is currently
available, another server can be added to the cluster dynamically, and
you can redistribute the services among the systems in the newly
formed cluster without disrupting other HA applications in the
cluster.
The systems can be added or removed, the services/applications can be
added/removed from the cluster, and even the resources upon which
an application/service depends, can be modified on the fly. All of
this allows the users to ensure minimal disruption to their services
when there is a need to change the cluster configuration.
Java-Based Portable Cluster Manager
IRIS FailSafe cluster manager graphical user interface (GUI) allows
users to set up, administer, and monitor their high-availability
cluster with ease. The Java-based GUI is comprised of:
- FailSafe Manager to configure your cluster and set it up to
run in production mode.
- FailSafe Cluster View to display a
dynamic graphical overview that lets you monitor the state
of your cluster, obtain detailed information on specific
highly available resources, and modify the
cluster configuration.
The web-like GUI design enables the users to click upon any
blue text to get more information (glossary,
help, or configuration details) and to
change the cluster configuration (ex move a resource
group, define a cluster).
Services Available During Planned Maintenance
When there is need for hardware or software upgrade, normally it
results in system downtime. In contrast, an IRIS FailSafe cluster
allows you to migrate the services of the target node to other nodes
in the cluster, remove the node from the cluster, upgrade it as needed
and then bring it back to join the cluster. The services can then be
redistributed among all the cluster members. This enables the IT
managers, to minimize service downtime, during planned system
maintenance.
Popular Highly-Available Solutions
Packaged with the agents for popular applications like NFS, Samba,
Netscape Enterprise server etc, IRIS FailSafe enables easy setup for
highly available file servers, Web servers, and more. The following agents are
available from SGI for use with IRIS
FailSafe:
| Application Agent |
Highly-Available Solution |
| TMF agent |
To setup highly-available TMF servers for tape management |
| NFS agent |
To setup highly-available file servers for Unix clients |
| Samba agent |
To setup highly-available file servers for PC clients |
| Netscape server agent |
To setup highly-available Web servers |
| Informix agent |
To setup highly-available Informix application servers |
| Oracle agent |
To setup highly-available Oracle application servers |
| DMF agent |
To setup highly-available DMF servers for your HSM environment |
|
Open API for Easy Integration of Other Applications
IRIS FailSafe has implemented an open API to facilitate easy
integration of third party applications. Applications with certain
characteristics that make them HA-capable can be easily integrated
into the HA framework provided by IRIS FailSafe. These characteristics
include:
- Service is not stateful across start/stop operation i.e. can
start and stop multiple times, without losing any state
- Service manages its data efficiently, and concisely
i.e. keeps its entire files etc in one directory structure that can be
duplicated across systems easily, as opposed to having them scattered.
- Service can handle multiple instances on the same node or
across systems, without problems
SGI Managed Services team members can provide the custom engineering
to integrate any customer desired applications into the IRIS FailSafe
umbrella. A programmer's guide is also provided for those wanting to
explore the task themselves.
Note that application itself does *not* need to be modified in any way to
integrate with the IRIS FailSafe framework.
Power of Distributed Computing and Enhanced Throughput for Workloads
As the HA cluster is built by joining multiple systems together which
watch over one another, it also brings the power of distributed
computing to the user's environment. Each of the systems in the
cluster is a complete system with its own I/O, networking and CPU
capabilities. Therefore, by carefully defining the data boundaries and
distributing the workload of one system onto two or more systems, the
users can avail the additional I/O thruput and networking bandwidth of
these multiple systems in the cluster.
Flexible cluster topology for better capacity planning
IRIS FailSafe allows for N+1 or N+M and ring or star cluster
configurations, so that users can plan for the backup capacity that
best suits their requirements. In an N+1 cluster configuration, 1
system is designated as the dedicated backup system for N primary
systems and must be able to assume the workload of the highest capacity
system. In the N+M configuration, multiple systems can be designated
as backups, along with doing their primary activities. Flexible and
user configurable failover policies allow the users to manage the
environment and workload distribution as per their needs.
Protecting Data Integrity
Within an HA cluster, IRIS FailSafe not only makes applications highly
available, it also ensures data integrity by making sure that in an
event of a failover of services, the failed node does not
inadvertently (and most undesirably) attempt to write to the data
store again. Using the cluster membership protocols, when a node is
determined to be down, it is reset by one of the other cluster
members, in order to prevent the "split brain syndrome".
Without this guarantee, having more than one node simultaneously
accessing a disk could compromise data integrity.
Enhancing Availability in the Entire Processing Environment
HA clusters created with IRIS FailSafe provide a strong foundation for
hosting highly available services for your mission critical
environments. However, several other factors must be considered and
the environment must be designed top-to-bottom for HA. The cluster
environment should be setup to eliminate as many single points of
failures as possible, and factors such as power failure should be
addressed.