Computer cluster

A computer cluster is a set of computers that work together so that they can be viewed as a single system. Unlike grid computers, computer clusters have each node set to perform the same task, controlled and scheduled by software. The newest manifestation of cluster computing is cloud computing.

Not to be confused with data cluster or grid computing.

The components of a cluster are usually connected to each other through fast local area networks, with each node (computer used as a server) running its own instance of an operating system. In most circumstances, all of the nodes use the same hardware^[1] and the same operating system, although in some setups (e.g. using Open Source Cluster Application Resources (OSCAR)), different operating systems can be used on each computer, or different hardware.^[2]

Clusters are usually deployed to improve performance and availability over that of a single computer, while typically being much more cost-effective than single computers of comparable speed or availability.^[3]

Computer clusters emerged as a result of the convergence of a number of computing trends including the availability of low-cost microprocessors, high-speed networks, and software for high-performance distributed computing. They have a wide range of applicability and deployment, ranging from small business clusters with a handful of nodes to some of the fastest supercomputers in the world such as IBM's Sequoia.^[4] Prior to the advent of clusters, single-unit fault tolerant mainframes with modular redundancy were employed; but the lower upfront cost of clusters, and increased speed of network fabric has favoured the adoption of clusters. In contrast to high-reliability mainframes, clusters are cheaper to scale out, but also have increased complexity in error handling, as in clusters error modes are not opaque to running programs.^[5]

Benefits[edit]

Clusters are primarily designed with performance in mind, but installations are based on many other factors. Fault tolerance (the ability for a system to continue working with a malfunctioning node) allows for scalability, and in high-performance situations, low frequency of maintenance routines, resource consolidation (e.g. RAID), and centralized management. Advantages include enabling data recovery in the event of a disaster and providing parallel data processing and high processing capacity.^[16]^[17]

In terms of scalability, clusters provide this in their ability to add nodes horizontally. This means that more computers may be added to the cluster, to improve its performance, redundancy and fault tolerance. This can be an inexpensive solution for a higher performing cluster compared to scaling up a single node in the cluster. This property of computer clusters can allow for larger computational loads to be executed by a larger number of lower performing computers.

When adding a new node to a cluster, reliability increases because the entire cluster does not need to be taken down. A single node can be taken down for maintenance, while the rest of the cluster takes on the load of that individual node.

If you have a large number of computers clustered together, this lends itself to the use of distributed file systems and RAID, both of which can increase the reliability and speed of a cluster.

Software development and administration[edit]

Parallel programming[edit]

Load balancing clusters such as web servers use cluster architectures to support a large number of users and typically each user request is routed to a specific node, achieving task parallelism without multi-node cooperation, given that the main goal of the system is providing rapid user access to shared data. However, "computer clusters" which perform complex computations for a small number of users need to take advantage of the parallel processing capabilities of the cluster and partition "the same computation" among several nodes.^[27]

Automatic parallelization of programs remains a technical challenge, but parallel programming models can be used to effectuate a higher degree of parallelism via the simultaneous execution of separate portions of a program on different processors.^[27]^[28]

Debugging and monitoring[edit]

Developing and debugging parallel programs on a cluster requires parallel language primitives and suitable tools such as those discussed by the High Performance Debugging Forum (HPDF) which resulted in the HPD specifications.^[21]^[29] Tools such as TotalView were then developed to debug parallel implementations on computer clusters which use Message Passing Interface (MPI) or Parallel Virtual Machine (PVM) for message passing.

The University of California, Berkeley Network of Workstations (NOW) system gathers cluster data and stores them in a database, while a system such as PARMON, developed in India, allows visually observing and managing large clusters.^[21]

Application checkpointing can be used to restore a given state of the system when a node fails during a long multi-node computation.^[30] This is essential in large clusters, given that as the number of nodes increases, so does the likelihood of node failure under heavy computational loads. Checkpointing can restore the system to a stable state so that processing can resume without needing to recompute results.^[30]

Implementations[edit]

The Linux world supports various cluster software; for application clustering, there is distcc, and MPICH. Linux Virtual Server, Linux-HA – director-based clusters that allow incoming requests for services to be distributed across multiple cluster nodes. MOSIX, LinuxPMI, Kerrighed, OpenSSI are full-blown clusters integrated into the kernel that provide for automatic process migration among homogeneous nodes. OpenSSI, openMosix and Kerrighed are single-system image implementations.

Microsoft Windows computer cluster Server 2003 based on the Windows Server platform provides pieces for high-performance computing like the job scheduler, MSMPI library and management tools.

gLite is a set of middleware technologies created by the Enabling Grids for E-sciencE (EGEE) project.

slurm is also used to schedule and manage some of the largest supercomputer clusters (see top500 list).

Other approaches[edit]

Although most computer clusters are permanent fixtures, attempts at flash mob computing have been made to build short-lived clusters for specific computations. However, larger-scale volunteer computing systems such as BOINC-based systems have had more followers.

Baker, Mark; et al. (11 Jan 2001). "Cluster Computing White Paper". :cs/0004014.

arXiv

Marcus, Evan; Stern, Hal (2000-02-14). . John Wiley & Sons. ISBN 978-0-471-35601-1.

Blueprints for High Availability: Designing Resilient Distributed Systems

Pfister, Greg (1998). . Prentice Hall. ISBN 978-0-13-899709-0.

In Search of Clusters

Buyya, Rajkumar, ed. (1999). High Performance Cluster Computing: Architectures and Systems. Vol. 1. NJ, USA: Prentice Hall. 978-0-13-013784-5.

ISBN

Buyya, Rajkumar, ed. (1999). High Performance Cluster Computing: Architectures and Systems. Vol. 2. NJ, USA: Prentice Hall. 978-0-13-013785-2.

ISBN

IEEE Technical Committee on Scalable Computing (TCSC)

Reliable Scalable Cluster Technology, IBM

Tivoli System Automation Wiki

April 2015, by Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune and John Wilkes

Computer cluster

Benefits[edit]

Software development and administration[edit]

Parallel programming[edit]

Debugging and monitoring[edit]

Implementations[edit]

Other approaches[edit]

arXiv

Blueprints for High Availability: Designing Resilient Distributed Systems

In Search of Clusters

ISBN

ISBN

IEEE Technical Committee on Scalable Computing (TCSC)

Reliable Scalable Cluster Technology, IBM

Tivoli System Automation Wiki

Large-scale cluster management at Google with Borg