Intro to Cluster Computing
What is a Cluster?
A computing cluster is a group of closely linked computers that work together as a single computer. This configuration gives better performance than any of the single computers, called nodes, are able to achieve working alone. Nodes are divided into three types. The majority are compute nodes, these are what actually carry out the computations. Login nodes handle login requests, allow users to interact with their data, and submit jobs for the backend compute nodes. Lastly the head node controls the cluster and contains the configuration information for all the other nodes in the cluster.
The head node uses a resource manager to manage the cluster's resources such as node availability, RAM, CPU time, and a scheduler to decide where and when jobs will run. SLURM is both the resource manager and scheduler on The Forge. For more information on SLURM please see their website at www.slurm.schedmd.com. The head node is running the SLURM control daemon which listens to the client daemons running on the backend compute nodes and gets information about the status of currently running jobs and other things regarding the node’s general health, it uses this information to track resource availability. The scheduler decides which jobs to run based on several criteria including information about what resources the user requests in their job files, expected resource availability, and the number of jobs the user already has running: also known as “fair share” scheduling.
Some Common Terms
cpu time- the amount of time a CPU is in use, measured per CPU (so a 4 processor job that runs for 15 minutes takes 1 hour of cpu time)
job- any user submitted program executed on the cluster.
job file- a file containing information about a job used by the resource manager and the scheduler to schedule the job.
queue- an ordered group of jobs waiting to run.
partition- a virtual container of resources the scheduler can assign jobs to.
wall-clock time- the amount of time a job is expected to, or has run on the cluster (a 4 processor job that runs for 15 minutes takes 15 minutes of wall-clock time).
ITRSS currently supports two clusters, The NIC, and The Forge. The NIC, Numerically Intensive Computing, cluster was built in 2009 on a handful of general compute nodes with a total core count of 652 running Centos 6.0. Resources have been added to the NIC every year and the total core count has grown to just over 2600. In 2014 ITRSS began looking to expand the HPC facilities on campus significantly and it was decided that the new hardware would be built into a new cluster, The Forge. The Forge is built on 20 Super Micro compute nodes using the newest Haswell Intel chips totaling 1280 cores, a large 350 TB shared file system for data, the latest high speed infiniband interconnect EDR, and is running Centos 6.5. In the summer of 2015 system testing opened for The Forge for a small group of researchers and students, The Forge went into production at the beginning of the fall 2015 semester. ITRSS will migrate the newest computing hardware in The NIC cluster to The Forge cluster and discontinue support for The NIC cluster.
S&T students and researchers who have a need for high performance computing have access to consulting and hosting provided by ITRSS. Historically ITRSS has been able to provide some of the industries newest gear designed for high performance computing to S&T students at no cost to them. ITRSS also provides consulting services for research faculty who need HPC equipment, specialized equipment is selected to provide the maximum benefit to the researcher and the computing environment as a whole.