Intro to Cluster Computing

What is a Supercomputing Cluster?

A supercomputing cluster is an aggregation of computing power for the purpose of providing high-performance computing for users.  At its most basic, a cluster is a collection of computers linked together by sets of very fast connection wires.  The Forge cluster at Missouri S&T is used for academic research and is optimized as "software as a service" for performing floating-point operations and fast internal networks, as well as storage.

Computer clusters emerged as a result of the convergence of a number of computing trends including the availability of low-cost microprocessors, high-speed networks, and software for high-performance distributed computing. They have a wide range of applicability and deployment, ranging from small business clusters with a handful of nodes to some of the fastest supercomputers in the world.

Cluster Components

Node

A node is a physical, stand-alone computer that can handle computing tasks and run jobs. It's connected to other compute nodes via a fast network interconnect and contains CPUs, memory, and devices managed by an operating system.

Server Chassis

A server chassis is a metal structure that is used to house or physically assemble servers in various different form factors. A server chassis makes it possible to put multiple nodes as well as other storage and peripheral equipment in a single physical body.

Cluster

A cluster is the complete collection of nodes with networking and file storage facilities. It's usually a group of independent computers connected via a fast network interconnect, managed by a resource manager, which acts as a large parallel computer.

Memory (RAM)

RAM (random access memory) is the place in a computing device where the operating system, application programs, and data in current use are kept so they can be quickly reached by the device's processor.

Socket

A socket is a connector that houses the microprocessor. By extension, it represents the physical package of a processor, that typically contains multiple cores.

CPU

A Central Processing Unit (CPU), or core, or CPU core, is the smallest unit in a microprocessor that can run programs. Modern processors typically have multiple cores.  The difference between a GPU and a CPU is CPUs are slower, but more detail-oriented than a GPU.

GPU

A Graphical Processing Unit (GPU) is a specialized device initially designed to generate graphical output. On modern computing architecture, they are used to accelerate certain types of computation; they are much faster at this computation than a CPU is, but can't handle the complicated tasks that a CPU can. GPUs have their own memory and are attached to CPUs, within a node. Each compute node can host one or more GPUs.

PCIe Link

PCIe, technically Peripheral Component Interconnect Express but often seen abbreviated as PCIe or PCI-E, is a standard type of connection for internal devices in a computer. Generally, PCIe refers to the actual expansion slots on the motherboard that accept PCIe-based expansion cards and to the types of expansion cards themselves.

Infiniband Link

Infiniband is a computer-networking communications standard used in high-performance computing that features very high throughput and very low latency. In otherwords, Infiniband can communicate a lot of data very quickly.  It is often used for data interconnect both among and within computers.

Node, Chassis, and Cluster conceptual design schematics

Types of Nodes

Compute Nodes: These are what actually carry out the computations.

Login Nodes: Login nodes are points of access to a computer cluster. Users usually connect to login nodes via SSH to compile and debug their code, review their results, do some simple tests, and submit their batch jobs to the parallel computer.

***Important Note:*** Login nodes are not for computing. Login nodes are usually shared among many users and therefore must not be used to run computationally intensive tasks. Those should be submitted to the scheduler which will dispatch them on compute nodes.

Head Node: The head node controls the cluster and contains the configuration information for all the other nodes in the cluster.

 

Common Terminology

To make this documentation more accessible, we attempt to explain key terms in a relatively simple and non-technical way.

Application

An application is a computer program designed to perform a group of coordinated functions, tasks, or activities for the benefit of the user. In the context of scientific computing, an application typically performs computations related to a scientific goal (molecular dynamics simulations, genome assembly, computational fluid dynamics simulations, etc).

Backfill

Backfill scheduling is a method that a scheduler can use in order to maximize utilization. It allows smaller (both in terms of size and time requirements), lower priority jobs to start before larger, higher priority ones, as long as doing so doesn't push back the higher-priority jobs expected start time.

CPU Time

CPU time is the amount of time for which a central processing unit (CPU) was used for processing instructions of a computer program or operating system, as opposed to, for example, waiting for input/output (I/O) operations or entering low-power (idle) mode.

Executable

A binary (or executable) program refers to the machine-code compiled version of an application. This is a binary file that a computer can execute directly. This is not the same as the application source code, which is the human-readable version of the application internal instructions, and needs to be compiled by a compiler to produce the executable binary.

Fairshare

A resource scheduler ranks jobs by priority for execution. Each job's priority in queue is determined by multiple factors.  One of these factors is the user's fairshare score. A user's fairshare score is computed based on a target (the given portion of the resources that this user should be able to use) and the user's effetive usage, ie the amount of resources (s)he effectively used in the past. As a result, the more resources past jobs have used, the lower the priority of the next jobs will be. Past usage is computed based on a sliding window and is progressively forgotten over time.  Giving higher priorty to users who have been underserved in the past enables all users on a shared resource to be able to use a fair portion of it.

FLOPS

Floating-point Operations Per Second (FLOPS) are a measure of computing performance, and represent the number of floating-point operations that a CPU can perform each second. Modern CPUs and GPUs are capable of doing TeraFLOPS (10^12 floating-point operations per second), depending on the precision of those operations (half-precision: 16 bits, single-precision: 32 bits, double-precision: 64 bits).

IOPS

Input/output operations per second (IOPS, pronounced eye-ops) is an input/output performance measurement used to characterize computer storage system performance.

Job (Batch Job)

A job, or batch job, is the scheduler’s base unit of calculating which resources will be allocated to a user for a specified amount of time. Users create job submission scripts to ask the scheduler for resources such as cores, memory, runtime, etc. The scheduler then puts the requests in a queue and allocates requested resources based on jobs’ priority.

Job File (Batch File)

A job file is the file where the sequence of commands that perform a job are listed. 

Job Step

Job steps are sets of (possibly parallel) tasks within a job.

Modules

Environment modules, or software modules, are a type of software management tool used on in most HPC environments. Using modules enable users to selectively pick the software that they want to use and add them to their environment. This allows users to switch between different versions or flavors of the same software, pick compilers, libraries, and software components and avoid conflicts between them.

MPI

Message Passing Interface (MPI) is a standardized and portable message-passing system designed to exchange information between processes running on different nodes. There are several implementations of the MPI standard, which is the most common way used to scale parallel applications beyond a single compute node.

OpenMP

Open Multi Processing (OpenMP) is a parallel programming model designed for shared memory architecture. It's based on pragmas that can be added in applications to let the compiler generate a code that can run on multiple cores within the same node.

Partition

A partition is a set of compute nodes within a cluster with a common feature. For example, compute nodes with GPUs, or compute nodes belonging to same owner, could form a partition.

QOS

A Quality Of Service (QOS) is the set of rules and limitations that apply to a categories of job. The combination of a partition (set of machines where a job can run) and QOS (set of rules that applies to that job) makes what is often referred to as a scheduler queue.

Queue

A queue is a class of compute node that is managed by the scheduler. 

Run Time (Walltime)

The run time, or walltime, of a job is the time required to finish its execution.

Scheduler

The goal of a scheduler is to find the appropriate resources to run a set of computational tasks in the most efficient manner. Based on resource requirements and job descriptions, it will prioritize those jobs, allocate resources (nodes, CPUs, memory) and schedule their execution.

Slurm

Simple Linux Utility for Resource Management (SLURM) is a software that manages computing resources and schedules tasks on them. Slurm coordinates the running of many programs on a shared facility and makes sure that resources are used in an optimal manner.

SSH

Secure Shell (SSH) is a protocol to securely access remote computers. Based on the client-server model, multiple users with an SSH client can access a remote computer. Some operating systems such as Linux and Mac OS have a built-in SSH client, while others can use one of many publicly available clients.

Thread

A process, in the simplest terms, is an executing program. One or more threads run in the context of the process. A thread is the basic unit to which the operating system allocates processor time. A thread can execute any part of the process code, including parts currently being executed by another thread. Threads are co-located on the same node.

Task

In the Slurm context, a task is to be understood as a process. A multi-process program is made of several tasks. A task is typically used to schedule a MPI process, that in turn can use several CPUs. By contrast, a multi-threaded program is composed of only one task, which uses several CPUs.