HOME

TheInfoList



OR:

The Slurm Workload Manager, formerly known as Simple Linux Utility for Resource Management (SLURM), or simply Slurm, is a
free and open-source Free and open-source software (FOSS) is software available under a Software license, license that grants users the right to use, modify, and distribute the software modified or not to everyone free of charge. FOSS is an inclusive umbrella term ...
job scheduler A job scheduler is a computer application for controlling unattended background program execution of jobs. This is commonly called batch scheduling, as execution of non-interactive jobs is often called batch processing, though traditional ''job ...
for
Linux Linux ( ) is a family of open source Unix-like operating systems based on the Linux kernel, an kernel (operating system), operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically package manager, pac ...
and
Unix-like A Unix-like (sometimes referred to as UN*X, *nix or *NIX) operating system is one that behaves in a manner similar to a Unix system, although not necessarily conforming to or being certified to any version of the Single UNIX Specification. A Uni ...
kernels, used by many of the world's
supercomputer A supercomputer is a type of computer with a high level of performance as compared to a general-purpose computer. The performance of a supercomputer is commonly measured in floating-point operations per second (FLOPS) instead of million instruc ...
s and
computer cluster A computer cluster is a set of computers that work together so that they can be viewed as a single system. Unlike grid computers, computer clusters have each node set to perform the same task, controlled and scheduled by software. The newes ...
s. It provides three key functions: * allocating exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work, * providing a framework for starting, executing, and monitoring work, typically a parallel job such as
Message Passing Interface The Message Passing Interface (MPI) is a portable message-passing standard designed to function on parallel computing architectures. The MPI standard defines the syntax and semantics of library routines that are useful to a wide range of use ...
(MPI) on a set of allocated nodes, and * arbitrating contention for resources by managing a queue of pending jobs. Slurm is the workload manager on about 60% of the
TOP500 The TOP500 project ranks and details the 500 most powerful non-distributed computing, distributed computer systems in the world. The project was started in 1993 and publishes an updated list of the supercomputers twice a year. The first of these ...
supercomputers. Slurm uses a best fit algorithm based on
Hilbert curve scheduling In parallel processing, the Hilbert curve scheduling method turns a multidimensional task allocation problem into a one-dimensional space filling problem using Hilbert curves, assigning related tasks to locations with higher levels of proximity.'' ...
or fat tree network topology in order to optimize locality of task assignments on parallel computers.


History

Slurm began development as a collaborative effort primarily by
Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory (LLNL) is a Federally funded research and development centers, federally funded research and development center in Livermore, California, United States. Originally established in 1952, the laboratory now i ...
, SchedMD, Linux NetworX,
Hewlett-Packard The Hewlett-Packard Company, commonly shortened to Hewlett-Packard ( ) or HP, was an American multinational information technology company. It was founded by Bill Hewlett and David Packard in 1939 in a one-car garage in Palo Alto, California ...
, and
Groupe Bull Bull SAS (also known as Groupe Bull, Bull Information Systems, or simply Bull) is a French computer company headquartered in Les Clayes-sous-Bois, in the western suburbs of Paris. The company has also been known at various times as Bull General ...
as a Free Software resource manager. It was inspired by the closed source Quadrics RMS and shares a similar syntax. The name is a reference to the soda in
Futurama ''Futurama'' is an American animated science fiction sitcom created by Matt Groening for the Fox Broadcasting Company and later revived by Comedy Central, and then Hulu. The series follows Philip J. Fry, who is cryogenically preserved for 1 ...
. Over 100 people around the world have contributed to the project. It has since evolved into a sophisticated batch scheduler capable of satisfying the requirements of many large computer centers. ,
TOP500 The TOP500 project ranks and details the 500 most powerful non-distributed computing, distributed computer systems in the world. The project was started in 1993 and publishes an updated list of the supercomputers twice a year. The first of these ...
list of most powerful computers in the world indicates that Slurm is the workload manager on more than half of the top ten systems.


Structure

Slurm's design is very modular with about 100 optional plugins. In its simplest configuration, it can be installed and configured in a couple of minutes. More sophisticated configurations provide database integration for accounting, management of resource limits and workload prioritization.


Features

Slurm features include: * No single point of failure, backup daemons, fault-tolerant job options * Highly scalable (schedules up to 100,000 independent jobs on the 100,000 sockets of IBM Sequoia) * High performance (up to 1000 job submissions per second and 600 job executions per second) * Free and open-source software (
GNU General Public License The GNU General Public Licenses (GNU GPL or simply GPL) are a series of widely used free software licenses, or ''copyleft'' licenses, that guarantee end users the freedom to run, study, share, or modify the software. The GPL was the first ...
) * Highly configurable with about 100 plugins * Fair-share scheduling with hierarchical bank accounts * Preemptive and gang scheduling (time-slicing of parallel jobs) * Integrated with database for accounting and configuration * Resource allocations optimized for network topology and on-node topology (sockets, cores and hyperthreads) * Advanced reservation * Idle nodes can be powered down * Different operating systems can be booted for each job * Scheduling for generic resources (e.g.
Graphics processing unit A graphics processing unit (GPU) is a specialized electronic circuit designed for digital image processing and to accelerate computer graphics, being present either as a discrete video card or embedded on motherboards, mobile phones, personal ...
) * Real-time accounting down to the task level (identify specific tasks with high CPU or memory usage) * Resource limits by user or bank account * Accounting for power consumption by job * Support of IBM Parallel Environment (PE/POE) * Support for job arrays * Job profiling (periodic sampling of each task's CPU use, memory use, power consumption, network and file system use) * Sophisticated multifactor job prioritization algorithms * Support for MapReduce+ * Support for
burst buffer In the high-performance computing environment, burst buffer is a fast intermediate storage layer positioned between the front-end computing processes and the back-end storage systems. It bridges the performance gap between the processing speed o ...
that accelerates scientific data movement The following features are announced for version 14.11 of Slurm, was released in November 2014: * Improved job array data structure and scalability * Support for heterogeneous generic resources * Add user options to set the CPU governor * Automatic job requeue policy based on exit value * Report API use by user, type, count and time consumed * Communication gateway nodes improve scalability


Supported platforms

Recent Slurm releases run only on
Linux Linux ( ) is a family of open source Unix-like operating systems based on the Linux kernel, an kernel (operating system), operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically package manager, pac ...
. Older versions had been ported to a few other
POSIX The Portable Operating System Interface (POSIX; ) is a family of standards specified by the IEEE Computer Society for maintaining compatibility between operating systems. POSIX defines application programming interfaces (APIs), along with comm ...
-based
operating system An operating system (OS) is system software that manages computer hardware and software resources, and provides common daemon (computing), services for computer programs. Time-sharing operating systems scheduler (computing), schedule tasks for ...
s, including
BSD The Berkeley Software Distribution (BSD), also known as Berkeley Unix or BSD Unix, is a discontinued Unix operating system developed and distributed by the Computer Systems Research Group (CSRG) at the University of California, Berkeley, beginni ...
s (
FreeBSD FreeBSD is a free-software Unix-like operating system descended from the Berkeley Software Distribution (BSD). The first version was released in 1993 developed from 386BSD, one of the first fully functional and free Unix clones on affordable ...
,
NetBSD NetBSD is a free and open-source Unix-like operating system based on the Berkeley Software Distribution (BSD). It was the first open-source BSD descendant officially released after 386BSD was fork (software development), forked. It continues to ...
and
OpenBSD OpenBSD is a security-focused operating system, security-focused, free software, Unix-like operating system based on the Berkeley Software Distribution (BSD). Theo de Raadt created OpenBSD in 1995 by fork (software development), forking NetBSD ...
),Slurm Platforms
/ref> but this is no longer feasible as Slurm now requires
cgroups cgroups (abbreviated from control groups) is a Linux kernel feature that limits, accounts for, and isolates the resource usage (CPU, memory, disk I/O, etc.) of a collection of processes. Engineers at Google started the work on this feature ...
for core operations. Clusters running operating systems other than Linux will need to use a different batch system, such as LPJS. Slurm also supports several unique computer architectures, including: *
IBM International Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American Multinational corporation, multinational technology company headquartered in Armonk, New York, and present in over 175 countries. It is ...
BlueGene/Q models, including the 20 petaflop IBM Sequoia *
Cray Cray Inc., a subsidiary of Hewlett Packard Enterprise, is an American supercomputer manufacturer headquartered in Seattle, Washington. It also manufactures systems for data storage and analytics. Several Cray supercomputer systems are listed ...
XT, XE and Cascade *
Tianhe-2 Tianhe-2 or TH-2 (, i.e. 'Milky Way 2') is a 33.86- petaflop supercomputer located in the National Supercomputer Center in Guangzhou, China. It was developed by a team of 1,300 scientists and engineers. It was the world's fastest supercomputer ...
a 33.9 petaflop system with 32,000 Intel Ivy Bridge chips and 48,000 Intel Xeon Phi chips with a total of 3.1 million cores * IBM Parallel Environment * Anton


License

Slurm is available under the GNU General Public License v2.


Commercial support

In 2010, the developers of Slurm founded SchedMD, which maintains the canonical source, provides development, level 3 commercial support and training services. Commercial support is also available from
Bull A bull is an intact (i.e., not Castration, castrated) adult male of the species ''Bos taurus'' (cattle). More muscular and aggressive than the females of the same species (i.e. cows proper), bulls have long been an important symbol cattle in r ...
,
Cray Cray Inc., a subsidiary of Hewlett Packard Enterprise, is an American supercomputer manufacturer headquartered in Seattle, Washington. It also manufactures systems for data storage and analytics. Several Cray supercomputer systems are listed ...
, and Science + Computing (subsidiary of
Atos Atos SE is a European multinational information technology (IT) service and consulting company with headquarters in Bezons suburb of Paris, France, and offices worldwide. It specialises in hi-tech transactional services, unified communicat ...
).


Usage

The slurm system has three main parts: * slurmctld, a central control
daemon A demon is a malevolent supernatural being, evil spirit or fiend in religion, occultism, literature, fiction, mythology and folklore. Demon, daemon or dæmon may also refer to: Entertainment Fictional entities * Daemon (G.I. Joe), a character ...
running on a single control node (optionally with
failover Failover is switching to a redundant or standby computer server, system, hardware component or network upon the failure or abnormal termination of the previously active application, server, system, hardware component, or network in a computer ...
backups); * many computing nodes, each with one or more slurmd daemons; * clients that connect to the manager node, often with ssh. The clients can issue commands to the control daemon, which would accept and divide the workload to the computing daemons. For clients, the main commands are srun (queue up an interactive job), sbatch (queue up a job), squeue (print the job queue) and scancel (remove a job from the queue). Jobs can be run in
batch mode Batch may refer to: Food and drink * Batch (alcohol), an alcoholic fruit beverage * Batch loaf, a type of bread popular in Ireland * A dialect term for a bread roll used in North Warwickshire, Nuneaton and Coventry, as well as on the Wirral, ...
or interactive mode. For interactive mode, a compute node would start a shell, connects the client into it, and run the job. From there the user may observe and interact with the job while it is running. Usually, interactive jobs are used for initial debugging, and after debugging, the same job would be submitted by sbatch. For a batch mode job, its stdout and stderr outputs are typically directed to text files for later inspection.


See also

* Job Scheduler and Batch Queuing for Clusters *
Beowulf cluster A Beowulf cluster is a computer cluster of normally identical, commodity-grade computers networked into a small local area network with libraries and programs installed that allow processing to be shared among them. The result is a high-performa ...
*
Maui Cluster Scheduler Maui Cluster Scheduler is a job scheduler for use on computer cluster, clusters and supercomputers initially developed by Cluster Resources, Inc. Maui is capable of supporting multiple scheduling policies, dynamic priorities, reservations, and f ...
*
Open Source Cluster Application Resources Open Source Cluster Application Resources (OSCAR) is a Linux-based software installation for high-performance cluster computing. OSCAR allows users to install a Beowulf type high performance computing cluster. See also * TORQUE Resource Manager ...
(OSCAR) *
TORQUE In physics and mechanics, torque is the rotational analogue of linear force. It is also referred to as the moment of force (also abbreviated to moment). The symbol for torque is typically \boldsymbol\tau, the lowercase Greek letter ''tau''. Wh ...
* Univa Grid Engine * Platform LSF


References


Further reading

* * * *


External links

*
Slurm Documentation

Slurm Workload Manager Architecture Configuration and Use


{{DEFAULTSORT:Slurm Job scheduling Parallel computing Grid computing Cluster computing Free software programmed in C Software using the GNU General Public License