Setup Slurm cluster for HPC

Setup Slurm cluster for HPC

Slurm, or Simple Linux Utility for Resource Management, is an open-source job scheduler and workload manager for high performance computing (HPC) platforms. It helps manage and distribute compute resources to users, and can start multiple jobs on a single node or a single job on multiple nodes. Slurm’s scheduling capabilities can help improve productivity, reduce costs, and accelerate job execution.

In this bog we will setup HPC cluster and run some sample jobs to demonstrate functionality.

Architecture

  • Slurmctld — Slurm controller service. (Head node)
  • Slurmd — Slurm worker or compute node service. (Compute node)
  • Slurmdbd — Database service for accounting storage (optional)

Installation

Setup Munge (Controller or head node)

$ sudo apt install munge libmunge2 libmunge-dev
$ munge -n | unmunge | grep STATUS        

Generate munge key (Location: /etc/munge/munge.key)

$ sudo /usr/sbin/mungekey        

Setup correct permissions

$ sudo chown -R munge: /etc/munge/ /var/log/munge/ /var/lib/munge/ /run/munge/
$ sudo chmod 0700 /etc/munge/ /var/log/munge/ /var/lib/munge/
$ sudo chmod 0755 /run/munge/
$ sudo chmod 0700 /etc/munge/munge.key
$ sudo chown -R munge: /etc/munge/munge.key        

Restart services

$ systemctl enable munge
$ systemctl restart munge
$ systemctl status munge        

Setup Munge (Worker or Compute nodes)

$ sudo apt install munge libmunge2 libmunge-dev
$ munge -n | unmunge | grep STATUS        

Copy munge.key from controller nodes to all the worker nodes and set permissions.

$ sudo chown -R munge: /etc/munge/ /var/log/munge/ /var/lib/munge/ /run/munge/
$ sudo chmod 0700 /etc/munge/ /var/log/munge/ /var/lib/munge/
$ sudo chmod 0755 /run/munge/
$ sudo chmod 0700 /etc/munge/munge.key
$ sudo chown -R munge: /etc/munge/munge.key        

Restart services

$ systemctl enable munge
$ systemctl restart munge
$ systemctl status munge        

Setup Slurm

Distribution base installation

$ sudo apt update -y
$ sudo apt install slurmd slurmctld -y        

OR

Build packages from latest source (Recommended way for production)

$ apt-get install build-essential fakeroot devscripts equivs
$ tar -xaf slurm-24.05.2.tar.bz2
$ cd slurm-24.05.2
$ mk-build-deps -i debian/control
$ debuild -b -uc -us        

Create slurm users on all the nodes

$ export SLURMUSER=1001
$ groupadd -g $SLURMUSER slurm
$ useradd  -m -c "SLURM workload manager" -d /var/lib/slurm -u $SLURMUSER -g slurm  -s /bin/bash slurm        

Install packages on head node or login node.

$ dpkg -i slurm-smd_24.05.2-1_amd64.deb
$ dpkg -i slurm-smd-slurmctld_24.05.2-1_amd64.deb
$ dpkg -i slurm-smd-client_24.05.2-1_amd64.deb        

Install packages on compute nodes

$ dpkg -i slurm-smd_24.05.2-1_amd64.deb
$ dpkg -i slurm-smd-slurmd_24.05.2-1_amd64.deb
$ dpkg -i slurm-smd-client_24.05.2-1_amd64.deb        

Notes: may need to run following to fix some dependencies

$ apt -y — fix-broken install        

Configuration

Main configuration file is /etc/slurm/slurm.conf (Keep it default and just tweak few options according your setup. copy same slurm.conf on all your compute nodes. Even you can keep it on NFS or shared location to keep it same on all the nodes)

ClusterName=mycluster
SlurmctldHost=headn1
SlurmUser=slurm
ProctrackType=proctrack/cgroup
AccountingStorageType=accounting_storage/none
NodeName=computen[1-9] CPUs=80 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 ThreadsPerCore=2 RealMemory=250000        

Start services on head or contoller nodes

$ systemctl start slurmctld        

Start services on worker or compute nodes

$ systemctl start slurmd        

Validation

If all good then you will see following on head node

root@headn1:~# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
ALL*         up   infinite      9   idle computen[1-9]        

Test the cluster using following command. It will run hostname on all the nodes using srun

root@headn1:~# srun -N 9 hostname
computen2
computen9
computen4
computen5
computen3
computen6
computen1
computen7
computen8        

You can check status of submited job using squeue command. (Run sleep for 60 seconds and check status in queue)

root@headn1:~# srun -N 4 sleep 60

root@headn1:~# squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               126       ALL    sleep     root  R       0:05      4 computen[1-4]        

Let’s test MPI jobs for real HPC test.

You have to copy your MPI program on NFS shared storage to make it available to all your worker or compute nodes. Reference doc https://slurm.schedmd.com/mpi_guide.html

hello_world.c

#include <mpi.h>
#include <stdio.h>

int main(int argc, char** argv) {
    // Initialize the MPI environment
    MPI_Init(NULL, NULL);

    // Get the number of processes
    int world_size;
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);

    // Get the rank of the process
    int world_rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

    // Get the name of the processor
    char processor_name[MPI_MAX_PROCESSOR_NAME];
    int name_len;
    MPI_Get_processor_name(processor_name, &name_len);

    // Print off a hello world message
    printf("Hello world from processor %s, rank %d out of %d processors\n",
           processor_name, world_rank, world_size);

    // Finalize the MPI environment.
    MPI_Finalize();
}        

Compile hellow_world.c

mpicc mpi_hello_world.c -o hello-world        

Run (In this example, I am going to run it on 4 compute nodes using -N 4)

root@headn1:/data/sample# srun -N 4 --mpi=pmix hello-world
Hello world from processor computen3, rank 0 out of 1 processors
Hello world from processor computen2, rank 0 out of 1 processors
Hello world from processor computen4, rank 0 out of 1 processors
Hello world from processor computen1, rank 0 out of 1 processors        

Add GPU node

GPU node is nothing but compute node but with GPU card with few more config flags.

Create /etc/slurm/gres.conf file with following lines

NodeName=gpun1 Name=gpu AutoDetect=off File=/dev/nvidia0        

Add following in slurm.conf file and restart services.

GresTypes=gpu
NodeName=gpun1 CPUs=80 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 ThreadsPerCore=2 RealMemory=250000 Gres=gpu:1 Feature=gpu        

Check sinfo status

root@headn1:~# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
ALL*         up   infinite      10  idle computen[1-9],gpun1        

Create PARTITION for queue management

Add following in /etc/slurm/slurm.conf file

PartitionName=ALL Nodes=ALL Default=YES MaxTime=INFINITE State=UP
PartitionName=COMP Nodes=computen[1-9] Shared=NO MaxTime=INFINITE State=UP
PartitionName=GPU Nodes=gpun1 Shared=NO MaxTime=INFINITE State=UP        

Check status of partitions

root@headn1:/data/sample# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
ALL*         up   infinite      9   idle computen[1-9],gpun1
COMP         up   infinite      8   idle computen[1-9]
GPU          up   infinite      1   idle gpun1        

Now you can target job to specific partition or queue ( using -p GPU)

root@headn1:~# srun -p GPU --gres=gpu:1 nvidia-smi -L
GPU 0: NVIDIA A10 (UUID: GPU-a759982f-198e-2303-7427-fbc160cf37bd)        

Enjoy!!!

要查看或添加评论,请登录

Satish Patel的更多文章

  • Setup Slurm-web for Slurm HPC Clusters

    Setup Slurm-web for Slurm HPC Clusters

    Slurm-web provides a web interface on top of Slurm with intuitive graphical views, clear insights and advanced…

  • IPsec VPN tunnel between StrongSwan and PaloAlto firewall

    IPsec VPN tunnel between StrongSwan and PaloAlto firewall

    StrongSwan is a complete IPsec solution providing encryption and authentication to servers and clients. strongSwan can…

  • TRex Traffic Generator

    TRex Traffic Generator

    TRex is an open source, low cost, stateful and stateless traffic generator fuelled by DPDK. It generates L3-7 traffic…

    3 条评论
  • Multinode Kolla-Ansible LAB using LXD containers

    Multinode Kolla-Ansible LAB using LXD containers

    In this blog, I’m going to build openstack multinode lab using kolla-ansible with help of LXD virtualization. Multinode…

    2 条评论
  • Openstack Central Logging using Opensearch

    Openstack Central Logging using Opensearch

    OpenSearch is a distributed search and analytics engine that supports various use cases, from implementing a search box…

  • HP 6125XLG Blade Switch IRF Setup

    HP 6125XLG Blade Switch IRF Setup

    HP (Hewlett-Packard) switches support a feature called Intelligent Resilient Framework (IRF), which is designed to…

  • Upgrade Ceph from Quincy to Reef Release.

    Upgrade Ceph from Quincy to Reef Release.

    In this blog post, I’m going to upgrade production ceph storage from Quincy to Reef release using cephadm. Please read…

    1 条评论
  • Openstack NFS Storage Driver for Cinder

    Openstack NFS Storage Driver for Cinder

    Cinder can use network file system (NFS) shares as a storage backend driver using an NFS driver implementation. A…

  • Openstack Manila Integration to GlusterFS with Ganesha-NFS

    Openstack Manila Integration to GlusterFS with Ganesha-NFS

    This blogpost introduces the shared file system service for OpenStack Manila. In this lab i am going to integrate…

  • High Performance computing (HPC) on Openstack

    High Performance computing (HPC) on Openstack

    Recently i am working on deployment on High-Performance Computing (HPC) on Openstack. In this blog, I am going cover…

社区洞察

其他会员也浏览了