System architectureΒΆ

An HPC cluster is a collection of many tightly couple nodes that work together to accomplish a task. Such cluster is made up of worker nodes(typically hundreds) that are used for computation and a few login nodes, which allocated tasks to worker nodes and manage the cluster. All the nodes are connected to a shared file-system and communicate through an internal network.

To schedule tasks to worker nodes, a workload management system must be used. Slurm is one of the most widely used workload manager, with the responsibility of managing tasks and resources in a cluster. A typical Slurm architecture involves

  1. slurmctld deamon running on the login nodes. Its responsibility includes allocating and monitoring cluster resources and managing how jobs are launched on worker nodes
  2. slurmd deamon running on the worker nodes. Its responsibility is to communicate with the slurmctld deamon

The figure bellow shows a high-level view of the system architecture.

_images/architecture.png