Introduction to Supercomputing, Part 2

Using a high-performance computing system in the most effective and efficient way is not an easy task. Therefore, this course brings a follow-up to the first introductory Supercomputing course (part 1), where you can take a deeper dive into the use of supercomputers with a special focus on efficiency and good practices and a very practical approach.

Regularity
Every 2 months

Trainers
Maxim Masterov
Xavier Álvarez Farré
Carlos Teijeiro Barjas

What will you learn in this course?

The format of this course includes the following modules:

  • Fundamentals of performance analysis. This introductory technical presentation introduces high-performance hybrid systems, abstractly covering the architecture and configuration of the system. Our aim is to enhance the understanding of HPC complexity before delving deeper into the importance of performance analysis models. Special attention is paid to the Roofline model.
    • Abstract Modelling hybrid supercomputers. Presenting an abstract modelling approach for hybrid supercomputers, condensing their complexity into three key parameters: peak performance, memory and network bandwidth.
    • Performance analysis. Explore performance analysis, starting with an overview of different models and going deeper into the specifics of the roofline model.
    • The roofline model. Describing the roofline model and presenting its practical application through clear explanations and demonstrations.
  • File systems. This practical session covers the proper use of file systems on HPC systems, especially on Snellius.
  • Slurm hybrid tasks. Slurm, a widely used job scheduler for high-performance computing (HPC) systems, has been introduced in earlier sections for a basic understanding. This module covers the specific resource allocation parameters for hybrid jobs with shared and distributed memory.
    • Nodes, cores and tasks. This segment covers the fundamental concepts of nodes, cores and tasks, and highlights their role within the context of HPC systems.
    • Bindings. The concept of bindings is explored, providing insight into how tasks are associated with specific resources, improving participants’ understanding of resource allocation mechanisms.
    • Hands on. We will run the vector optics kernel with multiple configurations using a set of scripts.
  • QCG pilot job. In some cases, users need to run a large number of lightweight cases. However, the nodes of supercomputers are too powerful and only allow relatively large partitions. For example, the smallest possible allocation on Snellius is 1/4 of a node: 32 cores and 64 GB. Job concurrency is a common strategy for efficiently launching multiple lightweight jobs on such large partitions.
    • Fundamentals of job concurrency. This segment discusses the basic principles underlying job concurrency. Job concurrency is a methodological approach that enables the simultaneous execution of multiple smaller jobs within a larger allocated partition. The goal is to optimise resource utilisation and improve efficiency in scenarios where lighter tasks are executed on nodes designed for heavier workloads.
    • Hands-on QCG PilotJob. This hands-on session provides participants with hands-on experience of the QCG Pilotjob framework. Participants will gain practical insights into the strategies and techniques of using job concurrency to launch and manage multiple lightweight jobs within the context of bulky node partitions.

Costs

Participation is free of charge

Prerequisites

Participation in the course Introduction to Supercomputing Part 1

The language of instruction is English

Do you want to participate?

Every 2 months there is an Introduction to Supercomputing part 2. Not sure if you know enough? Participate first in Supercomputing part 1.