If you need to perform many calculations, or analyses that are too big for your own system, clusters and supercomputers provide the computing power you need. In this course, you will learn to work with the national supercomputer Snellius.
What will you learn in this course?
This course is a continuation of the first introduction course to Supercomputing, where you can take a deeper dive in the use of supercomputers with some particular focus in efficiency and good practices and an eminently practical approach.
The outline of this sesión includes the following modules:
- Fundamentals of performance analysis. This technical introductory presentation introduces hybrid high-performance systems, abstractly covering the system’s architecture and configuration. Our goal is to enhance the understanding of HPC complexity before delving into the importance of performance analysis models. Special focus will be given to the Roofline model.
- Abstract modeling of hybrid supercomputers. To present an abstract modeling approach for hybrid supercomputers, condensing their complexity into three core parameters: peak performance, memory, and network bandwidth.
- Performance analysis. To explore performance analysis, starting with an overview of various models and delving into the specifics of the roofline model.
- The Roofline model. To describe the roofline model and present its practical application through clear explanations and demonstrations.
- File systems. This practical session covers the correct usage of file systems on HPC systems, especially on Snellius.
- Slurm hybrid jobs. Slurm, a prevalent job scheduler for High-Performance Computing (HPC) systems, has been introduced in previous sections for fundamental understanding. This module advances the specifics of resource allocation parameters for hybrid shared- and distributed-memory jobs.
- Nodes, cores, and tasks. This segment will delve into the fundamental concepts of nodes, cores, and tasks, shedding light on their roles within the context of HPC systems.
- Bindings. The concept of bindings will be explored, providing insights into how tasks are linked to specific resources enhancing participants’ understanding of resource allocation mechanisms.
- Hands on. We will execute the vector addition kernel with multiple configurations using a set of scripts.
- QCG Pilotjob. In some cases, users have to execute a large amount of lightweight cases. However, supercomputer’s nodes are too powerful and allow only relatively big partitions. For instance, the smallest allocation possible on Snellius is 1/4 of a node: 32 cores and 64 GB. Job concurrency is a common strategy to efficiently launch multiple light jobs on such big partitions.
- Fundamentals of job concurrency. This segment delves into the foundational principles underlying job concurrency. Job concurrency is a methodological approach that enables the simultaneous execution of multiple smaller jobs within a larger allocated partition. The objective is to optimize resource utilization and enhance efficiency in scenarios where lighter tasks are executed on nodes designed for heavier workloads.
- Hands on QCG PilotJob. This practical session provides participants with hands-on experience working with the QCG Pilotjob framework. Participants will gain practical insights into the strategies and techniques of utilizing job concurrency to launch and manage multiple lightweight jobs within the context of sizable node partitions.
Participation in the course Introduction to Supercomputing, Part I
The language of instruction is English
This course takes place at the VU Campus
De Boelelaan 1081, Amsterdam – Room WN-C203/C255
View the floor plan here