High Performance Computing on a supercomputer
For my research, I have had to run a program with multiple input parameter files to generate different models. Each execution/instance is independent of one another – i.e., model i doesn’t rely on the output from model j. There are a few different ways I could run them simultanously on a multi-core machine to take advantage of the cluster (i.e., supercomputer) at work. Here are a three methods I have used in the past. So far, I find method 3 to be the best, as elaborated below.
#Method 1:
Write a python script with mpi4py
to generate a pool of workers and use the subprocess
module to call the executable program in bash. In my SLURM script, the job scheduler, I define the walltime and call the python script once.
#Method 2: Write a SLURM script where each line calls the executable program with a different input parameter file. Just like Method 1, I will also have to define the walltime in the SLURM script.
#Method 3: Using disBatch to call on a list of tasks (i.e., each line calls the executable with a different input parameter file).
Problem with Method 2
I have to distribute the total number of jobs N_jobs across m nodes and call on m SLURM scripts separately (where m = N_jobs/# of cores per node). Otherwise, it will only be using all the cores on 1 node regardless of how many are reserved using sbatch
and the SLURM script. This is because even though the job scheduler will reserve the nodes requested, it is up to me to write a program to be called by sbatch
that takes advantage of the multiple nodes. Writing out line-by-line to call the executable will not distribute the jobs across nodes.
Why Method 3 is better than Method 1
Similar to Method 2, Method 1 will distribute the jobs across all processes of a node reserved in the SLURM script, and I have to predefine/know how long the jobs will take (and thus, how long to reserve the nodes for) a priori. (In princple, mpi4py
supports cluster computing, not sure why we didn’t write the code in such a way to distribute jobs across nodes…)
With Method 3, I don’t have to predefine how many cores there will be on each node (cf. in Method 2, which is also done so manually, because some nodes have 28 cores, some have 40 cores). In addition, it also automatically decrease the number of nodes and cores I need as the number of jobs reduces to < N_nodes * N_cores over time. In this method, I don’t need to predefine the walltime I need, all I have to give as input parameter is the number of nodes I want to run all the tasks on.
Thus far, this is the most efficient way for the purpose of running multiple independent jobs across a workstation/cluster with a large number of processors.