Slurm User Guide
This guide will walk you through the basics of using Slurm to submit, manage, and monitor jobs on a cluster.
Of course the first step for using the cluster is having access to it. For that you must be sure to have:
- Your KU credentials: user and password.
- Permissions on the cluster, both for the servers and the project folders. You have to ask to your Group Leader for these, he/she being the only one who knows who you are and where you do belong.
- Access to KU network.
- Access to a terminal/terminal emulator.
Your KU credentials
You need an alfanumeric user abx123 with the relative password. These are granted by KU-IT, contact your HR department if you don't have them.
Permissions
When you are in possession of KU credentials, you have to ask your Group Leader/Responsible to write an email to me (david.galligani@bric.ku.dk) to give you access to the cluster.
- SRV-supekgate-users: the gate node
You need to ask via KU identity portal (identity.ku.dk) for access also to:
- the project dir "bricsoftware" : additional software modules installed by BRIC
and from serviceportal.ku.dk to :
- additional project and dataset dirs belonging to your group (Compute Dataset or Compute Projects)
If your group is new to the HPC system, your Group Leader should ask for a project dir.
Important Compute nodes have no access to KU's H: N: or S: Drives. This drives are using a Windows' compatible protocol (Samba), which is not compatible with HPC Systems. These drives are available only from the gate nodes, and you NEED to copy the data you need for your work to your /project or /dataset dirs.
If your group is new to the HPC system, your Group Leader needs to ask for a project dir, here you can find some information about how to do it.
Access to KU Network
In order to access the cluster you have to be in the KU network either physically or virtually, as for security reasons is not possibe to access the servers from outside. In case you need to work from remote, you can use KU VPN downloading and installing the VPN client from this website.
Terminal
To access the cluster you need a terminal or a terminal emulator. On Linux and Apple systems you can just open a console, in Windows you can use the PowerShell. Once you fired up your terminal you need to use the SSH (secure shell) command to connect:
$ ssh kuser@supekgate
where kuser is your alphanumerical KU user id and supekgate is an alias for one of the four gates to the cluster (supekgate01-4fl) If you get a DNS error (host not found) try using the complete version supekgate.unicph.domain . If you have never connected to this particular server before you will encounter a message similar to this:
The authenticity of host 'supekhead' can't be established. RSA key fingerprint is 2a:b6:f6:8d:9d:c2:f8:2b:8c:c5:03:06:a0:f8:59:12. Are you sure you want to continue connecting (yes/no)?
This is your computer warning you that you are about to connect to another computer, type "yes" to proceed. This will add the HPC to your "known hosts", and you shouldn't see the message again the future.
Before diving into Slurm-specific commands, it's crucial to know some basic Linux commands. Here are some essential ones you should be familiar with when using a Slurm cluster:
1. File and Directory Operations
- ls: List directory contents
- ls -l: Long format listing
- ls -a: Show hidden files
- pwd: Print working directory
- cd: Change directory
- cd ..: Move up one directory
- cd ~: Go to home directory
- mkdir: Create a new directory
- rm: Remove files or directories
- rm -r: Remove directories and their contents recursively
- cp: Copy files or directories
- mv: Move or rename files or directories
- touch: Create an empty file or update file timestamps
2. File Viewing and Editing
- cat: Display file contents
- less: View file contents page by page
- head: Display the beginning of a file
- tail: Display the end of a file
- tail -f: Follow file changes in real-time
- nano or vim: Text editors
3. File Permissions and Ownership
- chmod: Change file permissions
- chown: Change file ownership
4. Process Management
- ps: Display current processes
- ps aux: Show all processes for all users
- top or htop: Interactive process viewers
- kill: Terminate processes
5. System Information
- df: Report file system disk space usage
- du: Estimate file space usage
- free: Display amount of free and used memory
6. Text Processing
- grep: Search text using patterns
- sed: Stream editor for filtering and transforming text
- awk: Pattern scanning and processing language
7. Network Commands
- ssh: Secure shell for remote login
- scp: Securely copy files between hosts on a network
- wget or curl: Retrieve files from the web
8. File Compression and Archiving
- tar: Tape archiver, used for creating and extracting archives
- gzip, gunzip: Compress or expand files
- zip, unzip: Package and compress (archive) files
9. File Transfer
- rsync: Fast, versatile file copying tool
10. Miscellaneous
- man: Display the manual page for a command
- history: Show command history
- which: Locate a command
Using These Commands in Slurm Context
- File Management: You'll use ls, cd, mkdir, cp, and mv to navigate and manage your files and directories on the cluster.
- Job Script Creation: Use text editors like nano or vim to create and edit your Slurm job scripts.
- File Permissions: Use chmod to ensure your job scripts are executable.
- Process Monitoring: Commands like ps and top can be useful for monitoring your jobs on the compute nodes (if you have access).
- Output Examination: Use cat, less, head, and tail to view your job output files.
- Data Transfer: Use scp or rsync to transfer files to and from the cluster.
- Text Processing: Commands like grep, sed, and awk are invaluable for parsing and analyzing job outputs.
Basic Slurm Commands
Here are some essential Slurm commands you'll use frequently:
- srun : Run a parallel job
- sbatch : Submit a batch script
- scancel: Cancel a job
- squeue : View information about jobs in the queue
- sinfo : View information about Slurm nodes and partitions
Batch Jobs
For batch jobs, create a submission script and use `sbatch`:
1. Create a script (e.g., `job_script.sh`):
#!/bin/bash #SBATCH --job-name=my_job #SBATCH --output=output_%j.log #SBATCH --error=error_%j.log #SBATCH --ntasks=1 #SBATCH --time=01:00:00 #SBATCH --mem=1G # Your commands here echo "Hello, Slurm!"
2. Submit the job:
$ sbatch job_script.sh
Monitoring Jobs
To view information about your jobs in the queue:
$ squeue -u $USER
To see detailed information about a specific job:
$ scontrol show job job_id
## Managing Jobs To cancel a job:
$ scancel job_id
To hold a job:
$ scontrol hold job_id
To release a held job:
$ scontrol release job_id
Resource Constraints
Specify resource requirements in your job script:
#SBATCH --cpus-per-task=4 #SBATCH --mem=8G #SBATCH --gres=gpu:2
Job Dependencies
Create dependencies between jobs:
sbatch --dependency=afterok:job_id next_job.sh
Best Practices
- Estimate resources accurately: Request only the resources you need to avoid long queue times.
- Set appropriate time limits: This helps the scheduler plan more effectively.
- Use job names: Give your jobs meaningful names for easier management.
- Monitor your jobs: Regularly check the status of your jobs and kill them if they're not behaving as expected.
- Use appropriate partitions: Choose the right partition based on your job's requirements.
- Optimize your code: Well-optimized code can reduce resource usage and improve job throughput.
The module system is a software environment management tool widely used in HPC environments. It allows users to dynamically modify their shell environment to access different software packages and versions. Here's how to effectively use the module system in your Slurm jobs:
1. Basic Module Commands
Before diving into Slurm-specific usage, let's review some basic module commands:
- `module avail`: List all available modules
- `module list`: Show currently loaded modules
- `module load
`: Load a specific module - `module unload
`: Unload a specific module - `module purge`: Unload all currently loaded modules
- `module show
`: Display information about a module
2. Using Modules in Slurm Job Scripts
Here's an example of how to use modules in a Slurm job script:
#!/bin/bash #SBATCH --job-name=module_job #SBATCH --output=output_%j.log #SBATCH --error=error_%j.log #SBATCH --time=01:00:00 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=1 #SBATCH --mem=4G # Purge all loaded modules module purge # Load required modules module load gcc/9.3.0 module load python/3.8.5 module load openmpi/4.0.4 # Your job commands here python my_script.py
3. Loading Software Stacks
Sometimes, you might need to load a complete software stack. Many HPC systems provide meta-modules for this purpose:
#!/bin/bash #SBATCH --job-name=stack_job #SBATCH --output=output_%j.log #SBATCH --error=error_%j.log #SBATCH --time=01:00:00 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=1 #SBATCH --mem=4G # Load a complete software stack module load foss/2020a # Load additional modules as needed module load python/3.8.5 # Your job commands here
4. Module Dependencies
Some modules may have dependencies or conflicts. The module system often handles these automatically, but it's good to be aware of them:
#!/bin/bash #SBATCH --job-name=dep_job #SBATCH --output=output_%j.log #SBATCH --error=error_%j.log #SBATCH --time=01:00:00 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=1 #SBATCH --mem=4G # Load a module with dependencies module load tensorflow/2.4.1-cuda11.0-python3 # The above might automatically load CUDA, cuDNN, and Python modules # Your job commands here python my_tensorflow_script.py
5. Using Module Collections
If you frequently use the same set of modules, you can create a module collection:
$ # Create a module collection module save my_collection
# In your Slurm script #!/bin/bash #SBATCH --job-name=collection_job #SBATCH --output=output_%j.log #SBATCH --error=error_%j.log #SBATCH --time=01:00:00 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=1 #SBATCH --mem=4G # Load your module collection module restore my_collection # Your job commands here
6. Best Practices for Using Modules with Slurm
- Purge before loading: Start your script with `module purge` to ensure a clean environment.
- Be specific: Use full module names including versions to ensure reproducibility.
- Check for conflicts: Use `module show` to check for potential conflicts before loading modules.
- Use module collections: For complex environments, create and use module collections.
- Document your modules: Comment your Slurm script to explain why each module is needed.
- Use module load in job scripts: Don't rely on modules loaded in your login environment; explicitly load them in your job script.
7. Troubleshooting Module Issues in Slurm Jobs
If you encounter module-related issues:
- Check your Slurm output and error logs for module-related errors.
- Ensure the modules you're trying to load are available on the compute nodes (they might differ from login nodes).
- Use `module show
` to verify module details and dependencies. - If a module isn't found, check if you need to load a specific compiler or MPI implementation first.
8. Advanced Module Usage
Some advanced module features:
- Module versioning: `module load
/ ` - Swapping modules : `module swap
` - Module aliases : `module alias my_python python/3.8.5`
A Slurm job script is a shell script (typically bash) that contains both Slurm directives and the commands you want to run on the cluster. Let's break down the components and syntax of a Slurm job script:
Basic Structure
#!/bin/bash #SBATCH [options] #SBATCH [more options] # Your commands here
Shebang
The first line of your script should be the shebang:
#!/bin/bash
This tells the system to interpret the script using the bash shell.
Slurm Directives
Slurm directives are special comments that start with `#SBATCH`. They tell Slurm how to set up and run your job. Here are some common directives:
#SBATCH --job-name=my_job # Name of the job #SBATCH --output=output_%j.log # Standard output log file (%j is replaced by the job ID) #SBATCH --error=error_%j.log # Standard error log file #SBATCH --time=01:00:00 # Time limit (HH:MM:SS) #SBATCH --ntasks=1 # Number of tasks (processes) #SBATCH --cpus-per-task=1 # Number of CPU cores per task #SBATCH --mem=1G # Memory limit #SBATCH --partition=general # Partition (queue) name #SBATCH --gres=gpu:2 # Request 2 GPUs
Common Slurm Directives
Here's a more comprehensive list of Slurm directives:
- `--job-name=
`: Set a name for the job - `--output=
`: Specify the file for standard output - `--error=
`: Specify the file for standard error - `--time=
- `--ntasks=
`: Specify the number of tasks to run - `--cpus-per-task=
`: Set the number of CPU cores per task - `--mem=<size[units]>`: Set the total memory required (e.g., 1G for 1 gigabyte)</size[units]>
- `--partition=
`: Specify the partition to run the job on - `--array=
`: Create a job array (e.g., --array=1-10 for 10 array jobs) - `--mail-type=
`: Specify email notification events (e.g., BEGIN, END, FAIL) - `--mail-user=
`: Set the email address for notifications - `--nodes=
`: Request a specific number of nodes - `--gres=
`: Request generic consumable resources (e.g., GPUs)
Environment Variables
Slurm sets several environment variables that you can use in your script:
- - `$SLURM_JOB_ID`: The ID of the job
- - `$SLURM_ARRAY_TASK_ID`: The array index for job arrays
- - `$SLURM_CPUS_PER_TASK`: Number of CPUs allocated per task
- - `$SLURM_NTASKS`: Total number of tasks in a job
Example Job Script
Here's an example of a more complex Slurm job script:
#!/bin/bash #SBATCH --job-name=complex_job #SBATCH --output=output_%A_%a.log #SBATCH --error=error_%A_%a.log #SBATCH --array=1-5 #SBATCH --time=02:00:00 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=4 #SBATCH --mem=8G #SBATCH --partition=general #SBATCH --mail-type=BEGIN,END,FAIL #SBATCH --mail-user=your.email@example.com # Load any necessary modules module load python/3.8 # Run the main command python my_script.py --input-file input.txt --output-file output.txt # Optional: Run some post-processing if [ $? -eq 0 ]; then echo "Job completed successfully" python post_process.py output.txt else echo "Job failed" fi
Understanding --ntasks in Slurm
When you use the `--ntasks` option in Slurm without other specifications, it's important to understand how Slurm interprets and applies this setting.
When you specify `--ntasks=4` without other options:
- Slurm will allocate resources for 4 tasks.
- By default, each task is allocated 1 CPU (core).
- The tasks may be distributed across multiple nodes, depending on the cluster's configuration and available resources.
#SBATCH --ntasks=4 # This will run your command with 4 tasks srun ./my_program
In this scenario:
- Your job will be allocated 4 CPUs in total.
- These 4 CPUs could be on a single node or spread across multiple nodes, depending on availability and the cluster's configuration.
- Each task will have access to 1 CPU by default.
Important Considerations
- CPU Allocation: Without specifying `--cpus-per-task`, each task gets 1 CPU by default.
- Memory Allocation: The default memory allocation per task depends on the cluster's configuration. It's often a good practice to specify memory requirements explicitly.
- Node Distribution: Tasks may be distributed across nodes unless you specify `--nodes` or use the `--ntasks-per-node` option.
- Parallel Execution: This setting is particularly useful for MPI jobs where you want to run multiple parallel processes.
Examples with Additional Specifications
1. Specifying CPUs per Task
#SBATCH --ntasks=4 #SBATCH --cpus-per-task=2 srun ./my_multi_threaded_program
This allocates 4 tasks, each with 2 CPUs, totaling 8 CPUs for the job.
2. Constraining to a Single Node
#SBATCH --ntasks=4 #SBATCH --nodes=1 srun ./my_program
This ensures all 4 tasks run on the same node.
3. Specifying Tasks per Node
#SBATCH --ntasks=4 #SBATCH --ntasks-per-node=2
srun ./my_program
This distributes the 4 tasks across 2 nodes, with 2 tasks per node.
Best Practices
- Be explicit about your resource requirements when possible (CPUs, memory, etc.).
- Consider the nature of your program (MPI, multi-threaded, etc.) when deciding how to allocate tasks and CPUs.
- Use `--ntasks` in combination with other options like `--cpus-per-task` or `--nodes` for more precise control over resource allocation.
- Test your job submissions with smaller task counts before scaling up to ensure proper resource utilization.
Best Practices for Job Scripts
- Use variables: For repeated values or for clarity, use shell variables.
- Comment your script: Explain complex parts of your script for better maintainability.
- Error handling: Include error checking and handling in your script.
- Modularity: For complex workflows, consider breaking your job into multiple scripts.
- Resource estimation: Start with conservative resource estimates and adjust based on actual usage.
- Environment setup: Load necessary modules and set environment variables at the beginning of your script.
- Output management: Use job ID and array ID in output file names to avoid overwrites.
When submitting jobs to a Slurm-managed cluster, understanding the difference between threads and CPUs per task is crucial for optimizing your job's performance and efficient use of cluster resources.
Key Concepts
1. CPU (Core)
In Slurm terminology, a CPU typically refers to a physical core on a processor. Each CPU can execute a single thread of instructions at a time (ignoring hyperthreading for simplicity).
2. Task
A task in Slurm is essentially a process. It's a running instance of a program that may use one or more CPUs.
3. Thread
A thread is the smallest unit of processing that can be scheduled by an operating system. A single task (process) can have multiple threads, which can run concurrently on different CPUs.
Slurm Resource Allocation Options
Option | Description |
---|---|
--ntasks | Number of tasks (processes) to run |
--cpus-per-task | Number of CPUs (cores) allocated to each task |
--threads-per-core | Number of threads to use per core (relevant for hyperthreading) |
Threads vs CPUs per Task
Using More Threads
When you increase the number of threads in your program:
- It allows for more parallel execution within a single task (process).
- Threads share the same memory space, making communication between threads faster.
- Ideal for programs that are designed to use multi-threading (e.g., OpenMP programs).
- The number of threads is typically controlled by the program itself or environment variables (e.g., OMP_NUM_THREADS).
Using More CPUs per Task
When you increase the number of CPUs per task in Slurm:
- It allocates more physical cores to each task (process).
- Allows for true parallel execution of threads on separate cores.
- Necessary for multi-threaded programs to utilize multiple cores effectively.
- Controlled by the --cpus-per-task option in Slurm.
Examples
1. Single-threaded Program
#SBATCH --ntasks=1 #SBATCH --cpus-per-task=1 ./my_single_threaded_program
This allocates one CPU for a single task, suitable for a program that doesn't use threading.
2. Multi-threaded Program (e.g., OpenMP)
#SBATCH --ntasks=1 #SBATCH --cpus-per-task=4 export OMP_NUM_THREADS=4 ./my_openmp_program
This allocates 4 CPUs for a single task, allowing an OpenMP program to use 4 threads effectively.
3. MPI Program with Threading
#SBATCH --ntasks=2 #SBATCH --cpus-per-task=4 export OMP_NUM_THREADS=4 mpirun ./my_hybrid_mpi_openmp_program
This runs 2 MPI tasks, each with 4 CPUs, suitable for a hybrid MPI/OpenMP program.
Key Considerations
- Match --cpus-per-task to the number of threads your program will use for optimal performance.
- Be aware of the total resources you're requesting (ntasks * cpus-per-task) to ensure it doesn't exceed node capabilities.
- For programs that don't control their own threading, you may need to set environment variables (like OMP_NUM_THREADS) to match --cpus-per-task.
- Some programs may benefit more from multiple tasks (MPI) rather than multiple threads, depending on their design.
Understanding these concepts allows you to efficiently allocate resources for your specific computational needs, optimizing both performance and cluster utilization.
What are Slurm Job Arrays?
Slurm job arrays are a mechanism for submitting and managing collections of similar jobs quickly and easily. Instead of submitting hundreds or thousands of individual jobs, you can use a job array to submit a single job script that will spawn multiple job tasks.
How to Use Job Arrays
To create a job array, you use the --array
option in your Slurm batch script:
#SBATCH --array=1-100
This will create 100 job array tasks, numbered from 1 to 100.
More Complex Array Specifications:
- Range with step:
#SBATCH --array=1-100:10
(1, 11, 21, ..., 91) - Comma-separated list:
#SBATCH --array=1,5,7,9
- Combination:
#SBATCH --array=1-5,10,20-25
Using the Array Task ID
Within your job script, you can use the $SLURM_ARRAY_TASK_ID
environment variable to distinguish between tasks:
#!/bin/bash
#SBATCH --array=1-100
echo "Processing file_${SLURM_ARRAY_TASK_ID}.txt"
./my_program input_${SLURM_ARRAY_TASK_ID}.dat output_${SLURM_ARRAY_TASK_ID}.result
Why Use Job Arrays?
- Efficiency in Job Submission: Submit many similar jobs with a single script.
- Easier Management: Manage a group of related jobs as a single unit.
- Improved Scheduling: Slurm can schedule array tasks more efficiently than individual jobs.
- Reduced System Overhead: Less load on the scheduling system compared to submitting many individual jobs.
- Simplified Dependency Management: You can make other jobs depend on the entire array or specific tasks.
Best Practices and Tips
- Limit Concurrent Tasks: Use
#SBATCH --array=1-1000%20
to limit to 20 concurrently running tasks. - Output Files: Use
%A
for array job ID and%a
for task ID in output file names:#SBATCH --output=output_%A_%a.log
- Resource Allocation: Ensure each task has appropriate resources. Arrays are good for many small, similar jobs.
- Task Independence: Array tasks should be independent of each other to run efficiently.
Example: Processing Multiple Datasets
#!/bin/bash
#SBATCH --job-name=data_process
#SBATCH --output=output_%A_%a.log
#SBATCH --error=error_%A_%a.log
#SBATCH --array=1-100
#SBATCH --time=01:00:00
#SBATCH --mem=4G
# List of datasets
datasets=(dataset1.csv dataset2.csv dataset3.csv ... dataset100.csv)
# Get the dataset for this task
dataset=${datasets[$SLURM_ARRAY_TASK_ID - 1]}
# Run the processing script
python process_data.py $dataset
Managing Job Arrays
- View array jobs:
squeue -a
- Cancel entire array:
scancel [array_job_id]
- Cancel specific task:
scancel [array_job_id]_[task_id]
When to Use Job Arrays
Job arrays are ideal for:
- Parameter sweeps in simulations or analyses
- Processing multiple input files with the same program
- Running the same analysis on different datasets
- Embarrassingly parallel problems where tasks are independent
By using job arrays effectively, you can significantly streamline your workflow, especially when dealing with large numbers of similar computational tasks.
Conda is a popular package management system and environment management system. When using Slurm on a cluster, you may need to activate and use Conda environments in your job scripts. Here's how to do it effectively:
1. Loading Conda
First, you need to ensure Conda is available in your Slurm job. This usually involves loading a module or sourcing a script to initialize Conda.
$ module load anaconda3
2. Activating a Conda Environment
Once Conda is loaded, you can activate your environment. Here's how you might do this in a Slurm script:
#!/bin/bash #SBATCH --job-name=conda_job #SBATCH --output=output_%j.log #SBATCH --error=error_%j.log #SBATCH --time=01:00:00 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=1 #SBATCH --mem=4G # Load Conda module load anaconda3 # Activate your environment conda activate myenv # Run your Python script python my_script.py
3. Creating Conda Environments on the Fly
Sometimes, you might want to create a Conda environment as part of your job. Here's an example:
#!/bin/bash #SBATCH --job-name=conda_create_job #SBATCH --output=output_%j.log #SBATCH --error=error_%j.log #SBATCH --time=01:00:00 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=1 #SBATCH --mem=4G # Load Conda module load anaconda3 # Create a new environment conda create -n job_env python=3.8 numpy pandas -y # Activate the new environment conda activate job_env # Run your Python script python my_script.py # Optionally, remove the environment at the end of the job conda deactivate conda env remove -n job_env -y
4. Best Practices for Using Conda with Slurm
- Specify exact versions: When creating environments, specify exact versions of packages to ensure reproducibility.
- Use environment files: For complex environments, use a `environment.yml` file to specify dependencies.
- Clean up after your job: If you create temporary environments, make sure to remove them to free up space.
6. Troubleshooting Conda in Slurm Jobs
If you encounter issues:
- Check your Slurm output and error logs for Conda-related errors.
- Ensure Conda is properly initialized in your job script.
- Verify that the specified Conda environment exists and contains the necessary packages.
- Check for any conflicts between Conda and other modules loaded in your job.
The Jupyter Notebook is a web-based interactive computing platform. The notebook combines live code, equations, narrative text, visualizations etc. Chances are that you would like to use it in the cluster.
To do so you have to:
- Create a sbatch job file
- Launch the job
- Create an SSH tunnel
- Connect to the Notebook via browser
Create a sbatch job
You must create something like this:
#!/bin/bash #SBATCH --job-name jupy #SBATCH --nodes 1 #SBATCH --ntasks-per-node 1 #SBATCH --cpus-per-task 64 #SBATCH --mem 64G #SBATCH --time 1-00:00:00 #SBATCH --output jupy.log cd /yourdir/ module purge module load your_modules conda activate your_env jupyter notebook --no-browser --ip=0.0.0.0 --port=PortNumber
Where PortNumber is a free port for your Notebook to run, I will use 1357 here as an example.
IMPORTANT: If another process or another Jupyter Notebook is running on the port number you specified, your job will be either die or will be queued waiting for the previous job to finish. Please change the port number.
Launch the job
Then you have to launch your job with:
$ sbatch yourjob.sh
In the log file you specified in your script you will find a line with a token:
[I 10:23:58.726 NotebookApp] http://computingnode:1357/?token=8e184c4fa1ec5d60e31fb721adc3202317aac0adca177127
Write it down. It will change every time.
Create an SSH tunnel
Create an SSH tunnel from your local machine to the computing node via the headnode:
$ ssh -N -L 1357:computingnode.unicph.domain:1357 youruser@headnode
Any traffic you send to port 1357 on your local machine will be securely forwarded through the headnode and then on to port 1357 on the computing node.
This is useful if you need to access a service on the computing node that you can't reach directly, but you can reach through the head node.
Connect to the Notebook
To connect to your notebook you have just to point your browser to
http://localhost:1357
and write down the token you got before (8e184c4...) in the password field.
Nextflow is a powerful workflow management system that can be integrate with SLURM.
Configuring Nextflow for SLURM
To run Nextflow on SLURM, you need to create a configuration file (nextflow.config) in your project directory:
process {
executor = 'slurm'
queue = 'your_queue_name'
clusterOptions = '--account=your_account'
}
This configuration tells Nextflow to use SLURM as the executor and specifies the queue to use. Adjust the 'queue' and 'clusterOptions' as needed for your specific SLURM setup.
Understanding Process Configuration and SLURM Integration
Process Configuration
The 'process' block in the configuration file defines settings that apply to all processes in your Nextflow script:
- executor = 'slurm': This tells Nextflow to use SLURM for job submission and management.
- queue = 'your_queue_name': Specifies the SLURM partition (queue) to which jobs will be submitted. Replace 'your_queue_name' with the appropriate partition name for your cluster.
- clusterOptions = '--account=your_account': Allows you to specify additional SLURM options. In this example, it sets the account to be charged for the job. Modify this according to your cluster's requirements.
You can also define process-specific settings in your Nextflow script:
process example_task {
cpus 4
memory '8 GB'
time '2h'
script:
"""
your_command_here
"""
}
These settings (cpus, memory, time) will be translated into appropriate SLURM resource requests when the job is submitted.
SLURM and Executor Management
It's important to understand that when using SLURM with Nextflow, SLURM itself handles most of the job execution and resource management tasks. The 'executor' settings in Nextflow are primarily used for local execution or when using other executors. When using SLURM:
- SLURM manages job queuing, scheduling, and resource allocation.
- SLURM's own configurations and limits (set by cluster administrators) control aspects like maximum concurrent jobs, job priorities, and resource limits.
- Nextflow's role is to submit jobs to SLURM and monitor their progress, rather than directly managing execution details.
Therefore, many of the 'executor' settings in Nextflow (like queueSize or submitRateLimit) are not typically necessary or used when working with SLURM. Instead, you would rely on SLURM's own configurations and use SLURM-specific options in your Nextflow process definitions.
SLURM-Specific Options
When using Nextflow with SLURM, you can take advantage of SLURM-specific options in your process definitions:
process resource_intensive_task {
cpus 16
memory '64 GB'
time '12h'
clusterOptions '--qos=high --exclude=node01,node02'
script:
"""
your_command_here
"""
}
In this example:
- cpus, memory, time: These are translated into SLURM resource requests.
- clusterOptions: This allows you to pass SLURM-specific options directly to the sbatch command, such as quality of service (--qos) or node exclusions.
Running Your Pipeline
To run your Nextflow pipeline on SLURM, use the following command:
nextflow run your_script.nf
Nextflow will automatically submit jobs to SLURM based on your configuration and script directives.
Go to the Service Portal
Click through
- Research IT
- Research Applications and Services
- Choose "Compute Dataset" or "Compute Projects"
Select options:
- The name of the Compute Dataset/Project (Should contain group leader name, e.g. "lund_group")
FS system where you request UCPH-IT to mount the Compute Dataset/Project
- choose: **FS-bric**
Options for Backup and Performance
- For most applications, "basic" should fine
Audit
- Choose only audit if you need to store/work with with patient-sensitive data. Otherwise, choose No
Understanding what constitutes misuse of cluster resources is crucial for maintaining a productive and secure research computing environment. The following behaviors are generally considered unacceptable and may result in account suspension or termination:
1. Resource Abuse
- **Hogging Resources**: Submitting an excessive number of jobs or requesting more resources than necessary, preventing other users from accessing the cluster. - **Ignoring Queue Policies**: Consistently submitting jobs to inappropriate queues or violating queue-specific limits. - **Circumventing Fairshare Mechanisms**: Attempting to game the system to get more than your fair share of resources.
2. Security Violations
- **Unauthorized Access**: Attempting to access other users' data or trying to elevate your privileges on the system. - **Sharing Credentials**: Giving your login information to others or allowing unauthorized users to access the cluster through your account. - **Installing Malicious Software**: Introducing malware, viruses, or any unauthorized software to the cluster.
3. Inappropriate Use of Compute Resources
- **Non-Research Activities**: Using the cluster for non-academic or non-approved purposes.
4. Misuse of Data and Storage
- **Storing Inappropriate Content**: Using cluster storage for personal files, especially large media files or inappropriate content. - **Data Theft**: Attempting to exfiltrate large amounts of data or sensitive information from the cluster. - **Excessive I/O Operations**: Running jobs that perform excessive read/write operations, potentially damaging the storage systems.
5. Violation of Software Licenses
- **License Violations**: Using software in violation of its license terms or using more licenses than allocated to you.
6. Disrupting Cluster Operations
- **Denial of Service**: Intentionally or unintentionally running jobs that crash nodes or disrupt cluster services.
7. Non-Compliance with Policies
- **Ignoring System Announcements**: Consistently disregarding important announcements from system administrators.
8. Attempts to Evade Monitoring or Quotas
- **Hiding Activities**: Attempting to conceal the nature of your computations or evade system monitoring tools. - **Quota Evasion**: Trying to bypass storage quotas or other resource limitations set by administrators.
Consequences of Violations
The consequences of engaging in these behaviors can include: 1. Temporary suspension of your account 2. Permanent revocation of cluster access 3. Reporting to your supervisor, department, or funding agency
Best Practices to Avoid Violations
1. **Read and Understand Policies**: Familiarize yourself with all cluster usage policies and guidelines. 2. **When in Doubt, Ask**: If you're unsure about whether a particular use is acceptable, consult with cluster administrators. 3. **Regular Check-ins**: Periodically review your resource usage and job patterns to ensure compliance. 4. **Report Suspicious Activities**: If you notice potential misuse by others, report it to the administrators. 5. **Respect Others**: Remember that the cluster is a shared resource. Be considerate of other users' needs. By adhering to these guidelines and using the cluster responsibly, you help maintain a productive and secure research computing environment for everyone.
What's an HPC system?
An HPC (High-Performance Computing) system is a network of computers (cluster) designed to process large amounts of data and perform complex calculations at high speeds. Key features include:
-
Parallel processing: Multiple computers work together on a single problem.
-
Powerful processors: Often uses specialized CPUs or GPUs for faster computations.
-
High-speed interconnects: Fast communication between nodes for efficient data sharing.
-
Large storage capacity: Handles vast amounts of data for scientific simulations or data analysis.
HPC systems are crucial for tasks like weather forecasting, molecular modeling, and artificial intelligence research.
What is Slurm?
Slurm (Simple Linux Utility for Resource Management) is a job scheduler that manages computational resources in a cluster. It allocates resources to jobs, dispatches them, monitors their execution, and cleans up after job completion.
Why use Slurm?
- Resource allocation: Once resources are allocated to your job, they're exclusively yours for the duration of execution, regardless of system load.
- Detached execution: No need to keep an open terminal session.
- Efficient resource use: Jobs start as soon as requested resources are available, even outside working hours.
- Fair scheduling: Jobs are prioritized based on requested resources, user's system share, and queue time.
Slurm Concepts
Before diving into Slurm usage, it's important to understand some key concepts:
- Node : A computer in the cluster.
- Partition: A group of nodes with specific characteristics.
- Job : A resource allocation request for a specific program or task.
- Task : An instance of a running program within a job.
Which Partition can I use?
You have 3 partitions on the Cluster
- normal_prio: 1 day max runnig time, priority 100
- normal_prio_long: 15 days max running time, priority 50
- high_prio: 1 day max running time, priority 1000
Basic Usage
Loading Software as modules
To use a software that's not part of the system you can load it as a module
$ module avail
list all available modules
$ module load R/4.4.0
load R version 4.40
$ module list
list loaded modules
$ module unload module_name
unload loaded module module_name
$ module purge
unload all loaded modules
Simple Job Submission
Prefix your command with
$ srun myprogram
Run an interactive bash session
$ srun --pty bash
Note: This uses default settings, which may not always be suitable.
Simple Job Submission
Prefix your command with
$ srun myprogram
Run an interactive bash session
$ srun --pty bash
Note: This uses default settings, which may not always be suitable.
Specifying a Partition
Use the -p
option with srun
:
srun -p partition_name myprogram
Running Detached Jobs (Batch Mode)
- Create a shell script (batch script) containing:
- Slurm directives (lines starting with
#SBATCH
) - Any necessary preparatory steps (e.g., loading modules)
- Your
srun
command
- Slurm directives (lines starting with
- Submit the script using
sbatch
:sbatch myscript.sh
Using Conda
You can use conda inside your Batch script
# Load Conda
- module load anaconda3
- # Activate your environment
- conda activate myenv
- # Run your Python script
- python my_script.py
Monitoring Jobs
Checking Job Status
Use squeue
to see which jobs are running or queued:
squeue
To see only your jobs:
squeue -u yourusername
Viewing Job Details
Use scontrol
:
scontrol show job <jobid>
Checking Job Output
Slurm captures console output to a file named slurm-<jobid>.out
in the submission directory. You can examine this file while the job is running or after it finishes.
Resource Requests
CPUs
To request multiple CPU threads:
#SBATCH --cpus-per-task=X
srun --cpus-per-task=X myprogram
Note: This argument must be given to both sbatch
(via #SBATCH
) and srun
. The first one for the job allocation, the second for the task e
Other Resources
Specify in your batch script using #SBATCH
directives:
#SBATCH --mem=8G
#SBATCH --time=02:00:00
#SBATCH --gres=gpu:1
Here you have the options for Memory, Time Limit and GPUs
Example batch script
#!/bin/bash #SBATCH --job-name=conda_job #SBATCH --output=output_%j.log #SBATCH --error=error_%j.log #SBATCH --time=01:00:00 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=1 #SBATCH --mem=4G # Load Conda module load anaconda3 # Activate your environment conda activate myenv # Run your Python script python my_script.py
An example script with conda, launch it with:
sbatch conda_job.sh
Useful Slurm Commands
squeue
: Show job queue informationsinfo
: Display node and partition informationscancel <jobid>
: Delete a jobsacct
: View accounting data for jobsscontrol
: Detailed info
Best Practice
Please:
-
Use resources carefully. Test your requirements with an interactive session with srun before launching sbatch scripts
-
Don't launch a lot of jobs together. Respect the other users.
-
Use the appropriate queue for your job.
-
Send a mail to the sysadmin if you have any doubts.
Contact
Joachim Weischenfeldt, PhD
Group Leader
joachim.weischenfeldt@bric.ku.dk
Phone:+45 35 45 60 40
David Galligani
IT Specialist
david.galligani@bric.ku.dk