Markus92 · August 26, 2025 10:51 · Apr 21, 2020 · Apr 21, 2020 · Apr 21, 2020 · Oct 21, 2019
diff --git a/serversetup.md b/serversetup.md
@@ -242,8 +242,8 @@ TaskPlugin=task/cgroup
 # TIMERS
 #KillWait=30
 #MinJobAge=300
-#SlurmctldTimeout=120
-#SlurmdTimeout=300
+SlurmctldTimeout=600
+SlurmdTimeout=600
 #
 #
 # SCHEDULING
@@ -372,3 +372,12 @@ user            devices         /
 This will move all users in the group *gpu* to GPU access, and everyone else to no GPU access. Exactly what we want.
 
 Now reboot for the final time and you're done!
+
+## Post-mortem
+This system has been up and running for around a year now, and it works perfectly:
+the system had only two short outages. One was caused by time-out of the SLURM
+daemon, killing all running jobs for some reason (new jobs were fine). This
+is mitigated now by setting the time-outs a bit less tight.
+The other one, we have no clue what happened. It was a total hardware lockup,
+even the physical console didn't respond. A quick physical reboot later and
+everything was up and running again like before!
diff --git a/serversetup.md b/serversetup.md
@@ -7,13 +7,13 @@ One challenge is, is how to manage these GPUs. There are many approaches, but gi
 
 A group at a previous affiliation of mine had the same problems and used Docker containers with a job scheduler to mitigate most of these problem. Unfortunately I had never used it myself and was thus not familiar with the exact details of their implementation. This approach solves most of our problems: no conflicting software versions (just roll a container per research paper and archive it), no competing for GPUs and, most importantly, people can't accidentally screw up colleagues' experiments.
 
-There was one more constraint I had: our system should be as easy to use as possible. When I talked about 'job scheduling' and 'GPU allocation' to my colleagues, the reaction I got was that they were scared it'd be either too complicated to use, or too limited to use. As I really didn't want to go the 'Google Sheets' route for GPU scheduling, I kept this in mind during the design of the system. Another constraint was set by our sysadmin: no root for users, as we got some old NFSv3 fileservers which authenticate on UID/GID level. This immediately excluded Docker, as described [here](https://docs.docker.com/engine/security/security/#docker-daemon-attack-surface). As users would be allowed to control the Docker daemon, it's pretty much the same as putting everyone in the group 'sudoers'. Not something we want, to be honest.
+There was one more constraint I had: our system should be as easy to use as possible. When I talked about 'job scheduling' and 'GPU allocation' to my colleagues, the reaction I got was that they were scared it'd be either too complicated to use, or too limited to use. As I really didn't want to go the 'Google Sheets' route for GPU scheduling, I kept this in mind during the design of the system. Another constraint was set by our sysadmin: no root for users, as we got some legacy NFSv3 fileservers which authenticate on UID/GID level. This immediately excluded Docker, as described [here](https://docs.docker.com/engine/security/security/#docker-daemon-attack-surface). As users would be allowed to control the Docker daemon, it's pretty much the same as putting everyone in the group 'sudoers'. Not something we want, to be honest.
 
 In the end I decided to use a combination of [Singularity](https://sylabs.io/singularity/) and [SLURM](https://slurm.schedmd.com/). Singularity is a container tool created for and used a lot by HPC facilities. SLURM is an industry-standard job scheduler, also used by many HPC instances. An advantage of these tools is that they are industry-standard, thus used a lot and well-documented. Always helpful when running into problems. To enforce proper usage of these tools, control groups (cgroups) are used to lock access down: by default there are no GPU permissions.
 
 As most tutorials I found were quite outdated, here's a new one. It's in typical 'follow-along' style, so you can copy/paste the commands onto your own terminal and you'll end up with a similar system! Root access is required, obviously.
 
-Note: we are running a new install of Ubuntu 18.04 LTS.
+Note: we are running a new, fresh install of Ubuntu 18.04 LTS.
 
 ## Installing Singularity
 As most debian packages for Singularity are quite outdated, we'll compile it ourselves. It's written in Go, so we'll also install a recent Go version.
@@ -90,7 +90,9 @@ We're going to make a few changes to the default configuration, mainly to make i
 $sudo nano /usr/local/etc/singularity/singularity.conf
 ```
 
-First, change `always use nv = no` to `yes`. It doesn't really have any downsides, just saves you from typing --nv every time. Second, we add a few bind paths. Obviously these are user specific, though `/run/user` is useful for everyone running a systemd-based distribution like Ubuntu or Debian. I added these below the standard bind paths, you'll find it easily in the config file.
+First, to bind the NVIDIA binaries into every container, change `always use nv = no` to `yes`. It doesn't really have any downsides, just saves you from typing --nv every time.
+
+Second, we add a few bind paths. Obviously these are user specific, though `/run/user` is useful for everyone running a systemd-based distribution like Ubuntu or Debian. I added these below the standard bind paths, you'll find it easily in the config file.
 ```sh
 # For temporary files
 bind path = /run/user
@@ -105,7 +107,7 @@ $ singularity exec docker://nvcr.io/nvidia/pytorch:19.05-py3 jupyter notebook
 ```
 
 ## SLURM
-Unfortunately the packages in Ubuntu and Debian are a bit too outdated, so we'll compile our own version. First install some dependenices. Note that we'll install the cgroup stuff right away.
+For GPU scheduling, we use SLURM. Unfortunately the packages in Ubuntu and Debian are a bit too outdated, so we'll compile our own version. First install some dependencies. Note that we'll install the cgroup stuff right away.
 
 ```sh
 sudo apt-get install build-essential ruby-dev libpam0g-dev libmysqlclient-dev munge libmunge-dev libmysqld-dev cgroup-bin libpam-cgroup cgroup-tools
@@ -309,7 +311,7 @@ sudo groupadd gpu
 sudo usermod -aG gpu mark
 ```
 
-I'd advise to add every user with root access to this group for administration tasks. Do *not* add any regular users to it, or it'll break the purpose of the scheduling system.
+I'd advise to add every user with root access to this group for administration tasks. Do *not* add any regular users to it, or it'll break the purpose of the scheduling system as they'll have unlimited GPU access, always.
 
 To load these `cgroups` every time the system boots, we'll run `cgconfigparser` on boot. Let's create a small `systemd` script to do this:
 
@@ -369,4 +371,4 @@ user            devices         /
 
 This will move all users in the group *gpu* to GPU access, and everyone else to no GPU access. Exactly what we want.
 
-Now reboot for the final time and you're done!
+Now reboot for the final time and you're done!
diff --git a/serversetup.md b/serversetup.md
@@ -1,19 +1,19 @@
 # Setting up a GPU server with scheduling and containers
 
 
-Our group recently acquired a new server to do some deep learning: a [SuperMicro 4029GP-TRT2](https://www.supermicro.com/products/system/4U/4029/SYS-4029GP-TRT2.cfm), stuffed with 8x NVIDIA RTX 2080 Ti. Though maybe a bit overpowered, with upcoming networks like BigGAN and fully 3D networks, as well as students joining our group, this machine will be used quite a lot in the future.
+Our group recently acquired a new server to do some deep learning: a [SuperMicro 4029GP-TRT2](https://www.supermicro.com/products/system/4U/4029/SYS-4029GP-TRT2.cfm), stuffed with 8x NVidia RTX 2080 Ti. Though maybe a bit overpowered, with upcoming networks like BigGAN and fully 3D networks, as well as students joining our group, this machine will be used quite a lot in the future.
 
 One challenge is, is how to manage these GPUs. There are many approaches, but given that most PhD candidates aren't sysadmins, these range from 'free-for-all', leading to one person hogging all GPUs for weeks due to a bug in the code, to Excel sheets that noone understands and noone adheres to because changing GPU ids in code is hard. This leads to a lot of frustration, low productivity and under-utilisation of these expensive servers. Another issue is conflicting software versions. TensorFlow and Keras, for example, tend to do breaking API changes every now and then. As these always happen right before a conference deadline, this leads to even more frustration when trying to run a few extra experiments.
 
 A group at a previous affiliation of mine had the same problems and used Docker containers with a job scheduler to mitigate most of these problem. Unfortunately I had never used it myself and was thus not familiar with the exact details of their implementation. This approach solves most of our problems: no conflicting software versions (just roll a container per research paper and archive it), no competing for GPUs and, most importantly, people can't accidentally screw up colleagues' experiments.
 
-There was one more constraint I had: our system should be as easy to use as possible. When I talked about 'job scheduling' and 'GPU allocation' to my colleagues, the reaction I got was that they were scared it'd be either too complicated to use, or too limited to use. As I really didn't want to go the 'Google Sheets' route for GPU scheduling, I kept this in mind during the design of the system. Another constraint was set by our sysadmin: no root for users, as we use some old NFSv3 fileservers which authenticate on UID/GID level. This immediately excluded Docker, as described (here)[https://docs.docker.com/engine/security/security/#docker-daemon-attack-surface]. As users would be allowed to control the Docker daemon, it's pretty much the same as putting everyone in the group 'sudoers'. Not something we want, to be honest.
+There was one more constraint I had: our system should be as easy to use as possible. When I talked about 'job scheduling' and 'GPU allocation' to my colleagues, the reaction I got was that they were scared it'd be either too complicated to use, or too limited to use. As I really didn't want to go the 'Google Sheets' route for GPU scheduling, I kept this in mind during the design of the system. Another constraint was set by our sysadmin: no root for users, as we got some old NFSv3 fileservers which authenticate on UID/GID level. This immediately excluded Docker, as described [here](https://docs.docker.com/engine/security/security/#docker-daemon-attack-surface). As users would be allowed to control the Docker daemon, it's pretty much the same as putting everyone in the group 'sudoers'. Not something we want, to be honest.
 
-In the end I decided to use a combination of [Singularity](https://sylabs.io/singularity/) and [SLURM](https://slurm.schedmd.com/). Singularity is a container tool and used a lot by HPC facilities. SLURM is an industry-standard job scheduler, also used by many HPC instances. An advantage of these tools is that they are industry-standard, thus used a lot and well-documented. Always helpful when running into problems. To enforce proper usage of these tools, control groups (cgroups) are used to lock access down.
+In the end I decided to use a combination of [Singularity](https://sylabs.io/singularity/) and [SLURM](https://slurm.schedmd.com/). Singularity is a container tool created for and used a lot by HPC facilities. SLURM is an industry-standard job scheduler, also used by many HPC instances. An advantage of these tools is that they are industry-standard, thus used a lot and well-documented. Always helpful when running into problems. To enforce proper usage of these tools, control groups (cgroups) are used to lock access down: by default there are no GPU permissions.
 
-As most tutorials I found were quite outdated, here's a new one. It's in typical 'follow-along' style, so you can copy/paste the commands onto your own terminal and you'll end up with a similar system! Note: root access is required of course.
+As most tutorials I found were quite outdated, here's a new one. It's in typical 'follow-along' style, so you can copy/paste the commands onto your own terminal and you'll end up with a similar system! Root access is required, obviously.
 
-Note: we are running Ubuntu 18.04 LTS.
+Note: we are running a new install of Ubuntu 18.04 LTS.
 
 ## Installing Singularity
 As most debian packages for Singularity are quite outdated, we'll compile it ourselves. It's written in Go, so we'll also install a recent Go version.
@@ -62,18 +62,18 @@ go env
 ```
 This should give some output of Go.
 
-Next step is compiling Singularity itself. First get dep, then Singularity. Obviously change v3.2.1 to any later version if you want. Take a look at their github tags for more info.
+Next step is compiling Singularity itself. First get dep, then Singularity. Obviously change v3.5.2 to any later version if you want. Take a look at their github tags for more info.
 ```sh
 go get -u github.com/golang/dep/cmd/dep
 go get -d github.com/sylabs/singularity
 cd $GOPATH/src/github.com/sylabs/singularity
-git checkout v3.2.1
+git checkout v3.5.2
 ```
 It'll complain a bit about no Go files being there, but still does its job.
 Now compile time, this will take a few minutes:
 ```sh
 ./mconfig
-make -C builddir
+make -j10 -C builddir
 sudo make -C ./builddir install
 ```
 
@@ -82,7 +82,7 @@ You should be done now! Let's test it:
 singularity version
 ```
 
-And the output should be `3.2.1` or the version you picked before.
+And the output should be `3.5.2` or the version you picked before.
 
 We're going to make a few changes to the default configuration, mainly to make it easier for our users. We'll add a few bind points and change a few defaults to make the containers as transparent as possible.
 
@@ -147,9 +147,9 @@ sudo systemctl enable slurmdbd
 ```
 
 We can't start them yet because we don't have a slurm.conf file yet.
-There is a generator to make one, but I'll drop my own slurm.conf file here below.
+There is a generator to make one, but I'll drop my own slurm.conf file here below later.
 
-We also need mysql for accounting. This isn't the most desirable application you can install (for security reasons), but nowadays the defaults of mysql 5.7 at Ubuntu 18.04 are pretty sane.
+We also need mysql for accounting. This isn't the most desirable application you can install (for security reasons), but nowadays the defaults of mysql 5.7 at Ubuntu 18.04 are pretty sane (no more guest access, no empty root password).
 
 ```sh
 sudo DEBIAN_FRONTEND=noninteractive apt-get install -y mysql-server pwgen
@@ -160,12 +160,13 @@ Use pwgen two generate two passwords: one for the mysql root user, one for the s
 ```sh
 pwgen 16 2
 ```
-Write them down or store them somewhere.
+Write them down or store them somewhere. Now open a mysql shell:
 
 ```sh
 mysql
 ```
-Then run these commands in the mysql shell:
+Then run these commands in the shell: Replace your_secure_password
+with one of the password generated by `pwgen` above.
 ```sql
 create user 'slurm'@'localhost';
 set password for 'slurm'@'localhost' = 'your_secure_password';
@@ -181,7 +182,7 @@ Now it's time for the configuration files. There's two:
 2. `slurmd.conf` which is the generic slurm configuration
 
 I'll start with `slurmdbd.conf` and will just copypaste them here.
-Put them in `/etc/slurm/`
+Put them in `/etc/slurm/`. Don't forget to replace the password!
 
 ```
 # SLURMDB config file
@@ -338,7 +339,8 @@ And run the command `sudo systemctl enable cgconfigparser.service` after.
 
 This will now be run every time on boot. So reboot the system.
 
-To move user processes into the right group, we edit `/etc/pam.d/common-session`.
+To move user processes into the right group, we edit
+`/etc/pam.d/common-session`.
 Add below line to the bottom of the file:
 
 ```

diff --git a/serversetup.md b/serversetup.md
@@ -1,15 +1,15 @@
 # Setting up a GPU server with scheduling and containers
 
 
-Our group recently acquired a new server to do some deep learning: a (https://www.supermicro.com/products/system/4U/4029/SYS-4029GP-TRT2.cfm)[SuperMicro 4029GP-TRT2], stuffed with 8x NVIDIA RTX 2080 Ti. Though maybe a bit overpowered, with upcoming networks like BigGAN and fully 3D networks, as well as students joining our group, this machine will be used quite a lot in the future.
+Our group recently acquired a new server to do some deep learning: a [SuperMicro 4029GP-TRT2](https://www.supermicro.com/products/system/4U/4029/SYS-4029GP-TRT2.cfm), stuffed with 8x NVIDIA RTX 2080 Ti. Though maybe a bit overpowered, with upcoming networks like BigGAN and fully 3D networks, as well as students joining our group, this machine will be used quite a lot in the future.
 
 One challenge is, is how to manage these GPUs. There are many approaches, but given that most PhD candidates aren't sysadmins, these range from 'free-for-all', leading to one person hogging all GPUs for weeks due to a bug in the code, to Excel sheets that noone understands and noone adheres to because changing GPU ids in code is hard. This leads to a lot of frustration, low productivity and under-utilisation of these expensive servers. Another issue is conflicting software versions. TensorFlow and Keras, for example, tend to do breaking API changes every now and then. As these always happen right before a conference deadline, this leads to even more frustration when trying to run a few extra experiments.
 
 A group at a previous affiliation of mine had the same problems and used Docker containers with a job scheduler to mitigate most of these problem. Unfortunately I had never used it myself and was thus not familiar with the exact details of their implementation. This approach solves most of our problems: no conflicting software versions (just roll a container per research paper and archive it), no competing for GPUs and, most importantly, people can't accidentally screw up colleagues' experiments.
 
-There was one more constraint I had: our system should be as easy to use as possible. When I talked about 'job scheduling' and 'GPU allocation' to my colleagues, the reaction I got was that they were scared it'd be either too complicated to use, or too limited to use. As I really didn't want to go the 'Google Sheets' route for GPU scheduling, I kept this in mind during the design of the system. Another constraint was set by our sysadmin: no root for users, as we use some old NFSv3 fileservers which authenticate on UID/GID level. This immediately excluded Docker, as described [https://docs.docker.com/engine/security/security/#docker-daemon-attack-surface](here). As users would be allowed to control the Docker daemon, it's pretty much the same as putting everyone in the group 'sudoers'. Not something we want, to be honest.
+There was one more constraint I had: our system should be as easy to use as possible. When I talked about 'job scheduling' and 'GPU allocation' to my colleagues, the reaction I got was that they were scared it'd be either too complicated to use, or too limited to use. As I really didn't want to go the 'Google Sheets' route for GPU scheduling, I kept this in mind during the design of the system. Another constraint was set by our sysadmin: no root for users, as we use some old NFSv3 fileservers which authenticate on UID/GID level. This immediately excluded Docker, as described (here)[https://docs.docker.com/engine/security/security/#docker-daemon-attack-surface]. As users would be allowed to control the Docker daemon, it's pretty much the same as putting everyone in the group 'sudoers'. Not something we want, to be honest.
 
-In the end I decided to use a combination of (https://sylabs.io/singularity/)[Singularity] and (https://slurm.schedmd.com/)[SLURM]. Singularity is a container tool and used a lot by HPC facilities. SLURM is an industry-standard job scheduler, also used by many HPC instances. An advantage of these tools is that they are industry-standard, thus used a lot and well-documented. Always helpful when running into problems. To enforce proper usage of these tools, control groups (cgroups) are used to lock access down.
+In the end I decided to use a combination of [Singularity](https://sylabs.io/singularity/) and [SLURM](https://slurm.schedmd.com/). Singularity is a container tool and used a lot by HPC facilities. SLURM is an industry-standard job scheduler, also used by many HPC instances. An advantage of these tools is that they are industry-standard, thus used a lot and well-documented. Always helpful when running into problems. To enforce proper usage of these tools, control groups (cgroups) are used to lock access down.
 
 As most tutorials I found were quite outdated, here's a new one. It's in typical 'follow-along' style, so you can copy/paste the commands onto your own terminal and you'll end up with a similar system! Note: root access is required of course.
 

diff --git a/serversetup.md b/serversetup.md
@@ -0,0 +1,370 @@
+# Setting up a GPU server with scheduling and containers
+
+
+Our group recently acquired a new server to do some deep learning: a (https://www.supermicro.com/products/system/4U/4029/SYS-4029GP-TRT2.cfm)[SuperMicro 4029GP-TRT2], stuffed with 8x NVIDIA RTX 2080 Ti. Though maybe a bit overpowered, with upcoming networks like BigGAN and fully 3D networks, as well as students joining our group, this machine will be used quite a lot in the future.
+
+One challenge is, is how to manage these GPUs. There are many approaches, but given that most PhD candidates aren't sysadmins, these range from 'free-for-all', leading to one person hogging all GPUs for weeks due to a bug in the code, to Excel sheets that noone understands and noone adheres to because changing GPU ids in code is hard. This leads to a lot of frustration, low productivity and under-utilisation of these expensive servers. Another issue is conflicting software versions. TensorFlow and Keras, for example, tend to do breaking API changes every now and then. As these always happen right before a conference deadline, this leads to even more frustration when trying to run a few extra experiments.
+
+A group at a previous affiliation of mine had the same problems and used Docker containers with a job scheduler to mitigate most of these problem. Unfortunately I had never used it myself and was thus not familiar with the exact details of their implementation. This approach solves most of our problems: no conflicting software versions (just roll a container per research paper and archive it), no competing for GPUs and, most importantly, people can't accidentally screw up colleagues' experiments.
+
+There was one more constraint I had: our system should be as easy to use as possible. When I talked about 'job scheduling' and 'GPU allocation' to my colleagues, the reaction I got was that they were scared it'd be either too complicated to use, or too limited to use. As I really didn't want to go the 'Google Sheets' route for GPU scheduling, I kept this in mind during the design of the system. Another constraint was set by our sysadmin: no root for users, as we use some old NFSv3 fileservers which authenticate on UID/GID level. This immediately excluded Docker, as described [https://docs.docker.com/engine/security/security/#docker-daemon-attack-surface](here). As users would be allowed to control the Docker daemon, it's pretty much the same as putting everyone in the group 'sudoers'. Not something we want, to be honest.
+
+In the end I decided to use a combination of (https://sylabs.io/singularity/)[Singularity] and (https://slurm.schedmd.com/)[SLURM]. Singularity is a container tool and used a lot by HPC facilities. SLURM is an industry-standard job scheduler, also used by many HPC instances. An advantage of these tools is that they are industry-standard, thus used a lot and well-documented. Always helpful when running into problems. To enforce proper usage of these tools, control groups (cgroups) are used to lock access down.
+
+As most tutorials I found were quite outdated, here's a new one. It's in typical 'follow-along' style, so you can copy/paste the commands onto your own terminal and you'll end up with a similar system! Note: root access is required of course.
+
+Note: we are running Ubuntu 18.04 LTS.
+
+## Installing Singularity
+As most debian packages for Singularity are quite outdated, we'll compile it ourselves. It's written in Go, so we'll also install a recent Go version.
+
+First, install some standard packages for compiling stuff.
+```sh
+$ sudo apt-get update && \
+    sudo apt-get install -y \
+    python \
+    git \
+    dh-autoreconf \
+    build-essential \
+    libarchive-dev \
+    libssl-dev \
+    uuid-dev \
+    libgpgme11-dev \
+    squashfs-tools
+```
+
+```sh
+$ wget https://dl.google.com/go/go1.12.6.linux-amd64.tar.gz
+$ sudo tar -xvf go1.12.6.linux-amd64.tar.gz
+$ sudo mv go /usr/local
+$ /usr/local/go/bin/go version
+```
+
+To make sure the GOPATH is set for everyone, I created a new script in `/etc/profile.d`
+```sh
+$ sudo nano /etc/profile.d/dl_paths.sh
+```
+
+And the script:
+```sh
+GOROOT="/usr/local/go"
+
+export GOROOT=${GOROOT}
+export GOPATH=$HOME/go
+export PATH=$GOROOT/bin:$PATH
+```
+
+To test this, logout and log back in again (or just reboot).
+
+```sh
+export
+go env
+```
+This should give some output of Go.
+
+Next step is compiling Singularity itself. First get dep, then Singularity. Obviously change v3.2.1 to any later version if you want. Take a look at their github tags for more info.
+```sh
+go get -u github.com/golang/dep/cmd/dep
+go get -d github.com/sylabs/singularity
+cd $GOPATH/src/github.com/sylabs/singularity
+git checkout v3.2.1
+```
+It'll complain a bit about no Go files being there, but still does its job.
+Now compile time, this will take a few minutes:
+```sh
+./mconfig
+make -C builddir
+sudo make -C ./builddir install
+```
+
+You should be done now! Let's test it:
+```sh
+singularity version
+```
+
+And the output should be `3.2.1` or the version you picked before.
+
+We're going to make a few changes to the default configuration, mainly to make it easier for our users. We'll add a few bind points and change a few defaults to make the containers as transparent as possible.
+
+```sh
+$sudo nano /usr/local/etc/singularity/singularity.conf
+```
+
+First, change `always use nv = no` to `yes`. It doesn't really have any downsides, just saves you from typing --nv every time. Second, we add a few bind paths. Obviously these are user specific, though `/run/user` is useful for everyone running a systemd-based distribution like Ubuntu or Debian. I added these below the standard bind paths, you'll find it easily in the config file.
+```sh
+# For temporary files
+bind path = /run/user
+# Mounts to data
+bind path = /raid
+```
+
+And finally a test run (this might take a while as the container is HUGE.)
+```sh
+$ cd ~
+$ singularity exec docker://nvcr.io/nvidia/pytorch:19.05-py3 jupyter notebook
+```
+
+## SLURM
+Unfortunately the packages in Ubuntu and Debian are a bit too outdated, so we'll compile our own version. First install some dependenices. Note that we'll install the cgroup stuff right away.
+
+```sh
+sudo apt-get install build-essential ruby-dev libpam0g-dev libmysqlclient-dev munge libmunge-dev libmysqld-dev cgroup-bin libpam-cgroup cgroup-tools
+```
+
+Then download, extract and compile. My machine has many cores so we'll use some multi-threading in the make. Depending on your computer, you might have enough time to grab and drink some coffee.
+```sh
+wget https://download.schedmd.com/slurm/slurm-19.05.0.tar.bz2
+tar -xaf slurm-19.05.0.tar.bz2 
+cd slurm-19.05.0/
+./configure --sysconfdir=/etc/slurm --enable-pam --localstatedir=/var --with-munge --with-ssl
+make -j10
+sudo make install
+```
+
+Logout/login, then check if it actually does something.
+```sh
+srun
+```
+You'll get an error about the configuration file not existing, 
+
+Now start and enable munge.
+
+```sh
+sudo systemctl enable munge
+sudo systemctl start munge
+sudo systemctl status munge
+```
+
+Copy systemd files and enable them, create user for SLURM.
+
+```sh
+cd ~/slurm-19.05.0/etc
+sudo cp *.service /lib/systemd/system/
+sudo adduser --system --no-create-home --group slurm
+sudo systemctl enable slurmd
+sudo systemctl enable slurmctld
+sudo systemctl enable slurmdbd
+```
+
+We can't start them yet because we don't have a slurm.conf file yet.
+There is a generator to make one, but I'll drop my own slurm.conf file here below.
+
+We also need mysql for accounting. This isn't the most desirable application you can install (for security reasons), but nowadays the defaults of mysql 5.7 at Ubuntu 18.04 are pretty sane.
+
+```sh
+sudo DEBIAN_FRONTEND=noninteractive apt-get install -y mysql-server pwgen
+```
+
+Use pwgen two generate two passwords: one for the mysql root user, one for the slurm user.
+
+```sh
+pwgen 16 2
+```
+Write them down or store them somewhere.
+
+```sh
+mysql
+```
+Then run these commands in the mysql shell:
+```sql
+create user 'slurm'@'localhost';
+set password for 'slurm'@'localhost' = 'your_secure_password';
+grant usage on *.* to 'slurm'@'localhost';
+create database slurm_acct_db;
+grant all privileges on slurm_acct_db.* to 'slurm'@'localhost';
+flush privileges;
+exit
+```
+
+Now it's time for the configuration files. There's two:
+1. `slurmdbd.conf` which is for the database daemon
+2. `slurmd.conf` which is the generic slurm configuration
+
+I'll start with `slurmdbd.conf` and will just copypaste them here.
+Put them in `/etc/slurm/`
+
+```
+# SLURMDB config file
+#  Created by Mark Janse 2019-06-18
+# logging level
+ArchiveEvents=no
+ArchiveJobs=yes
+ArchiveSteps=no
+ArchiveSuspend=no
+
+# service
+DbdHost=localhost
+SlurmUser=slurm
+AuthType=auth/munge
+
+# logging; remove this to use syslog
+LogFile=/var/log/slurm-llnl/slurmdbd.log
+
+# database backend
+StoragePass=your_secure_password
+StorageUser=slurm
+StorageType=accounting_storage/mysql
+StorageLoc=slurm_acct_db
+```
+
+And here's the `slurm.conf`. I'll assume hostname `turing` for the main pc. The name of the cluster is `bip-cluster`, but isn't really too important:
+At the bottom I also define the node, ours has 8 GPUs, 2 CPUs, 10 cores per CPU and 2 threads per core. Change these to your own liking.
+
+```
+#def dslurm.conf file generated by configurator easy.html.
+# Put this file on all nodes of your cluster.
+# See the slurm.conf man page for more information.
+#
+
+# Set your hostname here!
+SlurmctldHost=turing
+#
+#MailProg=/bin/mail
+MpiDefault=none
+#MpiParams=ports=#-#
+ProctrackType=proctrack/cgroup
+ReturnToService=1
+SlurmctldPidFile=/var/run/slurmctld.pid
+#SlurmctldPort=6817
+SlurmdPidFile=/var/run/slurmd.pid
+#SlurmdPort=6818
+SlurmdSpoolDir=/var/spool/slurmd
+SlurmUser=slurm
+#SlurmdUser=root
+StateSaveLocation=/var/spool/slurm
+SwitchType=switch/none
+TaskPlugin=task/cgroup
+#
+#
+# TIMERS
+#KillWait=30
+#MinJobAge=300
+#SlurmctldTimeout=120
+#SlurmdTimeout=300
+#
+#
+# SCHEDULING
+FastSchedule=1
+SchedulerType=sched/backfill
+SelectType=select/cons_tres
+SelectTypeParameters=CR_Core
+#
+#
+# LOGGING AND ACCOUNTING
+AccountingStorageType=accounting_storage/slurmdbd
+AccountingStorageEnforce=associations
+ClusterName=bip-cluster
+#JobAcctGatherFrequency=30
+JobAcctGatherType=jobacct_gather/linux
+#SlurmctldDebug=info
+SlurmctldLogFile=/var/log/slurm/slurmctld.log
+#SlurmdDebug=info
+SlurmdLogFile=/var/log/slurm/slurmd.log
+#
+#
+# COMPUTE NODES
+# NodeName=turing Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 State=UNKNOWN
+
+# Partitions
+GresTypes=gpu
+NodeName=turing Gres=gpu:8 Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 State=UNKNOWN
+PartitionName=tu102 Nodes=turing Default=YES MaxTime=96:00:00 MaxNodes=2 DefCpuPerGPU=5 State=UP
+```
+
+For GPU scheduling you also need a `gres.conf`. This changes per machine if you have different amounts of GPUs. In our case, there is only one machine with 8 GPUs.
+```
+#  Defines all 8 GPUs on Turing
+Name=gpu File=/dev/nvidia[0-7]
+```
+## Restricting unauthorized GPU access
+Previously, we already installed several tools to enable `cgroups` to work. Now we're going to make them work.
+First we create the file `cgconfig.conf`. See below for contents. We create a group `nogpu` for processes without gpu access, and a group `gpu` for processes which can access the GPU.
+
+Location of the file is `/etc/cgconfig.conf`
+```
+# Below restricts access to NVIDIA devices for all users in this cgroup
+#  Number 195 is documented in kernel for NVIDIA driver stuff
+
+group nogpu {
+    devices {
+        devices.deny = "c 195:* rwm";
+    }
+}
+
+# Opposite of above, just to be sure
+group gpu {
+    devices {
+        devices.allow = "c 195:* rwm";
+    }
+}
+
+
+```
+
+For admin tasks, you might want to create a usergroup which will always have GPU access.
+
+```sh
+sudo groupadd gpu
+sudo usermod -aG gpu mark
+```
+
+I'd advise to add every user with root access to this group for administration tasks. Do *not* add any regular users to it, or it'll break the purpose of the scheduling system.
+
+To load these `cgroups` every time the system boots, we'll run `cgconfigparser` on boot. Let's create a small `systemd` script to do this:
+
+```sh
+sudo nano /lib/systemd/system/cgconfigparser.service
+```
+
+And copy-paste below file in there:
+
+```
+[Unit]
+Description=cgroup config parser
+After=network.target
+
+[Service]
+User=root
+Group=root
+ExecStart=/usr/sbin/cgconfigparser -l /etc/cgconfig.conf
+Type=oneshot
+
+[Install]
+WantedBy=multi-user.target
+
+```
+
+And run the command `sudo systemctl enable cgconfigparser.service` after.
+
+This will now be run every time on boot. So reboot the system.
+
+To move user processes into the right group, we edit `/etc/pam.d/common-session`.
+Add below line to the bottom of the file:
+
+```
+session optional        pam_cgroup.so
+```
+
+The pam reads the file `/etc/cgrules.conf` so create that. I added it here below:
+
+```
+# /etc/cgrules.conf
+#The format of this file is described in cgrules.conf(5)
+#manual page.
+#
+# Example:
+#<user>         <controllers>   <destination>
+#@student       cpu,memory      usergroup/student/
+#peter          cpu             test1/
+#%              memory          test2/
+# End of file
+
+root            devices         /
+user            devices         /
+@gpu            devices         /gpu
+*               devices         /nogpu
+```
+
+This will move all users in the group *gpu* to GPU access, and everyone else to no GPU access. Exactly what we want.
+
+Now reboot for the final time and you're done!
No results found