Server workflows without pain

Introduction

Running code on a server requires a more complex setup (e.g., ssh or containers) than running code locally. This can easily introduce friction in our experimental workflow. Nonetheless, offloading part of our workflow to a server should not require additional effort as the same high-level aspects are available both locally and on the server (e.g., running a script). In this post, I will cover my approach to making server interaction painless (even satisfying). I will focus on SLURM-managed clusters, but the principles are broadly applicable (e.g., for cloud computing or a remote machine without a scheduler).

Implementation

Our goal is to make the server an extension of our local machine. We rarely want to leave the confort of our local terminal. Our workflow will be based on having the local project folder mirrored on the server. Whatever changes we make locally (e.g., code changes or the creation of a new file) should be available on the server without we having to produce the specific scp or rsync incantation. Similarly, results and data generated on the server should be readily available on our local machine. Besides file syncronization, we need to run jobs on the server. Typically, long running or heavy compute jobs are offloaded to the server, while low compute explorative jobs will be ran locally. The anchor point for most operations will be the project folder. By running commands relative to the project folder, we can ignore absolute path differences between the server and local machines.

We now define a few BASH functions to make interaction nicer. These can be approximately grouped into the categories below. All commands below along with a few additional ones can be found here. All code is available under a MIT license.

Project configuration

This is the basic configuration for our setup. We login with USERNAME to our remote machine (see USER_HOST). The absolute paths to our project folder on the server (REMOTE_FOLDERPATH) and locally (LOCAL_FOLDERPATH) are included below. $LOCAL_FOLDERPATH is going to be mirrored on the server on $REMOTE_FOLDERPATH. CONTAINER_REL_FILEPATH refers to the relative path of the singularity container inside the project folder.

USERNAME="renatomp"
USER_HOST="renatomp@bridges.psc.xsede.org"
# NOTE: use absolute paths.
LOCAL_FOLDERPATH="/Users/renatomp/Desktop/my_project"
REMOTE_FOLDERPATH="/pylon5/yyyy/renatomp/my_project"
# NOTE: path provided relative to the project folder.
CONTAINER_REL_FILEPATH="py36_gpu.img"

Syncing files and folders

This is the most basic functionality, allowing us to move files back and forth between the local and remote machines. The prefix utproj_ here and in future commands refers to commands that are run with a working directory relative to the project folder. These commands simplify interaction by using the project credentials filled previously. For example, if we create a file main.py on the local project folder and want to send it to the mirrored remote folder, we write utproj_sync_file_to_server main.py. Another useful observation is that we name commands in a way that hints at the sequence of the arguments. This naming convention makes it easy to remember the sequence of arguments that we need to provide. For example, for ut_sync_folder_to_server, the first argument is the path to the folder to sync to the server and the second argument is the server location.

UT_RSYNC_FLAGS="--archive --update --recursive --verbose"
ut_sync_file_to_server(){ rsync $UT_RSYNC_FLAGS "$1" "$2"; }
ut_sync_file_from_server(){ rsync $UT_RSYNC_FLAGS "$1" "$2"; }
ut_sync_folder_to_server(){ rsync $UT_RSYNC_FLAGS "$1/" "$2/"; }
ut_sync_folder_from_server(){ rsync $UT_RSYNC_FLAGS "$1/" "$2/"; }

### commands are run with respect to the specific project folder.
utproj_sync_file_to_server(){ ut_sync_file_to_server "$LOCAL_FOLDERPATH/$1" "$USER_HOST:$REMOTE_FOLDERPATH/$1"; }
utproj_sync_file_from_server(){ ut_sync_file_from_server "$USER_HOST:$REMOTE_FOLDERPATH/$1" "$LOCAL_FOLDERPATH/$1"; }
utproj_sync_folder_to_server(){ ut_sync_folder_to_server "$LOCAL_FOLDERPATH/$1" "$USER_HOST:$REMOTE_FOLDERPATH/$1"; }
utproj_sync_folder_from_server(){ ut_sync_folder_from_server "$USER_HOST:$REMOTE_FOLDERPATH/$1" "$LOCAL_FOLDERPATH/$1"; }

utproj_sync_project_to_server(){ utproj_sync_folder_to_server "."; }
utproj_sync_project_from_server(){ utproj_sync_folder_from_server "."; }

The rsync flags are set so that a file will only be transfered if it has suffered changes (with respect to the existing version). For project folders with large numbers of files, it may be useful to ignore certain folders as rsync may introduce delays even if only little data needs to be transfered. Another alternative is to define commands that distinguish between transfers of data and of code, allowing us to do very fast code transfers, which are likely to occur more frequently and involve only a few small files.

Submitting jobs

Given a project folder mirrored both locally and on the server, we now develop the functionality to run jobs on the server.

ut_run_command_on_server(){ ssh "$2" -t "$1"; }
ut_run_command_on_server_on_folder(){ ssh "$2" -t "cd \"$3\" && $1"; }
ut_run_bash_on_server_on_folder(){ ssh "$1" -t "cd \"$2\" && bash"; }

utproj_run_command_on_server(){ ut_run_command_on_server "$1" "$USER_HOST"; }
utproj_run_command_on_server_on_project_folder(){ ut_run_command_on_server_on_folder "$1" "$USER_HOST" "$REMOTE_FOLDERPATH"; }

utproj_submit_cpu_job_with_resources(){
    script='#!/bin/bash'"
#SBATCH --nodes=1
#SBATCH --partition=RM-shared
#SBATCH --cpus-per-task=$4
#SBATCH --mem=$5MB
#SBATCH --time=$6
#SBATCH --job-name=\"$2\"
$1" && utproj_run_command_on_server "cd \"$3\" && echo \"$script\" > _run.sh && chmod +x _run.sh && sbatch _run.sh && rm _run.sh";
}

# 1: command, 2: job name, 3: folder, 4: num cpus, 5: num_gpus, 6: memory in mbs, 7: time in minutes
# limits: 7GB per gpu, 48 hours, 16 cores per gpu,
# NOTE: read https://www.psc.edu/bridges/user-guide/running-jobs for more details.
# NOTE: if not using bridges, partitions may have different names, or the
# procedure to submit jobs may be different. defining auxiliary commands to
# perform these operation is still useful.
utproj_submit_gpu_job_with_resources(){
    script='#!/bin/bash'"
#SBATCH --nodes=1
#SBATCH --partition=GPU-shared
#SBATCH --gres=gpu:k80:$5
#SBATCH --cpus-per-task=$4
#SBATCH --mem=$6MB
#SBATCH --time=$7
#SBATCH --job-name=\"$2\"
$1" && utproj_run_command_on_server "cd \"$3\" && echo \"$script\" > _run.sh && chmod +x _run.sh && sbatch _run.sh && rm _run.sh";
}

# NOTE: basic resource usage for testing. change to suit your project.
# NOTE: it may be useful to define multiple commands to use different
# amounts of resources.
utproj_run_server_cpu_command(){ utproj_submit_cpu_job_with_resources "$1" "my_cpu_job" "$REMOTE_FOLDERPATH" 1 1024 60; }
utproj_run_server_gpu_command(){ utproj_submit_gpu_job_with_resources "$1" "my_gpu_job" "$REMOTE_FOLDERPATH" 1 1 1024 60; }

To submit a job, it is necessary to handle the peculiriaties of SLURM for the cluster that we are working with. The code below instantiates this for Bridges. We need to specify the partition on which the job will be ran. The commands prefixed by utproj_ in this subsection allows us to access different resource levels (e.g., 1 hour of CPU time or 1 hour of GPU time). These commands were defined for illustrative purposes, so defining your own commands involving different resource budgets will likely be useful.

We are dealing with a SLURM cluster, so these are SLURM commands. For remote machines that do not have schedulers, similar functionality may be developed, although the exact implementation may be different (e.g., instead of going through the scheduler, we might ssh into the machine of interest and run a headless job; instead of checking the queue, we might check what jobs do we have running on that machine). In all cases, running a job on the server is an important functionality which we want to make as frictionless as possible.

Managing jobs

To effectively manage jobs running on server from the local terminal, it is useful to have functionality to check the queue (e.g., to check if a previous submitted job is waiting for resources, currently running, or finished) and to cancel jobs (e.g., a job might have been submitted by mistake or no longer be relevant). All commands below are self-explanatory.

utproj_show_queue(){ utproj_run_command_on_server "squeue"; }
utproj_show_my_jobs(){ utproj_run_command_on_server "squeue -u $USERNAME"; }
utproj_cancel_job(){ utproj_run_command_on_server "scancel -n \"$1\""; }
utproj_cancel_all_my_jobs(){ utproj_run_command_on_server "scancel -u $USERNAME"; }

Working with singularity containers

We often don’t have complete control over the server environment (e.g., we may not be able to install additional software). Containers address these problems, although they do introduce workflow friction, e.g., the container needs to be rebuilt with environment changes. Fortunately, the principles described for server interaction can be applied to container management. Friction can be minimized by defining functions that do the desired high-level manipulations (e.g., rebuilding a container, executing a command inside a container). If your workflow does not use containers, but rather, for example, module load commands or Python virtual environments, the same tricks are possible with slight changes.

ut_run_command_in_singularity_container(){ singularity exec --nv "$2" "$1"; }

utproj_run_cpu_command_in_singularity_container(){ ut_run_command_in_singularity_container "$1" "$CONTAINER_REL_FILEPATH"; }
utproj_run_server_cpu_command_in_singularity_container(){
    script_name=_cpu_cmd_`ut_random_uuid`.sh &&
    ut_create_runnable_script_from_command "$script_name" "$1" &&
    utproj_sync_file_to_server "$script_name";# &&
    utproj_run_server_cpu_command "source utils.sh && module load singularity && utproj_run_command_in_singularity_container ./$script_name";# &&
    # utproj_delete_file "$script_name";
}
utproj_run_server_gpu_command_in_singularity_container(){
    script_name=_gpu_cmd_`ut_random_uuid`.sh &&
    ut_create_runnable_script_from_command "$script_name" "$1" &&
    utproj_sync_file_to_server "$script_name" &&
    utproj_run_server_gpu_command "source utils.sh && module load singularity && utproj_run_command_in_singularity_container ./$script_name";# &&
    # utproj_delete_file "$script_name";
}

Running a command inside a container on the server involves a few steps but we won’t need to concern ourselves with them once appropriate functions are defined, e.g., utproj_run_server_cpu_command_in_singularity_container creates a file with the command that we want to run, syncs it file to the server, and submits a job that: defines the utility commands on the server, loads singularity, and executes the desired command inside the container. This is a complex operation that can be accomplished trivially by running utproj_run_cpu_command_in_singularity_container $MY_CMD

Complex commands build on simpler commands. For example, utproj_run_server_cpu_command_in_singularity_container builds on utproj_sync_file_to_server, utproj_run_server_cpu_command, and utproj_run_command_in_singularity_container along with other simple functions.

Aliases for most frequent commands

Long names are useful because they make functionality and argument order explicit. It can get tedious to write long commands (even with using tab to bring up the list of commands matching a prefix). We can define shorter aliases for the commands that we use most frequently.

srv_cmd(){ utproj_run_command_on_server_on_project_folder "$1"; }
srv_cpu_cmd(){ utproj_run_server_cpu_command_in_singularity_container "$1"; }
srv_gpu_cmd(){ utproj_run_server_gpu_command_in_singularity_container "$1"; }
srv_sync(){ utproj_sync_project_to_server && utproj_sync_folder_from_server; }

Example interaction

We now are going to walk over an example interaction with Bridges using the functions that we just defined. This section is meant to prove the usefulness of working with these commands.

Creating a project

Let us start by creating our project (named my_project) locally. This functionality is defined such that all action happens locally and firmly with respect to the root of the project folder. We will interact with the server from within the local terminal in our VS Code window. We start with the following folder structure:

my-local-terminal$ tree
.
___ py36_gpu.img
|__ utils.sh

0 directories, 2 files
my-local-terminal$ pwd
/Users/renatomp/Desktop/my_project

utils.sh contains the commands that we defined above (see here). We need to redefine the credentials appropriately. For this particular example, they would look like:

USERNAME="renatomp"
USER_HOST="renatomp@bridges.psc.xsede.org"
LOCAL_FOLDERPATH="/Users/renatomp/Desktop/my_project"
REMOTE_FOLDERPATH="/pylon5/yyyy/renatomp/my_project"
# NOTE: this path is provided relative to the project folder.
CONTAINER_REL_FILEPATH="py36_gpu.img"

I’ve kept my true username private by replacing it with renatomp. We can now go ahead and define all commands:

source utils.sh

A useful trick to make retrieve commands less tedious is to type in the terminal ut_ or utproj_ and press TAB twice to print the list of all commands that match the prefix. The same trick can be used for longer prefixes, e.g., utproj_run_.

my-local-terminal$ ut_
ut_build_py27_gpu_singularity_container
ut_build_py36_gpu_singularity_container
ut_build_static_singularity_container_from_writable_singularity_container
ut_create_runnable_script_from_command
ut_random_uuid
ut_register_ssh_key_on_server
ut_run_bash_on_server_on_folder
ut_run_command_every_num_seconds
ut_run_command_in_singularity_container
ut_run_command_on_server
ut_run_command_on_server_on_folder
ut_sudo_bash_into_singularity_container
ut_sync_file_from_server
ut_sync_file_to_server
ut_sync_folder_from_server
ut_sync_folder_to_server
my-local-terminal$ utproj_
utproj_cancel_all_my_jobs
utproj_cancel_job
utproj_continuously_sync_project_from_server
utproj_delete_file
utproj_delete_folder
utproj_run_bash_on_server
utproj_run_command_in_singularity_container
utproj_run_command_on_server
utproj_run_command_on_server_on_project_folder
utproj_run_server_cpu_command
utproj_run_server_cpu_command_in_singularity_container
utproj_run_server_gpu_command
utproj_run_server_gpu_command_in_singularity_container
utproj_show_my_jobs
utproj_show_queue
utproj_submit_cpu_job_with_resources
utproj_submit_gpu_job_with_resources
utproj_sync_file_from_server
utproj_sync_file_to_server
utproj_sync_folder_from_server
utproj_sync_folder_to_server
utproj_sync_project_from_server
utproj_sync_project_to_server

Registering an ssh key

It is important to register an ssh key for not having to type our password each time a server command is submitted. You can use ssh-copy-id to this effect, i.e., ssh-copy-id $HOST_NAME. For Bridges this step is different (see here).

Syncing code

Let us now get the project on the server. We first create a remote empty folder for the project and then sync the current local state to it. We have:

my-local-terminal$ ut_run_command_on_server "mkdir $REMOTE_FOLDERPATH" "$USER_HOST"
my-local-terminal$ utproj_sync_project_to_server
stdin: is not a tty
building file list ... done
./
py36_gpu.img
utils.sh

sent 4427144763 bytes  received 70 bytes  7746535.14 bytes/sec
total size is 4426604182  speedup is 1.00

We can see that all files are there.

my-local-terminal$ utproj_run_command_on_server_on_project_folder ls
py36_gpu.img  utils.sh
Connection to bridges.psc.xsede.org closed.

If we want to create a new file and sync it, we can create it in the editor locally (say script.sh), and then run

my-local-terminal$ utproj_sync_project_to_server
stdin: is not a tty
building file list ... done
./
script.sh

sent 198 bytes  received 48 bytes  32.80 bytes/sec
total size is 4426604182  speedup is 17994325.94

Only script.sh was transfered as the large Singularity container and the other files were already on the server.

Running jobs

Let us check the GPUs available on the compute nodes of Bridges.

my-local-terminal$ utproj_run_server_gpu_command nvidia-smi
Submitted batch job 5456467
Connection to bridges.psc.xsede.org closed.

The command was submitted as a SLURM job. After it completes, we can sync back the results, which are in a SLURM output file for the job just submitted.

my-local-terminal$ utproj_sync_project_from_server
stdin: is not a tty
receiving file list ... done
./
slurm-5456467.out

sent 48 bytes  received 1547 bytes  638.00 bytes/sec
total size is 4426605494  speedup is 2775301.25

For which we have:

Mon May 13 10:44:19 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48                 Driver Version: 410.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:84:00.0 Off |                    0 |
| N/A   25C    P8    30W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

We can also see that the CPU partition does not have GPUs.

my-local-terminal$ utproj_run_server_cpu_command nvidia-smi
Submitted batch job 5456493
Connection to bridges.psc.xsede.org closed.

my-local-terminal$ utproj_show_
utproj_show_my_jobs  utproj_show_queue
my-local-terminal$ utproj_show_my_jobs
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           5456493 RM-shared my_cpu_j renatomp PD       0:00      1 (Priority)
Connection to bridges.psc.xsede.org closed.

We see that the job is still waiting on the queue. Upon completion, we retrieve the results

my-local-terminal$ utproj_sync_project_from_server
stdin: is not a tty
receiving file list ... done
./
slurm-5456493.out

sent 48 bytes  received 336 bytes  153.60 bytes/sec
total size is 4426605568  speedup is 11527618.67
my-local-terminal$ cat slurm-54564
slurm-5456467.out  slurm-5456493.out
my-local-terminal$ cat slurm-5456493.out
/var/slurmd/job5456493/slurm_script: line 8: nvidia-smi: command not found

Running commands on the server inside a container

We now show how to easily run commands inside the singularity container on the server. We will just see that the operating systems of the container and of the remote compute node are different. We will do this from our local terminal.

my-local-terminal$ utproj_run_server_cpu_command "lsb_release -a"
Submitted batch job 5662140
Connection to bridges.psc.xsede.org closed.
my-local-terminal$ utproj_show_my_jobs              JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           5662140 RM-shared my_cpu_j renatomp PD       0:00      1 (Priority)
Connection to bridges.psc.xsede.org closed.

After it completes:

my-local-terminal$ utproj_sync_project_
utproj_sync_project_from_server
utproj_sync_project_to_server
my-local-terminal$ utproj_sync_project_from_server
stdin: is not a tty
receiving file list ... done
./
slurm-5662140.out

sent 48 bytes  received 598 bytes  258.40 bytes/sec
total size is 4426605989  speedup is 6852331.25
my-local-terminal$ cat slurm-5662140.out
LSB Version:    :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID: CentOS
Description:    CentOS Linux release 7.4.1708 (Core)
Release:        7.4.1708
Codename:       Core

We get CentOS Linux release 7.4.1708 (Core) for the operating system. Now let us check the operating system of the container:

my-local-terminal$ utproj_run_server_cpu_command_in_singularity_container "lsb_release -a"
stdin: is not a tty
building file list ... done
_cpu_cmd_57b16862-cdba-4f98-b174-02a6f6745d4b.sh

sent 195 bytes  received 42 bytes  94.80 bytes/sec
total size is 27  speedup is 0.11
Submitted batch job 5662143
Connection to bridges.psc.xsede.org closed.
my-local-terminal$ utproj_show_my_jobs              JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           5662143 RM-shared my_cpu_j renatomp PD       0:00      1 (Priority)
Connection to bridges.psc.xsede.org closed.

Syncing back the results from the server after it terminates, we get:

my-local-terminal$ utproj_sync_project_from_server
stdin: is not a tty
receiving file list ... done
./
slurm-5662143.out

sent 48 bytes  received 559 bytes  242.80 bytes/sec
total size is 4426606200  speedup is 7292596.71
my-local-terminal$ cat slurm-5662143.out
INFO:    Could not find any NVIDIA binaries on this host!
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.2 LTS
Release:        18.04
Codename:       bionic

We get Ubuntu 18.04.2 LTS for the operating system, showing that the command ran inside the container.

Conclusion: a personal note

This is the basic structure of a typical interaction with the server. We don’t need long lived ssh sessions. It is easy to situate ourselves: we are firmly working locally and delegating work to the server. We are never disjointly working on the server and locally. Each high-level operation approximately corresponds to a single command. We are free to define new commands.

Personally, one of the main benefits of these commands is that they give me a strong reassurance that when I create a file locally, I can get it to the server trivially (i.e., without thinking). The same is true for running jobs on the server (inside a container or not). It becomes very natural to interact in terms of these commands. My recommendation is to define commands for sequences of commands that you find yourself doing over and over again. Ask yourself what are their semantic goals and then name them accordingly.

I recommend that you copy this code into your project and make your project-specific changes. Repetition between projects is not a major concern as the main goal is to reduce cognitive load for a particular project. The alternative to this workflow is to string low-level commands together on the terminal every time we want to do something. It is painful to use and to hand over to other people (including your future self—think about revisiting a project after more than 3 months of not working on it!). Always make your life easier. The guiding principle is “don’t make me think”. Save your thinking for things that matters.