Testbed Information

This page contains essential information about the ROCS testbed. This testbed comprises servers and switches from Prof. Canini, Prof. Fahmy and Prof. Kalnis.

High level composition

The testbed is composed of various types of machines:

mcnode01-08: Machines with 1 x P100 GPU.
mcnode09-16: Relatively capable machines with no GPU.
mcnode17-20: Wimpy machines with little memory.
mcnode21: Machine with 28 cores, Intel SGX support.
mcnode22-25: Machines with 4 x A100 GPUs.
mcnode26-29: Capable single-socket AMD, 64 cores (128 threads) machines with many PCIe expansion slots. mcnode27 has a single A100 GPU and a BlueField-2 NIC. mcnode28 has a BlueField-2 NIC.
mcnode31-40: Machines with V100 GPUs. mcnode[31-32, 37-40] have 2 V100s and mcnode33-36 have 3 V100s.
mcnode41-44: Machines with 128 cores (256 threads).
mcstore01-02: 300TB storage servers hosting the network filesystems.
mcmgt01: Login node to Slurm cluster.

Access to nodes

The testbed operates under a private network. To connect to a node in the network, you first need to be connected to our VPN server. If you haven’t done so, learn how to connect the testbed VPN.

Once you can connect to the testbed, the next step is to configure your local SSH setup. You will use SSH to connect to the nodes. To use SSH, you need to have a local SSH key-pair and share the public key with us during the registration. Your account will be authenticated using this key pair. If you haven’t done so, please share a public SSH key with us before you proceed.

After you share the public SSH key, configure your local SSH setup with the following settings.

Add these lines to ~/.ssh/config and fill in your KAUST_username:

Host mcnode* mcmgt01
   ForwardAgent yes
   StrictHostKeyChecking no
   UserKnownHostsFile=/dev/null

Host *
  User <KAUST_Username>

Add these lines to /etc/hosts (you will need sudo):

172.18.0.10 mcmgt01

172.18.0.12 mcnode01
172.18.0.13 mcnode02
172.18.0.14 mcnode03
172.18.0.15 mcnode04
172.18.0.16 mcnode05
172.18.0.17 mcnode06
172.18.0.18 mcnode07
172.18.0.19 mcnode08
172.18.0.20 mcnode09
172.18.0.21 mcnode10
172.18.0.22 mcnode11
172.18.0.23 mcnode12
172.18.0.24 mcnode13
172.18.0.25 mcnode14
172.18.0.26 mcnode15
172.18.0.27 mcnode16
172.18.0.28 mcnode17
172.18.0.29 mcnode18
172.18.0.30 mcnode19
172.18.0.31 mcnode20
172.18.0.32 mcnode21
172.18.0.33 mcnode22
172.18.0.34 mcnode23
172.18.0.35 mcnode24
172.18.0.36 mcnode25
172.18.0.37 mcnode26
172.18.0.38 mcnode27
172.18.0.39 mcnode28
172.18.0.40 mcnode29
172.18.0.41 mcnode30
172.18.0.42 mcnode31
172.18.0.43 mcnode32
172.18.0.44 mcnode33
172.18.0.45 mcnode34
172.18.0.46 mcnode35
172.18.0.47 mcnode36
172.18.0.48 mcnode37
172.18.0.49 mcnode38
172.18.0.50 mcnode39
172.18.0.51 mcnode40
172.18.0.52 mcnode41
172.18.0.53 mcnode42
172.18.0.54 mcnode43
172.18.0.55 mcnode44

172.18.0.98 mcstore01
172.18.0.99 mcstore02

After you performed the configuration and your public SSH key has been configured in the cluster, confirm that your setup is working by trying to connect to mcmgt01.

ssh mcmgt01

If you were unable to connect, please reach out. Your account might not have been configured yet.

User Accounts

KAUST user

The nodes in the cluster are configured to use your KAUST user account. However, you can’t use your password to authenticate. Instead, authentication is only performed through SSH keys.

Once you login to a node you will notice you have your own home folder in /home/<KAUST_username>. This folder is in a network file system present on all of the nodes. This means you will see the same content of your home folder across all the nodes.

By default, you won’t have sudo access in the testbed nodes. If you need to install custom software, see the software installation section.

Ubuntu user

Outside of the KAUST accounts, the nodes have a local ubuntu user. This user is useful in case we can’t authenticate using the KAUST network. The SSH authentication to the ubuntu user is done through an SSH key, just like the other user accounts. Usually, only the testbed administrators have access to this ubuntu account.

The ubuntu user has sudo previledges without needing a password.

The home folder of ubuntu user is in /local/ubuntu. This path is not part of the network file system; it is part of the local filesystem. Any modifications to this local folder on a given node are not reflected on the other nodes.

In case you need to use this user, we like to “enforce” a simple netiquette. If you need to use the local filesystem, please create a subfolder as /local/ubuntu/<your KAUST username> and have the habit of working inside that folder. Also, do not delete files that others have created.

Storage

The nodes in the testbed have network and local storage:

Network storage is very reliable and it is mounted across all machines.
Local storage is faster but any data on it should be considered ephemeral.

Even though network storage is reliable, always backup your very important files externally!

Network storage

We have set up a file server on mcstore01. The file server currently has three shares:

/home/ - A 112 TB volume, in RAIDZ3 (tolerates up to three disk failures)
Hosts all the users’ home folders.
/data/fat/ - A 65 TB volume, in RAIDZ2 (tolerates up to two disk failures)
Useful for projects with multiple users. Slightly faster than /home/.
/data/secure/ - A read-only 14 TB volume, in MIRROR (tolerates one disk failure)
Holds immutable data like drivers and datasets.

By default, you should use your /home/KAUST_username/ directory to store data. The storage associated with this share is very reliable.

The /data/fat/ share should be used for projects with multiple users. As these projects come up, we will create a subdirectory under this share and give access permissions to members of that project.

Finally, you should not regularly need to change /data/secure. This is a read-only file system that is meant for things that are immutable like the Mellanox drivers, CUDA drivers, archived containers and datasets.

Keep in mind that these are network file systems and they are far less performant than a local file system.

Local storage

For situations where disk I/O is a concern, the superior performance of local storage might be preferred over network storage.

Local storage is available on nodes through extra disks (non-boot disks). Typically, the extra disks are exposed in /mnt/scratch or /mnt/data. Depending on your needs, we can accommodate various storage configurations.

Note that data stored on local storage is not backed up and should be considered ephemeral.

Software installation

To ensure a consistent experience across our cluster, the nodes in the testbed have the same software installed. If you want to install custom software for yourself, you can do it under Conda environments.

Conda environments

Conda environments are great for two main reasons:

Conda allows you to install many types of packages, including python packages and system packages (e.g. the ones you would install with apt install) under a user’s local folder. This allows the underlying system to stay unmodified for other users. It also avoids the need for sudo. As an example, Conda makes it very easy to install pytorch and mpi.
Environments are reproducible, so they can be used across many setups. This makes it easy to clone the environment between our cluster and IBEX.

You can learn to use Conda environments by following this tutorial.
Conda packages can be browsed on the anaconda website.

Conda environments are great but the conda command is quite slow to install packages. Mamba offers the mamba command, a faster drop in replacement for conda. mamba is fully compatible with conda:

mamba create -n env
mamba activate env
mamba install package

We recommend you install Mamba. It can be installed with:

curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh
bash Miniforge3-$(uname)-$(uname -m).sh

Note: this mamba installation also installs conda, so you can still use conda if you want.

Conda limitations

Sometimes you might find that the software you want to install is not available as a conda package. If this happens, reach out. We can help you install the software under your home folder, or if we think it’s worth it, directly install the software across all the nodes.

Requesting sudo access

For some work, sudo access is really required. For example, if it frequently requires configuring the network of a node. In this case you will be granted sudo permissions. Once you no longer require the node, we will remove your sudo access and revert the node to its standard state.

Networking

The testbed is permeated with the following networks:

SSH network
ILO network
Storage network
Backend network
Programmable switch network

Nodes are normally accessed through the SSH network. This network has speeds from 1Gbps or 10Gbps depending on the available network interface ports at the node.

When nodes are unresponsive, we can use the ILO out-of-band management network to address the issue. Check Troubleshooting for more information on how to deal with these issues.

The storage network is a dedicated 100Gbps network between nodes and shared storage. This network ensures high transfer speeds between the nodes and the storage system.

The backend network interconnects all the nodes through a high speed link (100Gbps). This network should be used when running programs that involve communication between nodes. For example, for distributed training of deep learning models.

Finally, the programmable switch network is used to interact with tofino switches. Please do not attempt to use this network or these switches unless you are working on a project that requires them and you know how to use the P4 language.

Outside of the previous networks, the testbed also supports dedicated custom network topologies. If your usecase requires it, please reach out.

Network configuration

In the IPs below, substitute X with the node number (e.g., for mcnode04, X=4) and Y for the switch number (e.g., for mcswitch05, Y=5).

SSH network - 1/10Gbps
- Subnet: 172.18.0.0/24
- IP: 172.18.0.(X+11) (nodes)
- IP: 172.18.0.24Y (switches)
ILO network
- Subnet: 172.17.0.0/24
- IP: 172.17.0.(X+11)
Storage network - 100Gbps
- Interface name: store
- Subnet: 172.20.0.0/24
- IP: 172.20.0.X
Backend network - 100Gbps
- Interface name: fabric
- Subnet: 10.200.0.0/24
- IP: 10.200.0.X
Programmable switch network. Only if you need the programmable switch.
- Subnet: 11.0.0.0/24
- IP: 11.0.0.2X

Network connectivity

The network connectivity in the ROCS testbed can be found here.

Coordination

The testbed is a shared resource and this requires some coordination. We use a Trello board to allocate resources to users.

When you require access to machines for your research, reach out and specify the resources you need. For example, you might need a node with a V100 GPU. Based on your requirements, we will consider the available nodes and the possibility of reassigning nodes to obtain a setup that match your requirements. Once you no longer require those resources, please let us know so we can free them for another user.

This requires a bit of discipline but it is a very simple mechanism and we hope it can continue to work.

Slurm cluster

To increase resource usage, a large portion of the testbed’s nodes with GPUs has been dedicated to a Slurm cluster. If you require a GPU for your work, try to use the Slurm setup.

Check the slurm environment page to learn how to use the Slurm setup.

Docker

Docker is installed across all the nodes of the testbed. To use docker, you need to be part of the local docker group of that particular node. Reach out if you need to be added to a docker group.

Current limitations

Volumes cannot be mounted under network file systems (see network storage). Instead, if you need to mount a volume, this should be done under local storage. For docker volumes you can use /docker_home/<KAUST_username>/. Reach out if you need to use this folder.
The docker command cannot be used directly through a slurm job. If you want to use Docker with slurm, create an interactive job and keep it open. While that job is running, SSH into the node where the job is running from your local machine, and run your docker command there.

Troubleshooting

Things break. Both Hardware and Software are not perfect and sometimes they stop working as intended. The most common issues are unresponsive or slow machines.

IMPORTANT: We want to give users to possibility to troubleshoot their own issues, but we need to be aware of the actions that are being performed in the testbed. Don’t perform any administrative modification without checking with us.

To address unresponsive machines we might need to reboot the node through the ILO. Check Troubleshooting unresponsive node.

Updates

Notification events and further communication about the testbed is done via a Slack channel. Everyone in the group needs to be included in the rocs-testbed channel under the Slack instance for The ML Hub.

Administration software

In case you’re curious, you can find out the software being used to administrate the testbed under the Administration page.

Note: We are using Netdata to monitor node statistics. Netdata runs as a service on each node. Users can request to stop this service in case it impacts performance-oriented experiments.

Other

Any questions? Just ask! Especially when you are not sure.

Last updated on Oct 3, 2024