Administration

Here you can find the documentation for the administrative side of the ROCS testbed.

Management ILOs

Every node has an out-of-band management ILO. You can connect to the node management interface via https://172.17.0.X where X is the same last-digit of the main IP address in the 172.18.0.*. subnet. For example, 172.18.0.12 (mcnode01) has an ILO IP at 172.17.0.12.

With the ILO you can take power management action, view the virtual screen of the machines, and do a number of other administrative actions. This is useful when for instance the kernel crashes and you need to reboot the machine. It is also useful in case you need to reconfigure anything in the BIOS.

The credentials are admin / asusadmin for ASUS nodes and ADMIN / ADMIN for most SuperMicro nodes. mcnode41-44 credentials are ADMIN / AdminPassword0.

Provisioning and deployment

The nodes are managed via Metal As A Service (MAAS). It lets you treat physical servers like virtual machines (instances) in the cloud. The servers are all listed on MAAS at http://172.18.0.11:5240/MAAS/. Currently we are using MAAS 3.2.9. Various docs are at https://maas.io/docs. The credentials are maasadmin / Apple0r@ng3.

The main operation that you will need from MAAS is node deployment. This is not a very frequent operation. Ask for a demonstration if needed. However, you can simply familiarize with it by reading the documentation (link above).

Enlisting a new node on MaaS

Change boot order so that the node boots over PXE MaaS will pick this up Configure the node inside MaaS Run Ansible script

Configuration - Ansible

The configuration of the testbed is performed via Ansible. Ansible allows us to write files called playbooks where we specify a set of commands to be executed over a set of nodes.

We mostly use Ansible to perform large configurations, such as configuring Slurm, but it is also very useful to quickly fix issues across multiple nodes.

Here is the list of configurations we perform using Ansible:

  • Base setup - Configure software and network after fresh OS installation
  • Authentication - Configure nodes to support KAUST authentication
  • User management - Add and remove users from the testbed
  • Slurm configuration - Configure slurm related software and config files

All Ansible playbooks and documentation can be found on the SANDS group shared folder under /sands - Documents/infrastructure/ansible/.

Monitoring

The nodes in the testbed are being monitored using Netdata. Netdata runs as a service on every node and sends data to a dashboard hosted at netdata.cloud. You can login to netdata using your KAUST email. Once you authenticate, you should see the “ROCS testbed” space on the left side of the screen. If you can’t see it, you may need to be invited.

Netdata is a very comprehensive tool. It has a wide range of default metrics and supports many plugins to complement and customize the monitoring setup.

Enabled plugins

  • nvidia-smi - Collects GPU related metrics, including GPU utilization and memory.
    Enabled on mcnodes with a GPU. Documentation.
  • Ping - Monitors reachability of the ILO and SSH addresses.
    Enabled and configured on mcmgt02. Documentation

Troubleshooting

High-level suggestions on how to troubleshoot common issues with the nodes.

Unresponsive node

  1. Connect to the Management ILO of the node.
  2. Look for a section called “Remote Control” then select “iKVM”.
    This will open a virtual local shell on the machine.
  3. If even the local shell is unresponsive, power cycle the node to reboot it.
  4. Login as user ubuntu password ubuntu.
  5. Try to determine the cause for unresponsiveness:
    1. Check status of the ssh service
    2. Check dmesg
  6. If you can’t solve the cause of the issue, power cycle the node to reboot it.

Node state

Login to Netdata to see collected monitoring metrics.

Current node issues can be found by checking the alerts page.

For past issues, select the appropriate timeframe at the top section and go through common problem metrics, including:

  • CPU throttling
  • Low Memory
  • Low Swap memory
  • Low GPU memory
  • Low Disk space

IT

The testbed is supported by two teams. The DataCenter (DC) team assists us with physical installation, topology changes and installation of components. The IT Networking team is responsible for the network configuration of the management network and the ILOs network. The IT Linux team can assist with the management node (mcmgt02). Certain operations require their involvement and can take time.

Location

All hardware sits in the datacenter in bldg 14. It currently spans across three racks: H4.10, H4.11, H4.12.

Next