How to create HPC Cluster with Slurm scheduler on Google Cloud Platform

From PUBLIC-WIKI
Jump to navigation Jump to search

Generating SSH keys for specific accounts (from Linux machine)

  • Login to the Linux machine console.
  • Run the following command to generate the SSH key pair:
ssh-keygen -t rsa -f ~/.ssh/[KEY_FILENAME] -C [USERNAME]
Note 1: Replace KEY_FILENAME with the actual private key file name
Note 2: Replace USERNAME with the user who will use this SSH key to login
  • Change the permissions of the SSH private key using the command below:
chmod 400 ~/.ssh/[KEY_FILENAME]
Note: Replace KEY_FILENAME with the actual private key file name

Generating SSH keys for specific accounts (from Windows machine)

  • Download puttygen.exe from:
http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html
  • Run the puttygen.exe
  • Click Generate and follow the on-screen instructions to generate a new key.
Note: Make sure you create keys with at least 2048 bits
  • In the Key comment section, replace the existing text with the username who will use this key to login to the VM instance.
  • Click Save private key to write your private key to a file with a .ppk extension.
  • Click Save public key to write your public key to a file for later use.

Configure Metadata settings for the VM instances

  • From the upper left pane, click on “Compute Engine”
  • From the left pane, click on “Metadata”
  • From the main pane, under “Metadata” click on Edit -> add new key:
  • Key: enable-oslogin
  • Value: false
  • Click on Save
  • Open the previously created public key (usually it has no file extension) using a text editor and copy its entire content into memory.
  • From the main pane, under “SSH Keys” -> click on Edit -> click on Add item -> paste the content of the public key previously created into the free text field labeled “Enter entire key data” -> click on Save

Granting OS Login External User Access (Non-IUCC users)

In-case you are trying to access your GCP project using Google G Suite account belong to your University, and the GCP project was created by the IUCC, contact CloudSupport@iucc.ac.il and ask to add your Google G Suite account to the relevant university access group within CSU.IUCC.AC.IL Google G Suite organization.

Google Cloud SDK installation phase

  • Login to a machine using privileged account
  • Install Google Cloud SDK tools.
  • Linux (Debian / Ubuntu):
https://cloud.google.com/sdk/docs/quickstart-debian-ubuntu
  • Linux (CentOS / RedHat):
https://cloud.google.com/sdk/docs/quickstart-redhat-centos
  • Windows:
https://cloud.google.com/sdk/docs/quickstart-windows
  • Run the following from command prompt to initialize the Cloud SDK:
gcloud init --console-only
  • Select a GCP project from the list
  • Select a default Compute region and zone

Common Google Cloud SDK CLI Commands

  • Login to Google Cloud Platform:
gcloud auth application-default login --no-launch-browser
Note: The command prompt will show you a link – copy the link to a new browser, login with your GCP project credentials and copy the verification code from the browser to the command prompt
  • List all active GCP accounts:
gcloud auth list
  • Change the active account:
gcloud config set account <Account_Name>
Note: Replace <Account_Name> with the target GCP account
  • Lists all available GCP projects:
gcloud projects list
  • Change the GCP project:
gcloud config set project "<Project_ID>"
Note: Replace <Project_ID> with the target GCP project ID

Git installation phase on Linux (Debian / Ubuntu)

  • Login to the Linux machine using privileged account
  • Run the commands below:
sudo apt-get update
sudo apt-get install git-core
  • Run the command below to view the Git version:
git --version

Git installation phase on Linux (Centos / RedHat)

  • Login to the Linux machine using privileged account
  • Run the commands below:
sudo yum install git
  • Run the command below to view the Git version:
git --version

Git installation phase on Windows

  • Login to the Windows machine using privileged account
  • Download and install Git from the link below:
https://gitforwindows.org/

Slurm download and configuration phase

  • Download the Slurm deployment from command prompt:
  • Linux:
cd ~
git clone https://github.com/SchedMD/slurm-gcp.git
cd slurm-gcp
  • Windows:
cd %UserProfile%
git clone https://github.com/SchedMD/slurm-gcp.git
cd slurm-gcp
  • Edit the slurm-cluster.yaml configuration file, and update the following parameters:
  • cluster_name – Specify your target HPC cluster name (without spaces)
  • static_node_count – Leave the default with minimum of 2 nodes
  • max_node_count – Set here the maximum number of nodes for auto-scaling
  • zone – Specify here the target zone to deploy the HPC cluster (such as europe-west2-a)
Note: The full list of GCP zones can be found on: https://cloud.google.com/compute/docs/regions-zones/
  • region – Specify here the target region to deploy the HPC cluster (such as europe-west2)
Note: The full list of GCP regions can be found on:
https://cloud.google.com/compute/docs/regions-zones/
  • cidr – Update the target network range for the HPC cluster to be the same as the target zone (inside GCP console -> VPC Network)
For example: In-case the region IP range is 10.154.0.0/20, update the cidr value inside the configuration file to 10.154.0.0/24
  • controller_machine_type – For small HPC cluster you may leave the default instance type n1-standard-2 (2 vCPUs and 7.5 GB of memory)
  • compute_machine_type
  • login_machine_type - Leave the default instance type n1-standard-2 (2 vCPUs and 7.5 GB of memory)
  • controller_disk_type - Remove the “#” sign
  • controller_disk_size_gb – Remove the “#” sign and change the value to 10
  • default_users – Specify here the email address of the GCP users who will be allowed to login to the HPC cluster and manage it (separated with comma)
  • Run the command below to deploy the new cluster:
gcloud deployment-manager deployments create mycluster --config slurm-cluster.yaml
Note: Replace mycluster with your target HPC cluster name (without spaces)
  • If you are using the Deployment Manager API on a new GCP project, press “Y” to enable the API
  • Document the machine names (specially the controller and login1)
  • Wait for the deployment process to complete (around 10 minutes)
  • Go to the Google Deployment manager console to view the status of the cluster deployment:
https://console.cloud.google.com/dm/deployments
  • Run the command below to login to the HPC cluster login1 machine:
gcloud compute ssh google1-login1 --zone=<ZONE>
Note 1: Replace google1-login1 with the name of the HPC cluster login machine
Note 2: Replace <ZONE> with the target zone that you deployed the cluster into (such as europe-west2-a)
  • On first login, press “Y” to continue
  • Wait for the Slurm deployment to complete (might take around 10 minutes)
  • In-case you are logged into the login1 machine, you will see the message “Slurm login daemon installation complete”, once the cluster installation completes
Note: Once the Slurm cluster deployment completes, you might need to logout and login back again to the login1 server for the mount of /home to take effect
  • Run the command below to check the status of the Slurm cluster:
sinfo
  • Logoff the login1 machine:
exit

Adding additional users the ability to login to the login1 machine

  • Run the command below to login to the HPC cluster “login” machine:
gcloud compute ssh google1-login1 --zone=<ZONE>
Note 1: Replace google1-login1 with the name of the HPC cluster login machine
Note 2: Replace <ZONE> with the target zone that you deployed the cluster into (such as europe-west2-a)
  • Run the command below to add a new group:
sudo groupadd mygroup
Note: Change mygroup to the target group
  • Run the command below to add a new user:
sudo useradd -g mygroup myusername
Note 1: Change mygroup to the target group
Note 2: Change myusername to the target username
  • Create the following folder:
sudo mkdir /home/myusername/.ssh
Note: Change myusername to the target username
  • Change the permissions on the .ssh folder:
sudo chmod 700 /home/myusername/.ssh/
Note: Change myusername to the target username
  • Create using VI the file /home/myusername/.ssh/authorized_keys
Note: Change myusername to the target username
  • Paste the content of the previously created public key into the authorized_keys
  • Change the permissions on the authorized_keys file:
sudo chmod 600 /home/myusername/.ssh/authorized_keys
Note: Change myusername to the target username
  • Change the ownership of the folder below:
sudo chown myusername:myusername -R /home/myusername/.ssh
Note: Change myusername to the target username

Expanding the controller machine disk

  • Login to the GCP VM console:
https://console.cloud.google.com/compute/instances
  • Locate the controller VM and select it -> from the upper pane, click on Stop -> on the warning page, click on Stop
  • Once the controller machine was stopped, from the left pane, click on Disks -> select the controller machine disk -> click on Edit -> change the size -> click on Save
  • From the left pane, click on VM instances -> select the controller VM -> click on Start -> on the warning page, click on Start
  • Run the command below to login to the HPC cluster controller machine:
gcloud compute ssh google1-controller --zone=<ZONE>
Note 1: Replace google1-controller with the name of the HPC cluster controller machine
Note 2: Replace <ZONE> with the target zone that you deployed the cluster into (such as europe-west2-a)
  • Run the command below to make sure the disk was expanded:
df -h
  • Logoff the controller machine
exit

Installing software

  • Login to the controller machine:
gcloud compute ssh google1-controller --zone=<ZONE>
Note 1: Replace google1-controller with the name of the HPC cluster controller machine
Note 2: Replace <ZONE> with the target zone that you deployed the cluster into (such as europe-west2-a)
  • Install all software into /apps
  • Logoff the controller machine:
exit

Expanding the default GCP quota limits (Maximum number of CPUs)

  • Login to the quota page:
https://console.cloud.google.com/iam-admin/quotas
  • From the main pane, locate Compute Engine API CPUs in the target location (region) where the Slurm cluster is deployed and select it
  • From the upper pane, click on Edit Quotas
  • From the newly opened right pane, specify your contact details (Name, Email and mobile phone) -> click Next
  • Specify new quota limit and specify reason for the request
  • Click Done
  • Click on Submit request
  • Wait until you receive confirmation email from Google Cloud support that the quota limit was increased, before you proceed with the Lustre deployment phase

Expanding the default GCP quota limits (Maximum capacity of persistent disk)

  • Login to the quota page:
https://console.cloud.google.com/iam-admin/quotas
  • From the main pane, locate Compute Engine API Persistent Disk Standard (GB) in the target location (region) where the Slurm cluster is deployed and select it
  • From the upper pane, click on Edit Quotas
  • From the newly opened right pane, specify your contact details (Name, Email and mobile phone) -> click Next
  • Specify new quota limit and specify reason for the request
  • Click Done
  • Click on Submit request
  • Wait until you receive confirmation email from Google Cloud support that the quota limit was increased, before you proceed with the Lustre deployment phase

Deleting the HPC cluster

  • Run the command below to remove the HPC cluster deployment:
gcloud deployment-manager --project=[PROJECT_ID] deployments delete mycluster
Note 1: Replace [PROJECT_ID] with the target GCP project ID
Note 2: Replace mycluster with your target HPC cluster name (without spaces)
  • Delete the SSH keys (from Linux machine):
rm -rf ~/.ssh/google_compute_engine
rm -rf ~/.ssh/google_compute_engine.pub

References

  • Deploy an Auto-Scaling HPC Cluster with Slurm
https://codelabs.developers.google.com/codelabs/hpc-slurm-on-gcp/#0
  • Slurm on Google Cloud Platform
https://github.com/SchedMD/slurm-gcp
  • Easy HPC clusters on GCP with Slurm
https://cloud.google.com/blog/products/gcp/easy-hpc-clusters-on-gcp-with-slurm
  • Provisioning a slurm cluster
http://gcpexamplesforresearch.web.unc.edu/2019/01/slurm-cluster/
  • Google Cloud HPC Day
https://hackmd.io/@mB_F56f1R86PnZTa2vK3oQ/SksCSxeOE?type=view
  • Introducing Lustre file system Cloud Deployment Manager scripts
https://cloud.google.com/blog/products/storage-data-transfer/introducing-lustre-file-system-cloud-deployment-manager-scripts
  • Lustre Deployment Manager Script
https://github.com/GoogleCloudPlatform/deploymentmanager-samples/tree/master/community/lustre