How to create HPC Cluster with Slurm scheduler on Google Cloud Platform: Difference between revisions

From PUBLIC-WIKI
Jump to navigation Jump to search
No edit summary
No edit summary
Line 176: Line 176:
* Logoff the controller machine:
* Logoff the controller machine:
: '''exit'''
: '''exit'''
== Deleting and HPC cluster ==
* Run the command below to remove the HPC cluster deployment:
: '''gcloud deployment-manager --project=[PROJECT_ID] deployments delete mycluster'''
: Note 1: Replace '''[PROJECT_ID]''' with the target GCP project ID
: Note 2: Replace '''mycluster''' with your target HPC cluster name (without spaces)
* Delete the SSH keys (from Linux machine):
: '''rm -rf ~/.ssh/google_compute_engine'''
: '''rm -rf ~/.ssh/google_compute_engine.pub'''
== References ==
* Deploy an Auto-Scaling HPC Cluster with Slurm
: https://codelabs.developers.google.com/codelabs/hpc-slurm-on-gcp/#0
* Slurm on Google Cloud Platform
: https://github.com/SchedMD/slurm-gcp
* Easy HPC clusters on GCP with Slurm
: https://cloud.google.com/blog/products/gcp/easy-hpc-clusters-on-gcp-with-slurm
* Provisioning a slurm cluster
: http://gcpexamplesforresearch.web.unc.edu/2019/01/slurm-cluster/
* Google Cloud HPC Day
: https://hackmd.io/@mB_F56f1R86PnZTa2vK3oQ/SksCSxeOE?type=view
* Introducing Lustre file system Cloud Deployment Manager scripts
: https://cloud.google.com/blog/products/storage-data-transfer/introducing-lustre-file-system-cloud-deployment-manager-scripts
* Lustre Deployment Manager Script
: https://github.com/GoogleCloudPlatform/deploymentmanager-samples/tree/master/community/lustre

Revision as of 15:09, 7 July 2019

Generating SSH keys for specific accounts (from Linux machine)

  • Login to the Linux machine console.
  • Run the following command to generate the SSH key pair:
ssh-keygen -t rsa -f ~/.ssh/[KEY_FILENAME] -C [USERNAME]
Note 1: Replace KEY_FILENAME with the actual private key file name
Note 2: Replace USERNAME with the user who will use this SSH key to login
  • Change the permissions of the SSH private key using the command below:
chmod 400 ~/.ssh/[KEY_FILENAME]
Note: Replace KEY_FILENAME with the actual private key file name

Generating SSH keys for specific accounts (from Windows machine)

  • Download puttygen.exe from:
http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html
  • Run the puttygen.exe
  • Click Generate and follow the on-screen instructions to generate a new key.
Note: Make sure you create keys with at least 2048 bits
  • In the Key comment section, replace the existing text with the username who will use this key to login to the VM instance.
  • Click Save private key to write your private key to a file with a .ppk extension.
  • Click Save public key to write your public key to a file for later use.

Configure Metadata settings for the VM instances

  • From the upper left pane, click on “Compute Engine”
  • From the left pane, click on “Metadata”
  • From the main pane, under “Metadata” click on Edit -> add new key:
  • Key: enable-oslogin
  • Value: false
  • Click on Save
  • Open the previously created public key (usually it has no file extension) using a text editor and copy its entire content into memory.
  • From the main pane, under “SSH Keys” -> click on Edit -> click on Add item -> paste the content of the public key previously created into the free text field labeled “Enter entire key data” -> click on Save

Google Cloud SDK installation phase

  • Login to a machine using privileged account
  • Install Google Cloud SDK tools.
  • Linux (Debian / Ubuntu):
https://cloud.google.com/sdk/docs/quickstart-debian-ubuntu
  • Linux (CentOS / RedHat):
https://cloud.google.com/sdk/docs/quickstart-redhat-centos
  • Windows:
https://cloud.google.com/sdk/docs/quickstart-windows
  • Run the following from command prompt to initialize the Cloud SDK:
gcloud init --console-only
  • Select a GCP project from the list
  • Select a default Compute region and zone

Common Google Cloud SDK CLI Commands

  • Login to Google Cloud Platform:
gcloud auth application-default login --no-launch-browser
Note: The command prompt will show you a link – copy the link to a new browser, login with your GCP project credentials and copy the verification code from the browser to the command prompt
  • List all active GCP accounts:
gcloud auth list
  • Change the active account:
gcloud config set account <Account_Name>
Note: Replace <Account_Name> with the target GCP account
  • Lists all available GCP projects:
gcloud projects list
  • Change the GCP project:
gcloud config set project "<Project_ID>"
Note: Replace <Project_ID> with the target GCP project ID

Git installation phase on Linux (Debian / Ubuntu)

  • Login to the Linux machine using privileged account
  • Run the commands below:
sudo apt-get update
sudo apt-get install git-core
  • Run the command below to view the Git version:
git --version

Git installation phase on Linux (Centos / RedHat)

  • Login to the Linux machine using privileged account
  • Run the commands below:
sudo yum install git
  • Run the command below to view the Git version:
git --version

Git installation phase on Windows

  • Login to the Windows machine using privileged account
  • Download and install Git from the link below:
https://gitforwindows.org/

Slurm download and configuration phase

  • Download the Slurm deployment from command prompt:
  • Linux:
cd ~
git clone https://github.com/SchedMD/slurm-gcp.git
cd slurm-gcp
  • Windows:
cd %UserProfile%
git clone https://github.com/SchedMD/slurm-gcp.git
cd slurm-gcp
  • Edit the slurm-cluster.yaml configuration file, and update the following parameters:
  • cluster_name – Specify your target HPC cluster name (without spaces)
  • static_node_count – Leave the default with minimum of 2 nodes
  • max_node_count – Set here the maximum number of nodes for auto-scaling
  • zone – Specify here the target zone to deploy the HPC cluster (such as europe-west2-a)
Note: The full list of GCP zones can be found on: https://cloud.google.com/compute/docs/regions-zones/
  • region – Specify here the target region to deploy the HPC cluster (such as europe-west2)
Note: The full list of GCP regions can be found on:
https://cloud.google.com/compute/docs/regions-zones/
  • cidr – Update the target network range for the HPC cluster according to your needs
  • controller_machine_type – For small HPC cluster you may leave the default instance type n1-standard-2 (2 vCPUs and 7.5 GB of memory)
  • compute_machine_type
  • login_machine_type - Leave the default instance type n1-standard-2 (2 vCPUs and 7.5 GB of memory)
  • controller_disk_type - Remove the “#” sign
  • controller_disk_size_gb – Remove the “#” sign and change the value to 10
  • default_users – Specify here the email address of the GCP users who will be allowed to login to the HPC cluster and manage it (separated with comma)
  • Run the command below to deploy the new cluster:
gcloud deployment-manager deployments create mycluster --config slurm-cluster.yaml
Note: Replace mycluster with your target HPC cluster name (without spaces)
  • If you are using the Deployment Manager API on a new GCP project, press “Y” to enable the API
  • Document the machine names (specially the controller and login1)
  • Wait for the deployment process to complete (around 10 minutes)
  • Go to the Google Deployment manager console to view the status of the cluster deployment:
https://console.cloud.google.com/dm/deployments
  • Run the command below to login to the HPC cluster login1 machine:
gcloud compute ssh google1-login1 --zone=<ZONE>
Note 1: Replace google1-login1 with the name of the HPC cluster login machine
Note 2: Replace <ZONE> with the target zone that you deployed the cluster into (such as europe-west2-a)
  • On first login, press “Y” to continue
  • In-case you are logged into the login1 machine, you will see the message “Slurm login daemon installation complete”, once the cluster installation completes
  • Run the command below to check the status of the Slurm cluster:
sinfo
  • Logoff the login1 machine:
exit

Adding additional users the ability to login to the login1 machine

  • Run the command below to login to the HPC cluster “login” machine:
gcloud compute ssh google1-login1 --zone=<ZONE>
Note 1: Replace google1-login1 with the name of the HPC cluster login machine
Note 2: Replace <ZONE> with the target zone that you deployed the cluster into (such as europe-west2-a)
  • Run the command below to add a new group:
sudo groupadd mygroup
Note: Change mygroup to the target group
  • Run the command below to add a new user:
sudo useradd -g mygroup myusername
Note 1: Change mygroup to the target group
Note 2: Change myusername to the target username
  • Create the following folder:
sudo mkdir /home/myusername/.ssh
Note: Change myusername to the target username
  • Change the permissions on the .ssh folder:
sudo chmod 700 /home/myusername/.ssh/
Note: Change myusername to the target username
  • Create using VI the file /home/myusername/.ssh/authorized_keys
Note: Change myusername to the target username
  • Paste the content of the previously created public key into the authorized_keys
  • Change the permissions on the authorized_keys file:
sudo chmod 600 /home/myusername/.ssh/authorized_keys
Note: Change myusername to the target username
  • Change the ownership of the folder below:
sudo chown myusername:myusername -R /home/myusername/.ssh
Note: Change myusername to the target username

Expanding the controller machine disk

  • Login to the GCP VM console:
https://console.cloud.google.com/compute/instances
  • Locate the controller VM and select it -> from the upper pane, click on Stop -> on the warning page, click on Stop
  • Once the controller machine was stopped, from the left pane, click on Disks -> select the controller machine disk -> click on Edit -> change the size -> click on Save
  • From the left pane, click on VM instances -> select the controller VM -> click on Start -> on the warning page, click on Start
  • Run the command below to login to the HPC cluster login machine:
gcloud compute ssh google1-controller --zone=<ZONE>
Note 1: Replace google1-controller with the name of the HPC cluster controller machine
Note 2: Replace <ZONE> with the target zone that you deployed the cluster into (such as europe-west2-a)
  • Run the command below to make sure the disk was expanded:
df -h
  • Logoff the controller machine
exit

Installing software

  • Login to the controller machine:
gcloud compute ssh google1-controller --zone=<ZONE>
Note 1: Replace google1-controller with the name of the HPC cluster controller machine
Note 2: Replace <ZONE> with the target zone that you deployed the cluster into (such as europe-west2-a)
  • Install all software into /apps
  • Logoff the controller machine:
exit

Deleting and HPC cluster

  • Run the command below to remove the HPC cluster deployment:
gcloud deployment-manager --project=[PROJECT_ID] deployments delete mycluster
Note 1: Replace [PROJECT_ID] with the target GCP project ID
Note 2: Replace mycluster with your target HPC cluster name (without spaces)
  • Delete the SSH keys (from Linux machine):
rm -rf ~/.ssh/google_compute_engine
rm -rf ~/.ssh/google_compute_engine.pub

References

  • Deploy an Auto-Scaling HPC Cluster with Slurm
https://codelabs.developers.google.com/codelabs/hpc-slurm-on-gcp/#0
  • Slurm on Google Cloud Platform
https://github.com/SchedMD/slurm-gcp
  • Easy HPC clusters on GCP with Slurm
https://cloud.google.com/blog/products/gcp/easy-hpc-clusters-on-gcp-with-slurm
  • Provisioning a slurm cluster
http://gcpexamplesforresearch.web.unc.edu/2019/01/slurm-cluster/
  • Google Cloud HPC Day
https://hackmd.io/@mB_F56f1R86PnZTa2vK3oQ/SksCSxeOE?type=view
  • Introducing Lustre file system Cloud Deployment Manager scripts
https://cloud.google.com/blog/products/storage-data-transfer/introducing-lustre-file-system-cloud-deployment-manager-scripts
  • Lustre Deployment Manager Script
https://github.com/GoogleCloudPlatform/deploymentmanager-samples/tree/master/community/lustre