How to create HPC Cluster with Slurm scheduler on Google Cloud Platform: Difference between revisions
Jump to navigation
Jump to search
No edit summary |
No edit summary |
||
Line 176: | Line 176: | ||
* Logoff the controller machine: | * Logoff the controller machine: | ||
: '''exit''' | : '''exit''' | ||
== Deleting and HPC cluster == | |||
* Run the command below to remove the HPC cluster deployment: | |||
: '''gcloud deployment-manager --project=[PROJECT_ID] deployments delete mycluster''' | |||
: Note 1: Replace '''[PROJECT_ID]''' with the target GCP project ID | |||
: Note 2: Replace '''mycluster''' with your target HPC cluster name (without spaces) | |||
* Delete the SSH keys (from Linux machine): | |||
: '''rm -rf ~/.ssh/google_compute_engine''' | |||
: '''rm -rf ~/.ssh/google_compute_engine.pub''' | |||
== References == | |||
* Deploy an Auto-Scaling HPC Cluster with Slurm | |||
: https://codelabs.developers.google.com/codelabs/hpc-slurm-on-gcp/#0 | |||
* Slurm on Google Cloud Platform | |||
: https://github.com/SchedMD/slurm-gcp | |||
* Easy HPC clusters on GCP with Slurm | |||
: https://cloud.google.com/blog/products/gcp/easy-hpc-clusters-on-gcp-with-slurm | |||
* Provisioning a slurm cluster | |||
: http://gcpexamplesforresearch.web.unc.edu/2019/01/slurm-cluster/ | |||
* Google Cloud HPC Day | |||
: https://hackmd.io/@mB_F56f1R86PnZTa2vK3oQ/SksCSxeOE?type=view | |||
* Introducing Lustre file system Cloud Deployment Manager scripts | |||
: https://cloud.google.com/blog/products/storage-data-transfer/introducing-lustre-file-system-cloud-deployment-manager-scripts | |||
* Lustre Deployment Manager Script | |||
: https://github.com/GoogleCloudPlatform/deploymentmanager-samples/tree/master/community/lustre |
Revision as of 15:09, 7 July 2019
Generating SSH keys for specific accounts (from Linux machine)
- Login to the Linux machine console.
- Run the following command to generate the SSH key pair:
- ssh-keygen -t rsa -f ~/.ssh/[KEY_FILENAME] -C [USERNAME]
- Note 1: Replace KEY_FILENAME with the actual private key file name
- Note 2: Replace USERNAME with the user who will use this SSH key to login
- Change the permissions of the SSH private key using the command below:
- chmod 400 ~/.ssh/[KEY_FILENAME]
- Note: Replace KEY_FILENAME with the actual private key file name
Generating SSH keys for specific accounts (from Windows machine)
- Download puttygen.exe from:
- Run the puttygen.exe
- Click Generate and follow the on-screen instructions to generate a new key.
- Note: Make sure you create keys with at least 2048 bits
- In the Key comment section, replace the existing text with the username who will use this key to login to the VM instance.
- Click Save private key to write your private key to a file with a .ppk extension.
- Click Save public key to write your public key to a file for later use.
Configure Metadata settings for the VM instances
- From the upper left pane, click on “Compute Engine”
- From the left pane, click on “Metadata”
- From the main pane, under “Metadata” click on Edit -> add new key:
- Key: enable-oslogin
- Value: false
- Click on Save
- Open the previously created public key (usually it has no file extension) using a text editor and copy its entire content into memory.
- From the main pane, under “SSH Keys” -> click on Edit -> click on Add item -> paste the content of the public key previously created into the free text field labeled “Enter entire key data” -> click on Save
Google Cloud SDK installation phase
- Login to a machine using privileged account
- Install Google Cloud SDK tools.
- Linux (Debian / Ubuntu):
- Linux (CentOS / RedHat):
- Windows:
- Run the following from command prompt to initialize the Cloud SDK:
- gcloud init --console-only
- Select a GCP project from the list
- Select a default Compute region and zone
Common Google Cloud SDK CLI Commands
- Login to Google Cloud Platform:
- gcloud auth application-default login --no-launch-browser
- Note: The command prompt will show you a link – copy the link to a new browser, login with your GCP project credentials and copy the verification code from the browser to the command prompt
- List all active GCP accounts:
- gcloud auth list
- Change the active account:
- gcloud config set account <Account_Name>
- Note: Replace <Account_Name> with the target GCP account
- Lists all available GCP projects:
- gcloud projects list
- Change the GCP project:
- gcloud config set project "<Project_ID>"
- Note: Replace <Project_ID> with the target GCP project ID
Git installation phase on Linux (Debian / Ubuntu)
- Login to the Linux machine using privileged account
- Run the commands below:
- sudo apt-get update
- sudo apt-get install git-core
- Run the command below to view the Git version:
- git --version
Git installation phase on Linux (Centos / RedHat)
- Login to the Linux machine using privileged account
- Run the commands below:
- sudo yum install git
- Run the command below to view the Git version:
- git --version
Git installation phase on Windows
- Login to the Windows machine using privileged account
- Download and install Git from the link below:
Slurm download and configuration phase
- Download the Slurm deployment from command prompt:
- Linux:
- cd ~
- git clone https://github.com/SchedMD/slurm-gcp.git
- cd slurm-gcp
- Windows:
- cd %UserProfile%
- git clone https://github.com/SchedMD/slurm-gcp.git
- cd slurm-gcp
- Edit the slurm-cluster.yaml configuration file, and update the following parameters:
- cluster_name – Specify your target HPC cluster name (without spaces)
- static_node_count – Leave the default with minimum of 2 nodes
- max_node_count – Set here the maximum number of nodes for auto-scaling
- zone – Specify here the target zone to deploy the HPC cluster (such as europe-west2-a)
- Note: The full list of GCP zones can be found on: https://cloud.google.com/compute/docs/regions-zones/
- region – Specify here the target region to deploy the HPC cluster (such as europe-west2)
- Note: The full list of GCP regions can be found on:
- https://cloud.google.com/compute/docs/regions-zones/
- cidr – Update the target network range for the HPC cluster according to your needs
- controller_machine_type – For small HPC cluster you may leave the default instance type n1-standard-2 (2 vCPUs and 7.5 GB of memory)
- compute_machine_type
- For small HPC cluster you may leave the default instance type n1-standard-2 (2 vCPUs and 7.5 GB of memory)
- For large HPC cluster, choose instance type from the High-CPU family (https://cloud.google.com/compute/docs/machine-types#highcpu) or from the Compute-Optimized VMs (https://cloud.google.com/blog/products/compute/introducing-compute-and-memory-optimized-vms-for-google-compute-engine)
- login_machine_type - Leave the default instance type n1-standard-2 (2 vCPUs and 7.5 GB of memory)
- controller_disk_type - Remove the “#” sign
- controller_disk_size_gb – Remove the “#” sign and change the value to 10
- default_users – Specify here the email address of the GCP users who will be allowed to login to the HPC cluster and manage it (separated with comma)
- Run the command below to deploy the new cluster:
- gcloud deployment-manager deployments create mycluster --config slurm-cluster.yaml
- Note: Replace mycluster with your target HPC cluster name (without spaces)
- If you are using the Deployment Manager API on a new GCP project, press “Y” to enable the API
- Document the machine names (specially the controller and login1)
- Wait for the deployment process to complete (around 10 minutes)
- Go to the Google Deployment manager console to view the status of the cluster deployment:
- Run the command below to login to the HPC cluster login1 machine:
- gcloud compute ssh google1-login1 --zone=<ZONE>
- Note 1: Replace google1-login1 with the name of the HPC cluster login machine
- Note 2: Replace <ZONE> with the target zone that you deployed the cluster into (such as europe-west2-a)
- On first login, press “Y” to continue
- In-case you are logged into the login1 machine, you will see the message “Slurm login daemon installation complete”, once the cluster installation completes
- Run the command below to check the status of the Slurm cluster:
- sinfo
- Logoff the login1 machine:
- exit
Adding additional users the ability to login to the login1 machine
- Run the command below to login to the HPC cluster “login” machine:
- gcloud compute ssh google1-login1 --zone=<ZONE>
- Note 1: Replace google1-login1 with the name of the HPC cluster login machine
- Note 2: Replace <ZONE> with the target zone that you deployed the cluster into (such as europe-west2-a)
- Run the command below to add a new group:
- sudo groupadd mygroup
- Note: Change mygroup to the target group
- Run the command below to add a new user:
- sudo useradd -g mygroup myusername
- Note 1: Change mygroup to the target group
- Note 2: Change myusername to the target username
- Create the following folder:
- sudo mkdir /home/myusername/.ssh
- Note: Change myusername to the target username
- Change the permissions on the .ssh folder:
- sudo chmod 700 /home/myusername/.ssh/
- Note: Change myusername to the target username
- Create using VI the file /home/myusername/.ssh/authorized_keys
- Note: Change myusername to the target username
- Paste the content of the previously created public key into the authorized_keys
- Change the permissions on the authorized_keys file:
- sudo chmod 600 /home/myusername/.ssh/authorized_keys
- Note: Change myusername to the target username
- Change the ownership of the folder below:
- sudo chown myusername:myusername -R /home/myusername/.ssh
- Note: Change myusername to the target username
Expanding the controller machine disk
- Login to the GCP VM console:
- Locate the controller VM and select it -> from the upper pane, click on Stop -> on the warning page, click on Stop
- Once the controller machine was stopped, from the left pane, click on Disks -> select the controller machine disk -> click on Edit -> change the size -> click on Save
- From the left pane, click on VM instances -> select the controller VM -> click on Start -> on the warning page, click on Start
- Run the command below to login to the HPC cluster login machine:
- gcloud compute ssh google1-controller --zone=<ZONE>
- Note 1: Replace google1-controller with the name of the HPC cluster controller machine
- Note 2: Replace <ZONE> with the target zone that you deployed the cluster into (such as europe-west2-a)
- Run the command below to make sure the disk was expanded:
- df -h
- Logoff the controller machine
- exit
Installing software
- Login to the controller machine:
- gcloud compute ssh google1-controller --zone=<ZONE>
- Note 1: Replace google1-controller with the name of the HPC cluster controller machine
- Note 2: Replace <ZONE> with the target zone that you deployed the cluster into (such as europe-west2-a)
- Install all software into /apps
- Logoff the controller machine:
- exit
Deleting and HPC cluster
- Run the command below to remove the HPC cluster deployment:
- gcloud deployment-manager --project=[PROJECT_ID] deployments delete mycluster
- Note 1: Replace [PROJECT_ID] with the target GCP project ID
- Note 2: Replace mycluster with your target HPC cluster name (without spaces)
- Delete the SSH keys (from Linux machine):
- rm -rf ~/.ssh/google_compute_engine
- rm -rf ~/.ssh/google_compute_engine.pub
References
- Deploy an Auto-Scaling HPC Cluster with Slurm
- Slurm on Google Cloud Platform
- Easy HPC clusters on GCP with Slurm
- Provisioning a slurm cluster
- Google Cloud HPC Day
- Introducing Lustre file system Cloud Deployment Manager scripts
- Lustre Deployment Manager Script