How to create AWS ParallelCluster with Slurm scheduler: Difference between revisions
Jump to navigation
Jump to search
No edit summary |
No edit summary |
||
(4 intermediate revisions by the same user not shown) | |||
Line 20: | Line 20: | ||
* To install pip3, run the command below: | * To install pip3, run the command below: | ||
: '''sudo apt install python3-pip''' | : '''sudo apt install python3-pip''' | ||
== Python installation phase on CentOS 7 == | |||
* Login to the CentOS machine using SSH, and follow the instructions below to install Python3 and Python3-PIP: | |||
: https://www.rosehosting.com/blog/how-to-install-python-3-6-4-on-centos-7/ | |||
== Python installation phase on Windows == | == Python installation phase on Windows == | ||
Line 92: | Line 96: | ||
* Wait for the cluster deployment to complete. | * Wait for the cluster deployment to complete. | ||
* Document the '''MasterPublicIP''' and '''ClusterUser'''. | * Document the '''MasterPublicIP''' and '''ClusterUser'''. | ||
== Increase EC2 service limit == | |||
* In-case you need to increase the EC2 service limit (for example number of EC2 instances from a specific instance type), follow the instructions below: | |||
: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-resource-limits.html#request-increase | |||
== Connecting to the HPC cluster (from Linux Machine) == | == Connecting to the HPC cluster (from Linux Machine) == | ||
Line 99: | Line 107: | ||
: Note 2: Replace '''/path/to/keyfile.pem''' with the actual path and key file name | : Note 2: Replace '''/path/to/keyfile.pem''' with the actual path and key file name | ||
* Run the command below to verify the state of the cluster: | * Run the command below to verify the state of the cluster: | ||
: ''' | : '''sinfo''' | ||
== Connecting to the HPC cluster (from Windows Machine) == | == Connecting to the HPC cluster (from Windows Machine) == | ||
Line 118: | Line 126: | ||
* Click on Open | * Click on Open | ||
* Run the command below to verify the state of the cluster: | * Run the command below to verify the state of the cluster: | ||
: ''' | : '''sinfo''' | ||
== Common actions to control the cluster == | == Common actions to control the cluster == | ||
Line 139: | Line 147: | ||
: '''pcluster start mycluster''' | : '''pcluster start mycluster''' | ||
: Note: Replace '''mycluster''' with the previously create cluster name | : Note: Replace '''mycluster''' with the previously create cluster name | ||
* | |||
== Delete AWS ParallelCluster == | |||
* In-case you wish to keep the AWS ParallelCluster master node static IP, login to the AWS console: | |||
: https://console.aws.amazon.com/ec2/ | |||
* From the left pane, click on Elastic IPs -> select the public IP of the master node -> Actions -> Disassociate address | |||
* From command prompt (the same machine you used the pcluster commands), run the command below to delete the cluster: | |||
: '''pcluster delete mycluster''' | : '''pcluster delete mycluster''' | ||
: Note: Replace '''mycluster''' with the previously create cluster name | : Note: Replace '''mycluster''' with the previously create cluster name | ||
Line 164: | Line 177: | ||
* Scale HPC Workloads with Elastic Fabric Adapter and AWS ParallelCluster | * Scale HPC Workloads with Elastic Fabric Adapter and AWS ParallelCluster | ||
: https://idk.dev/scale-hpc-workloads-with-elastic-fabric-adapter-and-aws-parallelcluster/ | : https://idk.dev/scale-hpc-workloads-with-elastic-fabric-adapter-and-aws-parallelcluster/ | ||
* Best Practices for Running Ansys Fluent Using AWS ParallelCluster | |||
: https://aws.amazon.com/blogs/opensource/best-practices-running-ansys-fluent-aws-parallelcluster/ | |||
* AWS ParallelCluster with AWS Directory Services Authentication | |||
: https://aws.amazon.com/blogs/opensource/aws-parallelcluster-aws-directory-services-authentication/ | |||
* Adding support for FSx for Lustre: | * Adding support for FSx for Lustre: | ||
: https://aws-parallelcluster.readthedocs.io/en/develop/configuration.html#fsx-section | : https://aws-parallelcluster.readthedocs.io/en/develop/configuration.html#fsx-section |
Latest revision as of 13:17, 1 August 2019
Environment preparation phase within AWS Management Console
- Login the AWS IAM console:
- From the left pane, click on Policies -> click on Users -> Add User -> specify the name parallelcluster-user -> Access type: Programmatic access -> click Next: Permissions -> Set permissions -> select a group with “AdministratorAccess” role -> click Next: Tags -> click Next: Review -> click on Create user -> click on Download .csv and keep it in a secured location -> click on Close
- Follow the instructions below to create a key pair to access the cluster machines:
- Follow the instructions below to create S3 bucket (with unique name) for storing data to export and import data to/from the FSx Lustre storage:
- https://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html
- Note 1: Document the S3 bucket name for use inside the ParallelCluster config file
- Note 2: Create a folder called export (in small letters), inside the S3 bucket
- In-case you wish to create a dedicate VPC and subnet for the HPC cluster, follow the instructions below:
- Logoff the AWS console
Python installation phase on Linux (Debian / Ubuntu)
- Login to a Linux machine using SSH, and follow the instructions below to install Python 3:
- https://docs.aws.amazon.com/cli/latest/userguide/install-linux-python.html
- Note: In-case you already have Python 3 install, use the command below to upgrade to the latest build:
- sudo apt-get upgrade python3
- To install pip3, run the command below:
- sudo apt install python3-pip
Python installation phase on CentOS 7
- Login to the CentOS machine using SSH, and follow the instructions below to install Python3 and Python3-PIP:
Python installation phase on Windows
- Login to a Windows machine using privileged account, and follow the instructions below to install Python 3 and PIP:
AWS ParallelCluster installation phase
- Run the commands below to install the AWS ParallelCluster:
- Linux:
- sudo pip install aws-parallelcluster
- Windows:
- pip install aws-parallelcluster
- Run the command below to verify the installed version:
- pcluster version
- Follow the instructions below to install the AWS CLI:
- Run the command below in-order to configure AWS CLI:
- aws configure
- AWS Access Key ID – Specify the value from the CSV of the previously created IAM user parallelcluster-user
- AWS Secret Access Key – Specify the value from the CSV of the previously created IAM user parallelcluster-user
- Default region name – specify a region such as eu-west-1
- Default output format: JSON
- Run the command below to setup the initial configuration:
- pcluster configure
- Cluster Template: Specify here a custom name for the HPC template (such as HPC Cluster)
- AWS Region ID: Specify the same region you specified for the aws configure command (such as eu-west-1)
- VPC Name: Specify the same name as the Cluster Template (such as HPC Cluster)
- Key Name: Specify the name of the EC2 Key pair previously created
- VPC ID: Specify the name of the target VPC ID to deploy the HPC cluster into
- Note: The full list of VPC’s can be found within the AWS management console: https://console.aws.amazon.com/vpc
- Master Subnet ID: Specify here the name of the target subnet ID to deploy the HPC cluster into
- Note: The full list of subnets can be found within the AWS management console: https://console.aws.amazon.com/vpc
- Edit the ParallelCluster config file:
- Linux: The file is located inside ~/.parallelcluster/config
- Windows: The file is located inside %UserProfile%\.parallelcluster\config
- Add the following parameters to the [cluster] section (for a large cluster):
- base_os = centos7
- master_instance_type = c5n.xlarge
- compute_instance_type = c5n.18xlarge
- cluster_type = ondemand
- initial_queue_size = 2
- scheduler = slurm
- placement_group = DYNAMIC
- enable_efa = compute
- fsx_settings = fs
- Note: For small cluster, add the following parameters to the [cluster] section:
- base_os = centos7
- master_instance_type = m4.large
- compute_instance_type = m4.large
- cluster_type = ondemand
- initial_queue_size = 2
- max_queue_size = 3
- scheduler = slurm
- placement_group = DYNAMIC
- fsx_settings = fs
- Add the following entire section to the config file:
- [fsx fs]
- shared_dir = /fsx
- storage_capacity = 3600
- imported_file_chunk_size = 1024
- export_path = s3://bucket/export
- import_path = s3://bucket
- weekly_maintenance_start_time = 1:00:00
- Note 1: The storage_capacity is the size of the FSx Lustre storage in GB
- Note 2: Replace the value of bucket with the previously S3 bucket name
- Run the command below to deploy the new cluster:
- pcluster create mycluster
- Note: Replace mycluster with your target HPC cluster name (without spaces)
- Go to the CloudFormation console to view the deployment status:
- Wait for the cluster deployment to complete.
- Document the MasterPublicIP and ClusterUser.
Increase EC2 service limit
- In-case you need to increase the EC2 service limit (for example number of EC2 instances from a specific instance type), follow the instructions below:
Connecting to the HPC cluster (from Linux Machine)
- Run the command below to connect using SSH to the master server:
- pcluster ssh mycluster -i /path/to/keyfile.pem
- Note 1: Replace mycluster with the previously create cluster name
- Note 2: Replace /path/to/keyfile.pem with the actual path and key file name
- Run the command below to verify the state of the cluster:
- sinfo
Connecting to the HPC cluster (from Windows Machine)
- Download puttygen.exe from:
- Run the puttygen.exe
- Click on “Load” -> change the file extension from “Putty Private key files” to “All Files” -> locate the private key pair and click on Open -> click on OK -> click on “Save private key” -> click on “Yes” -> save the private key file with PPK extension -> close puttygen.exe
- Download Putty from:
- Run putty.exe
- From the left pane, under “Connection” -> expand SSH -> click on “Auth” -> from the main pane, under “Authentication parameters”, click on “Browse” -> locate the SSH private key generated by puttygen.exe
- From the left pane, click on “Session” -> from the main pane, under “Host Name (or IP address)” specify the following:
- user@IP_Address
- Note 1: Replace user with the previously documented ClusterUser value
- Note 2: Replace IP_Address with the previously documented MasterPublicIP
- Under “Saved Sessions”, specify a name for this newly created connection.
- Click on Save
- Click on Open
- Run the command below to verify the state of the cluster:
- sinfo
Common actions to control the cluster
- Displays a list of stacks that are associated with AWS ParallelCluster:
- pcluster list
- Displays a list of all instances in a cluster:
- pcluster instances mycluster
- Note: Replace mycluster with the previously create cluster name
- View the current status of the cluster:
- pcluster status mycluster
- Note: Replace mycluster with the previously create cluster name
- Updates a running cluster by using the values in the configuration file:
- pcluster update mycluster -c ~/.parallelcluster/config
- Note 1: Replace mycluster with the previously create cluster name
- Note 2: Replace ~/.parallelcluster/config with the target config file location
- Stops the compute fleet, leaving the master node running:
- pcluster stop mycluster
- Note: Replace mycluster with the previously create cluster name
- Starts the compute fleet for a cluster that has been stopped:
- pcluster start mycluster
- Note: Replace mycluster with the previously create cluster name
Delete AWS ParallelCluster
- In-case you wish to keep the AWS ParallelCluster master node static IP, login to the AWS console:
- From the left pane, click on Elastic IPs -> select the public IP of the master node -> Actions -> Disassociate address
- From command prompt (the same machine you used the pcluster commands), run the command below to delete the cluster:
- pcluster delete mycluster
- Note: Replace mycluster with the previously create cluster name
- Long term data must be stored inside S3 bucket
- The Amazon FSx for Lustre storage (mount /fsx) will be used for the duration of the compute job
References
- Getting started with AWS ParallelCluster:
- Setting Up AWS ParallelCluster
- Install AWS ParallelCluster in a Virtual Environment
- A Scientist's Guide to Cloud-HPC: Example with AWS ParallelCluster, Slurm, Spack, and WRF
- Launch your first sample HPC environment on AWS and review important concepts along the way
- AWS ParallelCluster Wiki:
- Deploying an Elastic HPC Cluster
- Scale HPC Workloads with Elastic Fabric Adapter and AWS ParallelCluster
- Best Practices for Running Ansys Fluent Using AWS ParallelCluster
- AWS ParallelCluster with AWS Directory Services Authentication
- Adding support for FSx for Lustre:
- Getting Started with Amazon FSx for Lustre
- Amazon FSx for Lustre Lustre User Guide