Scientific Cloud Computing

We have recently published a paper describing our progress on Scientific Cloud Computing for materials simulations:
A high performance scientific cloud computing environment for materials simulations", K. Jorissen, F.D. Vila and J.J. Rehr, Comp. Phys. Comm. 183 (2012), p. 1911

In a previous publication we established proof-of-principle for Scientific Cloud Computing:
Scientific Computing in the Cloud", J.J. Rehr, F.D. Vila, J.P. Gardner, L. Svec and M. Prange, Comp. Sci. Eng. 12 (2010) 34-43.

  • What is the SC2IT Toolset?

    Who is the SCC Toolset for?

    The toolset is for people who are interested in launching and using virtual cloud clusters using our Scientific Cloud Computing Toolset. The virtual cloud cluster is optimized for materials simulations, although it could be used for other purposes, too. Using the SCC Toolset requires that you are working from a *NIX environment (Mac OS X, Linux, or Windows + Cygwin), that you are comfortable doing basic work from a commandline terminal, and that you have an Amazon Web Services account in good standing. You will not need to configure the virtual cluster or do any kind of system administration.
    (If you are a developer, however, you will find it easy to change just about any thing you want.)

    Who is the SCC Toolset not for?

    If you are only interested in scientific cloud computing from a user-friendly, graphical environment that does not require commandline skills, then you should not install the stand-alone SCC Toolset. However, you may be interested in the latest version of JFEFF, the GUI for the FEFF9 code. This GUI can run FEFF9 cloud calculations from its graphical environment. We plan to develop more general-purpose GUIs for materials simulations in the future.

    What is the SCC Toolset?

    Our 2nd generation SCC toolset consists of a set of bash scripts that run on your local machine (e.g., a laptop PC or UNIX desktop), and control the virtual SCC environment. First, the toolset transforms a group of EC2 instances based on our Amazon Machine Image, into an interconnected cluster that functions as a virtual parallel computing platform. Second, the toolset is a wrapper for the EC2 API, replacing cumbersome API calls by much more user-friendly calls that store many settings in the environment and in configuration files to keep the user from having to manage them manually. They function as an intermediary layer between the user and the EC2 API. For example, a user could connect to an existing virtual machine from a command-line terminal by entering a rather complicated, session-dependent command:
    ssh -i/home/user/.ec2_clust/.ec2_clust_info.7729.r-de70cdb7/key_ pair_user.pem user@ec2-72-44-53-27.compute-1.amazonaws.com
    Alternatively, using the SCC toolset script, the same task only requires:
    ec2-clust-connect
    The toolset also simplifies the use of applications inside the cluster by providing scripts for launching and monitoring the load of different tasks.

    Which applications can I run on these virtual cloud clusters?

    This depends on the type of virtual machine you choose. If you choose our Scientific AMI (Amazon Machine Image), it will come preinstalled with several optimized materials science codes. The SCC AMI includes electronic structure codes (ABINIT, Quantum ESPRESSO, Car-Parrinello, and WIEN2k*), and excited-state codes (AI2NBSE, Exc!ting, FEFF9*, OCEAN, and RT-SIESTA), and quantum chemistry codes (NWChem). Note that some of these codes (marked with *) require the purchase of a license before you can use them.
    You can also install your own applications on the virtual cluster. In our opinion SCC is currently appropriate for CPU-intensive applications, such as most typical materials science applications. For data-intensive applications, however, data traffic to and from the cloud can become prohibitively slow and costly.

    What makes the SCC Toolset and SCC AMI appropriate for Scientific computing?

    The SCC AMI is a blueprint for a cloud instance specifically configured for parallel, HPC scientific computing applications. It is a 64-bit Fedora 13 LINUX distribution enhanced with tools typically needed for scientific computing, such as Fortran 95 and C++ compilers, numerical libraries (e.g. BLAS, ScaLAPACK), MPI (1.4.3), PBS, PVFS, etc. It is bundled with several widely used materials science codes that typically demand HPC capabilities. Depending on needs, the SCC AMI can be loaded onto different "instance types". Slower but cheaper instances can be used for simple tasks, while higher performance workhorses are available for more demanding calculations. The SCC Toolset creates a number of individual virtual machines in the cloud, and then transforms the "N" individual machines into an N-node cluster that functions like a traditional LINUX Beowulf cluster. This includes the requirements of typical parallel materials simulation applications, such as mounting an NFS partition and creating appropriate "/etc/hosts" files on all nodes, and configuring passwordless ssh access between nodes.

    I want to know more about the performance of SCC and this Toolset.

    Please consult our recent publication on the Publications page.

  • Users Guide

    Obtaining and installing the SCC Toolset

    Download the SCC Toolset and unpack it on your computer. You can install the files anywhere, but we suggest keeping them in the same place as your AWS EC2 installation.
    You will then need to set several environment variables. It is most convenient to do this once and for all by editing ~/.cshrc or ~/.bashrc (or another file if you are not using csh or bash). A template is provided with the SCC Toolset. Specifically, you will need to set the path to the SCC Toolset, and to your AWS credentials. If you are already using Amazon EC2, then you likely have all you need. If you are new to EC2, you will need to go through the EC2 setup process. Make sure to create 2 ssh key pairs. There is a handy GUI on the AWS website to help you do these things.
    If we have given you a demo version of our Toolset, then it contains all the necessary configuration settings and keypairs, and you do not need to worry about these instructions.
    Please contact us if you need further help.

    System requirements and AWS account

    The SCC Toolset does not use significant system resources. In terms of software requirements, you'll need the Java EC2 API, the Java runtime environment (JRE), and a *NIX environment with Bash and Secure shells. The toolset can be installed under many common operating systems, including Linux, Mac OS, and, using Cygwin, on MS Windows. In addition to these software requirements, you need a valid Amazon AWS account and appropriate security credentials, including two ssh key pairs linked to the AWS account.

    What tools are included in the SCC Toolset?

    ec2-clust-launch N Launch cluster with N instances
    ec2-clust-connect Connect to a cluster
    ec2-clust-put Copy files to a cluster
    ec2-clust-get Copy files from a cluster
    ec2-clust-run Start a job on a cluster
    ec2-clust-list List running clusters
    ec2-clust-usage Monitor CPU usage in cluster
    ec2-clust-load Monitor load in cluster
    ec2-clust-terminate Terminate a running cluster


    Creating a virtual cluster

    On the commandline terminal, type

    ec2-clust-launch -n N [-c Name] [-m MachineType] [-t InstanceType] [-e EphStorage]

    The ec2-clust-launch script performs all the tasks needed to launch and configure a cluster of N instances on the EC2. Optionally, the MachineType (i.e. AMI) and the InstanceType can be selected. AWS currently offers about a dozen InstanceTypes of varying CPU, memory, and network capabilities. AWS will bill you per hour for any instances you start. It is a good idea to label the cluster with a Name. The EphStorage option only applies to larger instance types such as cluster instances, and creates ephemeral data volumes, which may be necessary for calculations requiring a lot of scratch diskspace.
    For example, to create a large cluster of small instances for a FEFF9 calculation:
    ec2-clust-launch -n 84 -c FeffCluster1 -m fer-32 -t m1.small
    To create a cluster of cluster instances for a large WIEN2k calculation:
    ec2-clust-launch -n 32 -c Cl32HPC -m kevin-cluster -e 1
    (creates 1 900GB ephemeral data volume; instance type is set to "cc1.4xlarge" by default for this MachineType).

    Accessing the cluster and running a calculation

    To open an ssh session on the cluster to run applications, use

    ec2-clust-connect [-c Name]

    To open an ssh session on the cluster to configure the system, use

    ec2-clust-connect-root [-c Name]

    To upload files or directories to the cluster, use

    ec2-clust-put [-c Name] localfile remotefile

    To download files or directories from the cluster, use

    ec2-clust-get [-c Name] remotefile localfile

    To send a calculation to the cluster and retrieve the results when done, use

    ec2-clust-run -e Task [-c Name] [-t]

    The "Name" is always optional; if omitted, the most recently launched cluster is used. If a directory is specified for up/download, it will be copied recursively. The ec2-clust-run requires a configuration file for each type of Task; e.g. for FEFF9 this configuration file tells the Toolset which input files to gather and send to the cloud, how to know whether the calculation on the cloud has finished or not, and which files to bring back before optionally (-t) terminating the cluster. Thus the ec2-clust-run script can execute a calculation for select Tasks without requiring the user to ever log in on the virtual cluster.


    Monitoring the cluster

    To see all currently active clusters, use

    ec2-clust-list

    To get the CPU usage for all nodes of a cluster, use

    ec2-clust-usage [-c Name]

    To get the load for all nodes of a cluster, use

    ec2-clust-load [-c Name]


    Stopping the cluster

    To destroy a cloud cluster, use

    ec2-clust-terminate [-c Name]

    This command should be used when the calculation is done to avoid paying unnecessary usage charges. However, it is important to copy the results back to your local computer first. Once a cluster is terminated, it cannot be restarted, and all your output is lost!


    What instance type should I use?

    This really depends on your application. If network performance is crucial, such as for many parallel HPC codes using MPI and ScaLAPACK routines, then we recommend the cc1.4xlarge "cluster instances". For other applications we recommend thinking about RAM memory requirements and how quickly you want the results - "m1.small" instances are a bargain, but they can also be 3x slower per core than other instance types.


    What should I use for "MachineType"?

    description forthcoming


    I still don't know what to do!

    Better drop us a line at feff@uw.edu .


  • Demo

    You got it.


    This browser cannot play the embedded video file.

    sccdemo


    (Download if network is too slow for streaming)


  • Reference

    ec2-clust-launch -n N [-c Name] [-m MachineType] [-t InstanceType] [-e EphStorage]

    The ec2-clust-launch script is the most important tool in the set: It performs all the tasks needed to launch and configure a cluster of N instances on the EC2. Optionally, the MachineType (i.e. AMI) and the InstanceType can be selected. AWS currently offers about a dozen InstanceTypes of varying CPU, memory, and network capabilities. Schematically the ec2-clust-launch script performs the following tasks, as summarized from the comments within the script:

    ### Create a cluster of CLUST_NINS instances
    # Get the general configuration information
    # Check if the EC2_HOME is set and set all the derived variables
    # Check and set the location of the cluster tools
    # To avoid launch problems check for the presence of a lock
    # Create a lock for this process
    # Check if we have a cluster list
    # Get the total number of instances
    # Set the default cluster name (use current process id)
    # Set the default machine type
    # Process input options
    # Check that the PK (private key) and CERT files are available
    # Set the cluster index
    # Load the appropriate machine profile
    # Create an EC2 placement group if requesting cluster instances
    # Launch instances on EC2
    # Get reservation ID and list of instance IDs
    # Get the instance rank indices
    # Save name and reservation ID in cluster list
    # Release the lock
    # Make a directory to hold the cluster information
    # Manage the certificates that are used to access the cluster
    # Monitor instances until all the information we need is available
    # Make directory that will be used to store info to transfer
    # Initialize setup script in transfer directory
    # Get public and private DNS names
    # Set the head instance public DNS name
    # Create a list of the internal EC2 addresses
    # Save information about the cluster to .ec2_clust_config file
    # Create a hosts file and mpi hostfile for all the cluster instances
    # Copy hosts files to directory to transfer and add to setup script
    # Copy monitor tools to directory to transfer, add to setup script
    # Set up ephemeral storage on cluster instances
    # Point SCRATCH file system to ephemeral volume
    # Add shared dir creation to setup script
    # Create the exports file for the head instance
    # Copy exports file to directory to transfer and add to setup script
    # Add nfs config reload to setup script
    # Add fstab update and mount to setup script
    # Copy user certificates to directory to transfer
    # Add user certificate setup to setup script
    # Compress the info storage directory
    # Make sure the keys are ready on the other side
    # Transfer all files at once but don't launch more processes than permitted by the OS
    # Run setup on all nodes
    # Optionally give the head node a head start so it can get the nfs exports ready by the time the nodes want to mount them
    # Do cleanup locally but save cluster information
    # Print out setup timing info

    We now discuss this set of operations further. Each of the N nodes is a clone of the selected AMI, with all its preinstalled software and data, running on virtualized hardware determined by the InstanceType (e.g., "High CPU, 8 cores"). When the N instances have booted in EC2, the launch script performs setup tasks that transform the N individual machines into an N-node cluster that functions like a traditional LINUX Beowulf cluster. The tasks mentioned above include mounting an NFS partition and creating appropriate "/etc/hosts" files on all nodes, and configuring passwordless ssh access between nodes, all of which are requirements for many parallel scientific codes. One node is designated master node. This node generally distributes MPI jobs to the other nodes and makes a working directory partition available over the local network. The script also sets up a user account other than root for users to run the scientific codes provided in the AMIs. It is useful to tag the cluster with a Name (-c argument), especially if one intends to run several clusters at the same time. Certain InstanceTypes can create additional ephemeral data volumes up to about 2TB for storage intensive calculations (-e argument). The ec2-clust-launch command creates a temporary folder on the local control computer to store information about the cluster. This information includes identifiers and internal and external IP addresses for each of the instances comprising the cluster. The other scripts in the toolset access this information when they need to communicate with the cluster. User-related information, e.g. identifiers for the user’s AWS account, is stored in environment variables.

    ec2-clust-connect [-c Name] [-d command]

    Opens a ssh session on the Name cluster, or on the most recently launched cluster if no argument is given. The script logs in with the user account created by ec2-clust-launch, instead of the default root access offered by AWS. If -d is given, the command (best entered 'between quotes') will be executed on the cloud cluster, and control will be returned to the local shell.

    ec2-clust-connect-root [-c Name] [-d command]

    Opens a ssh session on the Name cluster and logs in as root. This is required only for developers, not for users running a calculation, unless runtime changes in configuration are needed. If -d is given, the command (best entered 'between quotes') will be executed on the cloud cluster, and control will be returned to the local shell.

    ec2-clust-put [-c Name] localfile remotefile

    Copies the file localfile on the local machine to the file remotefile on the master node of the Name cluster (or the most recent cluster if none is specified). If localfile is a directory it will be copied recursively. The master node has a shared working directory that all other nodes can access.

    ec2-clust-get [-c Name] remotefile localfile

    Copies the file remotefile on the head node of the Name cluster (or the most recent cluster if none is specified) to the file localfile on the local machine. If remotefile is a directory it will be copied recursively. The master node has a shared working directory that all other nodes can access.

    ec2-clust-list

    Lists all active clusters. Each cluster is identified by a Name, its AWS reservation ID, and an index number.

    ec2-clust-terminate [-c Name]

    Terminates all N instances comprising the cloud cluster Name, and cleans up the configuration files containing the specifics of the cluster on the local machine. The cluster cannot be restarted; all desired data should be retrieved before running the ‘terminate’ script. If no cluster is specified, the most recent one will be terminated.

    ec2-clust-run -e Task [-c Name] [-t]

    This tool connects to cluster Name (or the most recent cluster if none is specified) and executes a job there. Currently, Task can be WIEN2k or FEFF9. The tool loads a profile describing the selected Task. It scans the working directory for required input files and copies them to the cloud cluster Name. It then instructs the cluster to execute the task on all its nodes. It periodically connects to check for Task specific error files or successful termination. It copies relevant output files back to the local working directory, and terminates the cluster after completion if the -t flag is given.

    ec2-clust-usage [-c Name]

    Reports current CPU and memory usage for all nodes in cluster Name. This command can be executed either from within the cluster or from outside it, where the –c option is not required.

    ec2-clust-load [-c Name]

    Reports the 1 min, 5 min, and 15 min average load for all nodes in cluster Name. As in the ec2-clust-usage case, this command can be executed either from within the cluster or from outside it, where the -c option is not required.