Michael J. Gallagher, PhD.
Associate Professor of Finance, St Bonaventure University
High Performance Computing
MPI - R Studio - AWS
Quick Reference Guide
(from a windows machine to a Linux server)
RStudio IDE is licensed under the GNU Affero General Public License:
http://www.gnu.org/licenses/agpl-3.0-standalone.html
Open MPI is distributed under the New BSD license
https://www.open-mpi.org/community/license.php
The links below are preconfigured, out of the box, computing environments which utilize Amazon Web Services and have Open MPI and R-Studio already installed. To my knowledge this is the only place these applications have been combined and made available for public use. The platform is designed to be a scalable environment for doing nonparametric data analysis using the R package “npRmpi”. A common complaint of any researcher who has done nonparametric cross validation bootstrapping is the enormous computer resources required, and the length of time needed to perform the calculations. As a bench mark, nonparametric conditional model specification testing on a multivariate model with 50 years of daily data, using a standard quad core 8 GB RAM machine, easily would take twenty four hours. Using the configuration made available here, and clustering 4 processors from Amazon the run time for the same analysis is cut down to about 18 minutes. Using the spot rate feature available from Amazon, 4 processors could be set up at a cost of less than $1.00 per hour; conceivably within the budget of many academic researchers.
1) Open an account at Amazon Web Services.
There are detailed instructions for getting up and running, there is extensive documentation, and there is a myriad of resources available.
2) You will need an SSH and SFTP client. I recommend PuTTY.
http://www.chiark.greenend.org.uk/~sgtatham/putty/
Install all the binaries. PuTTY provides two essential ingredients. One is a secure shell from which you invoke the computer nodes you have created and execute R command files and otherwise fine tune your environment for you purposes. Secondly, thru your windows Command shell, PSFTP, (PuTTY Secure File Transfer Protocol) allows you a backup method to upload/download to your server.
3) You will need an encryption key pair. Amazon provides detailed instruction for generating keys, also saving them in .pem format and .ppk format. You will need both, so carefully save them in the directory from which you will access the server. The .pem file will enable passwordless communication between your computing nodes, which is essential for MPI. The .ppk file is necessary to allow PuTTY to give you a command shell to your server.
4) Now you are ready to access your server and do a few remaining steps to configure your environment. Choose an Amazon Machine Image AMI corresponding to your region from the list on this web page.
a. Choosing an AMI is the first step of the prompts on Amazon to set up your environment. The AMI’s I created are Linux ubuntu operating systems.
b. Choose an instance type. I have been using a Compute optimized cc2.8xlarge which is a compute optimized instance designed for cluster computing. I believe any of the newer generations should work as well; I just have not tested them. The exception is the free tier, on which I was unable to establish node to node communication.
c. The next step is to configure your instance; choose the number of instances. I usually run 4 but there is no reason not to run more, however bear in mind that each node is additional cost, currently the spot rate has been around $.25, but more importantly, there is a point of diminishing returns. The cc2.8xlarge instance has 16 cores, actually 16 cores and 16 hyper threads, along with 32 GB Ram , so, running 4 in parallel is equivalent to a 64 core machine with 128 GB RAM, pretty much your own little super computer. Also you are given the choice to choose spot instances; this can be a significant savings over purchasing on demand instances.
d. Configure your security group. Here the focus is on inbound rules; you only want the ip address of your local machine to be able to access your server. You will need the following:
i. TYPE PROTOCOL PORT RANGE SOURCE
SSH TCP 22 local IP address
All TCP TCP 0-65535 Security Group ID
Custom TCP Rule TCP 8787 local IP address
HTTP TCP 80 local IP address
The above satisfy the following: SSH on port 22 and HTTP on port 80 allow you to communicate from your local machine to your server. A custom rule on port 8787 allows you to communicate with your server through R Studio Server. Finally when you save and name your security group you will have a security group ID which will allow your compute nodes to communicate with each other on all ports, this is essential for MPI.
e. Launch your instance.
5) When your instance is up and running, which may take a few minutes, you can PuTTY into your server through a PuTTY shell. You will need a bare minimum of Linux skills.
a. From the ubuntu command line prompt: ubuntu@ip-XXX-XX-XX-XX:~$ you will edit your host file to include the other instances you have launched. Use an elementary Linux editor such as nano, and the Linux command sudo to allow you to edit a file without changing file ownership. In other words, type:
ubuntu@ip-XXX-XX-XX-XX:~$ sudo nano /home/ubuntu/etc/hosts
This will open your host file and allow you to edit it. Add the internal IP addresses for the instances you created and name them, perhaps node1, node2, etc. Save the edited hostfile.
b. Create a file with the configuration of the instances you have selected. For example if you have chosen a cc2.8xlarge your file will look like this:
node1 slots=16
node2 slots=16
node3 slots=16
node4 slots=16
To create such a file, simply use nano again:
ubuntu@ip-XXX-XX-XX-XX:~$ sudo nano nodefile Then type the information about your nodes based on the type and number of instances you have chosen. You must do this on all nodes, or you may use the file extend which is on the image and will send out whatever files you wish to all your nodes. Extend should be edited to reflect the number of nodes you are using.
6) You must enable passwordless SSH between all your nodes, the easiest way to do this is to put your .pem file on all your nodes. To can do this through a windows command shell using PuTTY Secure File Transfer Protocol. From the PuTTY directory in the windows command shell type: psftp.exe ubuntu@XX.XX.XX.XX -i yourputty-key-pair-name.ppk Where XX.XX.XX.XX is the public IP address of you instance, and yourputty-key-pair-name.ppk is the PuTTY key pair you generated in step 3. When you’ve access you server through psftp, simple type:
put C:\Dir\where\keys\are\name-of-key-pair.pem /home/ubuntu/.ssh/id_rsa
You will need to do this on all your compute nodes.
7) While you have psftp open, you may upload data files or R script files, or you may do this from the R Studio Server GUI which we’ll launch next.
8) From the PuTTY command shell prompt, ubuntu@ip-XXX-XX-XX-XX:~$ type
sudo adduser username You will be prompted for a password which you type in twice. Now open a web browser window and put the public IP address of node1 followed by the default port for R Studio Server :8787 in other words: XX.XX.XX.XX:8787
This will open an RStudio GUI sign in page, use the username and password you just created in the Putty command window and you are ready to go!
A word about Linux file and directory permissions: To allow Read, Write, Execute permissions in Linux requires specific authorizations. These can be configured using the command:
sudo chmod XXX /directory/filename Where XXX corresponds to owner, user, other, and X’s are numbers according to privileges. 7 is Read, Write Execute, 6 is Read, Write, 5 is Read, Execute, 4 is Read only. In other words, sudo chmod 655 /directory/filename gives owner - Read Write, user Read Execute, and other Read Execute. sudo chmod 777 /directory/filename gives everyone all privileges. Perhaps be careful with this one!