Amazon Web Services (AWS) and other cloud computing resources provide two main advantages over running code on your local computer. First, you can access much larger, faster computers that allow you to complete your analyses in a fraction of the time it would take on the average personal computer. Second, you can run these resources remotely, thus connecting from anywhere and allowing processes to continue to run even when your personal computer is off or tied up doing other things.
So when considering if you want to use AWS, we suggest considering the following.
This is in no way meant to dissuade you from learning to use cloud resources! However, we want to avoid frustrations that end in abandonment and thinking “I should have just done this in Excel…”
The two AWS resources you will likely use are Elastic Compute Cloud (EC2) and Simple Cloud Storage Service (S3). Think of EC2 as RAM and processors, or where you run programs. Think of S3 as the hard drive, or where you store files. While some files can be stored on EC2, it is more expensive and not as stable. Thus, it is best to only have what you’re currently working on in that space.
Go to AWS account setup at https://portal.aws.amazon.com/billing/signup#/start and create an account. Please note that AWS does talk to regular Amazon so if you have an Amazon account, best to use a different email here. Your UW email is preferred.
Next, send your account admin (probably Kim kadm@uw.edu) an email with
the email address you’ve used for your account. They will add you to the
AWS organization and send you 1) an access key ID, 2) a secret access
key, and 3) a log-in link in a .csv
. SAVE
this file as you’ll need these keys to access AWS.
Finally, link your individual account to the organization by logging
in as an IAM user with the console log-in link in the .csv
from Kim (or input the account ID in that link at https://console.aws.amazon.com/).
AWS has its own Command Line Interface (CLI) which gets around any
differences between AWS resources and your personal computer
(e.g. operating system, versions, etc.). Download the CLI at https://aws.amazon.com/cli/ (links on right). Once
complete, check that it installed correctly by going to your command
line (called Terminal on Mac/Linux) and typing aws
[Enter].
You should see some information on aws commands and an error (because we
didn’t provide a full command) as below. Windows may look slightly
different.
aws
usage: aws [options] <command> <subcommand> [<subcommand> ...] [parameters]
To see help text, you can run:
aws help
aws <command> help
aws <command> <subcommand> help
aws: error: the following arguments are required: command
Next, you need to link your account to the CLI on your computer using
the keys that your account admin sent you. Do so with
aws configure
and input your values as below. You’ll need
to do this once on every computer you access AWS through.
aws configure
AWS Access Key ID [None]: #################
AWS Secret Access Key [None]: #################
Default region name [None]: us-west-2
Default output format [None]: txt
Now you’re ready to use AWS!
The following will take you through a simple example using AWS. This includes:
In your word processor of choice, create a text (.txt
)
file on your Desktop named test.txt
. You can write whatever
you want in there! Next, upload this file to S3 using the online tool or
through the CLI as described below.
Go to AWS S3 at https://s3.console.aws.amazon.com/s3/.
Create a bucket (which is basically a folder) with the orange button in the upper right. You want to keep the default settings especially ‘Block all public access’ in order to adhere to HIPAA. Note that the bucket name must be unique across all of S3 so things like ‘data’ won’t work. For those in the Hawn organization, please start all your bucket names with your initials.
You can then interact with the bucket just like the file explorer on your computer. Click ‘Upload’ or drag and drop your file to upload it to S3.
Alternatively, the AWS CLI has built in commands for interacting with
both S3 and EC2. To make a bucket, you use aws s3 mb
like
so.
aws s3 mb s3://kadm-test
Then you can copy to S3 using aws s3 cp
with the
standard command line setup of [where the file is] followed by [where
you want it to go]. Since you’ve already provided your access keys with
aws configure
, you shouldn’t need to log-in again.
aws s3 cp ~/Desktop/test.txt s3://kadm-test
Or copy an entire directory using aws s3 sync
. This is
helpful if you have many files as it only copies or updates files that
are different between the two locations. Do not run this as it will
copy everythin on your desktop to S3.
aws s3 sync ~/Desktop/ s3://kadm-test
Returning to your AWS account homepage, go to the EC2 resources at https://us-west-2.console.aws.amazon.com/ec2/.
There are a lot of options here but all we really care about is in the left panel under ‘Instance > Instances’ and ‘Elastic Block Storage > Volumes’. The instance is the basic computer you setup in the cloud and the EBS volumes are additional hard drive space you can add to that basic computer.
First, go to Instances and ‘Launch Instance’. Then, build your desired computer. When setting up to run a real job, you’ll need to choose the operating system, size, and security that best fit your needs. Here, we will make the most basic one possible with
You may wish to add additional storage to your EC2 instance either because you forgot in Step 4 above or because you find that you need more later on.
If you go to ‘Elastic Block Storage > Volumes’, you will see the storage that we created with the EC2 instance we just built. To make more, click “Create Volume” and make a volume of your desired size. Be sure to select the ‘Availability zone’ (us-west-2a/b/c/d) that matches the EC2 instance you want to use.
Then select the new volume and link it to your EC2 instance under ‘Actions > Attach volume’. If you did not pick the correct zone, you will not see your instance as an option and will need to delete this one and make a new one. There are also options to detach, delete, etc volumes under the ‘Actions’ menu.
You access your EC2 instance from the AWS CLI using the key file you
downloaded. Depending on your operating system, this key may not be
accessible by the CLI so we will need to change its permissions. In your
terminal, navigate to wherever the key .pem
file is and
change its permissions to 600. For example, I keep all my keys in one
folder here.
cd ~/AWS/keys/
chmod 600 test.pem
Next log-in to your instance with secure shell ssh
using
the key file and your instance’s public IPv4 DNS (found under ‘INSTANCE
> Instances’). It will be something similar to below.
ssh -i ~/Documents/AWS/keys/test.pem ec2-user@ec2-54-185-232-33.us-west-2.compute.amazonaws.com
Note that different operating systems on EC2 have different default
user names. The Amazon AMI uses ec2-user@
while Ubuntu
Linux ones use ubuntu@
.
If you get a question about ‘The authenticity of host … can’t be
established’ type yes
[Enter] to allow the connection.
You can now explore your cloud computer in the command just like you would your own. At this point, though, there is nothing to see in the home directory.
pwd
/home/ec2-user
ls
Note that this is all specific to the Amazon AMI, which uses
yum
functions. If you are using a Linux EC2 instance,
replace with apt-get
.
Each EC2 operating system comes with some pre-installed software. Always begin by updating all software currently on the EC2.
sudo yum upgrade -y
sudo yum update -y
The Amazon AMI comes with the AWS CLI but if you choose an AMI without it, you can install it like so.
sudo yum install awscli -y
You can install pretty much any program compatible with the OS in one way or another. There are a number of pre-packaged AWS programs than can be installed with
sudo amazon-linux-extras install PROGRAM_NAME
Please see the conda and R tutorials for specific download instructions. Or simply Google for command line instructions to install your software of choice.
As we’ve mentioned, S3 is the best place to put your data. You can directly access data on S3 without copying it to EC2 by linking these accounts. This saves a lot of time! Plus, data fused from S3 do not count against your EC2/EBS storage. Thus, you could have a 100 GB EBS volume with a directory fused to a 1 TB S3 bucket.
First, configure your AWS CLI on the cloud computer just as you did on your own computer.
aws configure
AWS Access Key ID [None]: #################
AWS Secret Access Key [None]: #################
Default region name [None]: us-west-2
Default output format [None]: txt
Then install fuse and all its dependencies.
sudo amazon-linux-extras install -y epel
sudo yum install -y s3fs-fuse
Create a key for fuse which is your UserKey:SecretKey
in
a file named .passwd-s3fs
. Similar to your other key file,
change the permissions to 600.
echo #################:################# > ~/.passwd-s3fs
chmod 600 ~/.passwd-s3fs
Mount your S3 bucket to this instance. A bucket can be fused to
multiple directories on multiple instances but the directory on the
instance can only be fused to one bucket at a time. Here, we make a
data
directory in the home directory and link our test
bucket to it.
mkdir data/
s3fs kadm-test ~/data -o passwd_file=~/.passwd-s3fs \
-o default_acl=public-read -o uid=1000 -o gid=1000 -o umask=0007
Note the \
in the above code. This allows you to
write multiple lines in the terminal. You could also input the above all
on one line without the \
.
If for some reason you need to unmount this bucket, use the following.
fusermount -u ~/data/kadm-test
S3 bucket permissions can be a bit tricky. To avoid most of these issues, I recommend you create separate directories for 1) original data, 2) intermediate data, 3) final data.
As we did above, the original data is fused from S3 to your EC2 in
data/
. However, if you try to write a new file to this
directory, you will get a permissions error; it becomes “read-only” once
fused. So, as you work on the instance, save all outputs to a second
directory like working/
. Finally, copy any files you want
to download to a third directory like results/
and download
this to a different S3 bucket (more on this below).
Earlier we added a 1 GB elastic block storage (EBS) volume to our
EC2. You can see all the volumes associated with your instance with
lsblk
.
Below only highlights the disks; you can ignore the part types (short
for partition). You’ll see that the main disk (nvme0n1) is mounted to
the root directory /
while the additional EBS volume
(nvme1n1) has no mount point. The names that you see may be different
based on the OS you chose but you should be able to identify the EBS
volume based on its size.
lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
nvme1n1 259:0 0 1G 0 disk
nvme0n1 259:1 0 8G 0 disk /
Any added EBS volumes are completely unformatted. So, we need to name
and format the volume before using it. First, make the volume into a
filesystem with sudo mkfs
. Note that adding
sudo
to any command makes it run as an administrator on
your account.
sudo mkfs -t ext4 /dev/nvme1n1
Then, make a directory and mount it to the EBS volume. You will also need to change the permissions on the directory to 777 so you can read and write to it.
sudo mkdir ~/results
sudo mount /dev/nvme1n1 ~/results
sudo chmod 777 -R ~/results
Listing the volumes again, you will see that the EBS one is now
mounted to the results/
directory.
lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
nvme1n1 259:0 0 1G 0 disk /home/ec2-user/results
nvme0n1 259:1 0 8G 0 disk /
You can now use this additional storage just like it is part of the
original EC2 machine. For example, copy and rename the
test.txt
file into your results directory with
cp ~/data/test.txt ~/results/test2.txt
And you will see it there in addition to an automatically made ‘lost+found’ which is basically a trash folder.
ls results/
lost+found test2.txt
If you need to download additional software, you have the option to either put it on the main EC2 or on an EBS volume. There are pros and cons to each. Using the main EC2 means that every time you turn off the instance, your programs are deleted. This prevents you from cluttering the machine and being charged for storage, but forces you to re-download and install next time you start it up again. On the other hand, programs on an EBS volume are only deleted if you delete that volume. Thus, you can re-use the same set of programs on any of your EC2 instances without re-downloading. But be careful of cost as the EBS storage accrues charges even when you have no instances currently running.
Similar to putting your data up on S3, you can copy it down from the EC2. This is easiest if you are NOT logged into your AWS EC2 because we know the EC2 server address (we used it to log-in) but not our home computer server address.
Thus, exit your instance in the terminal with exit
(or
open another command line window) and secure copy scp
a
file you want with the standard command line setup of [where the file
is] followed by [where you want it to go].
scp -i ~/Documents/AWS/keys/genomics.pem \
ec2-user@ec2-54-185-232-33.us-west-2.compute.amazonaws.com:~/results/test2.txt ~/Desktop/
Or copy an entire directory with aws s3 sync
. Again
do not run the whole desktop sync.
aws s3 sync ec2-user@ec2-54-185-232-33.us-west-2.compute.amazonaws.com:~/results/ \
~/Desktop/
You may also want to download your files through S3. This is helpful in that your files are then backed up on S3 and easily transferable back to EC2 if you find you need to do more analyses.
First, log back into your EC2 instance if you logged out in the previous section. Then, create a bucket for your results.
aws s3 mb s3://kadm-results
Then sync your results directory to S3. This will copy everything to
S3, where you can download it from the online tool or CLI similar to how
we uploaded the test.txt
file earlier.
aws s3 sync ~/results/ s3://kadm-results
Both EC2 and S3 are charged per time of use. So when not in use, you need to turn off your EC2 instance. To turn off EC2, go to the EC2 resources at https://us-west-2.console.aws.amazon.com/ec2/. Select your instance and under Actions > Instance State, either
Either option pops up a warning about losing data. This refers to anything on the original EC2 build NOT to your EBS volumes. In either case, the instance is now off and you are not being charged for it.
However, you are still being charged for any EBS volumes not deleted upon termination so you also need to go to the EBS volumes tab, select your volume, and either
Thus, unless you will be using the same programs or data very soon, it is best to save whatever is on your EBS volumes to S3 and delete the volume at this point.
Also, not shown here, you should clean out your S3 data to only those files you need.
These policies are specific to the Hawn organization.
aws configure
.PI-project-type
like hawn-rstr-rnaseq
. You may
remove the project label if data span multiple projects such as
hawn-megaex