EC2 Village

A collection of mini guides, tips and tricks to using EC2 instances focused on bioinformatics
EC2
AWS
Bioinformatics
S3
Author

LW Pembleton

Published

December 14, 2023

Modified

January 28, 2024

Photo by Karl Pawlowicz on Unsplash

This originally started as a collection of notes I had made after progressively setting up multiple EC2 dev environments. Then I thought it might as well tidy it up 🧹🧼 a bit and push it out as a blog post in the hope it will help others. I plan to progressively add to it, so shout-out 💬 if you have suggestions.

The following is written with the default assumption of a Ubuntu 22.04 LTS base image.

Notes when creating a new EC2 instance

When creating your EC2 instance ensure that under the storage settings that you change the default gp2 to gp3. gp3 has higher baseline performance, adjustable IOPS and throughput. Plus it’s cheaper per GiB ($0.08 vs $0.10/GiB-month) 🤔 hmmm, maybe that is why AWS hasn’t gotten around to changing it to the default when creating EC2 instances.

When you are selecting an instance type, you should check out the ec2-instance-selector and my blog post on Navigating EC2 Pricing - The Better Way

Note

gp3 does now appear to be the default of Amazon Linux AMIs, however gp2 is still the default for Ubuntus AMIs.

Installing AWS CLI

AWS CLI is a must to enable you to transfer files and results between S3 and your EC2 instance

First, you will need to install unzip

sudo apt-get install unzip

Now you can download and install the AWS CLI

curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

This should automatically appear in your path, and you can call it with aws

You will need to configure it with your credentials to grant the necessary permissions 🔑

aws configure

Alternatively, a more secure option is to pass the relevant S3 permissions to your EC2 instance using an IAM instance profile. Details on setting up an IAM role with S3 permissions can be found here.

You can add the IAM profile when first creating your instance under ‘Advanced details’

or to an existing instance by selecting the instance, clicking the ‘Actions’ drop-down menu and then ‘Modify IAM role’ under ‘Security’ and then choosing the relevant IAM role. Then press ‘Update IAM role’.

Mount a secondary volume such as a Nitro NVME SSD

If you launch a storage-optimised EC2 instance such as one from the I4i family to take advantage of the super-fast local storage performance you might get a rude shock at first, where is the SSD 😕? All you initially have access to is the EBS boot volume… Nothing to fear; it is there, it just hasn’t been mounted yet.

use the lsblk command to view your available disk devices and their mount points.

NAME         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
loop0          7:0    0  24.4M  1 loop /snap/amazon-ssm-agent/6312
loop1          7:1    0  55.6M  1 loop /snap/core18/2745
loop2          7:2    0  63.3M  1 loop /snap/core20/1879
loop3          7:3    0 111.9M  1 loop /snap/lxd/24322
loop4          7:4    0  53.2M  1 loop /snap/snapd/19122
nvme1n1      259:0    0   1.7T  0 disk 
nvme0n1      259:1    0     8G  0 disk 
├─nvme0n1p1  259:2    0   7.9G  0 part /
├─nvme0n1p14 259:3    0     4M  0 part 
└─nvme0n1p15 259:4    0   106M  0 part /boot/efi

Identify the name of the missing drive you want to mount, for example, nvme1n1

Now check whether there is a file system already on the device

sudo file -s /dev/<DRIVE-NAME>

If this just shows data as the screen output there is no file system currently on the device. However, if it shows other information for example

[ubuntu ~]$ sudo file -s /dev/nvme1n1
/dev/nvme1n1: SGI XFS filesystem data (blksz 4096, inosz 512, v2 dirs)

It means there is already a file system.

Assuming there is no current file system (likely scenario for new instances) you need to make one with the mkfs -t command before you can mount it.

sudo mkfs -t xfs /dev/<DRIVE-NAME>
Tip

If you get a mkfs.xfs is not found error. You just need to install the XFS tools first using:

sudo apt-get install xfsprog

Once the filesystem is established, all you need to do is mount it and point it at a directory.

You can point it at any directory, but in this example, we are going to create a directory called data

mkdir data
sudo mount /dev/<DRIVE-NAME> /home/ubuntu/data

To check this has all worked correctly if you run lsblk again you should see your drive of interest show up with the associated mountpoint directory.

NAME         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
loop0          7:0    0  24.4M  1 loop /snap/amazon-ssm-agent/6312
loop1          7:1    0  55.6M  1 loop /snap/core18/2745
loop2          7:2    0  63.3M  1 loop /snap/core20/1879
loop3          7:3    0 111.9M  1 loop /snap/lxd/24322
loop4          7:4    0  53.2M  1 loop /snap/snapd/19122
nvme1n1      259:0    0   1.7T  0 disk /home/ubuntu/data
nvme0n1      259:1    0     8G  0 disk 
├─nvme0n1p1  259:2    0   7.9G  0 part /
├─nvme0n1p14 259:3    0     4M  0 part 
└─nvme0n1p15 259:4    0   106M  0 part /boot/efi

It is likely you will need to update the ownership permission of the mounted directory as it would have been mounted as root

To change the permissions to reflect your current user

sudo chown -R $USER /home/ubuntu/data

For more details visit the aws help page

Caution

nvme drives on EC2 instances are typically ephemeral disks, which means they physically exist on the actual host instance and therefore they do not persist after reboots or shutdown (as the new physical host will change). All data will be lost and you will need to remount after a restart.

Resource Monitoring

atop

atop is a Linux tool that provides real-time monitoring and detailed insights into system resource usage, displaying data on CPU, memory, disk, and network activities. It offers a comprehensive overview of processes, their resource consumption, and system-level performance metrics.

However, more importantly, it logs resource usage so you can go back and review CPU and/or memory spikes to determine what is causing errors, or commands to terminate early and whether a different instance size is necessary.

installing atop:

sudo apt-get update
sudo apt-get -y install atop sysstat

Configure atop reporting frequency and what statistics it captures: atop by default logs every 600 seconds (1 min), lets increase that to 60 second to help pick up on memory spikes that might be killing tasks.

sudo sed -i 's/^LOGINTERVAL=600.*/LOGINTERVAL=60/' /usr/share/atop/atop.daily
sudo sed -i -e 's|5-55/10|*/1|' -e 's|every 10 minutes|every 1 minute|' -e 's|debian-sa1|debian-sa1 -S XALL|g' /etc/cron.d/sysstat
sudo bash -c "echo 'SA1_OPTIONS=\"-S XALL\"' >> /etc/default/sysstat"

Activate and restart the services:

sudo systemctl enable atop.service crond.service sysstat.service

sudo systemctl restart atop.service crond.service sysstat.service

atop creates log files in /var/log/atop saves with the format atop_ccyymmdd

To view historical log results, run the command:

atop -r logfile

Where logfile is the path and name of the log file of the relevant day. Once viewing the log results, press t to move forwards to the next resource usage snapshot and Shift + t to move back.

Installing the latest stable R on Ubuntu

Step 1: Obtain the R Project GPG Key

To start, you need to acquire the R Project’s public GPG key. This key will ensure the authenticity and security of the software you’re about to install. Open your terminal and use the following command:

wget -qO- https://cloud.r-project.org/bin/linux/ubuntu/marutter_pubkey.asc | sudo gpg --dearmor -o /usr/share/keyrings/r-project.gpg 

Step 2: Add the R Source List to APT

With the GPG key in place, it’s time to add the R source list to APT. This list is essential for Ubuntu to recognise the repository where R is located. Run the following command:

echo "deb [signed-by=/usr/share/keyrings/r-project.gpg] https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/" | sudo tee -a /etc/apt/sources.list.d/r-project.list 

Step 3: Update Your Package List

To make Ubuntu aware of the new source you’ve added, update your package list with the following command:

sudo apt update 

Step 4: Install R with APT

Now, you’re ready to install R on your system. Execute this command:

sudo apt install --no-install-recommends r-base 
Tip

If prompted to confirm the installation, simply press ‘y’ and then ‘Enter’ to proceed.

That’s it! You’ve successfully installed the latest stable version of R on your Ubuntu machine.

Install relevant R packages

Installing packages in R on Windows is a relatively straightforward process, typically accomplished using the `install.packages()` function. However, the process can be a bit more involved on Linux, as you’ll often need to manually install external system dependencies. In this article, I’ll provide you with essential details for common packages, making the installation process smoother.

In many cases, you’ll find that the ‘make’ command is a fundamental requirement. While you can install ‘make’ as a standalone package, it’s usually more beneficial to install a comprehensive set of build tools through the ‘build-essential’ package.

To install ‘build-essential’ on a Debian-based Linux system like Ubuntu, run the following command in your terminal:

sudo apt-get install build-essential

The remotes R package provides a simple method to find out what system requirements certain R packages have on linux. Below is an example that lists out the requirements of the tidyverse package.

install.packages("remotes") 
remotes::system_requirements(os = "ubuntu", os_release = "20.04", package = "tidyverse")

Tidyverse

the tidyverse is a large collection of packages, and therefore, naturally, there is a reasonable list of dependencies. The following should install all the required libraries.

sudo apt install libssl-dev libcurl4-openssl-dev unixodbc-dev libxml2-dev libmariadb-dev libfontconfig1-dev libharfbuzz-dev libfribidi-dev libfreetype6-dev libpng-dev libtiff5-dev libjpeg-dev zlib1g-dev libxml2-dev libfontconfig1-dev libfribidi-dev libharfbuzz-dev libicu-dev

To install tidyverse via CRAN:

install.packages("tidyverse")

While using remotes to determine dependencies works most of the time, there are still cases where the dependencies are ‘unknown’ until your try and install it. In these scenarios you will need to revert back to the painstaking 🧗 back and forth install dependencies, retry R package installation, install next required dependency and so on… BGLR is one such example 👇

BLGR

The genomic selection package BLGR has a few additional dependencies.

sudo apt-get install libblas-dev liblapack-dev gfortran

To install BGLR via CRAN:

install.packages("BGLR")

Posit Package Manager

Shoutout to Josiah for this tip.

If you frequently find yourself installing numerous or large packages, it’s worth exploring the Posit package manager. Among its various features, one stands out— the ability to serve precompiled binary packages tailored for different system environments. This feature proves particularly beneficial when dealing with sizable R packages like dplyr that traditionally require compilation.

To integrate Posit into your workflow, visit the official website at https://packagemanager.posit.co/client/#/repos/cran/setup. Provide your system information, and under ‘Environment,’ choose ‘some other R outside RStudio IDE.’ Now, copy the command listed under ‘Setup Instructions’ and paste it into your R session. It should resemble something similar to the following:

options(repos = c(CRAN = "https://packagemanager.posit.co/cran/__linux__/jammy/latest"))

By executing this command, you redirect the install.packages() function to retrieve binaries from the Posit package manager, potentially saving valuable time during installations.

Working with S3 objects within R

If you search for a R packages for working with AWS S3 files within R, you will likely come across aws.s3. It is a great package and makes the process of dealing with S3 file within R super simple. While aws.s3 is great, it doesn’t appear to be maintained anymore and there have been no updates since May 2020. If you are storing your credentials within your EC2 instance with aws configure (which I don’t recommed) then aws.s3 seems to still work great, and many people still use it. However, if you provide your S3 credentials via an IAM profile it appears to be problematic, and the aws.ec2metadata update trick no longer works (mileage for others may vary).

For this reason I suggest you check out paws which is still in active development. It should automatically detect your credentials and there is a code example for downloading and uploading S3 files.

I plan to add an example snippet in time of reading a file directly from S3 into an R object using paws as the process is not as straight forwards as some would expect.

Installing bioinformatic software

PLINK2

Installing plink2 is as simple get grabbing the Linux-64bit Intel zipped binary from the website and unpacking it. Then just move the executable to your bin directory.

For example:

wget https://s3.amazonaws.com/plink2-assets/alpha5/plink2_linux_x86_64_20231011.zip
unzip plink2_linux_x86_64_20231011.zip
sudo mv plink2 /bin/

VCFtools

Grab the tarball from the GitHub release page and unpack it.

wget https://github.com/vcftools/vcftools/releases/download/v0.1.16/vcftools-0.1.16.tar.gz
tar -xvf vcftools-0.1.16.tar.gz

The PERL5LIB environment variable needs to be set to include the vcf.pm module.

export PERL5LIB=/path/to/your/vcftools-directory/src/perl/

Before you can compile the program you need to install the following dependency

sudo apt install build-essential
sudo apt install pkg-config
sudo apt install zlib1g-dev

You can now compile the program.

cd vcftools-0.1.16/
./configure
make
sudo make install

VCFtools will be automatically added to your user path so you should now be able to call it with vcftools

BCFtools

A similar process to VCFtools, grab the BCFtools tarball from the latest releases page and run configure, make and install.

What’s next

more to come…