RNA-Seq From Scratch: Part 0 — Installing Software

As promised in my short primer last time, we’re about to get started on our journey of really analyzing RNA-Seq data. And just as a side note, this is not another cute little tutorial with a small amount of data to get you started. This is the real deal. Lots of data, real world conditions, and all the big scary problems that come with them.

If you prefer a guided tour, you can watch the video here:

Fist things first, let’s talk about the experiment setup. I have recently become interested in pseudo-genes and in order to come up with a robust way to detect them, I’d first like to curate a large data-set of public gene expression data. The organism I have chosen, is Arabidopsis thaliana. Why?

  • I work with plants, so it’s right at home for me

Further down the line, I want to integrate transcriptomics, genomics and proteomics to curate a high-confidence set of pseudo-genes in A. thaliana so it makes for a great connection between different blog posts here as well.

Aims for today:

  • Install Singularity for software management

Installing Singularity from source:

First things first. The OS of the bioinformatics world, is for the most part Linux. If you use Windows, install WSL(2), if you use Mac OS, run Linux off of VirtualBox. There are tons of guides on the installation of those out there.

Until we get to the statistical analysis, we’ll only need to install (in the traditional sense) a single tool: Singularity. Singularity is a container system that will let us create single file executable images for everything we’ll need.

We start off by installing essential system dependencies. I’m running Ubuntu 20.04.1 for this tutorial. Most of these steps are directly lifted from the Singularity docs with small adaptions:

Singularity is written in Go, so if we want to install from source (and we want to since they don’t offer official packages) we’ll need Go (>1.13):

Go is pretty strict with its directory layout, so let’s set a Go home directory and add that to our PATH and GOPATH environment variables:

We can now already download Singularity and its build dependencies using go get. The second command will complain about missing Go files, ignore that:

Next we’ll get the most recent stable release of singularity from the git repository:

And now we can finally build and install:

We can organize the files that singularity creates in our home directory and also add that directory to the PATH variable, so we’ll have the binaries always accessible:

Testing singularity

We can finally build our very first singularity image. I’ve chose to build the NCBI’s SRA toolkit here, since we’ll need it really soon anyway:

We’ll need to configure the SRA tools before first use:

And now we’re all set up and can start downloading public data from the sequence read archive:

Life Science/Genomics/Transcriptomics. PhD in plant molecular biology but please don’t ask me any plant questions