As promised in my short primer last time, we’re about to get started on our journey of really analyzing RNA-Seq data. And just as a side note, this is not another cute little tutorial with a small amount of data to get you started. This is the real deal. Lots of data, real world conditions, and all the big scary problems that come with them.
If you prefer a guided tour, you can watch the video here:
Fist things first, let’s talk about the experiment setup. I have recently become interested in pseudo-genes and in order to come up with a robust way to detect them, I’d first like to curate a large data-set of public gene expression data. The organism I have chosen, is Arabidopsis thaliana. Why?
- I work with plants, so it’s right at home for me
- There is an incredible amount of data available for it, as it is probably the best studied plant
- I would like to apply the resulting pipeline to plants first and foremost
Further down the line, I want to integrate transcriptomics, genomics and proteomics to curate a high-confidence set of pseudo-genes in A. thaliana so it makes for a great connection between different blog posts here as well.
Aims for today:
- Install Singularity for software management
- Install our first Singularity image: NCBI’s SRA toolkit
- Download a single public RNA-Seq run using the Singularity image
Installing Singularity from source:
First things first. The OS of the bioinformatics world, is for the most part Linux. If you use Windows, install WSL(2), if you use Mac OS, run Linux off of VirtualBox. There are tons of guides on the installation of those out there.
Until we get to the statistical analysis, we’ll only need to install (in the traditional sense) a single tool: Singularity. Singularity is a container system that will let us create single file executable images for everything we’ll need.
We start off by installing essential system dependencies. I’m running Ubuntu 20.04.1 for this tutorial. Most of these steps are directly lifted from the Singularity docs with small adaptions:
Singularity is written in
Go, so if we want to install from source (and we want to since they don’t offer official packages) we’ll need
Go is pretty strict with its directory layout, so let’s set a Go home directory and add that to our
GOPATH environment variables:
We can now already download Singularity and its build dependencies using
go get. The second command will complain about missing
Go files, ignore that:
Next we’ll get the most recent stable release of singularity from the git repository:
And now we can finally build and install:
We can organize the files that singularity creates in our home directory and also add that directory to the PATH variable, so we’ll have the binaries always accessible:
We can finally build our very first singularity image. I’ve chose to build the NCBI’s SRA toolkit here, since we’ll need it really soon anyway:
We’ll need to configure the SRA tools before first use:
And now we’re all set up and can start downloading public data from the sequence read archive: