RNA-Seq (RNA-Sequencing) has fast become the preferred method for measuring gene expression, providing an accurate proxy for absolute quantitation of messenger RNA (mRNA) levels within a sample (Mortazavi et al, 2008). RNA-Seq has reached rapid maturity in data handling, QC (Quality Control) and downstream statistical analysis methods, taking substantial benefit from the extensive body of literature developed on the analysis of microarray technologies and their application to measuring gene expression. Although analysis of RNA-Seq remains more challenging than for microarray data, the field has now advanced to the point where it is possible to define mature pipelines and guidelines for such analyses. However, with the exception of commercial software options such as the CLCbio CLC Genomics Workbench, for example, we are not aware of any fully integrated open-source pipelines for performing these pre-processing steps. Both the technology behind RNA-Seq and the associated analysis methods continue to evolve at a rapid pace, and not all the properties of the data are yet fully understood. Hence, the steps and available software tools that could be used in such a pipeline have changed rapidly in recent years and it is only recently that it has become possible to propose a de-facto standard pipeline. Although proposing such a skeleton pipeline is now feasible there remain a number of caveats to be kept in mind in order to produce biologically and statistically sound results.
Here we present what is, in our opinion, a mature pipeline to pre-process and analyze RNA-Seq data. As often as possible we identify the most obvious pitfalls one will face while working with RNA-Seq data and point out caveats that should be considered. An overview of the pipeline is presented in Figure 1 and is detailed below. Briefly, the first step upon receiving the raw data from a sequencing facility is to conduct initial QC checks. These QC results will inform whether the data requires filtering to remove ribosomal RNA (rRNA) contamination, if sequence reads require ‘trimming’ to remove low quality bases and if there is a need to trim reads to remove sequencing adapters. These data pre-processing steps must be performed with care to ensure that the required data cleaning is adequately performed while avoiding the introduction of any potential bias, for example removing sequences of interest. Once the data is deemed of sufficient quality, it is aligned/mapped (both terms are considered synonyms in the following) against the chosen reference; this can be a model organism genome, a novel draft genome or a de-novo assembled transcriptome. Each of these alternatives has advantages and caveats, some of which are detailed below. Having obtained the mapping of the RNA-Seq reads to the genome, the subsequent analysis steps to be performed will be determined by the project goals and the scientific questions that one wishes to address. Distinctly different analysis methods are required depending on whether interest lies in identifying sequence variants or in exploring expression level differences between samples groups i.e differential expression (DE), for example. These are the two most popular uses of RNA-Seq data and are hence briefly introduced at the end of the current protocol. However, as these analyses are complex, we redirect the reader to more complete literature. There are many additional analyses that RNA-Seq data can be used for, including examining allele-specific expression and RNA editing, among others.
The pipeline we describe in the following is made publicly available and we aim to shortly release a worked example using a representative dataset as a companion to this protocol. The worked example will exemplify all of the steps detailed below and will demonstrate the influence of different biases and steps taken to mitigate them. The guideline will be available at https://bioinformatics.upsc.se/.
Before reading on, we wish to stress that as the analysis of RNA-Seq data is still a rapidly maturing field, one must always keep an open mind, challenging the results obtained to be sure a possible technical artifact does not underlie an observed difference, particularly where unexpected results are obtained. Only when such possibilities have been considered and eliminated can one assume that observed results are likely of biological origin.
1 Umeå Plant Science Center, Department of Plant Physiology, Umeå University, 90187, Umeå, Sweden
2 Department of Chemistry, Biotechnology and Food Science, Norwegian University of Life Sciences, 1432 Ås, Norway
3 Computational Life Science Cluster (CLiC), Umeå University, Umeå, Sweden
Corresponding author: Nicolas Delhomme