Advancing Epigenetics Towards Systems Biology

Guidelines for RNA-Seq data analysis (prot 67)

Nicolas Delhomme1, Niklas Mähler2, Bastian Schiffthaler1, David Sundell1, Chanaka Mannapperuma1, Torgeir R. Hvidsten1,2, Nathaniel R. Street1,3


RNA-Seq (RNA-Sequencing) has fast become the preferred method for measuring gene expression, providing an accurate proxy for absolute quantitation of messenger RNA (mRNA) levels within a sample (Mortazavi et al, 2008). RNA-Seq has reached rapid maturity in data handling, QC (Quality Control) and downstream statistical analysis methods, taking substantial benefit from the extensive body of literature developed on the analysis of microarray technologies and their application to measuring gene expression. Although analysis of RNA-Seq remains more challenging than for microarray data, the field has now advanced to the point where it is possible to define mature pipelines and guidelines for such analyses. However, with the exception of commercial software options such as the CLCbio CLC Genomics Workbench, for example, we are not aware of any fully integrated open-source pipelines for performing these pre-processing steps. Both the technology behind RNA-Seq and the associated analysis methods continue to evolve at a rapid pace, and not all the properties of the data are yet fully understood. Hence, the steps and available software tools that could be used in such a pipeline have changed rapidly in recent years and it is only recently that it has become possible to propose a de-facto standard pipeline. Although proposing such a skeleton pipeline is now feasible there remain a number of caveats to be kept in mind in order to produce biologically and statistically sound results.

Here we present what is, in our opinion, a mature pipeline to pre-process and analyze RNA-Seq data. As often as possible we identify the most obvious pitfalls one will face while working with RNA-Seq data and point out caveats that should be considered. An overview of the pipeline is presented in Figure 1 and is detailed below. Briefly, the first step upon receiving the raw data from a sequencing facility is to conduct initial QC checks. These QC results will inform whether the data requires filtering to remove ribosomal RNA (rRNA) contamination, if sequence reads require ‘trimming’ to remove low quality bases and if there is a need to trim reads to remove sequencing adapters. These data pre-processing steps must be performed with care to ensure that the required data cleaning is adequately performed while avoiding the introduction of any potential bias, for example removing sequences of interest. Once the data is deemed of sufficient quality, it is aligned/mapped (both terms are considered synonyms in the following) against the chosen reference; this can be a model organism genome, a novel draft genome or a de-novo assembled transcriptome. Each of these alternatives has advantages and caveats, some of which are detailed below. Having obtained the mapping of the RNA-Seq reads to the genome, the subsequent analysis steps to be performed will be determined by the project goals and the scientific questions that one wishes to address. Distinctly different analysis methods are required depending on whether interest lies in identifying sequence variants or in exploring expression level differences between samples groups i.e differential expression (DE), for example. These are the two most popular uses of RNA-Seq data and are hence briefly introduced at the end of the current protocol. However, as these analyses are complex, we redirect the reader to more complete literature. There are many additional analyses that RNA-Seq data can be used for, including examining allele-specific expression and RNA editing, among others.
The pipeline we describe in the following is made publicly available and we aim to shortly release a worked example using a representative dataset as a companion to this protocol. The worked example will exemplify all of the steps detailed below and will demonstrate the influence of different biases and steps taken to mitigate them. The guideline will be available at

Before reading on, we wish to stress that as the analysis of RNA-Seq data is still a rapidly maturing field, one must always keep an open mind, challenging the results obtained to be sure a possible technical artifact does not underlie an observed difference, particularly where unexpected results are obtained. Only when such possibilities have been considered and eliminated can one assume that observed results are likely of biological origin.

PDF version

Nicolas Delhomme1, Niklas Mähler2, Bastian Schiffthaler1, David Sundell1, Chanaka Mannapperuma1, Torgeir R. Hvidsten1,2, Nathaniel R. Street1,3

1 Umeå Plant Science Center, Department of Plant Physiology, Umeå University, 90187, Umeå, Sweden
2 Department of Chemistry, Biotechnology and Food Science, Norwegian University of Life Sciences, 1432 Ås, Norway
3 Computational Life Science Cluster (CLiC), Umeå University, Umeå, Sweden

Corresponding author: Nicolas Delhomme
Email feedback to: This email address is being protected from spambots. You need JavaScript enabled to view it.

Nicolas Delhomme, Niklas Mähler, Bastian Schiffthaler, David Sundell, Chanaka Mannapperuma, Torgeir R. Hvidsten<sup>1,2</sup>, Nathaniel R. Street<sup>1,3</sup>
Mon, Nov 26th 2018- Tue, Nov 27th 2018

Epitranscriptomics 2018 invites all the participants from all over the world to attend International Conference on Epigenetics and Epitranscriptomics during November 26-27, 2018 Helsinki, Finland. The...

Tue, Nov 27th 2018- Fri, Nov 30th 2018

Cancer progression and development of resistance to treatment is an urgent medical problem. Epigenetic reprogramming is an integral component of lineage-specific transcriptional programmes and its pot...

Mon, Dec 3rd 2018- Tue, Dec 4th 2018

15th Edition of EuroScicon Conference on Advanced Stem Cells & Regenerative Medicine Dec 3-4, 2018, Valencia, Spain Stem Cells conferences 2018 | Regenerative Medicine conferences 2018 | Advanced Ste...

Wed, Dec 5th 2018- Fri, Dec 7th 2018

This is the 5th in a series of biennial conferences presenting cutting-edge research in the field of epigenetics. We are committed to create an exceptional environment for epigenetic research in Fre...


EpiGeneSys Final
Meeting in Paris

Thur. 11 February 2016 - Sat. 13 February 2016

More than 280 scientists attended the fifth Annual Meeting of EpiGeneSys. The conference kicked off with a talk by coordinator Geneviève Almouzni, Director of the Research Center at the Institut Curie, highlighting the achievements of the network over more than five years...

Maison des océans - Paris Read more


The Non-Coding Genome ...

December 3-4 th, 2015

The last training workshop of the EpiGeneSys network

Hotel Mediterraneo - Rome, Italy Read more

Paris / TriRhena Chromatin Club

July 9th, 2015

...exciting talks and network with members of the Chromatin community!

... An EpiGeneSys TAB workshop

June 11st-12nd , 2015

... learn about current approaches to single cell epigenetics and to meet up and network with...

Montpellier, FranceRead more

Latest publications


The Histone Acetyltransferase Mst2 Protects Active Chromatin from Epigenetic Silencing by Acetylating the Ubiquitin Ligase Brl1.

Read more

Proliferation Drives Aging-Related Functional Decline in a Subpopulation of the Hematopoietic Stem Cell Compartment.

Read more

The impact of rare and low-frequency genetic variants in common disease.

Read more