Advancing Epigenetics Towards Systems Biology

Guidelines for RNA-Seq data analysis (prot 67)

Nicolas Delhomme1, Niklas Mähler2, Bastian Schiffthaler1, David Sundell1, Chanaka Mannapperuma1, Torgeir R. Hvidsten1,2, Nathaniel R. Street1,3


RNA-Seq (RNA-Sequencing) has fast become the preferred method for measuring gene expression, providing an accurate proxy for absolute quantitation of messenger RNA (mRNA) levels within a sample (Mortazavi et al, 2008). RNA-Seq has reached rapid maturity in data handling, QC (Quality Control) and downstream statistical analysis methods, taking substantial benefit from the extensive body of literature developed on the analysis of microarray technologies and their application to measuring gene expression. Although analysis of RNA-Seq remains more challenging than for microarray data, the field has now advanced to the point where it is possible to define mature pipelines and guidelines for such analyses. However, with the exception of commercial software options such as the CLCbio CLC Genomics Workbench, for example, we are not aware of any fully integrated open-source pipelines for performing these pre-processing steps. Both the technology behind RNA-Seq and the associated analysis methods continue to evolve at a rapid pace, and not all the properties of the data are yet fully understood. Hence, the steps and available software tools that could be used in such a pipeline have changed rapidly in recent years and it is only recently that it has become possible to propose a de-facto standard pipeline. Although proposing such a skeleton pipeline is now feasible there remain a number of caveats to be kept in mind in order to produce biologically and statistically sound results.

Here we present what is, in our opinion, a mature pipeline to pre-process and analyze RNA-Seq data. As often as possible we identify the most obvious pitfalls one will face while working with RNA-Seq data and point out caveats that should be considered. An overview of the pipeline is presented in Figure 1 and is detailed below. Briefly, the first step upon receiving the raw data from a sequencing facility is to conduct initial QC checks. These QC results will inform whether the data requires filtering to remove ribosomal RNA (rRNA) contamination, if sequence reads require ‘trimming’ to remove low quality bases and if there is a need to trim reads to remove sequencing adapters. These data pre-processing steps must be performed with care to ensure that the required data cleaning is adequately performed while avoiding the introduction of any potential bias, for example removing sequences of interest. Once the data is deemed of sufficient quality, it is aligned/mapped (both terms are considered synonyms in the following) against the chosen reference; this can be a model organism genome, a novel draft genome or a de-novo assembled transcriptome. Each of these alternatives has advantages and caveats, some of which are detailed below. Having obtained the mapping of the RNA-Seq reads to the genome, the subsequent analysis steps to be performed will be determined by the project goals and the scientific questions that one wishes to address. Distinctly different analysis methods are required depending on whether interest lies in identifying sequence variants or in exploring expression level differences between samples groups i.e differential expression (DE), for example. These are the two most popular uses of RNA-Seq data and are hence briefly introduced at the end of the current protocol. However, as these analyses are complex, we redirect the reader to more complete literature. There are many additional analyses that RNA-Seq data can be used for, including examining allele-specific expression and RNA editing, among others.
The pipeline we describe in the following is made publicly available and we aim to shortly release a worked example using a representative dataset as a companion to this protocol. The worked example will exemplify all of the steps detailed below and will demonstrate the influence of different biases and steps taken to mitigate them. The guideline will be available at

Before reading on, we wish to stress that as the analysis of RNA-Seq data is still a rapidly maturing field, one must always keep an open mind, challenging the results obtained to be sure a possible technical artifact does not underlie an observed difference, particularly where unexpected results are obtained. Only when such possibilities have been considered and eliminated can one assume that observed results are likely of biological origin.

PDF version

Nicolas Delhomme1, Niklas Mähler2, Bastian Schiffthaler1, David Sundell1, Chanaka Mannapperuma1, Torgeir R. Hvidsten1,2, Nathaniel R. Street1,3

1 Umeå Plant Science Center, Department of Plant Physiology, Umeå University, 90187, Umeå, Sweden
2 Department of Chemistry, Biotechnology and Food Science, Norwegian University of Life Sciences, 1432 Ås, Norway
3 Computational Life Science Cluster (CLiC), Umeå University, Umeå, Sweden

Corresponding author: Nicolas Delhomme
Email feedback to: This email address is being protected from spambots. You need JavaScript enabled to view it.

Nicolas Delhomme, Niklas Mähler, Bastian Schiffthaler, David Sundell, Chanaka Mannapperuma, Torgeir R. Hvidsten<sup>1,2</sup>, Nathaniel R. Street<sup>1,3</sup>
Mon, Feb 25th 2019- Tue, Feb 26th 2019

The seminar will feature approximately 10 talks and 2 poster sessions. All attendees are expected to actively participate in the conference, either by giving an oral presentation or presenting a poste...

Mon, Mar 11th 2019- Tue, Mar 12th 2019

We are pleased to invite you to the upcoming conference 12th World Congress on Cell and Tissue Science scheduled in Singapore on March 11-12,2019. It will bring world-class personalities and researche...

  • Holiday inn Singapore Atrium 317 Outram Road Singapore 169075
  • Organizer
Sun, Mar 17th 2019- Thu, Mar 21st 2019

Scholarship and Discounted Abstract Deadline: November 15, 2018! Transcription of the eukaryotic genome is not only driven by the cis regulatory elements embedded in the linear DNA sequence but also ...

Sun, Mar 17th 2019- Thu, Mar 21st 2019

Epigenetics is a major mechanism in human health and disease. Data from a range of diseases (cancer, neurological and immunological disorders) have uncovered altered epigenomes arising from mutations,...


EpiGeneSys Final
Meeting in Paris

Thur. 11 February 2016 - Sat. 13 February 2016

More than 280 scientists attended the fifth Annual Meeting of EpiGeneSys. The conference kicked off with a talk by coordinator Geneviève Almouzni, Director of the Research Center at the Institut Curie, highlighting the achievements of the network over more than five years...

Maison des océans - Paris Read more


The Non-Coding Genome ...

December 3-4 th, 2015

The last training workshop of the EpiGeneSys network

Hotel Mediterraneo - Rome, Italy Read more

Paris / TriRhena Chromatin Club

July 9th, 2015

...exciting talks and network with members of the Chromatin community!

... An EpiGeneSys TAB workshop

June 11st-12nd , 2015

... learn about current approaches to single cell epigenetics and to meet up and network with...

Montpellier, FranceRead more

Latest publications


The Histone Acetyltransferase Mst2 Protects Active Chromatin from Epigenetic Silencing by Acetylating the Ubiquitin Ligase Brl1.

Read more

Proliferation Drives Aging-Related Functional Decline in a Subpopulation of the Hematopoietic Stem Cell Compartment.

Read more

The impact of rare and low-frequency genetic variants in common disease.

Read more