Advancing Epigenetics Towards Systems Biology

Guidelines for RNA-Seq data analysis (prot 67)

Nicolas Delhomme1, Niklas Mähler2, Bastian Schiffthaler1, David Sundell1, Chanaka Mannapperuma1, Torgeir R. Hvidsten1,2, Nathaniel R. Street1,3

Introduction

RNA-Seq (RNA-Sequencing) has fast become the preferred method for measuring gene expression, providing an accurate proxy for absolute quantitation of messenger RNA (mRNA) levels within a sample (Mortazavi et al, 2008). RNA-Seq has reached rapid maturity in data handling, QC (Quality Control) and downstream statistical analysis methods, taking substantial benefit from the extensive body of literature developed on the analysis of microarray technologies and their application to measuring gene expression. Although analysis of RNA-Seq remains more challenging than for microarray data, the field has now advanced to the point where it is possible to define mature pipelines and guidelines for such analyses. However, with the exception of commercial software options such as the CLCbio CLC Genomics Workbench, for example, we are not aware of any fully integrated open-source pipelines for performing these pre-processing steps. Both the technology behind RNA-Seq and the associated analysis methods continue to evolve at a rapid pace, and not all the properties of the data are yet fully understood. Hence, the steps and available software tools that could be used in such a pipeline have changed rapidly in recent years and it is only recently that it has become possible to propose a de-facto standard pipeline. Although proposing such a skeleton pipeline is now feasible there remain a number of caveats to be kept in mind in order to produce biologically and statistically sound results.

Here we present what is, in our opinion, a mature pipeline to pre-process and analyze RNA-Seq data. As often as possible we identify the most obvious pitfalls one will face while working with RNA-Seq data and point out caveats that should be considered. An overview of the pipeline is presented in Figure 1 and is detailed below. Briefly, the first step upon receiving the raw data from a sequencing facility is to conduct initial QC checks. These QC results will inform whether the data requires filtering to remove ribosomal RNA (rRNA) contamination, if sequence reads require ‘trimming’ to remove low quality bases and if there is a need to trim reads to remove sequencing adapters. These data pre-processing steps must be performed with care to ensure that the required data cleaning is adequately performed while avoiding the introduction of any potential bias, for example removing sequences of interest. Once the data is deemed of sufficient quality, it is aligned/mapped (both terms are considered synonyms in the following) against the chosen reference; this can be a model organism genome, a novel draft genome or a de-novo assembled transcriptome. Each of these alternatives has advantages and caveats, some of which are detailed below. Having obtained the mapping of the RNA-Seq reads to the genome, the subsequent analysis steps to be performed will be determined by the project goals and the scientific questions that one wishes to address. Distinctly different analysis methods are required depending on whether interest lies in identifying sequence variants or in exploring expression level differences between samples groups i.e differential expression (DE), for example. These are the two most popular uses of RNA-Seq data and are hence briefly introduced at the end of the current protocol. However, as these analyses are complex, we redirect the reader to more complete literature. There are many additional analyses that RNA-Seq data can be used for, including examining allele-specific expression and RNA editing, among others.
The pipeline we describe in the following is made publicly available and we aim to shortly release a worked example using a representative dataset as a companion to this protocol. The worked example will exemplify all of the steps detailed below and will demonstrate the influence of different biases and steps taken to mitigate them. The guideline will be available at https://bioinformatics.upsc.se/.

Before reading on, we wish to stress that as the analysis of RNA-Seq data is still a rapidly maturing field, one must always keep an open mind, challenging the results obtained to be sure a possible technical artifact does not underlie an observed difference, particularly where unexpected results are obtained. Only when such possibilities have been considered and eliminated can one assume that observed results are likely of biological origin.

PDF version

Nicolas Delhomme1, Niklas Mähler2, Bastian Schiffthaler1, David Sundell1, Chanaka Mannapperuma1, Torgeir R. Hvidsten1,2, Nathaniel R. Street1,3

1 Umeå Plant Science Center, Department of Plant Physiology, Umeå University, 90187, Umeå, Sweden
2 Department of Chemistry, Biotechnology and Food Science, Norwegian University of Life Sciences, 1432 Ås, Norway
3 Computational Life Science Cluster (CLiC), Umeå University, Umeå, Sweden

Corresponding author: Nicolas Delhomme
Email feedback to: This email address is being protected from spambots. You need JavaScript enabled to view it.

Nicolas Delhomme, Niklas Mähler, Bastian Schiffthaler, David Sundell, Chanaka Mannapperuma, Torgeir R. Hvidsten<sup>1,2</sup>, Nathaniel R. Street<sup>1,3</sup>
Wed, Jun 19th 2019- Wed, Jun 19th 2019

Being successful in the previous 3 conferences in the series, Cardiologists 2016 in Berlin, Germany, Cardiologists 2017 in Paris, France and Cardiologists 2018 in Barcelona, Spain, this year we are mo...

Mon, Aug 12th 2019- Tue, Aug 13th 2019

Asia Chemical Engineering 2019 welcomes all attendees, presenters, and exhibitors from all over the world to Auckland, New Zealand. We are delighted to invite you all to attend the “ 7th Asia Pacific ...

  • Naumi Hotel Auckland Airport, 153 Kirkbride Rd, Mangere, Auckland 2022, New Zealand
  • Organizer
Mon, Aug 26th 2019- Tue, Aug 27th 2019

PULSUS, a renowned organization that organizes highly noteworthy conferences throughout the globe invites participants from all over the world to attend “3rd World Congress on Advanced Cancer Science ...

  • Hyatt regency Osaka, 1 Chome-13-11 Nankokita, Suminoe Ward, Osaka, Osaka Prefecture 559-0034, Japan
  • Organizer
Thu, Sep 19th 2019- Fri, Sep 20th 2019

Opening Symposium of the new headquarters of the Josep Carreras Leukaemia Research Institute (IJC) Genetics and Epigenetics of Leukemia and Lymphoma: From Knowledge to Applications September 19-20, 2...

  • Josep Carreras Leukaemia Research Institute, Biomedical Campus Can Ruti, Badalona, Barcelona, Spain
  • Organizer

LAST EVENTS

EpiGeneSys Final
Meeting in Paris

Thur. 11 February 2016 - Sat. 13 February 2016

More than 280 scientists attended the fifth Annual Meeting of EpiGeneSys. The conference kicked off with a talk by coordinator Geneviève Almouzni, Director of the Research Center at the Institut Curie, highlighting the achievements of the network over more than five years...

Maison des océans - Paris Read more

PAST EVENTS

The Non-Coding Genome ...

December 3-4 th, 2015

The last training workshop of the EpiGeneSys network

Hotel Mediterraneo - Rome, Italy Read more

Paris / TriRhena Chromatin Club

July 9th, 2015

...exciting talks and network with members of the Chromatin community!

... An EpiGeneSys TAB workshop

June 11st-12nd , 2015

... learn about current approaches to single cell epigenetics and to meet up and network with...

Montpellier, FranceRead more

Latest publications

2017-06-27

The Histone Acetyltransferase Mst2 Protects Active Chromatin from Epigenetic Silencing by Acetylating the Ubiquitin Ligase Brl1.

Read more
2017-05-26

Proliferation Drives Aging-Related Functional Decline in a Subpopulation of the Hematopoietic Stem Cell Compartment.

Read more
2017-04-30

The impact of rare and low-frequency genetic variants in common disease.

Read more