Chromatin immunoprecipitation followed by sequencing (ChIP-seq) is becoming the standard experimental procedure to investigate transcriptional regulation and epigenetic mechanisms on a genome-wide scale (reviewed in (Park, 2009)). The technique involves covalent cross-linking of proteins to the DNA, followed by fragmentation and immunoprecipitation (IP) of the chromatin by using an antibody against the protein or histone modification of interest. The result of this experiment is a set of short DNA fragments of about 200 bp in length that represent regions of the genome where the protein is bound, or where specific histone modifications occurred. The segments are then sequenced using one of the various next generation sequencing procedures now available. The resulting reads (usually 36 to 100bp) are then mapped back to the reference genome of interest in order to identify regions with significant binding.
Since the introduction of the experimental technique, several bioinformatics approaches have been developed to cope with the analysis of these data (reviewed in (Laajala et al., 2009; Wilbanks and Facciotti, 2010)). Usually these different methods have been initially developed to analyse a given dataset and associated experimental design and therefore they are based on different assumptions. For example, some methods used window-based scans to establish read density profiles, while others use different kernel density estimators; some perform peak assignments in a strand-specific basis, whereas others do so in a non-strand sensitive fashion; some implement the usage of control or background datasets; and, usually, every method is based on a different statistical model or test and uses alternative approaches to adjust for multiple testing or to normalise the data. Given such disparity in methods, it is not difficult to imagine that the results obtained from such analyses are heavily dependent on the method employed, with peak overlaps ranging from 100 to 13% depending on the algorithm (Wilbanks and Facciotti, 2010).
Here, we present a step-by-step protocol for the analysis of ChIP-seq data using a new robust procedure based on the estimation of background signal using an input DNA control. Unlike many of the currently available methods, which are based on fitting the ChIP-seq signal to a given distribution, our approach is based on an unbiased evaluation of the noise in the sample that is then used to calculate the statistical significance of the binding events. Hence, our procedure is ideal for profiles where no previous information about the mode of binding –e.g. sharp peaks or broad domains– is known. The method, implemented through the statistical package R/Bioconductor (Gentleman et al., 2004), has been successfully used for small genomes such as D. melanogaster (Schwartz et al., 2006; Kind et al., 2008; Conrad et al., 2012), and can be used for any dataset with a sufficient coverage for both the input and the IP sample. In this protocol, we use a recent ChIP-seq dataset by Raja et al. to illustrate each step of the analysis (Raja et al., 2010). The code is available at http://www.ebi.ac.uk/luscombe-srv/protocols/epigenesys.
1 European Bioinformatics Institute. Wellcome Trust Genome Campus, Cambridge CB10 1SD, UK.
2 Okinawa Institute of Science & Technology, 1919-1 Tancha, Onna-son, Kunigami- gun, Okinawa 904-0495, Japan.
3 University College London Genetics Institute, Gower Street, London WC1E 6BT, UK
Corresponding author: Nicholas M. Luscombe