An ambitious EC-funded research initiative on epigenetics advancing towards systems biology 1

Identification of Transcription Factor Binding Sites in ChIP-exo using R/Bioconductor (Prot 68)

Pedro Madrigal1,2


Precisely mapping protein-DNA binding to genomic sites is a pivotal task in order to understand gene regulation. Chromatin immunoprecipitation (ChIP) followed by microarray hybridization (ChIP-chip) or sequencing (ChIP-seq) have been extensively used to map transcription factor binding sites (TFBSs), with ChIP-seq comparing favourably with respect to ChIP-chip in terms of resolution and signal-to-noise ratio (Ho et al., 2011). While ChIP-seq remains the standard, most-used methodology (Furey, 2012), λ exonuclease digestion followed by high-throughput sequencing, or ChIP-exo, has recently emerged as a powerful and promising technique able to substitute ChIP-seq, and to circumvent its limitations (Rhee and Pugh, 2011; Mendenhall and Bernstein, 2012). In this protocol, the distribution of mapped reads is characterised by pairs of two distinct peaks, one at each DNA strand, centred at the λ exonuclease borders and separated frequently at fixed distances (Rhee and Pugh, 2011). Importantly, the improved resolution of ChIP-exo can provide novel insights into protein-DNA interactions (Rhee and Pugh, 2011; Serandour et al., 2013). Furthermore, ChIP-exo distinguishes weaker peaks more confidently, and also closely-located binding events, that in ChIP-seq are generally unresolved or deconvolved through computational approaches (e.g., Guo et al. (2012)).
In this protocol, first I describe the differences between ChIP-seq and ChIP-exo data analysis pipelines, and then concentrate on peak calling using the R/Bioconductor package CexoR. Unlike (for example) the popular ChIP-seq peak caller MACS (Feng et al., 2012), CexoR analyses multiple ChIP-exo replicates together, allowing a better identification of narrow peaks and simpler downstream analysis.
CexoR is able to locate reproducible protein-DNA interaction in ChIP-exo datasets with no need of genome sequence information, manual matching of peak-pairs, paired control data (inputs), or downstream assessment of replicate reproducibility. In addition, the R statistical environment allows integration with other pipelines and downstream analyses via other R and Bioconductor packages.

PDF version

Pedro Madrigal1,2

1 Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK
2 Wellcome Trust-MRC Cambridge Stem Cell Institute, Anne McLaren Laboratory for Regenerative Medicine, Department of Surgery, University of Cambridge, Cambridge, CB2 0SZ, UK

Corresponding author: Pedro Madrigal
Email feedback to:


Please enter your comment!
Please enter your name here