Tool Usage

Command:

rgt-THOR [option] CONFIG

Required Input

Option Name	Type	Description
CONFIG	File	Configuration file for the experiment

Options

Option Name	Type	Default	Description
–name	string	none	Experiment’s name and prefix for all files that are created.
–merge	boolean	False	Merge peaks which have a distance less than the estimated mean fragment size (recommended for histone data).
–no-merge-bin	boolean	True	Merge the overlapping bin before filtering by p-value.
–housekeeping-genes	file (bed format)	none	Define housekeeping genes (BED format) used for normalizing.
–output-dir	string	none	Store files in output directory.
–report	boolean	False	Generate HTML report about experiment.
–deadzones	file (bed format)	none	Define blacklisted genomic regions avoided for analysis (BED format).
–no-correction	boolean	False	Do not use multipe test correction for p-values (Benjamini/Hochberg).
–pvalue	float	0.1	P-value cutoff for peak detection. Call only peaks with p-value lower than cutoff.
–exts	interger1,interger2,…	estimate	Read’s extension size for BAM files (comma separated list for each BAM file in config file). If option is not chosen, estimate extension sizes from reads.
–factors-inputs	float1,float2, …	estimate	Normalization factors for input-DNA (comma separated list for each BAM file in config file). If option is not chosen, estimate factors.
–scaling-factors	float1,float2, …	estimate	Scaling factor for each BAM file (not control input-DNA) as comma separated list for each BAM file in config file. If option is not chosen, follow normalization strategy (TMM or HK approach)
–save-input	boolean	False	Save input DNA bigwig (if input was provided).

Advanced options

Option Name	Type	Default	Description
–regions	file (bed format)	none	Define regions (BED format) to restrict the analysis, that is, where to train the HMM and search for DPs. It is faster, but less precise.
–binsize	integer	100	Size of bins for creating the signal.
–step	integer	50	Stepsize with which the window consecutively slides across the genome to create the signal.
–debug	boolean	False	Output debug information. Warning: space consuming!
–no-gc-content	boolean	False	Do not normalize towards GC content.
–norm-regions	file (bed format)	none	Restrict normalization to particular regions (BED format).
–threshold	float	95	Minimum signal support for differential peaks to define training set as percentage (t₂, see paper).
–size	integer	10000	Number of bins the HMM’s training set constists of.
–par	integer	1	Percentile for p-value postprocessing filter.
–poisson	boolean	False	Use binomial distribution as emmission.
–single-strand	boolean	False	Allow single strand BAM file as input.
–m_threshold	integer	80	Define the M threshold of percentile for training TMM.
–a_threshold	integer	95	Define the A threshold of percentile for training TMM.
–rmdup	boolean	False	Remove the duplicate reads.

Config File

The config file contains all necessary information for the experiment.

The file contains several groups, each group is indicated by a header symbol (#) followed by the group’s name. The information of each group is then given in the following lines.

The headers are:

#rep1
#rep2
#chrom_sizes
#genome (optional)
#inputs1 (optional)
#inputs2 (optional)

The headers #inputs1, #inputs2 and #genome are optional. All other headers are required. An example for a config file is given here.

#rep

Headers #rep1 and #rep2 give all BAM files to be analysed. All files are listed line by line for the first (#rep1) and second (#rep2) biological condition.

#chrom_sizes

The chromosome sizes is a tab limited file with the chromosome name and the size of the chromosome. To download the chromosome size file of an organism, follow these instructions.

#inputs

For each BAM file, we can also provide control input-DNA BAM files by using headers #inputs1 and #inputs2. Input-DNA helps to handle bias in ChIP-seq profiles and can therefore improve the differential peak estimation.

#genome

Input-DNA and the genome (in fasta format) is necessary to correct for GC-content. GC-content correction can lead to more precise DP estimates. For instance, to download the human genome hg19.fa, run

wget 'ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/*'
gunzip *gz
cat *fa > hg19.fa
rm md5sum.txt chr*fa README.txt

Output

By default, THOR creates four types of files. The file(s)

*.bw give the postprocessed ChIP-seq signal (in bigWig format),
*-setup.info gives information about the setting,
*-diffpeaks.bed describes the estimated differential peaks in a proprietary BED format, and
*-diffpeaks.narrowPeak describes the estimated differential peaks in narrowPeak format.

setup.info file

The setup file contains information about the experiment and the configuration of THOR used to predict DPs.

This includes the extension sizes used for the reads, the normalization scaling factors for each single ChIP-seq signal and the overdispersion factor c describing the quadratic relationship between the mean and the variance in the data.

Moreover, the setup.info file gives information about the HMM configuration, that is, the transition matrix as well as the emission distribution. The emission distribution is a Negative Binomial distribution described by mu and alpha.

BigWig files

THOR creates a BigWig file for each ChIP-seq signal (*s[1/2]rep[0/1/2/3/…].bw), that is, for each biological condition (s[1/2]) and each replicate (rep[0/1/2/3/…]).

narrowPeak and BED files

As the narrowPeak format does not give the possibility to store the counts of condition 1 and condition 2 for each peak, we output an BED format with additional information in the 11th column.

The 11th column in the BED file gives a semicolon separated list for each differential peak. The first (second) element of the semicolon separated list contains a comma separated list of the counts of each replicate of the first (second) biological conditions. The third element of the list gives the calculated p-value.

For downstream analysis of this BED file, we provide two tools. The first tool separates the BED file by differential peaks that gain peaks in condition 1 and that gain peaks in condition 2. The second tool filters the BED file by p-value.

Please note: In both files, *-diffpeaks.bed and *-diffpeaks.narrowPeak, the strand column (6th column) indicates whether the peak gains condition 1 (positive strand) or gains condition 2 (negative strand).

To output valid narrowPeak and BED files, we add several dummy columns. Only then it is possible to use them in other tools, like e.g. the IGV browser.

For the BED file, columns 7 and 8, give the same information as columns 2 and 3; column 9 gives a colour code for the peaks (red for a differential peak in signal 1, and green for a differential peak in signal 2). Column 10 is always 0.

For the narrowPeak file, column 5, 7, 9 and 10 do not give any information.

Advanced Example

In this example, we study the Dendritic Cell development. Dendritic cells (DC) are professional antigen presenting cells that develop from hematopoietic stem cells in bone marrow. This system allows to differentiate ex-vivo mul- tipotent progenitors (MPP) to DC progenitors (CDP). We want to call differential peaks in our provided ChIP-seq data for histone mark H3K27ac of MPP and CDP (see here the GEO entry for the fastq files).

We create the config file DC_THOR.config:

#rep1
MPP_WT_H3K27ac_1.bam
MPP_WT_H3K27ac_2.bam
#rep2
CDP_WT_H3K27ac_1.bam
CDP_WT_H3K27ac_2.bam
#genome
mm9.fa
#chrom_sizes
mm9.chrom.sizes
#inputs1
Input_MPP_H3K27ac.bam
Input_MPP_H3K27ac.bam
#inputs2
Input_CDP_H3K27ac.bam
Input_CDP_H3K27ac.bam

Here, we do not give only the required (#rep1, #rep2 and #chrom_sizes), but also the optional header (#genome, #inputs1, #inputs2) to take advantage of input-DNA and GC-content normalization.

The following command finds differential peaks in the two conditions MPP and CDP.

rgt-THOR THOR_DC.config --report --housekeeping-genes hk_genes_new_promotor_mm9.bed --no-correction --output-dir ~/my_MPP_CDP_exp -n THOR_DC

THOR_DC.config contains all information necessary to run THOR. With –output-dir we determine the working directory ~/my_MPP_CDP_exp. THOR stops, if the folder already exists. All files created in ~/my_MPP_CDP_exp have the prefix THOR_DC (specific by the option -n). The p-value is not corrected (option –no-correction). Instead of using a TMM based approach to normalize the ChIP-seq profiles (default), we use a housekeeping genes approach. For that, please download the housekeeping genes and use the option –house-keeping-genes.

Moreover, we create a HTML report of our experiment (option –report). The HTML report contains useful information about our experiment, such as, the the experimental configuration (also described by *-setup.info), the function THOR is using to model the mean-variance relationship, the function used to estimate the fragmentation size, and the quality check of the house-keeping-gene normalization approach.