Tutorial of regions versus regions

In this tutorial, we will demonstrate how we can use RGT-Viz to visualize association among different region sets.

Download the data

We will use the epigenetic data from dendritic cell development study as example. There, we have ChIP-Seq data from the transcription factor PU.1 and IRF8, and histone modifications H3K4me1, H3K4me3, H3K9me3, H3K27me3, and H3K27ac on four cellular states: multipotent progenitors (MPP), dendritic cell progenitors (CDP), common dendritic cells (cDC) and plamatocyte dendritic cells (pDC). The functional annotation of these histone markers are showed as follows:

H3K4me1 is enriched at active and primed enhancers;
H3K4me3 is highly enriched at active promoters near Transcription start site (TSS);
H3K9me3 is a marker of heterochromatin which has pivotal role during lineage commitement;
H3K27me3 is associated with the downregulation of nearby genes via the formation of heterochromatic regions;
H3K27ac is accociated with the higher activation of transcription and defined as an active enhancer marker.

The peaks of PU.1 and IRF8 are further processed into 3 groups: overlapping peaks of PU.1 and IRF8, PU.1 peaks (no IRF8), and IRF8 peaks (no PU.1). Those files are listed below:

PU1_IRF8_pDC_overlap_peaks.bed
PU1_pDC_noIRF8_peaks.bed
IRF8_pDC_noPU1_peaks.bed
PU1_IRF8_cDC_overlap_peaks.bed
PU1_cDC_noIRF8_peaks.bed
IRF8_cDC_noPU1_peaks.bed

Next, please download the folder “rgt_viz_example” from here.

unzip rgt_viz_example
cd rgt_viz_example

Now you have the files as described below:

data/
├── bw
│   ├── H3K27ac_cDC.bw
│   ├── H3K27ac_CDP.bw
│   ├── H3K27ac_MPP.bw
│   ├── H3K27ac_pDC.bw
│   ├── H3K27me3_cDC.bw
│   ├── H3K27me3_CDP.bw
│   ├── H3K27me3_MPP.bw
│   ├── H3K27me3_pDC.bw
│   ├── H3K4me1_cDC.bw
│   ├── H3K4me1_CDP.bw
│   ├── H3K4me1_MPP.bw
│   ├── H3K4me1_pDC.bw
│   ├── H3K4me3_cDC.bw
│   ├── H3K4me3_CDP.bw
│   ├── H3K4me3_MPP.bw
│   ├── H3K4me3_pDC.bw
│   ├── H3K9me3_cDC.bw
│   ├── H3K9me3_CDP.bw
│   ├── H3K9me3_MPP.bw
│   ├── H3K9me3_pDC.bw
│   ├── IRF8_cDC.bw
│   ├── IRF8_pDC.bw
│   ├── PU1_cDC.bw
│   ├── PU1_CDP.bw
│   ├── PU1_MPP.bw
│   └── PU1_pDC.bw
└── peaks
    ├── H3K4me3_cDC_WT_peaks.bed
    ├── H3K4me3_CDP_WT_peaks.bed
    ├── H3K4me3_MPP_WT_peaks.bed
    ├── H3K4me3_pDC_WT_peaks.bed
    ├── PU1_IRF8_cDC_overlap_peaks.bed
    ├── IRF8_cDC_noPU1_peaks.bed
    ├── PU1_cDC_noIRF8_peaks.bed
    ├── PU1_IRF8_pDC_overlap_peaks.bed
    ├── IRF8_pDC_noPU1_peaks.bed
    └── PU1_pDC_noIRF8_peaks.bed

These directories include the genomic signals of histone modifications (files with a .bw ending as generated by bamCoverage) and the genomic regions of PU.1 and IRF8 peaks (files with .narrowPeak endings as generated by MACS2) in different DC cells.

With these data, the first question we would like to ask is: are PU.1 and IRF8 co-binders in DC differentiation? If so, in which cells?

Intersection test

For evaluating the association between PU.1 and IRF8, the intersection test is applied on the ChIP-seq binding regions of PU.1 on all cell types to compare with the ChIP-seq binding regions of IRF8 on cDC and pDC (the ChIP-seq binding regions of IRF8 in CDP and MPP are not available).

rgt-viz intersect -r Matrix_PU1.txt -q Matrix_IRF8.txt -o results -t PU1_IRF8_intersection -organism mm9 -stest 30

-r is reference region set as the base for statistics;
-q is query region set for testing its association with the reference regions;
-o indicates the output directory;
-t defines the title of this experiment;
-c defines the color tag for cloring the test;
-organism defines the genome assembly used here;
-stest defines the repitition times of random subregion test between reference and query. The more repitition times are, the more reliable the result is. However, it take time to run.

This command will generate a directory “results/PU1_IRF8_intersection” with figures and html pages.

The exact numbers of intersected regions between PU.1 and IRF8, and p-values are shown in below table:

Reference name	Query name	Reference number	Query number	Intersect.	Average intersect.	Chi-square statistic	Positive Association p-value	Negative Association p-value
MPP_PU1_peaks	cDC_IRF8_peaks	6212	34003	4412	1163	3359	0	1.00
MPP_PU1_peaks	pDC_IRF8_peaks	6212	6467	1745	848.5	351.3	5.1e-77	1.00
CDP_PU1_peaks	cDC_IRF8_peaks	20237	34003	14003	6574	4438	0	1.00
CDP_PU1_peaks	pDC_IRF8_peaks	20237	6467	4078	1494	1979	0	1.00
cDC_PU1_peaks	cDC_IRF8_peaks	20054	34003	13973	6538	4489	0	1.00
cDC_PU1_peaks	pDC_IRF8_peaks	20054	6467	4066	1497	1955	0	1.00
pDC_PU1_peaks	cDC_IRF8_peaks	21050	34003	14307	6757	4367	0	1.00
pDC_PU1_peaks	pDC_IRF8_peaks	21050	6467	4137	1478	2100	0	1.00

The intersection test reveals that PU.1 and IRF8 are associated significantly in all cell types as shown in the 8th column. Though there are many overlaps between IRF8 and PU.1 in all cell types, the table shows that the highest number of overlaps appears between PU.1 and IRF8 in cDC.

Jaccard test

Alternatively, we can use Jaccard test to evaluate the association level between PU.1 and IRF8 by comparing with jaccard index from repeating randomization.

Run the command:

rgt-viz jaccard -r Matrix_PU1.txt -q Matrix_IRF8.txt -o results -t PU1_IRF8_jaccard -organism mm9

This command will generate a directory “results/PU1_IRF8_jaccard” with figures and html pages.

We can also look at the statistic numbers and p-values as shown below:

Reference name	Query name	Reference number	Query number	True Jaccard index	Average random Jaccard	Negative Association p-value
MPP_PU1_peaks	cDC_IRF8_peaks	6212	34003	0.1101	0.0007	1.00
MPP_PU1_peaks	pDC_IRF8_peaks	6212	6467	0.1387	0.0004	1.00
CDP_PU1_peaks	cDC_IRF8_peaks	20237	34003	0.3125	0.0019	1.00
CDP_PU1_peaks	pDC_IRF8_peaks	20237	6467	0.1430	0.0007	1.00
cDC_PU1_peaks	cDC_IRF8_peaks	20054	34003	0.3140	0.0019	1.00
cDC_PU1_peaks	pDC_IRF8_peaks	20054	6467	0.1437	0.0007	1.00
pDC_PU1_peaks	cDC_IRF8_peaks	21050	34003	0.3147	0.0020	1.00
pDC_PU1_peaks	pDC_IRF8_peaks	21050	6467	0.1402	0.0007	1.00

Projection test

We next evaluate the association between IRF8 and PU.1 binding sites and histone modification markers. For this, we can use projection test. It evaluates the association between a set of region sets (peaks of Irf8, PU1 and Irf8/PU1) vs. a background regions (H3K4me1 peaks in cDC cells) by evaluating intersection counts with a random binomial model.

rgt-viz projection -r Matrix_H3K4me1.txt -q Matrix_cDC_pDC.txt -o results -t projection -c factor -organism mm9 -g cell

-r is reference region set as the base for statistics;
-q is query region set for testing its association with the reference regions;
-o indicates the output directory;
-t defines the title of this experiment;
-c defines the color tag for cloring the test;
-organism defines the genome assembly used here;
-g defines the group tag for grouping the test.

This command will generate a directory “results/projection” with figures and html pages.

These results indicates the majority of peaks associated with H3K4me3 in cDC are of PU.1 and Irf8 co-binding, while H34me3 pekas are associated with Irf8 peaks.

Combinatorial test

We next ask if co-binding of PU.1 and IRF8 relates to different histone modifications. For this, we use combinatorial test. This is another variant of intersection test that checks all the combinations of the given query regions and calcuate their intersections to the reference. This test is useful in exploring the unknown association between region sets.

rgt-viz combinatorial -o results -q Matrix_H3K4me1_cDC_pDC.txt -r Matrix_PU1_IRF8_peaks.txt -t combinatorial -organism mm9 -g cell -c factor