Configuration of Genomic Data

When RGT is installed, it will automatically create a folder that stores additional data (default: ~/rgtdata). This data includes chromosome sizes, position frequency matrices (describing transcription factor motifs), HTML scripts, etc. Some tools require data too big to fit in the installation procedure, such as genomes and genomic annotations. In this section we will describe how to obtain these data.

Automatic Data Setup

The easiest way to obtain all data sets required by RGT is to run the setupGenomicData.py python script inside the installed data directory. This will download the files from public servers and will take a few minutes. If you use MAC OS, make sure the command “wget” is available (further instructions here).

The following command will install all the necessary human genome (hg19) data sets:

cd ~/rgtdata
python setupGenomicData.py --hg19

The following command will install all the necessary mouse genome (mm9) data sets:

cd ~/rgtdata
python setupGenomicData.py --mm9

The following command will install all available data sets: (hg19, hg38, mm9, mm10, zv9, and zv10)

cd ~/rgtdata
python setupGenomicData.py --all

This script has further options that can be viewed with:

python setupGenomicData.py -h

Customize RGT Data Folder

The data.config File

The data.config file contains the default data set names (inside RGT data path) used by RGT tools. It is divided into sections (with labels in brackets), such as GenomeData and MotifData.

GenomeData

For each genome assembly, there are five fields targeting to the relevant files. You can customize these paths by yourself. Below is the example for hg19:

Field Name

Default Value

Description

genome

genome_hg19.fa

Sequence of assembly hg19 in FASTA format. This data set is not available upon installation. See instructions above on how to obtain this data set.

chromosome_sizes

chrom.sizes.hg19

Chromosome sizes file of assembly hg19.

gene_regions

genes_hg19.bed

Gene locations in BED format (from Gencode annotation file in GTF format).

annotation

gencode.v19.annotation.gtf

Gene annotation from Gencode version 19 for human in GTF format. This data set is not available upon installation. See instructions above on how to obtain this data set.

gene_alias

alias_human.txt

Alias file which allows for translation between multiple different gene IDs.

You should never modify the data.config file! This is due to the fact that every RGT installation will overwrite it. You can however customize the data.config.user file, by copying a similar section from the data.config file and modifying it to your wishes. For example, to use data from the organisms the user is interested in studying you simply create a section with the genome name and define all the relevant paths.

For example, here is a customized genome for Arabis thaliana (TAIR10):

[tair10]
genome: path/to/genome_tair10.fa
chromosome_sizes: path/to/chrom.sizes.tair10
gene_regions: path/to/genes_tair10.bed
annotation: path/to/tair10.annotation.gtf
gene_alias: path/to/alias_tair10.txt

The files that should be defined include:

  • Genome fasta file: These files must contain one sequence for each chromosome. Each sequence header must be the chromosome symbol (such as “chr1” for chromosome 1). It can be obtained from several resources, including the UCSC Downloads Website.

  • Gene annotation file: It is a BED file containing the genomic coordinates of each gene for the selected organism. It can be downloaded, among other places, in the UCSC Table Browser.

  • Chromosome sizes: It is a tab-separated plain text file with two columns. The first must contain the chromosome alias and the second must contain the length of the chromosome in base pairs. It can be fetched for some organisms using the fetchChromSizes script available at the UCSC Utilities Website.

  • GTF annotation file: A GTF file.

  • Gene ID/Symbol aliases: A tab delimited file with three columns. Each row describes a gene and its aliases. The first column contains the gene’s ENSEMBL ID. The second column contains the gene’s official symbol (or user’s symbol of preference). The third column contains an ampersand(&)-separated list of aliases.

MotifData

The following table describes the data.config path fields:

Field Name

Default Value

Description

pwm_dataset

motifs

Contains the path to the motif position weight matrices (PWM) repositories.

logo_dataset

logos

Contains the path to the logo graphs(graphical depiction of PWMs). This data set is not available upon installation. For more information on how to create this data set click here.

repositories

jaspar_vertebrates,uniprobe_primary

The PWM repositories that will be used in the analyses. It is a comma-separated list of folders inside <pwm_dataset> (see this option above) folder. For information on how to add additional repositories, click here.

RGT Data Folder Structure

After installation, the RGT data folder will contain the following data.

  • Organism Folders: Currently, we provide data for Homo sapiens (hg19, hg38) , Mus musculus (mm9, mm10), and Danio rerio (zv9, zv10). Inside these folders, you can find information regarding gene annotation and chromosome sizes.

  • fig: Default figures and HTML style files.

  • fp_hmms: Contains default hidden Markov model files for HINT tool.

  • motifs: Contain position weight matrices (PWMs) for vertebrates obtained in many different repositories.

Additional folders may be created regarding tool-specific data.