https://github.com/bioinfo-pf-curie/hic-pro
HiC-Pro: An optimized and flexible pipeline for Hi-C data processing
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
✓DOI references
Found 5 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.2%) to scientific vocabulary
Last synced: 10 months ago
·
JSON representation
Repository
HiC-Pro: An optimized and flexible pipeline for Hi-C data processing
Basic Info
- Host: GitHub
- Owner: bioinfo-pf-curie
- License: other
- Language: Python
- Default Branch: master
- Size: 45.9 MB
Statistics
- Stars: 0
- Watchers: 4
- Forks: 0
- Open Issues: 0
- Releases: 0
Fork of nservant/HiC-Pro
Created almost 9 years ago
· Last pushed about 6 years ago
https://github.com/bioinfo-pf-curie/HiC-Pro/blob/master/
# HiC-Pro
### An optimized and flexible pipeline for Hi-C data processing


[](https://groups.google.com/forum/#!forum/hic-pro)
[](https://doi.org/10.1186/s13059-015-0831-x)
----
Find documentation and examples at [http://nservant.github.io/HiC-Pro/](http://nservant.github.io/HiC-Pro/)
For any question about HiC-Pro, please contact nicolas.servant@curie.fr or use the [HiC-Pro forum](https://groups.google.com/forum/#!forum/hic-pro)
## What is HiC-Pro ?
HiC-Pro was designed to process Hi-C data, from raw fastq files (paired-end Illumina data) to normalized contact maps. It supports the main Hi-C protocols, including digestion protocols as well as protocols that do not require restriction enzymes such as DNase Hi-C. In practice, HiC-Pro was successfully applied to many data-sets including dilution Hi-C, in situ Hi-C, DNase Hi-C, Micro-C, capture-C, capture Hi-C or HiChip data.
The pipeline is flexible, scalable and optimized. It can operate either on a single laptop or on a computational cluster. HiC-Pro is sequential and each step of the workflow can be run independantly.
HiC-Pro includes a fast implementatation of the iterative correction method (see the [iced python package](https://github.com/hiclib/iced) for more information).
Finally, HiC-Pro can use phasing data to build [allele-specific contact maps](doc/AS.md).
If you use HiC-Pro, please cite :
*Servant N., Varoquaux N., Lajoie BR., Viara E., Chen CJ., Vert JP., Dekker J., Heard E., Barillot E.* HiC-Pro: An optimized and flexible pipeline for Hi-C processing. Genome Biology 2015, 16:259 [doi:10.1186/s13059-015-0831-x](https://doi.org/10.1186/s13059-015-0831-x)
## Using HiC-Pro through Singularity
HiC-Pro provides a Singularity container to ease its installation process.
A ready-to-use container is available [here](https://zerkalo.curie.fr/partage/HiC-Pro/singularity_images/hicpro_latest_ubuntu.img).
In order to build you own Singularity image;
1- Install singularity
- Linux : http://singularity.lbl.gov/install-linux
- MAC : http://singularity.lbl.gov/install-mac
- Windows : http://singularity.lbl.gov/install-windows
2- Build the singularity HiC-Pro image using the 'Singularity' file available in the HiC-Pro root directory.
```
sudo singularity build hicpro_latest_ubuntu.img MY_INSTALL_PATH/HiC-Pro/Singularity
```
3- Run HiC-pro
You can then either use HiC-Pro using the 'exec' command ;
```
singularity exec hicpro_latest_ubuntu.img HiC-Pro -h
```
Or directly use HiC-Pro within the Singularity shell
```
singularity shell hicpro_latest_ubuntu.img
HiC-Pro -h
```
## How to install it ?
The HiC-Pro pipeline requires the following dependencies :
- The [bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) mapper
- Python (>2.7) with *pysam (>=0.8.3)*, *bx-python(>=0.5.0)*, *numpy(>=1.8.2)*, and *scipy(>=0.15.1)* libraries.
**Note that the current version does not support python 3**
- R with the *RColorBrewer* and *ggplot2 (>2.2.1)* packages
- g++ compiler
- samtools (>1.1)
- Unix sort (**which support -V option**) is required ! For Mac OS user, please install the GNU core utilities !
Note that Bowtie >2.2.2 is strongly recommanded for allele specific analysis.
To install HiC-Pro (>=2.7.8), be sure to have the appropriate rights and run :
```
tar -zxvf HiC-Pro-master.tar.gz
cd HiC-Pro-master
## Edit config-install.txt file if necessary
make configure
make install
```
For older version (<2.7.8), the following process can be used
```
tar -zxvf HiC-Pro-master.tar.gz
cd HiC-Pro-master
## Edit config-install.txt file if necessary
make CONFIG_SYS=config-install.txt install
```
Note that if some of these dependencies are not installed (i.e. not detected in the $PATH), HiC-Pro will try to install them.
You can also edit the *config-install.txt* file and manually defined the paths to dependencies.
| | SYSTEM CONFIGURATION |
|---------------|-------------------------------------------------------------------------------|
| PREFIX | Path to installation folder |
| BOWTIE2_PATH | Full path the bowtie2 installation directory |
| SAMTOOLS_PATH | Full path to the samtools installation directory (>1.1 ) |
| R_PATH | Full path to the R installation directory |
| PYTHON_PATH | Full path to the python installation directory (>2.7 - python3 not supported) |
| CLUSTER_SYS | Scheduler to use for cluster submission. Must be TORQUE, SGE, SLURM or LSF |
## Annotation Files
In order to process the raw data, HiC-Pro requires three annotation files. Note that the pipeline is provided with some Human and Mouse annotation files.
**Please be sure that the chromosome names are the same than the ones used in your bowtie indexes !**
- **A BED file** of the restriction fragments after digestion. This file depends both of the restriction enzyme and the reference genome. See the [FAQ](doc/FAQ.md) and the [HiC-Pro utilities](doc/UTILS.md) for details about how to generate this file. A few annotation files are provided with the HiC-Pro sources as examples.
```
chr1 0 16007 HIC_chr1_1 0 +
chr1 16007 24571 HIC_chr1_2 0 +
chr1 24571 27981 HIC_chr1_3 0 +
chr1 27981 30429 HIC_chr1_4 0 +
chr1 30429 32153 HIC_chr1_5 0 +
chr1 32153 32774 HIC_chr1_6 0 +
chr1 32774 37752 HIC_chr1_7 0 +
chr1 37752 38369 HIC_chr1_8 0 +
chr1 38369 38791 HIC_chr1_9 0 +
chr1 38791 39255 HIC_chr1_10 0 +
(...)
```
- **A table file** of chromosomes' size. This file can be easily find on the UCSC genome browser. Of note, pay attention to the contigs or scaffolds, and be aware that HiC-pro will generate a map per chromosomes pair. For model organisms such as Human or Mouse, which are well annotated, we usually recommand to remove all scaffolds.
```
chr1 249250621
chr2 243199373
chr3 198022430
chr4 191154276
chr5 180915260
chr6 171115067
chr7 159138663
chr8 146364022
chr9 141213431
chr10 135534747
(...)
```
- **The bowtie2 indexes**. See the [bowtie2 manual page](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) for details about how to create such indexes.
## How to use it ?
First have a look at the help message !
```
HiC-Pro --help
usage : HiC-Pro -i INPUT -o OUTPUT -c CONFIG [-s ANALYSIS_STEP] [-p] [-h] [-v]
Use option -h|--help for more information
HiC-Pro 2.11.3
---------------
OPTIONS
-i|--input INPUT : input data folder; Must contains a folder per sample with input files
-o|--output OUTPUT : output folder
-c|--conf CONFIG : configuration file for Hi-C processing
[-p|--parallel] : if specified run HiC-Pro on a cluster
[-s|--step ANALYSIS_STEP] : run only a subset of the HiC-Pro workflow; if not specified the complete workflow is run
mapping: perform reads alignment - require fast files
proc_hic: perform Hi-C filtering - require BAM files
quality_checks: run Hi-C quality control plots
merge_persample: merge multiple inputs and remove duplicates if specified - require .validPairs files
build_contact_maps: Build raw inter/intrachromosomal contact maps - require .allValidPairs files
ice_norm : run ICE normalization on contact maps - require .matrix files
[-h|--help]: help
[-v|--version]: version
```
- Copy and edit the configuration file *'config-hicpro.txt'* in your local folder. See the [manual](doc/MANUAL.md) for details about the configuration file
- Put all input files in a rawdata folder. The input files have to be organized with **one folder per sample**, such as;
```
+ PATH_TO_MY_DATA
+ sample1
++ file1_R1.fastq.gz
++ file1_R2.fastq.gz
++ ...
+ sample2
++ file1_R1.fastq.gz
++ file1_R2.fastq.gz
*...
```
- Run HiC-Pro on your laptop in standalone model
```
MY_INSTALL_PATH/bin/HiC-Pro -i FULL_PATH_TO_DATA_FOLDER -o FULL_PATH_TO_OUTPUTS -c MY_LOCAL_CONFIG_FILE
```
- Run HiC-Pro on a cluster (TORQUE/SGE/SLURM/LSF)
```
MY_INSTALL_PATH/bin/HiC-Pro -i FULL_PATH_TO_DATA_FOLDER -o FULL_PATH_TO_OUTPUTS -c MY_LOCAL_CONFIG_FILE -p
```
In the latter case, you will have the following message :
```
Please run HiC-Pro in two steps :
1- The following command will launch the parallel workflow through 12 torque jobs:
qsub HiCPro_step1.sh
2- The second command will merge all outputs to generate the contact maps:
qsub HiCPro_step2.sh
```
Execute the displayed command from the output folder:
```
qsub HiCPro_step1.sh
```
Once executed succesfully (may take several hours), run the step using:
```
qsub HiCPro_step2.sh
```
## Test Dataset
The test dataset and associated results are available [here](https://zerkalo.curie.fr/partage/HiC-Pro/).
Small fastq files (2M reads) extracted from the Dixon et al. 2012 paper are available for test.
```
## Get the data. Will download a test_data folder and a configuration file
wget https://zerkalo.curie.fr/partage/HiC-Pro/HiCPro_testdata.tar.gz && tar -zxvf HiCPro_testdata.tar.gz
## Edit the configuration file and set the path to Human bowtie2 indexes
## Run HiC-Pro
time HICPRO_INSTALL_DIR/bin/HiC-Pro -c config_test_latest.txt -i test_data -o hicpro_latest_test
Run HiC-Pro 2.11.3
--------------------------------------------
Thu Mar 19, 12:18:10 (UTC+0100)
Bowtie2 alignment step1 ...
Logs: logs/dixon_2M_2/mapping_step1.log
Logs: logs/dixon_2M/mapping_step1.log
--------------------------------------------
Thu Mar 19, 12:18:57 (UTC+0100)
Bowtie2 alignment step2 ...
Logs: logs/dixon_2M_2/mapping_step2.log
Logs: logs/dixon_2M/mapping_step2.log
--------------------------------------------
Thu Mar 19, 12:19:08 (UTC+0100)
Combine R1/R2 alignment files ...
Logs: logs/dixon_2M_2/mapping_combine.log
Logs: logs/dixon_2M/mapping_combine.log
--------------------------------------------
Thu Mar 19, 12:19:13 (UTC+0100)
Mapping statistics for R1 and R2 tags ...
Logs: logs/dixon_2M_2/mapping_stats.log
Logs: logs/dixon_2M/mapping_stats.log
--------------------------------------------
Thu Mar 19, 12:19:15 (UTC+0100)
Pairing of R1 and R2 tags ...
Logs: logs/dixon_2M_2/mergeSAM.log
Logs: logs/dixon_2M/mergeSAM.log
--------------------------------------------
Thu Mar 19, 12:19:25 (UTC+0100)
Assign alignments to restriction fragments ...
Logs: logs/dixon_2M_2/mapped_2hic_fragments.log
Logs: logs/dixon_2M/mapped_2hic_fragments.log
--------------------------------------------
Thu Mar 19, 12:20:10 (UTC+0100)
Merge chunks from the same sample ...
Logs: logs/dixon_2M/merge_valid_interactions.log
Logs: logs/dixon_2M_2/merge_valid_interactions.log
--------------------------------------------
Thu Mar 19, 12:20:11 (UTC+0100)
Merge stat files per sample ...
Logs: logs/dixon_2M/merge_stats.log
Logs: logs/dixon_2M_2/merge_stats.log
--------------------------------------------
Thu Mar 19, 12:20:11 (UTC+0100)
Run quality checks for all samples ...
Logs: logs/dixon_2M/make_Rplots.log
Logs: logs/dixon_2M_2/make_Rplots.log
--------------------------------------------
Thu Mar 19, 12:20:22 (UTC+0100)
Generate binned matrix files ...
Logs: logs/dixon_2M/build_raw_maps.log
Logs: logs/dixon_2M_2/build_raw_maps.log
--------------------------------------------
Thu Mar 19, 12:20:22 (UTC+0100)
Run ICE Normalization ...
Logs: logs/dixon_2M/ice_500000.log
Logs: logs/dixon_2M/ice_1000000.log
Logs: logs/dixon_2M_2/ice_500000.log
Logs: logs/dixon_2M_2/ice_1000000.log
real 2m15,736s
user 4m3,277s
sys 0m24,423s
```
Owner
- Name: Institut Curie, Bioinformatics Core Facility
- Login: bioinfo-pf-curie
- Kind: organization
- Location: Paris, France
- Website: https://bioinfo-pf-curie.github.io/
- Repositories: 11
- Profile: https://github.com/bioinfo-pf-curie
bioinformatics platform of the Institut Curie