https://github.com/bioinfo-pf-curie/hic-pro

HiC-Pro: An optimized and flexible pipeline for Hi-C data processing

https://github.com/bioinfo-pf-curie/hic-pro

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 5 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.2%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

HiC-Pro: An optimized and flexible pipeline for Hi-C data processing

Basic Info
  • Host: GitHub
  • Owner: bioinfo-pf-curie
  • License: other
  • Language: Python
  • Default Branch: master
  • Size: 45.9 MB
Statistics
  • Stars: 0
  • Watchers: 4
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Fork of nservant/HiC-Pro
Created almost 9 years ago · Last pushed about 6 years ago

https://github.com/bioinfo-pf-curie/HiC-Pro/blob/master/

# HiC-Pro

### An optimized and flexible pipeline for Hi-C data processing

![MultiQC](https://img.shields.io/badge/MultiQC-1.6-blue.svg)
![Singularity](https://img.shields.io/badge/Singularity-build-brightgreen.svg)
[![Forum](https://img.shields.io/badge/Groups-%20join%20chat%20%E2%86%92-4fb99a.svg?style=flat-square)](https://groups.google.com/forum/#!forum/hic-pro)
[![DOI](https://img.shields.io/badge/DOI-10.1186%2Fs13059--015--0831--x-lightgrey.svg?style=flat-square)](https://doi.org/10.1186/s13059-015-0831-x)

----

Find documentation and examples at [http://nservant.github.io/HiC-Pro/](http://nservant.github.io/HiC-Pro/)

For any question about HiC-Pro, please contact nicolas.servant@curie.fr or use the [HiC-Pro forum](https://groups.google.com/forum/#!forum/hic-pro)

## What is HiC-Pro ?

HiC-Pro was designed to process Hi-C data, from raw fastq files (paired-end Illumina data) to normalized contact maps. It supports the main Hi-C protocols, including digestion protocols as well as protocols that do not require restriction enzymes such as DNase Hi-C. In practice, HiC-Pro was successfully applied to many data-sets including dilution Hi-C, in situ Hi-C, DNase Hi-C, Micro-C, capture-C, capture Hi-C or HiChip data.  
The pipeline is flexible, scalable and optimized. It can operate either on a single laptop or on a computational cluster. HiC-Pro is sequential and each step of the workflow can be run independantly.  
HiC-Pro includes a fast implementatation of the iterative correction method (see the [iced python package](https://github.com/hiclib/iced) for more information).
Finally, HiC-Pro can use phasing data to build [allele-specific contact maps](doc/AS.md).

If you use HiC-Pro, please cite :

*Servant N., Varoquaux N., Lajoie BR., Viara E., Chen CJ., Vert JP., Dekker J., Heard E., Barillot E.* HiC-Pro: An optimized and flexible pipeline for Hi-C processing. Genome Biology 2015, 16:259 [doi:10.1186/s13059-015-0831-x](https://doi.org/10.1186/s13059-015-0831-x)

## Using HiC-Pro through Singularity

HiC-Pro provides a Singularity container to ease its installation process.
A ready-to-use container is available [here](https://zerkalo.curie.fr/partage/HiC-Pro/singularity_images/hicpro_latest_ubuntu.img).

In order to build you own Singularity image;

1- Install singularity

- Linux : http://singularity.lbl.gov/install-linux
- MAC : http://singularity.lbl.gov/install-mac
- Windows : http://singularity.lbl.gov/install-windows

2- Build the singularity HiC-Pro image using the 'Singularity' file available in the HiC-Pro root directory.

```
sudo singularity build hicpro_latest_ubuntu.img MY_INSTALL_PATH/HiC-Pro/Singularity
```

3- Run HiC-pro

You can then either use HiC-Pro using the 'exec' command ;

```
singularity exec hicpro_latest_ubuntu.img HiC-Pro -h
```

Or directly use HiC-Pro within the Singularity shell

```
singularity shell hicpro_latest_ubuntu.img
HiC-Pro -h
```

## How to install it ?

The HiC-Pro pipeline requires the following dependencies :

- The [bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) mapper
- Python (>2.7) with *pysam (>=0.8.3)*, *bx-python(>=0.5.0)*, *numpy(>=1.8.2)*, and *scipy(>=0.15.1)* libraries.  
**Note that the current version does not support python 3**
- R with the *RColorBrewer* and *ggplot2 (>2.2.1)* packages
- g++ compiler
- samtools (>1.1)
- Unix sort (**which support -V option**) is required ! For Mac OS user, please install the GNU core utilities !

Note that Bowtie >2.2.2 is strongly recommanded for allele specific analysis.  

To install HiC-Pro (>=2.7.8), be sure to have the appropriate rights and run :

```
tar -zxvf HiC-Pro-master.tar.gz
cd HiC-Pro-master
## Edit config-install.txt file if necessary
make configure
make install
```

For older version (<2.7.8), the following process can be used

```
tar -zxvf HiC-Pro-master.tar.gz
cd HiC-Pro-master
## Edit config-install.txt file if necessary
make CONFIG_SYS=config-install.txt install
```

Note that if some of these dependencies are not installed (i.e. not detected in the $PATH), HiC-Pro will try to install them.  
You can also edit the *config-install.txt* file and manually defined the paths to dependencies.


|               | SYSTEM CONFIGURATION                                                          |
|---------------|-------------------------------------------------------------------------------|
| PREFIX        | Path to installation folder                                                   |
| BOWTIE2_PATH  | Full path the bowtie2 installation directory                                  |
| SAMTOOLS_PATH | Full path to the samtools installation directory (>1.1   )                    |
| R_PATH        | Full path to the R installation directory                                     |
| PYTHON_PATH   | Full path to the python installation directory (>2.7 - python3 not supported) |
| CLUSTER_SYS   | Scheduler to use for cluster submission. Must be TORQUE, SGE, SLURM or LSF    |


## Annotation Files

In order to process the raw data, HiC-Pro requires three annotation files. Note that the pipeline is provided with some Human and Mouse annotation files.  
**Please be sure that the chromosome names are the same than the ones used in your bowtie indexes !**

- **A BED file** of the restriction fragments after digestion. This file depends both of the restriction enzyme and the reference genome. See the [FAQ](doc/FAQ.md) and the [HiC-Pro utilities](doc/UTILS.md) for details about how to generate this file. A few annotation files are provided with the HiC-Pro sources as examples.

```
   chr1   0       16007   HIC_chr1_1    0   +
   chr1   16007   24571   HIC_chr1_2    0   +
   chr1   24571   27981   HIC_chr1_3    0   +
   chr1   27981   30429   HIC_chr1_4    0   +
   chr1   30429   32153   HIC_chr1_5    0   +
   chr1   32153   32774   HIC_chr1_6    0   +
   chr1   32774   37752   HIC_chr1_7    0   +
   chr1   37752   38369   HIC_chr1_8    0   +
   chr1   38369   38791   HIC_chr1_9    0   +
   chr1   38791   39255   HIC_chr1_10   0   +
   (...)
```

- **A table file** of chromosomes' size. This file can be easily find on the UCSC genome browser. Of note, pay attention to the contigs or scaffolds, and be aware that HiC-pro will generate a map per chromosomes pair. For model organisms such as Human or Mouse, which are well annotated, we usually recommand to remove all scaffolds.  

```
   chr1    249250621
   chr2    243199373
   chr3    198022430
   chr4    191154276
   chr5    180915260
   chr6    171115067
   chr7    159138663
   chr8    146364022
   chr9    141213431
   chr10   135534747
   (...)
```

- **The bowtie2 indexes**. See the [bowtie2 manual page](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) for details about how to create such indexes.


## How to use it ?

First have a look at the help message !

```
  HiC-Pro --help
  usage : HiC-Pro -i INPUT -o OUTPUT -c CONFIG [-s ANALYSIS_STEP] [-p] [-h] [-v]
  Use option -h|--help for more information

  HiC-Pro 2.11.3
  ---------------
  OPTIONS

   -i|--input INPUT : input data folder; Must contains a folder per sample with input files
   -o|--output OUTPUT : output folder
   -c|--conf CONFIG : configuration file for Hi-C processing
   [-p|--parallel] : if specified run HiC-Pro on a cluster
   [-s|--step ANALYSIS_STEP] : run only a subset of the HiC-Pro workflow; if not specified the complete workflow is run
      mapping: perform reads alignment - require fast files
      proc_hic: perform Hi-C filtering - require BAM files
      quality_checks: run Hi-C quality control plots
      merge_persample: merge multiple inputs and remove duplicates if specified - require .validPairs files
      build_contact_maps: Build raw inter/intrachromosomal contact maps - require .allValidPairs files
      ice_norm : run ICE normalization on contact maps - require .matrix files
   [-h|--help]: help
   [-v|--version]: version
```

- Copy and edit the configuration file *'config-hicpro.txt'* in your local folder. See the [manual](doc/MANUAL.md) for details about the configuration file

- Put all input files in a rawdata folder. The input files have to be organized with **one folder per sample**, such as;

```
   + PATH_TO_MY_DATA
     + sample1
       ++ file1_R1.fastq.gz
       ++ file1_R2.fastq.gz
       ++ ...
     + sample2
       ++ file1_R1.fastq.gz
       ++ file1_R2.fastq.gz
     *...
```

- Run HiC-Pro on your laptop in standalone model

```
    MY_INSTALL_PATH/bin/HiC-Pro -i FULL_PATH_TO_DATA_FOLDER -o FULL_PATH_TO_OUTPUTS -c MY_LOCAL_CONFIG_FILE
```

  - Run HiC-Pro on a cluster (TORQUE/SGE/SLURM/LSF)

```
   MY_INSTALL_PATH/bin/HiC-Pro -i FULL_PATH_TO_DATA_FOLDER -o FULL_PATH_TO_OUTPUTS -c MY_LOCAL_CONFIG_FILE -p
```

In the latter case, you will have the following message :

```
  Please run HiC-Pro in two steps :
  1- The following command will launch the parallel workflow through 12 torque jobs:
  qsub HiCPro_step1.sh
  2- The second command will merge all outputs to generate the contact maps:
  qsub HiCPro_step2.sh
```

Execute the displayed command from the output folder:

```
  qsub HiCPro_step1.sh
```

Once executed succesfully (may take several hours), run the step using:

```
  qsub HiCPro_step2.sh
```

## Test Dataset

The test dataset and associated results are available [here](https://zerkalo.curie.fr/partage/HiC-Pro/).
Small fastq files (2M reads) extracted from the Dixon et al. 2012 paper are available for test.

```
 ## Get the data. Will download a test_data folder and a configuration file
 wget https://zerkalo.curie.fr/partage/HiC-Pro/HiCPro_testdata.tar.gz && tar -zxvf HiCPro_testdata.tar.gz

 ## Edit the configuration file and set the path to Human bowtie2 indexes

 ## Run HiC-Pro
 time HICPRO_INSTALL_DIR/bin/HiC-Pro -c config_test_latest.txt -i test_data -o hicpro_latest_test

Run HiC-Pro 2.11.3
--------------------------------------------
Thu Mar 19, 12:18:10 (UTC+0100)
Bowtie2 alignment step1 ...
Logs: logs/dixon_2M_2/mapping_step1.log
Logs: logs/dixon_2M/mapping_step1.log

--------------------------------------------
Thu Mar 19, 12:18:57 (UTC+0100)
Bowtie2 alignment step2 ...
Logs: logs/dixon_2M_2/mapping_step2.log
Logs: logs/dixon_2M/mapping_step2.log

--------------------------------------------
Thu Mar 19, 12:19:08 (UTC+0100)
Combine R1/R2 alignment files ...
Logs: logs/dixon_2M_2/mapping_combine.log
Logs: logs/dixon_2M/mapping_combine.log

--------------------------------------------
Thu Mar 19, 12:19:13 (UTC+0100)
Mapping statistics for R1 and R2 tags ...
Logs: logs/dixon_2M_2/mapping_stats.log
Logs: logs/dixon_2M/mapping_stats.log

--------------------------------------------
Thu Mar 19, 12:19:15 (UTC+0100)
Pairing of R1 and R2 tags ...
Logs: logs/dixon_2M_2/mergeSAM.log
Logs: logs/dixon_2M/mergeSAM.log

--------------------------------------------
Thu Mar 19, 12:19:25 (UTC+0100)
Assign alignments to restriction fragments ...
Logs: logs/dixon_2M_2/mapped_2hic_fragments.log
Logs: logs/dixon_2M/mapped_2hic_fragments.log

--------------------------------------------
Thu Mar 19, 12:20:10 (UTC+0100)
Merge chunks from the same sample ...
Logs: logs/dixon_2M/merge_valid_interactions.log
Logs: logs/dixon_2M_2/merge_valid_interactions.log

--------------------------------------------
Thu Mar 19, 12:20:11 (UTC+0100)
Merge stat files per sample ...
Logs: logs/dixon_2M/merge_stats.log
Logs: logs/dixon_2M_2/merge_stats.log

--------------------------------------------
Thu Mar 19, 12:20:11 (UTC+0100)
Run quality checks for all samples ...
Logs: logs/dixon_2M/make_Rplots.log
Logs: logs/dixon_2M_2/make_Rplots.log

--------------------------------------------
Thu Mar 19, 12:20:22 (UTC+0100)
Generate binned matrix files ...
Logs: logs/dixon_2M/build_raw_maps.log
Logs: logs/dixon_2M_2/build_raw_maps.log

--------------------------------------------
Thu Mar 19, 12:20:22 (UTC+0100)
Run ICE Normalization ...
Logs: logs/dixon_2M/ice_500000.log
Logs: logs/dixon_2M/ice_1000000.log
Logs: logs/dixon_2M_2/ice_500000.log
Logs: logs/dixon_2M_2/ice_1000000.log

real	2m15,736s
user	4m3,277s
sys	0m24,423s

```

Owner

  • Name: Institut Curie, Bioinformatics Core Facility
  • Login: bioinfo-pf-curie
  • Kind: organization
  • Location: Paris, France

bioinformatics platform of the Institut Curie

GitHub Events

Total
Last Year