pairtools

Extract 3D contacts (.pairs) from sequencing alignments

https://github.com/open2c/pairtools

Science Score: 59.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 5 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Committers with academic emails
    3 of 12 committers (25.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (15.3%) to scientific vocabulary

Keywords

3d-genome bioinformatics file-formatter hi-c ngs pairs-file python

Keywords from Contributors

genomics dataframes genomic-intervals genomic-ranges ngs-analysis spatial-join chromatin contact-matrix cooler file-format
Last synced: 6 months ago · JSON representation

Repository

Extract 3D contacts (.pairs) from sequencing alignments

Basic Info
  • Host: GitHub
  • Owner: open2c
  • License: mit
  • Language: Python
  • Default Branch: master
  • Homepage:
  • Size: 3.13 MB
Statistics
  • Stars: 115
  • Watchers: 10
  • Forks: 35
  • Open Issues: 48
  • Releases: 14
Topics
3d-genome bioinformatics file-formatter hi-c ngs pairs-file python
Created about 9 years ago · Last pushed 6 months ago
Metadata Files
Readme Changelog License

README.md

pairtools

Documentation Status Build Status Join the chat on Slack DOI

Process Hi-C pairs with pairtools

pairtools is a simple and fast command-line framework to process sequencing data from a Hi-C experiment.

pairtools process pair-end sequence alignments and perform the following operations:

  • detect ligation junctions (a.k.a. Hi-C pairs) in aligned paired-end sequences of Hi-C DNA molecules
  • sort .pairs files for downstream analyses
  • detect, tag and remove PCR/optical duplicates
  • generate extensive statistics of Hi-C datasets
  • select Hi-C pairs given flexibly defined criteria
  • restore .sam alignments from Hi-C pairs
  • annotate restriction digestion sites
  • get the mutated positions in Hi-C pairs

To get started: - Visit pairtools tutorials, - Take a look at a quick example, - Check out the detailed documentation.

Data formats

pairtools produce and operate on tab-separated files compliant with the .pairs format defined by the 4D Nucleome Consortium. All pairtools properly manage file headers and keep track of the data processing history.

Additionally, pairtools define the .pairsam format, an extension of .pairs that includes the SAM alignments of a sequenced Hi-C molecule. .pairsam complies with the .pairs format, and can be processed by any tool that operates on .pairs files.

pairtools produces a set of additional extra columns, which describe properties of alignments, phase, mutations, restriction and complex walks. The full list of possible extra columns is provided in the pairtools format specification.

Installation

Requirements:

  • Python 3.x
  • Python packages cython, pysam, bioframe, pyyaml, numpy, scipy, pandas and click.
  • Command-line utilities sort (the Unix version), samtools and bgzip (shipped with samtools). If available, pairtools can compress outputs with pbgzip and lz4.

For the full list of recommended versions, see the requirements section in the pyproject.toml.

There are three options for installing pairtools:

  1. We highly recommend using the conda package manager to install pairtools together with all its dependencies. To get it, you can either install the full Anaconda Python distribution or just the standalone conda package manager.

With conda, you can install pairtools and all of its dependencies from the bioconda channel: sh $ conda install -c conda-forge -c bioconda pairtools

  1. Alternatively, install non-Python dependencies (sort, samtools, bgzip, pbgzip and lz4) separately and download pairtools with Python dependencies from PyPI using pip: sh $ pip install pairtools

  2. Finally, when the two options above don't work or when you want to modify pairtools, build pairtools from source via pip's "editable" mode: sh $ pip install numpy cython pysam $ git clone https://github.com/open2c/pairtools $ cd pairtools $ pip install -e ./ --no-build-isolation

Quick example

Setup a new test folder and download a small Hi-C dataset mapped to sacCer3 genome: bash $ mkdir /tmp/test-pairtools $ cd /tmp/test-pairtools $ wget https://github.com/open2c/distiller-test-data/raw/master/bam/MATalpha_R1.bam

Additionally, we will need a .chromsizes file, a TAB-separated plain text table describing the names, sizes and the order of chromosomes in the genome assembly used during mapping: bash $ wget https://raw.githubusercontent.com/open2c/distiller-test-data/master/genome/sacCer3.reduced.chrom.sizes

With pairtools parse, we can convert paired-end sequence alignments stored in .sam/.bam format into .pairs, a TAB-separated table of Hi-C ligation junctions:

bash $ pairtools parse -c sacCer3.reduced.chrom.sizes -o MATalpha_R1.pairs.gz --drop-sam MATalpha_R1.bam

Inspect the resulting table:

bash $ less MATalpha_R1.pairs.gz

Pipelines

  • We provide a simple working example of a mapping bash pipeline in /examples/.
  • distiller is a powerful Hi-C data analysis workflow, based on pairtools and nextflow.

Tools

  • parse: read .sam/.bam files produced by bwa and form Hi-C pairs

    • form Hi-C pairs by reporting the outer-most mapped positions and the strand on the either side of each molecule;
    • report unmapped/multimapped (ambiguous alignments)/chimeric alignments as chromosome "!", position 0, strand "-";
    • perform upper-triangular flipping of the sides of Hi-C molecules such that the first side has a lower sorting index than the second side;
    • form hybrid pairsam output, where each line contains all available data for one Hi-C molecule (outer-most mapped positions on the either side, read ID, pair type, and .sam entries for each alignment);
    • report .sam tags or mutations of the alignments;
    • print the .sam header as #-comment lines at the start of the file.
  • parse2: read .sam/.bam files with long paired-and or single-end reads and form Hi-C pairs from complex walks

    • identify and rescue chrimeric alignments produced by singly-ligated Hi-C molecules with a sequenced ligation junction on one of the sides;
    • annotate chimeric alignments by restriction fragments and report true junctions and hops (One-Read-Based Interactions Annotation, ORBITA);
    • perform intra-molecule deduplication of paired-end data when one side reads through the DNA on the other side of the read;
    • report index of the pair in the complex walk;
    • make combinatorial expansion of pairs produced from the same walk;
  • sort: sort pairs files (the lexicographic order for chromosomes, the numeric order for the positions, the lexicographic order for pair types).

  • merge: merge sorted .pairs files

    • merge sort .pairs;
    • combine the .pairs headers from all input files;
    • check that each .pairs file was mapped to the same reference genome index (by checking the identity of the @SQ sam header lines).
  • select: select pairs according to specified criteria

    • select pairs entries according to the provided condition. A programmable interface allows for arbitrarily complex queries on specific pair types, chromosomes, positions, strands, read IDs (including matches to a wildcard/regexp/list).
    • optionally print the non-matching entries into a separate file.
  • dedup: remove PCR duplicates from a sorted triu-flipped .pairs file

    • remove PCR duplicates by finding pairs of entries with both sides mapped to similar genomic locations (+/- N bp);
    • optionally output the PCR duplicate entries into a separate file;
    • detect optical duplicates from the original Illumina read ids;
    • apply filtering by various properties of pairs (MAPQ; orientation; distance) together with deduplication;
    • output yaml or convenient tsv deduplication stats into text file.
    • NOTE: in order to remove all PCR duplicates, the input must contain *all* mapped read pairs from a single experimental replicate;
  • maskasdup: mark all pairs in a pairsam as Hi-C duplicates

    • change the field pair_type to DD;
    • change the pair_type tag (Yt:Z:) for all sam alignments;
    • set the PCR duplicate binary flag for all sam alignments (0x400).
  • split: split a .pairsam file into .pairs and .sam.

  • flip: flip pairs to get an upper-triangular matrix

  • header: manipulate the .pairs/.pairsam header

    • generate new header for headerless .pairs file
    • transfer header from one .pairs file to another
    • set column names for the .pairs file
    • validate that the header corresponds to the information stored in .pairs file
  • stats: calculate various statistics of .pairs files

  • restrict: identify the span of the restriction fragment forming a Hi-C junction

  • phase: phase pairs mapped to a diploid genome

Contributing

Pull requests are welcome.

For development, clone and install in "editable" (i.e. development) mode with the -e option. This way you can also pull changes on the fly. sh $ git clone https://github.com/open2c/pairtools.git $ cd pairtools $ pip install -e .

Citing pairtools

Open2C, Nezar Abdennur, Geoffrey Fudenberg, Ilya M. Flyamer, Aleksandra A. Galitsyna, Anton Goloborodko*, Maxim Imakaev, Sergey V. Venev. "Pairtools: from sequencing data to chromosome contacts" bioRxiv, February 13, 2023. ; doi: https://doi.org/10.1101/2023.02.13.528389

License

MIT

Owner

  • Name: Open Chromosome Collective
  • Login: open2c
  • Kind: organization
  • Email: open.chromosome.collective@gmail.com

GitHub Events

Total
  • Create event: 8
  • Release event: 2
  • Issues event: 14
  • Watch event: 13
  • Delete event: 11
  • Issue comment event: 37
  • Push event: 20
  • Pull request event: 18
  • Pull request review event: 14
  • Pull request review comment event: 14
  • Fork event: 6
Last Year
  • Create event: 8
  • Release event: 2
  • Issues event: 14
  • Watch event: 13
  • Delete event: 11
  • Issue comment event: 37
  • Push event: 20
  • Pull request event: 18
  • Pull request review event: 14
  • Pull request review comment event: 14
  • Fork event: 6

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 680
  • Total Committers: 12
  • Avg Commits per committer: 56.667
  • Development Distribution Score (DDS): 0.256
Past Year
  • Commits: 30
  • Committers: 2
  • Avg Commits per committer: 15.0
  • Development Distribution Score (DDS): 0.2
Top Committers
Name Email Commits
golobor g****n@g****m 506
Phlya f****r@g****m 71
Aleksandra Galitsyna a****a@g****m 54
sergpolly s****y@g****m 18
Nezar n****r@g****m 17
gfudenberg g****g@g****m 5
anton.goloborodko a****o@c****t 4
hbbrandao h****o 1
Sameer Abraham s****0@g****m 1
Hugo Brandao h****o@g****u 1
Hagai Kariti h****k@c****l 1
Fliamer f****a@x****h 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 124
  • Total pull requests: 61
  • Average time to close issues: over 1 year
  • Average time to close pull requests: 3 months
  • Total issue authors: 60
  • Total pull request authors: 14
  • Average comments per issue: 3.9
  • Average comments per pull request: 1.97
  • Merged pull requests: 44
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 12
  • Pull requests: 13
  • Average time to close issues: 5 months
  • Average time to close pull requests: 18 days
  • Issue authors: 12
  • Pull request authors: 5
  • Average comments per issue: 2.58
  • Average comments per pull request: 0.23
  • Merged pull requests: 6
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • golobor (21)
  • Phlya (11)
  • sergpolly (10)
  • agalitsyna (6)
  • jiangshan529 (4)
  • bskubi (3)
  • zhqu1148980644 (3)
  • gdolsten (3)
  • wbszhu (3)
  • aakashsur (2)
  • YingChen94 (2)
  • nservant (2)
  • kalavattam (2)
  • SaimMomin12 (2)
  • mblanche (2)
Pull Request Authors
  • golobor (32)
  • Phlya (18)
  • agalitsyna (16)
  • ShigrafS (3)
  • sergpolly (2)
  • muffato (2)
  • Reovirus (2)
  • Unique-Usman (2)
  • hkariti (2)
  • stefanor (1)
  • frankyan (1)
  • gfudenberg (1)
  • itsameerkat (1)
  • ccoulombe (1)
Top Labels
Issue Labels
enhancement (18) help wanted (7) question (4) bug (3)
Pull Request Labels
bug (2)

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 383 last-month
  • Total docker downloads: 55
  • Total dependent packages: 3
  • Total dependent repositories: 4
  • Total versions: 13
  • Total maintainers: 2
pypi.org: pairtools

CLI tools to process mapped Hi-C data

  • Versions: 13
  • Dependent Packages: 3
  • Dependent Repositories: 4
  • Downloads: 383 Last month
  • Docker Downloads: 55
Rankings
Dependent packages count: 3.2%
Docker downloads count: 4.1%
Average: 7.0%
Forks count: 7.1%
Dependent repos count: 7.5%
Stargazers count: 7.9%
Downloads: 12.2%
Maintainers (2)
Last synced: 6 months ago

Dependencies

requirements.txt pypi
  • bioframe >=0.3.3
  • click >=6.6
  • cython *
  • nose >=1.3
  • numpy >=1.10
  • pandas >=1.3.4
  • pysam >=0.15.0
  • pyyaml *
  • scipy >=1.7.0
requirements_doc.txt pypi
  • Cython *
  • Sphinx *
  • bioframe *
  • click >=7.0
  • docutils >0.16
  • ipython *
  • nbsphinx *
  • nose *
  • numpy *
  • pandas *
  • pysam *
  • scipy *
  • sphinx_rtd_theme *
setup.py pypi
  • for *
.github/workflows/python-package.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
.github/workflows/python-publish-test.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
.github/workflows/python-publish.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
requirements-dev.txt pypi
  • pytest * development
  • pytest-cov * development
  • pytest-flake8 * development