cyvcf2

cython + htslib == fast VCF and BCF processing

https://github.com/brentp/cyvcf2

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    11 of 43 committers (25.6%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.3%) to scientific vocabulary

Keywords

bioinformatics cython genomics htslib vcf

Keywords from Contributors

tskit trees msprime genetics coalescent argument-parser structural-variation dna genotype alignment
Last synced: 6 months ago · JSON representation

Repository

cython + htslib == fast VCF and BCF processing

Basic Info
  • Host: GitHub
  • Owner: brentp
  • License: mit
  • Language: Cython
  • Default Branch: main
  • Homepage:
  • Size: 14.4 MB
Statistics
  • Stars: 411
  • Watchers: 6
  • Forks: 75
  • Open Issues: 47
  • Releases: 0
Topics
bioinformatics cython genomics htslib vcf
Created over 10 years ago · Last pushed over 1 year ago
Metadata Files
Readme Changelog License

README.md

cyvcf2

Note: cyvcf2 versions < 0.20.0 require htslib < 1.10. cyvcf2 versions >= 0.20.0 require htslib >= 1.10

The latest documentation for cyvcf2 can be found here:

Docs

If you use cyvcf2, please cite the paper

Fast python (2 and 3) parsing of VCF and BCF including region-queries.

Build

cyvcf2 is a cython wrapper around htslib built for fast parsing of Variant Call Format (VCF) files.

Attributes like variant.gt_ref_depths work for diploid samples and return a numpy array directly so they are immediately ready for downstream use. note that the array is backed by the underlying C data, so, once variant goes out of scope. The array will contain nonsense. To persist a copy, use: cpy = np.array(variant.gt_ref_depths) instead of just arr = variant.gt_ref_depths.

Example

The example below shows much of the use of cyvcf2.

```Python from cyvcf2 import VCF

for variant in VCF('some.vcf.gz'): # or VCF('some.bcf') variant.REF, variant.ALT # e.g. REF='A', ALT=['C', 'T']

variant.CHROM, variant.start, variant.end, variant.ID, \
            variant.FILTER, variant.QUAL

# numpy arrays of specific things we pull from the sample fields.
# gt_types is array of 0,1,2,3==HOM_REF, HET, UNKNOWN, HOM_ALT
variant.gt_types, variant.gt_ref_depths, variant.gt_alt_depths # numpy arrays
variant.gt_phases, variant.gt_quals, variant.gt_bases # numpy array

## INFO Field.
## extract from the info field by it's name:
variant.INFO.get('DP') # int
variant.INFO.get('FS') # float
variant.INFO.get('AC') # float

# convert back to a string.
str(variant)


## sample info...

# Get a numpy array of the depth per sample:
dp = variant.format('DP')
# or of any other format field:
sb = variant.format('SB')
assert sb.shape == (n_samples, 4) # 4-values per

to do a region-query:

vcf = VCF('some.vcf.gz') for v in vcf('11:435345-556565'): if v.INFO["AF"] > 0.1: continue print(str(v)) ```

Installation

pip with bundled htslib

pip install cyvcf2

pip with system htslib

Assuming you have already built and installed htslib version 1.12 or higher. CYVCF2_HTSLIB_MODE=EXTERNAL pip install --no-binary cyvcf2 cyvcf2

windows (experimental, only test on MSYS2)

Assuming you have already built and installed htslib. SETUPTOOLS_USE_DISTUTILS=stdlib pip install cyvcf2

github (building htslib and cyvcf2 from source)

``` git clone --recursive https://github.com/brentp/cyvcf2 pip install -r requirements.txt

sometimes it can be required to remove old files:

python setup.py clean_ext

CYVCF2HTSLIBMODE=BUILTIN CYTHONIZE=1 python setup.py install

or to use a system htslib.so

CYVCF2HTSLIBMODE=EXTERNAL python setup.py install ```

On OSX, using brew, you may have to set the following as indicated by the brew install:

``` For compilers to find openssl you may need to set: export LDFLAGS="-L/usr/local/opt/openssl/lib" export CPPFLAGS="-I/usr/local/opt/openssl/include"

For pkg-config to find openssl you may need to set: export PKGCONFIGPATH="/usr/local/opt/openssl/lib/pkgconfig" ```

Testing

Install pytest, then tests can be run with:

pytest

CLI

Run with cyvcf2 path_to_vcf

``` $ cyvcf2 --help Usage: cyvcf2 [OPTIONS] or -

fast vcf parsing with cython + htslib

Options: -c, --chrom TEXT Specify what chromosome to include. -s, --start INTEGER Specify the start of region. -e, --end INTEGER Specify the end of the region. --include TEXT Specify what info field to include. --exclude TEXT Specify what info field to exclude. --loglevel [DEBUG|INFO|WARNING|ERROR|CRITICAL] Set the level of log output. [default: INFO] --silent Skip printing of vcf. --help Show this message and exit. ```

See Also

Pysam also has a cython wrapper to htslib and one block of code here is taken directly from that library. But, the optimizations that we want for gemini are very specific so we have chosen to create a separate project.

Performance

For the performance comparison in the paper, we used thousand genomes chromosome 22 With the full comparison runner here.

Owner

  • Name: Brent Pedersen
  • Login: brentp
  • Kind: user
  • Location: Oregon, USA

Doing genomics

GitHub Events

Total
  • Issues event: 5
  • Watch event: 33
  • Issue comment event: 10
  • Fork event: 3
Last Year
  • Issues event: 5
  • Watch event: 33
  • Issue comment event: 10
  • Fork event: 3

Committers

Last synced: over 1 year ago

All Time
  • Total Commits: 475
  • Total Committers: 43
  • Avg Commits per committer: 11.047
  • Development Distribution Score (DDS): 0.276
Past Year
  • Commits: 46
  • Committers: 7
  • Avg Commits per committer: 6.571
  • Development Distribution Score (DDS): 0.674
Top Committers
Name Email Commits
Brent Pedersen b****e@g****m 344
Tom White t****e@g****m 21
graphenn g****n@g****m 15
indraniel i****l@g****m 12
Arya Massarat 2****m 7
Jerome Kelleher jk@w****k 7
Liang-Bo Wang l****g@w****u 5
Måns Magnusson m****n@s****e 5
Dave Lawrence d****w@g****m 4
Marcel Martin m****n@s****e 4
Sam Lichtenberg s****e@g****m 4
Nils Homer n****3 4
Graham Gower g****r@g****m 3
Michael Hall m****l@m****h 3
Tim Millar t****r 2
Elias Kuthe e****e@t****e 2
LiterallyUniqueLogin j****e@g****m 2
Sander Bollen a****n@l****l 2
Wouter De Coster d****r@g****m 2
arq5x a****x@v****u 2
cclauss c****s@b****h 2
chapmanb c****b@5****m 2
Ben Jeffery b****y@b****k 1
Ben Jeffery b****y@g****m 1
Danilo Horta d****a@p****e 1
Derek Croote d****e 1
Dave Larson d****n@g****u 1
Gehring, Julian j****g@i****m 1
Gilad Mishne g****d@m****g 1
James Eapen j****n@g****m 1
and 13 more...

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 105
  • Total pull requests: 52
  • Average time to close issues: 6 months
  • Average time to close pull requests: about 1 month
  • Total issue authors: 71
  • Total pull request authors: 22
  • Average comments per issue: 3.96
  • Average comments per pull request: 2.56
  • Merged pull requests: 45
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 5
  • Pull requests: 1
  • Average time to close issues: 15 days
  • Average time to close pull requests: 3 days
  • Issue authors: 5
  • Pull request authors: 1
  • Average comments per issue: 2.4
  • Average comments per pull request: 1.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • tomwhite (15)
  • davmlaw (5)
  • grahamgower (3)
  • vsbuffalo (3)
  • hammer (3)
  • jeromekelleher (2)
  • quattro (2)
  • nh13 (2)
  • CholoTook (2)
  • alanwilter (2)
  • brentp (2)
  • gnxsf (2)
  • graphenn (2)
  • shubhamsaini (1)
  • adrienlemeur (1)
Pull Request Authors
  • tomwhite (20)
  • graphenn (9)
  • grahamgower (3)
  • benjeffery (2)
  • esrice (2)
  • nakib103 (2)
  • EQt (2)
  • jeromekelleher (2)
  • Hoeze (2)
  • horta (1)
  • kullrich (1)
  • CholoTook (1)
  • davmlaw (1)
  • stefanor (1)
  • brentp (1)
Top Labels
Issue Labels
enhancement (2) bug (1)
Pull Request Labels
enhancement (1)

Packages

  • Total packages: 3
  • Total downloads:
    • pypi 82,567 last-month
  • Total docker downloads: 1,802
  • Total dependent packages: 45
    (may contain duplicates)
  • Total dependent repositories: 99
    (may contain duplicates)
  • Total versions: 197
  • Total maintainers: 2
pypi.org: cyvcf2

fast vcf parsing with cython + htslib

  • Versions: 95
  • Dependent Packages: 43
  • Dependent Repositories: 99
  • Downloads: 82,567 Last month
  • Docker Downloads: 1,802
Rankings
Dependent packages count: 0.4%
Docker downloads count: 1.1%
Dependent repos count: 1.5%
Downloads: 1.7%
Average: 2.2%
Stargazers count: 3.5%
Forks count: 5.3%
Maintainers (1)
Last synced: 6 months ago
proxy.golang.org: github.com/brentp/cyvcf2
  • Versions: 101
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent packages count: 9.0%
Average: 9.6%
Dependent repos count: 10.2%
Last synced: 6 months ago
spack.io: py-cyvcf2

fast vcf parsing with cython + htslib

  • Versions: 1
  • Dependent Packages: 2
  • Dependent Repositories: 0
Rankings
Dependent repos count: 0.0%
Average: 11.1%
Stargazers count: 12.0%
Forks count: 13.2%
Dependent packages count: 19.0%
Maintainers (1)
Last synced: 6 months ago

Dependencies

requirements.txt pypi
  • click *
  • coloredlogs *
  • cython >=0.23.3
  • numpy *
setup.py pypi
  • click *
  • coloredlogs *
  • numpy *
.github/workflows/build.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
.github/workflows/wheels.yml actions
  • actions/checkout v2 composite
  • actions/download-artifact v2 composite
  • actions/setup-python v2 composite
  • actions/upload-artifact v2 composite
  • mxschmitt/action-tmate v3 composite
  • pypa/gh-action-pypi-publish release/v1 composite