snps

tools for reading, writing, merging, and remapping SNPs

https://github.com/apriha/snps

Science Score: 46.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
✓
DOI references
Found 4 DOI reference(s) in README
✓
Academic publication links
Links to: ncbi.nlm.nih.gov
✓
Committers with academic emails
1 of 14 committers (7.1%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.5%) to scientific vocabulary

Keywords

bioinformatics chromosomes dna snps vcf

Keywords from Contributors

ancestry genealogy genes genetics genotype

Last synced: 6 months ago · JSON representation

Repository

tools for reading, writing, merging, and remapping SNPs

Basic Info

Host: GitHub
Owner: apriha
License: bsd-3-clause
Language: Python
Default Branch: main
Homepage:
Size: 2.05 MB

Statistics

Stars: 109
Watchers: 4
Forks: 18
Open Issues: 23
Releases: 45

Topics

bioinformatics chromosomes dna snps vcf

Created over 6 years ago · Last pushed 11 months ago

Metadata Files

Readme Changelog Contributing License

README.rst

.. image:: https://raw.githubusercontent.com/apriha/snps/main/docs/images/snps_banner.png

|ci| |codecov| |docs| |pypi| |python| |downloads| |ruff|

snps
====
tools for reading, writing, merging, and remapping SNPs 🧬

``snps`` *strives to be an easy-to-use and accessible open-source library for working with
genotype data*

Features
--------
Input / Output
``````````````
- Read raw data (genotype) files from a variety of direct-to-consumer (DTC) DNA testing
  sources with a `SNPs `_
  object
- Read and write VCF files (e.g., convert `23andMe `_ to VCF)
- Merge raw data files from different DNA tests, identifying discrepant SNPs in the process
- Read data in a variety of formats (e.g., files, bytes, compressed with `gzip` or `zip`)
- Handle several variations of file types, validated via
  `openSNP parsing analysis `_

Build / Assembly Detection and Remapping
````````````````````````````````````````
- Detect the build / assembly of SNPs (supports builds 36, 37, and 38)
- Remap SNPs between builds / assemblies

Data Cleaning
`````````````
- Perform quality control (QC) / filter low quality SNPs based on `chip clusters `_
- Fix several common issues when loading SNPs
- Sort SNPs based on chromosome and position
- Deduplicate RSIDs
- Deduplicate alleles in the non-PAR regions of the X and Y chromosomes for males
- Deduplicate alleles on MT
- Assign PAR SNPs to the X or Y chromosome

Analysis
````````
- Derive sex from SNPs
- Detect deduced genotype / chip array and chip version based on `chip clusters `_
- Predict ancestry from SNPs (when installed with `ezancestry `_)

Supported Genotype Files
------------------------
``snps`` supports `VCF `_ files and
genotype files from the following DNA testing sources:

- `23andMe `_
- `23Mofang `_
- `Ancestry `_
- `CircleDNA `_
- `Código 46 `_
- `DNA.Land `_
- `Family Tree DNA `_
- `Genes for Good `_
- `LivingDNA `_
- `Mapmygenome `_
- `MyHeritage `_
- `PLINK `_
- `Sano Genetics `_
- `SelfDecode `_
- `tellmeGen `_

Additionally, ``snps`` can read a variety of "generic" CSV and TSV files.

Dependencies
------------
``snps`` requires `Python `_ 3.8+ and the following Python
packages:

- `numpy `_
- `pandas `_
- `atomicwrites `_

Installation
------------
``snps`` is `available `_ on the
`Python Package Index `_. Install ``snps`` (and its required
Python dependencies) via ``pip``::

    $ pip install snps

For `ancestry prediction `_
capability, ``snps`` can be installed with `ezancestry `_::

    $ pip install snps[ezancestry]

Examples
--------
Download Example Data
`````````````````````
First, let's setup logging to get some helpful output:

>>> import logging, sys
>>> logger = logging.getLogger()
>>> logger.setLevel(logging.INFO)
>>> logger.addHandler(logging.StreamHandler(sys.stdout))

Now we're ready to download some example data from `openSNP `_:

>>> from snps.resources import Resources
>>> r = Resources()
>>> paths = r.download_example_datasets()
Downloading resources/662.23andme.340.txt.gz
Downloading resources/662.ftdna-illumina.341.csv.gz

Load Raw Data
`````````````
Load a `23andMe `_ raw data file:

>>> from snps import SNPs
>>> s = SNPs("resources/662.23andme.340.txt.gz")
>>> s.source
'23andMe'
>>> s.count
991786

The ``SNPs`` class accepts a path to a file or a bytes object. A ``Reader`` class attempts to
infer the data source and load the SNPs. The loaded SNPs are
`normalized `_ and
available via a ``pandas.DataFrame``:

>>> df = s.snps
>>> df.columns.values
array(['chrom', 'pos', 'genotype'], dtype=object)
>>> df.index.name
'rsid'
>>> df.chrom.dtype.name
'object'
>>> df.pos.dtype.name
'uint32'
>>> df.genotype.dtype.name
'object'
>>> len(df)
991786

``snps`` also attempts to detect the build / assembly of the data:

>>> s.build
37
>>> s.build_detected
True
>>> s.assembly
'GRCh37'

Merge Raw Data Files
````````````````````
The dataset consists of raw data files from two different DNA testing sources - let's combine
these files. Specifically, we'll update the ``SNPs`` object with SNPs from a
`Family Tree DNA `_ file.

>>> merge_results = s.merge([SNPs("resources/662.ftdna-illumina.341.csv.gz")])
Merging SNPs('662.ftdna-illumina.341.csv.gz')
SNPs('662.ftdna-illumina.341.csv.gz') has Build 36; remapping to Build 37
Downloading resources/NCBI36_GRCh37.tar.gz
27 SNP positions were discrepant; keeping original positions
151 SNP genotypes were discrepant; marking those as null
>>> s.source
'23andMe, FTDNA'
>>> s.count
1006960
>>> s.build
37
>>> s.build_detected
True

If the SNPs being merged have a build that differs from the destination build, the SNPs to merge
will be remapped automatically. After this example merge, the build is still detected, since the
build was detected for all ``SNPs`` objects that were merged.

As the data gets added, it's compared to the existing data, and SNP position and genotype
discrepancies are identified. (The discrepancy thresholds can be tuned via parameters.) These
discrepant SNPs are available for inspection after the merge via properties of the ``SNPs`` object.

>>> len(s.discrepant_merge_genotypes)
151

Additionally, any non-called / null genotypes will be updated during the merge, if the file
being merged has a called genotype for the SNP.

Moreover, ``merge`` takes a ``chrom`` parameter - this enables merging of only SNPs associated
with the specified chromosome (e.g., "Y" or "MT").

Finally, ``merge`` returns a list of ``dict``, where each ``dict`` has information corresponding
to the results of each merge (e.g., SNPs in common).

>>> sorted(list(merge_results[0].keys()))
['common_rsids', 'discrepant_genotype_rsids', 'discrepant_position_rsids', 'merged']
>>> merge_results[0]["merged"]
True
>>> len(merge_results[0]["common_rsids"])
692918

Remap SNPs
``````````
Now, let's remap the merged SNPs to change the assembly / build:

>>> s.snps.loc["rs3094315"].pos
752566
>>> chromosomes_remapped, chromosomes_not_remapped = s.remap(38)
Downloading resources/GRCh37_GRCh38.tar.gz
>>> s.build
38
>>> s.assembly
'GRCh38'
>>> s.snps.loc["rs3094315"].pos
817186

SNPs can be remapped between Build 36 (``NCBI36``), Build 37 (``GRCh37``), and Build 38
(``GRCh38``).

Save SNPs
`````````
Ok, so far we've merged the SNPs from two files (ensuring the same build in the process and
identifying discrepancies along the way). Then, we remapped the SNPs to Build 38. Now, let's save
the merged and remapped dataset consisting of 1M+ SNPs to a tab-separated values (TSV) file:

>>> saved_snps = s.to_tsv("out.txt")
Saving output/out.txt
>>> print(saved_snps)
output/out.txt

Moreover, let's get the reference sequences for this assembly and save the SNPs as a VCF file:

>>> saved_snps = s.to_vcf("out.vcf")
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.1.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.2.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.3.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.4.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.5.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.6.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.7.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.8.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.9.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.10.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.11.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.12.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.13.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.14.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.15.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.16.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.17.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.18.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.19.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.20.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.21.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.22.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.X.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.Y.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.MT.fa.gz
Saving output/out.vcf
1 SNP positions were found to be discrepant when saving VCF

When saving a VCF, if any SNPs have positions outside of the reference sequence, they are marked
as discrepant and are available via a property of the ``SNPs`` object.

All `output files `_ are saved to the
output directory.

Documentation
-------------
Documentation is available `here `_.

Acknowledgements
----------------
Thanks to Mike Agostino, Padma Reddy, Kevin Arvai, `openSNP `_,
`Open Humans `_, and `Sano Genetics `_.

``snps`` incorporates code and concepts generated with the assistance of
`OpenAI's `_ `ChatGPT `_. ✨

License
-------
``snps`` is licensed under the `BSD 3-Clause License `_.

.. https://github.com/rtfd/readthedocs.org/blob/master/docs/badges.rst
.. |ci| image:: https://github.com/apriha/snps/actions/workflows/ci.yml/badge.svg?branch=main
   :target: https://github.com/apriha/snps/actions/workflows/ci.yml
.. |codecov| image:: https://codecov.io/gh/apriha/snps/branch/main/graph/badge.svg
   :target: https://codecov.io/gh/apriha/snps
.. |docs| image:: https://readthedocs.org/projects/snps/badge/?version=stable
   :target: https://snps.readthedocs.io/
.. |pypi| image:: https://img.shields.io/pypi/v/snps.svg
   :target: https://pypi.python.org/pypi/snps
.. |python| image:: https://img.shields.io/pypi/pyversions/snps.svg
   :target: https://www.python.org
.. |downloads| image:: https://pepy.tech/badge/snps
   :target: https://pepy.tech/project/snps
.. |ruff| image:: https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json
   :target: https://github.com/astral-sh/ruff
   :alt: Ruff

Owner

Name: Andrew Riha
Login: apriha
Kind: user
Location: California

Repositories: 3
Profile: https://github.com/apriha

Systems engineer chasing hobbies in bioinformatics and data science

GitHub Events

Total

Create event: 7
Release event: 1
Issues event: 2
Watch event: 10
Delete event: 8
Issue comment event: 11
Push event: 11
Pull request event: 19

Last Year

Create event: 7
Release event: 1
Issues event: 2
Watch event: 10
Delete event: 8
Issue comment event: 11
Push event: 11
Pull request event: 19

Committers

Last synced: over 2 years ago

All Time

Total Commits: 835
Total Committers: 14
Avg Commits per committer: 59.643
Development Distribution Score (DDS): 0.438

Past Year

Commits: 8
Committers: 2
Avg Commits per committer: 4.0
Development Distribution Score (DDS): 0.25

Top Committers

Name	Email	Commits
Andrew Riha	a**a@g**m	469
apriha	a****a	142
Adam Faulconbridge	a**m@s**m	79
William Jones	w**s@g**m	68
Julian Runnels	j**s@y**m	19
Andrew Riha	a**w@s**m	17
William Jones	w**s@W**l	15
arvkevi	a**i@g**m	11
Anatoli Babenia	a**i@r**g	6
PhilPalmer	p**l@g**m	5
Gabriel Mota	g**a@h**m	1
Oleg Krasnikov	a**e@g**m	1
moffetak	a**t@h**m	1
Ubuntu	u**u@i**l	1

Committer Domains (Top 20 + Academic)

sanogenetics.com: 2 ip-172-31-16-240.eu-west-2.compute.internal: 1 rainforce.org: 1 yum.com: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 35
Total pull requests: 95
Average time to close issues: 7 months
Average time to close pull requests: about 2 months
Total issue authors: 13
Total pull request authors: 11
Average comments per issue: 1.03
Average comments per pull request: 1.19
Merged pull requests: 78
Bot issues: 0
Bot pull requests: 10

Past Year

Issues: 2
Pull requests: 13
Average time to close issues: N/A
Average time to close pull requests: 1 day
Issue authors: 2
Pull request authors: 2
Average comments per issue: 0.0
Average comments per pull request: 0.62
Merged pull requests: 12
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

apriha (18)
afaulconbridge (3)
satishmp (2)
willgdjones (2)
GerardManning (2)
ElSaico (1)
teepean (1)
GhostDog98 (1)
lakishadavid (1)
IsmailM (1)
IMingGarson (1)
kkarbasi (1)
mkbond777 (1)

Pull Request Authors

apriha (72)
dependabot[bot] (10)
afaulconbridge (9)
teepean (6)
arvkevi (3)
willgdjones (2)
JulianRunnels (1)
gabrielmotaa (1)
insolite (1)
adrianodemarino (1)
Titorat (1)

Top Labels

Issue Labels

enhancement (7) help wanted (2) good first issue (2) question (1) bug (1)

Pull Request Labels

dependencies (10)

Packages

Total packages: 1
Total downloads:
- pypi 3,895 last-month

Total dependent packages: 2
Total dependent repositories: 6
Total versions: 47
Total maintainers: 1

pypi.org: snps

tools for reading, writing, merging, and remapping SNPs

Homepage: https://github.com/apriha/snps
Documentation: https://snps.readthedocs.io/
License: BSD 3-Clause License
Latest release: 2.10.0
published 11 months ago

Versions: 47
Dependent Packages: 2
Dependent Repositories: 6
Downloads: 3,895 Last month

Rankings

Dependent packages count: 4.7%

Dependent repos count: 6.0%

Downloads: 6.2%

Average: 6.7%

Stargazers count: 7.8%

Forks count: 8.6%

Maintainers (1)

apriha

Last synced: 6 months ago

Dependencies

.github/workflows/ci.yml actions

actions/checkout v2 composite
actions/setup-python v2 composite
codecov/codecov-action v3 composite

.github/workflows/deploy.yml actions

actions/checkout v2 composite
actions/setup-python v2 composite

Pipfile pypi

black * develop
matplotlib * develop
pytest * develop
pytest-cov * develop
pytest-watch * develop
sphinx * develop
sphinx-rtd-theme * develop
snps *

Pipfile.lock pypi

alabaster ==0.7.12 develop
attrs ==21.4.0 develop
babel ==2.10.1 develop
black ==22.3.0 develop
certifi ==2022.5.18.1 develop
charset-normalizer ==2.0.12 develop
click ==8.1.3 develop
colorama ==0.4.4 develop
coverage ==6.4.1 develop
cycler ==0.11.0 develop
docopt ==0.6.2 develop
docutils ==0.17.1 develop
fonttools ==4.33.3 develop
idna ==3.3 develop
imagesize ==1.3.0 develop
importlib-metadata ==4.11.4 develop
iniconfig ==1.1.1 develop
jinja2 ==3.1.2 develop
kiwisolver ==1.4.2 develop
markupsafe ==2.1.1 develop
matplotlib ==3.5.2 develop
mypy-extensions ==0.4.3 develop
numpy ==1.21.6 develop
packaging ==21.3 develop
pathspec ==0.9.0 develop
pillow ==9.1.1 develop
platformdirs ==2.5.2 develop
pluggy ==1.0.0 develop
py ==1.11.0 develop
pygments ==2.12.0 develop
pyparsing ==3.0.9 develop
pytest ==7.1.2 develop
pytest-cov ==3.0.0 develop
pytest-watch ==4.2.0 develop
python-dateutil ==2.8.2 develop
pytz ==2022.1 develop
requests ==2.27.1 develop
six ==1.16.0 develop
snowballstemmer ==2.2.0 develop
sphinx ==5.0.1 develop
sphinx-rtd-theme ==1.0.0 develop
sphinxcontrib-applehelp ==1.0.2 develop
sphinxcontrib-devhelp ==1.0.2 develop
sphinxcontrib-htmlhelp ==2.0.0 develop
sphinxcontrib-jsmath ==1.0.1 develop
sphinxcontrib-qthelp ==1.0.3 develop
sphinxcontrib-serializinghtml ==1.1.5 develop
tomli ==2.0.1 develop
typed-ast ==1.5.4 develop
typing-extensions ==4.2.0 develop
urllib3 ==1.26.9 develop
watchdog ==2.1.8 develop
zipp ==3.8.0 develop
atomicwrites ==1.4.0
numpy ==1.21.6
pandas ==1.3.5
python-dateutil ==2.8.2
pytz ==2022.1
six ==1.16.0
snps *

docs/requirements.txt pypi

sphinx ==4.2.0
sphinx-rtd-theme ==1.0.0

setup.py pypi

numpy *

pyproject.toml pypi

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

snps

Science Score: 46.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.rst

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: snps

Rankings

Maintainers (1)

Dependencies