variation-normalization
Services and guidelines for normalizing variants
Science Score: 77.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 3 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
✓Committers with academic emails
3 of 12 committers (25.0%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.3%) to scientific vocabulary
Scientific Fields
Repository
Services and guidelines for normalizing variants
Basic Info
- Host: GitHub
- Owner: cancervariants
- License: mit
- Language: Python
- Default Branch: main
- Homepage: https://normalize.cancervariants.org/variation/
- Size: 19.5 MB
Statistics
- Stars: 14
- Watchers: 7
- Forks: 2
- Open Issues: 66
- Releases: 67
Metadata Files
README.md
Variation Normalization
The Variation Normalizer parses and translates free-text descriptions of genomic variations into computable objects conforming to the Variation Representation Specification (VRS), enabling consistent and accurate variant harmonization across a diversity of genomic knowledge resources. <!-- /description -->
Installation
Install from PyPI:
shell
python3 -m pip install variation-normalizer
| variation-normalization branch | variation-normalizer version | gene-normalizer version | VRS version | | ---- | --- | ---- | --- | | main | >=0.14.Z | >=0.9.Z | 2.0 |
About
Variation Normalization works by using four main steps: tokenization, classification, validation, and translation. During tokenization, we split strings on whitespace and parse to determine the type of token. During classification, we specify the order of tokens a classification can have. We then do validation checks such as ensuring references for a nucleotide or amino acid matches the expected value and validating a position exists on the given transcript. During translation, we return a VRS Allele object.
Variation Normalization is limited to the following types of variants:
- HGVS expressions and text representations (ex:
BRAF V600E):- protein (p.): substitution, deletion, insertion, deletion-insertion
- coding DNA (c.): substitution, deletion, insertion, deletion-insertion
- genomic (g.): substitution, deletion, ambiguous deletion, insertion, deletion-insertion, duplication
- gnomAD-style VCF (chr-pos-ref-alt, ex:
7-140753336-A-T)- genomic (g.): substitution, deletion, insertion
Variation Normalizer accepts input from GRCh37 or GRCh8 assemblies.
We are working towards adding more types of variations, coordinates, and representations.
VRS Versioning
The variation-normalization repo depends on VRS models, and therefore each variation-normalizer package on PyPI uses a particular version of VRS. The correspondences between packages may be summarized as:
| variation-normalization branch | variation-normalizer version | gene-normalizer version | VRS version | | ---- | --- | ---- | --- | | main | >=0.14.Z | >=0.9.Z | 2.0 |
Previous VRS Versioning
The correspondences between the packages that are no longer maintained may be summarized as:
| vrs-1.3 | 0.6.Z | 0.1.Z | 1.3 |
Available Endpoints
/to_vrs
Returns a list of validated VRS Variations.
/normalize
Returns a VRS Variation aligned to the prioritized transcript. The Variation Normalizer relies on Common Operations On Lots-of Sequences Tool (cool-seq-tool) for retrieving the prioritized transcript data. More information on the transcript selection algorithm can be found here.
If a genomic variation query is given a gene (E.g. BRAF g.140753336A>T), the associated cDNA representation will be returned. This is because the gene provides additional strand context. If a genomic variation query is not given a gene, the GRCh38 representation will be returned.
Development
Clone the repo:
shell
git clone https://github.com/cancervariants/variation-normalization.git
cd variation-normalization
For a development install, we recommend using Pipenv. See the pipenv docs for direction on installing pipenv in your compute environment.
Once installed, from the project root dir, just run:
shell
pipenv shell
pipenv update && pipenv install --dev
Required resources
Variation Normalization relies on some local data caches which you will need to set up. It uses pipenv to manage its environment, which you will also need to install.
Gene Normalizer
Variation Normalization relies on data from Gene Normalization. You must load all sources and merged concepts.
You must also have Gene Normalization's DynamoDB running in a separate terminal for the application to work.
For more information about the gene-normalizer and how to load the database, visit the README.
SeqRepo
Variation Normalization relies on seqrepo, which you must download yourself.
Variation Normalizer uses seqrepo to retrieve sequences at given positions on a transcript.
From the root directory:
shell
pip install seqrepo
sudo mkdir /usr/local/share/seqrepo
sudo chown $USER /usr/local/share/seqrepo
seqrepo pull -i 2024-12-20/ # Replace with latest version using `seqrepo list-remote-instances` if outdated
If you get an error similar to the one below:
shell
PermissionError: [Error 13] Permission denied: '/usr/local/share/seqrepo/2024-12-20/._fkuefgd' -> '/usr/local/share/seqrepo/2024-12-20/'
You will want to do the following:\ (Might not be ._fkuefgd, so replace with your error message path)
shell
sudo mv /usr/local/share/seqrepo/2024-12-20._fkuefgd /usr/local/share/seqrepo/2024-12-20
exit
Use the SEQREPO_ROOT_DIR environment variable to set the path of an already existing SeqRepo directory. The default is /usr/local/share/seqrepo/latest.
UTA
Variation Normalizer also uses Common Operations On Lots-of Sequences Tool (cool-seq-tool) which uses UTA as the underlying PostgreSQL database.
We provide two options for installing UTA:
- Using Docker: This is the preferred way
- Locally
Installing UTA via Docker
For this, you will need to install Docker. We recommend using Docker Desktop.
Once Docker is running, from the root of the directory, run the following:
shell
docker volume create --name=uta_vol
docker compose up
This should start the following container:
- uta: a database of transcripts and alignments (localhost:5432)
Check that the container is running:
shell
$ docker ps
CONTAINER ID IMAGE // NAMES
a40576b8cf1f biocommons/uta:uta_20241220 // variation-normalization-uta-1
Depending on your network and host, the first run is likely to take 5-15 minutes in order to download and install data. Subsequent startups should be nearly instantaneous.
You can test UTA and seqrepo installations like so:
shell
$ psql -XAt postgres://anonymous@localhost/uta -c 'select count(*) from uta_20241220.transcript'
329090
Installing UTA Locally
The following commands will likely need modification appropriate for the installation environment.
- Install PostgreSQL
Create user and database.
shell createuser -U postgres uta_admin createuser -U postgres anonymous createdb -U postgres -O uta_admin utaTo install locally:
shell
export UTA_VERSION=uta_20241220.pgd.gz
curl -O http://dl.biocommons.org/uta/$UTA_VERSION
gzip -cdq ${UTA_VERSION} | grep -v "^REFRESH MATERIALIZED VIEW" | psql -h localhost -U uta_admin --echo-errors --single-transaction -v ON_ERROR_STOP=1 -d uta -p 5432
If you have trouble installing UTA, you can visit these two READMEs.
Connecting to the UTA database
To connect to the UTA database, you can use the default url (postgresql://uta_admin@localhost:5432/uta/uta_20241220). If you do not wish to use the default, you must set the environment variable UTA_DB_URL which has the format of driver://user:pass@host:port/database/schema.
Starting the Variation Normalization Service Locally
gene-normalizers dynamodb and the uta database must be running.
To start the service, run the following:
shell
uvicorn variation.main:app --reload
Next, view the OpenAPI docs on your local machine: http://127.0.0.1:8000/variation
Code QC
Code style is managed by Ruff and checked prior to commit.
To perform formatting and check style:
shell
python3 -m ruff format . && python3 -m ruff check --fix .
We use pre-commit to run conformance tests.
This ensures:
- Style correctness
- No large files
- AWS credentials are present
- Private key is present
Pre-commit must be installed before your first commit. Use the following command:
commandline
pre-commit install
Testing
From the root directory of the repository:
shell
pytest tests/
Owner
- Name: VICC
- Login: cancervariants
- Kind: organization
- Website: http://cancervariants.org
- Repositories: 14
- Profile: https://github.com/cancervariants
The Variant Interpretation for Cancer Consortium
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use this software, please cite it as below." authors: - family-names: "Kuzma" given-names: "Kori" - family-names: "Stevenson" given-names: "James" - family-names: "Liu" given-names: "Jiachen" - family-names: "Coffman" given-names: "Adam" - family-names: "Henkenjohann" given-names: "Richard" - family-names: "Babb" given-names: "Lawrence" - family-names: "Liu" given-names: "Xuelu" - family-names: "Wagner" given-names: "Alex H." orcid: "https://orcid.org/0000-0002-2502-8961" doi: 10.5281/zenodo.5894937 title: "VICC Variation Normalization Service" version: 0.2.16dev date-released: 2022-01-23 url: "https://github.com/cancervariants/variation-normalization"
GitHub Events
Total
- Create event: 43
- Release event: 8
- Issues event: 48
- Watch event: 3
- Delete event: 30
- Issue comment event: 177
- Push event: 66
- Pull request review comment event: 7
- Pull request review event: 44
- Pull request event: 56
Last Year
- Create event: 43
- Release event: 8
- Issues event: 48
- Watch event: 3
- Delete event: 30
- Issue comment event: 177
- Push event: 66
- Pull request review comment event: 7
- Pull request review event: 44
- Pull request event: 56
Committers
Last synced: almost 3 years ago
All Time
- Total Commits: 958
- Total Committers: 12
- Avg Commits per committer: 79.833
- Development Distribution Score (DDS): 0.102
Top Committers
| Name | Commits | |
|---|---|---|
| korikuzma | k****a@g****m | 860 |
| Adam Coffman | a****n@w****u | 34 |
| Kori Kuzma | 4****a@u****m | 18 |
| dependabot[bot] | 4****]@u****m | 15 |
| Alex H. Wagner, PhD | a****4@w****u | 13 |
| James Stevenson | j****n@n****g | 9 |
| Alex H. Wagner, PhD | A****r@n****g | 3 |
| Jiachen Liu | 5****7@u****m | 2 |
| Brian | b****n@b****m | 1 |
| Brian Walsh | w****r@o****u | 1 |
| Richard Henkenjohann | r****n@g****m | 1 |
| Alex H. Wagner, PhD | a@a****o | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 4 months ago
All Time
- Total issues: 183
- Total pull requests: 186
- Average time to close issues: over 1 year
- Average time to close pull requests: 9 days
- Total issue authors: 11
- Total pull request authors: 7
- Average comments per issue: 2.22
- Average comments per pull request: 0.63
- Merged pull requests: 164
- Bot issues: 0
- Bot pull requests: 7
Past Year
- Issues: 13
- Pull requests: 56
- Average time to close issues: 5 days
- Average time to close pull requests: 3 days
- Issue authors: 4
- Pull request authors: 2
- Average comments per issue: 0.08
- Average comments per pull request: 0.59
- Merged pull requests: 49
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- korikuzma (150)
- ahwagner (8)
- jsstevenson (7)
- theferrit32 (4)
- wesleygoar (4)
- larrybabb (2)
- anastasiasmith1221 (1)
- rajatkapoordfci (1)
- katiestahl (1)
- jarbesfeld (1)
- MayLiu27 (1)
- bwalsh (1)
Pull Request Authors
- korikuzma (169)
- jsstevenson (39)
- dependabot[bot] (6)
- rajatkapoordfci (3)
- anastasiabratulin (1)
- theferrit32 (1)
- ahwagner (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 1,309 last-month
- Total dependent packages: 3
- Total dependent repositories: 3
- Total versions: 55
- Total maintainers: 3
pypi.org: variation-normalizer
VICC normalization routine for variations
- Homepage: https://github.com/cancervariants/variation-normalization
- Documentation: https://github.com/cancervariants/variation-normalization
- License: MIT License Copyright (c) 2018-2024 VICC Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
-
Latest release: 0.15.0
published 6 months ago
Rankings
Maintainers (3)
Dependencies
- coverage * develop
- coveralls * develop
- flake8 * develop
- flake8-annotations * develop
- flake8-docstrings * develop
- flake8-import-order * develop
- flake8-quotes * develop
- ipykernel * develop
- jupyter * develop
- jupyterlab * develop
- matplotlib * develop
- pre-commit * develop
- psycopg2-binary * develop
- pytest * develop
- pytest-asyncio * develop
- pytest-cov * develop
- pyyaml * develop
- twine * develop
- variation-normalizer * develop
- biocommons.seqrepo *
- boto3 *
- fastapi *
- ga4gh.vrs >=0.7.5.dev1
- ga4gh.vrsatile.pydantic >=0.0.11
- gene-normalizer >=0.1.26
- pydantic *
- pyliftover *
- uta-tools >=0.1.1
- uvicorn *
- aiofiles ==0.8.0
- anyio ==3.6.1
- appdirs ==1.4.4
- appnope ==0.1.3
- argcomplete ==2.0.0
- argh ==0.26.2
- argon2-cffi ==21.3.0
- argon2-cffi-bindings ==21.2.0
- asgiref ==3.5.2
- asttokens ==2.0.5
- asyncpg ==0.25.0
- attrs ==21.4.0
- babel ==2.10.1
- backcall ==0.2.0
- beautifulsoup4 ==4.11.1
- biocommons.seqrepo ==0.6.5
- bioutils ==0.5.5
- bleach ==5.0.0
- boto3 ==1.24.5
- botocore ==1.27.5
- bs4 ==0.0.1
- canonicaljson ==1.6.2
- certifi ==2022.5.18.1
- cffi ==1.15.0
- cfgv ==3.3.1
- charset-normalizer ==2.0.12
- click ==8.1.3
- coloredlogs ==15.0.1
- commonmark ==0.9.1
- configparser ==5.2.0
- coverage ==6.4.1
- coveralls ==3.3.1
- cssselect ==1.1.0
- cycler ==0.11.0
- debugpy ==1.6.0
- decorator ==5.1.1
- defusedxml ==0.7.1
- distlib ==0.3.4
- docopt ==0.6.2
- docutils ==0.18.1
- entrypoints ==0.4
- executing ==0.8.3
- fake-useragent ==0.1.11
- fastapi ==0.78.0
- fastjsonschema ==2.15.3
- filelock ==3.7.1
- flake8 ==4.0.1
- flake8-annotations ==2.9.0
- flake8-docstrings ==1.6.0
- flake8-import-order ==0.18.1
- flake8-quotes ==3.3.1
- fonttools ==4.33.3
- ga4gh.vrs ==0.8a0
- ga4gh.vrsatile.pydantic ==0.0.11
- gene-normalizer ==0.1.27
- gffutils ==0.11.0
- h11 ==0.13.0
- hgvs ==1.5.2
- humanfriendly ==10.0
- identify ==2.5.1
- idna ==3.3
- importlib-metadata ==4.11.4
- inflection ==0.5.1
- iniconfig ==1.1.1
- ipykernel ==6.13.1
- ipython ==8.4.0
- ipython-genutils ==0.2.0
- ipywidgets ==7.7.0
- jedi ==0.18.1
- jinja2 ==3.1.2
- jmespath ==1.0.0
- json5 ==0.9.8
- jsonschema ==3.2.0
- jupyter ==1.0.0
- jupyter-client ==7.3.4
- jupyter-console ==6.4.3
- jupyter-core ==4.10.0
- jupyter-server ==1.17.1
- jupyterlab ==3.4.3
- jupyterlab-pygments ==0.2.2
- jupyterlab-server ==2.14.0
- jupyterlab-widgets ==1.1.0
- keyring ==23.6.0
- kiwisolver ==1.4.2
- lxml ==4.9.0
- markdown ==3.3.7
- markupsafe ==2.1.1
- matplotlib ==3.5.2
- matplotlib-inline ==0.1.3
- mccabe ==0.6.1
- mistune ==0.8.4
- nbclassic ==0.3.7
- nbclient ==0.6.4
- nbconvert ==6.5.0
- nbformat ==5.4.0
- nest-asyncio ==1.5.5
- nodeenv ==1.6.0
- notebook ==6.4.12
- notebook-shim ==0.1.0
- numpy ==1.22.4
- packaging ==21.3
- pandas ==1.4.2
- pandocfilters ==1.5.0
- parse ==1.19.0
- parsley ==1.3
- parso ==0.8.3
- pexpect ==4.8.0
- pickleshare ==0.7.5
- pillow ==9.1.1
- pkginfo ==1.8.3
- platformdirs ==2.5.2
- pluggy ==1.0.0
- pre-commit ==2.19.0
- prometheus-client ==0.14.1
- prompt-toolkit ==3.0.29
- psutil ==5.9.1
- psycopg2 ==2.9.3
- psycopg2-binary ==2.9.3
- ptyprocess ==0.7.0
- pure-eval ==0.2.2
- py ==1.11.0
- pycodestyle ==2.8.0
- pycparser ==2.21
- pydantic ==1.9.1
- pydocstyle ==6.1.1
- pyee ==8.2.2
- pyfaidx ==0.7.0
- pyflakes ==2.4.0
- pygments ==2.12.0
- pyliftover ==0.4
- pyparsing ==3.0.9
- pyppeteer ==1.0.2
- pyquery ==1.4.3
- pyrsistent ==0.18.1
- pysam ==0.19.1
- pytest ==7.1.2
- pytest-asyncio ==0.18.3
- pytest-cov ==3.0.0
- python-dateutil ==2.8.2
- python-jsonschema-objects ==0.4.1
- pytz ==2022.1
- pyyaml ==6.0
- pyzmq ==23.1.0
- qtconsole ==5.3.1
- qtpy ==2.1.0
- readme-renderer ==35.0
- requests ==2.27.1
- requests-html ==0.10.0
- requests-toolbelt ==0.9.1
- rfc3986 ==2.0.0
- rich ==12.4.4
- s3transfer ==0.6.0
- send2trash ==1.8.0
- setuptools ==62.3.3
- simplejson ==3.17.6
- six ==1.16.0
- sniffio ==1.2.0
- snowballstemmer ==2.2.0
- soupsieve ==2.3.2.post1
- sqlparse ==0.4.2
- stack-data ==0.2.0
- starlette ==0.19.1
- tabulate ==0.8.9
- terminado ==0.15.0
- tinycss2 ==1.1.1
- toml ==0.10.2
- tomli ==2.0.1
- tornado ==6.1
- tqdm ==4.64.0
- traitlets ==5.2.2.post1
- twine ==4.0.1
- typing-extensions ==4.2.0
- urllib3 ==1.26.9
- uta-tools ==0.1.1
- uvicorn ==0.17.6
- virtualenv ==20.14.1
- w3lib ==1.22.0
- wcwidth ==0.2.5
- webencodings ==0.5.1
- websocket-client ==1.3.2
- websockets ==10.3
- widgetsnbextension ==3.6.0
- yoyo-migrations ==7.3.2
- zipp ==3.8.0
- aiofiles ==0.8.0
- anyio ==3.6.1
- appdirs ==1.4.4
- appnope ==0.1.3
- argcomplete ==2.0.0
- argh ==0.26.2
- asgiref ==3.5.2
- asttokens ==2.0.5
- asyncpg ==0.25.0
- attrs ==21.4.0
- backcall ==0.2.0
- beautifulsoup4 ==4.11.1
- biocommons.seqrepo ==0.6.5
- bioutils ==0.5.5
- boto3 ==1.24.5
- botocore ==1.27.5
- bs4 ==0.0.1
- canonicaljson ==1.6.2
- certifi ==2022.5.18.1
- charset-normalizer ==2.0.12
- click ==8.1.3
- coloredlogs ==15.0.1
- configparser ==5.2.0
- cssselect ==1.1.0
- decorator ==5.1.1
- executing ==0.8.3
- fake-useragent ==0.1.11
- fastapi ==0.78.0
- ga4gh.vrs ==0.8a0
- ga4gh.vrsatile.pydantic ==0.0.11
- gene-normalizer ==0.1.27
- gffutils ==0.11.0
- h11 ==0.13.0
- hgvs ==1.5.2
- humanfriendly ==10.0
- idna ==3.3
- importlib-metadata ==4.11.4
- inflection ==0.5.1
- ipython ==8.4.0
- jedi ==0.18.1
- jmespath ==1.0.0
- jsonschema ==3.2.0
- lxml ==4.9.0
- markdown ==3.3.7
- matplotlib-inline ==0.1.3
- numpy ==1.22.4
- pandas ==1.4.2
- parse ==1.19.0
- parsley ==1.3
- parso ==0.8.3
- pexpect ==4.8.0
- pickleshare ==0.7.5
- prompt-toolkit ==3.0.29
- psycopg2 ==2.9.3
- ptyprocess ==0.7.0
- pure-eval ==0.2.2
- pydantic ==1.9.1
- pyee ==8.2.2
- pyfaidx ==0.7.0
- pygments ==2.12.0
- pyliftover ==0.4
- pyppeteer ==1.0.2
- pyquery ==1.4.3
- pyrsistent ==0.18.1
- pysam ==0.19.1
- python-dateutil ==2.8.2
- python-jsonschema-objects ==0.4.1
- pytz ==2022.1
- pyyaml ==6.0
- requests ==2.27.1
- requests-html ==0.10.0
- s3transfer ==0.6.0
- setuptools ==62.3.3
- simplejson ==3.17.6
- six ==1.16.0
- sniffio ==1.2.0
- soupsieve ==2.3.2.post1
- sqlparse ==0.4.2
- stack-data ==0.2.0
- starlette ==0.19.1
- tabulate ==0.8.9
- tqdm ==4.64.0
- traitlets ==5.2.2.post1
- typing-extensions ==4.2.0
- urllib3 ==1.26.9
- uta-tools ==0.1.1
- uvicorn ==0.17.6
- w3lib ==1.22.0
- wcwidth ==0.2.5
- websockets ==10.3
- yoyo-migrations ==7.3.2
- zipp ==3.8.0
- ldez/gha-mjolnir v1.0.3 composite
- actions/checkout v2 composite
- actions/setup-python v1 composite
- actions/checkout v2 composite
- actions/setup-python v2 composite
- python 3.7 build