https://github.com/cancervariants/gene-harmony-analysis

https://github.com/cancervariants/gene-harmony-analysis

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (7.0%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: cancervariants
  • License: mit
  • Language: HTML
  • Default Branch: main
  • Size: 372 MB
Statistics
  • Stars: 3
  • Watchers: 3
  • Forks: 0
  • Open Issues: 15
  • Releases: 0
Created about 2 years ago · Last pushed 10 months ago
Metadata Files
Readme Contributing License

README.md

gene-harmony-analysis

Background

Human gene symbols are regulated and follow guidelines established by HGNC. All genes are designated an authoritative symbol (also known as a primary gene symbol), a descriptive name, and an HGNC identification number. Although primary gene symbols are monitored to be unique, alias symbols are not.

Aliases are additional gene symbols and short descriptions that are used synonymously for the gene and/or any associated gene products. Gene symbols are curated from use in databases, experimental results, and literature. Primary gene symbols and aliases play a crucial role in referencing genes across publications, medical records, and data collections.

Our preliminary research uncovered collisions, or instances where a single gene symbol was used for multiple different genes. We have categorized them into two kinds:

a) Alias-primary collisions, which are gene symbols that are used as a primary gene symbol and an alias. The primary gene symbol KRAS is an alias in addition to a primary gene symbol.

b) Alias-alias collisions are gene symbols that represent an alias of multiple genes. The gene symbol VH is an alias for 35 genes in the NCBI database.

Out of the 43,164 genes in the HGNC database, 483 (1.12%) had alias-primary collisions and 2,084 (4.83%) had alias-alias collisions. The Ensembl database, which has 40,353 genes, was found to have alias-primary collisions in 218 (0.54%) of genes and alias-alias collisions in 3,680 (9.12%) of genes.

The NCBI database, which had the largest number of genes- 75,346, had 1,712 (2.27%) and 5,670 (7.53%) of genes with alias-primary and alias-alias collisions respectively, illustrating the prevalence of ambiguity that challenges the aggregation of genomic knowledge.

The alias-primarycollisionanalysis and alias-aliascollisionanalysis Jupyter notebooks show the analyses to get these values

collision_graphic

Purpose

The difficulties associated with resolving ambiguity and ensuring accurate understanding of gene symbols restrict the rate of clinical decision-making and contribute to confusion in gene knowledge aggregation. The gene nomenclature system would be most effective if it is unambiguous with a tool to take existing knowledgebase entries as inputs to resolve.

This curated collection of alias data will be a foundation for disambiguating gene symbols.

Notebook Dependencies

| | Name of Notebook | Prerequisite Notebook(s) | Input files | Notes | |---|---------------------------------------------|--------------------------|----------------------------------------|---| | 1 | aliasprimarycollisionanalysis | none | ensgbiomartgene20240626.txt | | | | | | hgncbiomartgene20240626.txt | | | | | | Homosapiens.geneinfo20240627 | | | 2 | aliasaliascollisionanalysis | 1 | none | | | 3 | aliasaliascollisiondistributionanalysis | 2, 1 | none | | | 4 | symbolcapturegeneration | 1 | ensgmartexportdrosmurinortho.txt | takes longer than an hour to run | | | | | orthologset1df.txt | | | | | | | | | | | | orthologset10df.txt | | | 5 | symbolcaptureanalysis | 4, 1 | none | one cell needs to run overnight | | 6 | sqlitesymbolcapturetransformation | 4, 1 | ensgbiomartgene20240626.txt | | | | | | hgncbiomartgene20240626.txt | | | | | | Homosapiens.geneinfo20240627 | | | | | | orthologset1df.txt | | | | | | | | | | | | orthologset10df.txt | | | 7 | ambiguoussymboldistributionanalysis | 2, 1 | none | | | 8 | concordancevianetworkxanalysis | 6, 4, 1 | none | | | 9 | concordanceviaupsetplotanalysis | 6, 4, 1 | none | | | 10 | dgidbgenecontentanalysis | 2, 1 | dgidbgenesJUNE.tsv | | | 11 | dgidbqueryanalysis | 10, 2, 1 | log_data.xlsx | |

How can you help?

Contributing information on collisions that you come across will help collect data on the collisions that would be most impactful to resolve as well as increasing the data available for developing resolution strategies for downstream tool development.

Contact Information

For any feedback, questions, or conversation, please make an issue.

Owner

  • Name: VICC
  • Login: cancervariants
  • Kind: organization

The Variant Interpretation for Cancer Consortium

GitHub Events

Total
  • Issues event: 43
  • Delete event: 12
  • Issue comment event: 31
  • Push event: 57
  • Pull request event: 19
  • Pull request review comment event: 1
  • Pull request review event: 3
  • Create event: 14
Last Year
  • Issues event: 43
  • Delete event: 12
  • Issue comment event: 31
  • Push event: 57
  • Pull request event: 19
  • Pull request review comment event: 1
  • Pull request review event: 3
  • Create event: 14

Issues and Pull Requests

Last synced: 10 months ago

All Time
  • Total issues: 34
  • Total pull requests: 16
  • Average time to close issues: about 2 months
  • Average time to close pull requests: about 23 hours
  • Total issue authors: 2
  • Total pull request authors: 2
  • Average comments per issue: 0.56
  • Average comments per pull request: 0.19
  • Merged pull requests: 10
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 27
  • Pull requests: 11
  • Average time to close issues: 12 days
  • Average time to close pull requests: 1 day
  • Issue authors: 2
  • Pull request authors: 2
  • Average comments per issue: 0.7
  • Average comments per pull request: 0.27
  • Merged pull requests: 7
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • anastasiabratulin (33)
  • mcannon068nw (1)
Pull Request Authors
  • anastasiabratulin (14)
  • mcannon068nw (1)
Top Labels
Issue Labels
clean up (3) help wanted (1) question (1)
Pull Request Labels
documentation (1)

Dependencies

requirements-dev.txt pypi
  • PyYAML ==6.0.1 development
  • annotated-types ==0.6.0 development
  • anyio ==4.3.0 development
  • attrs ==23.2.0 development
  • bioutils ==0.5.8.post1 development
  • boto3 ==1.34.76 development
  • botocore ==1.34.76 development
  • canonicaljson ==2.0.0 development
  • certifi ==2024.2.2 development
  • charset-normalizer ==3.3.2 development
  • click ==8.1.7 development
  • coloredlogs ==15.0.1 development
  • exceptiongroup ==1.2.0 development
  • fastapi ==0.110.1 development
  • ga4gh.vrs ==2.0.0a5 development
  • gene-normalizer ==0.3.0.dev1 development
  • h11 ==0.14.0 development
  • humanfriendly ==10.0 development
  • idna ==3.6 development
  • jmespath ==1.0.1 development
  • numpy ==1.26.4 development
  • pydantic ==2.6.4 development
  • pydantic-core ==2.16.3 development
  • python-dateutil ==2.9.0.post0 development
  • requests ==2.31.0 development
  • s3transfer ==0.10.1 development
  • six ==1.16.0 development
  • sniffio ==1.3.1 development
  • starlette ==0.37.2 development
  • typing-extensions ==4.10.0 development
  • urllib3 ==1.26.18 development
  • uvicorn ==0.29.0 development
requirements.txt pypi
  • gene-normalizer ==0.3.0.dev1