https://github.com/arvkevi/clinvar-kaggle
Scripts used to generate the ClinVar conflicting classifications dataset on Kaggle
Science Score: 10.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
○DOI references
-
✓Academic publication links
Links to: ncbi.nlm.nih.gov -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.1%) to scientific vocabulary
Keywords
bioinformatics
genomics
kaggle
kaggle-dataset
machine-learning
Last synced: 9 months ago
·
JSON representation
Repository
Scripts used to generate the ClinVar conflicting classifications dataset on Kaggle
Basic Info
- Host: GitHub
- Owner: arvkevi
- License: mit
- Language: Python
- Default Branch: master
- Homepage: https://www.kaggle.com/kevinarvai/clinvar-conflicting
- Size: 46.3 MB
Statistics
- Stars: 11
- Watchers: 4
- Forks: 8
- Open Issues: 0
- Releases: 0
Topics
bioinformatics
genomics
kaggle
kaggle-dataset
machine-learning
Created about 8 years ago
· Last pushed almost 6 years ago
https://github.com/arvkevi/clinvar-kaggle/blob/master/
Scripts and data used to prepare a [Kaggle dataset](https://www.kaggle.com/kevinarvai/clinvar-conflicting). **Generate dataset using ClinVar .vcf w/ VEP annotations:** `python process_clinvar.py` will generate a version of the file `clinvar_conflicting.csv` with [vep annotations](https://useast.ensembl.org/Tools/VEP). Check out the [notebook](https://github.com/arvkevi/clinvar-kaggle/blob/master/clinvar-conflicting-eda.ipynb) to see some exploratory data analysis. ## Problem Statement The objective is to predict whether a ClinVar variant will have **conflicting classifications**. *Conflicting classifications are when two of any of the following three classification categories are present for one variant, two submissions of one category is not considered conflicting.* 1. Likely Benign or Benign 2. VUS 3. Likely Pathogenic or Pathogenic The `CLASS` feature in `clinvar_conflicting.csv` is a binary representation of whether or not a variant has conflicting classifications where `0` represents consistent classifications and `1` represents conflicting classifications. Since this problem only relates to variants with multiple classifications, I removed all variants from the original ClinVar vcf which were only had one submission.  ## Background [ClinVar](https://www.ncbi.nlm.nih.gov/clinvar/) is a public resource containing annotations about human genetic variants. These variants are classified on a spectrum between benign, likely benign, uncertain significance, likely pathogenic, and pathogenic. Variants that have conflicting classifications (defined above) can cause confusion when clinicians or researchers try to interpret whether the variant has an impact on the disease of a given patient. I'm exploring ideas for applying machine learning to genomics. I'm hoping this project will encourage others to think about the additional feature engineering that's probably necessary to confidently assess the objective. There could be benefit to identifying *single submission* variants that may yet to have assigned a **conflicting classification**. ## VEP annotations Ensembl's [Variant Effect Predictor (VEP)](http://grch37.ensembl.org/Homo_sapiens/Tools/VEP) was used to annotate the original ClinVar `.vcf`. It provides additional information about variants that can serve as features for the dataset. #### Step 1: Download and rename the annotated `.vcf` as `clinvar.annotated.vcf` #### Step 2: Create the new dataset with vep annotations. ```python process_clinvar.py```
Owner
- Name: Kevin Arvai
- Login: arvkevi
- Kind: user
- Location: Washington, D.C.
- Website: linkedin.com/in/kevinarvai/
- Twitter: arvkevi
- Repositories: 27
- Profile: https://github.com/arvkevi
Data science & clinical genomics
GitHub Events
Total
- Watch event: 4
- Fork event: 1
Last Year
- Watch event: 4
- Fork event: 1