https://github.com/arvkevi/clinvar-kaggle

Scripts used to generate the ClinVar conflicting classifications dataset on Kaggle

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: ncbi.nlm.nih.gov
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.1%) to scientific vocabulary

Keywords

bioinformatics genomics kaggle kaggle-dataset machine-learning

Last synced: 10 months ago · JSON representation

Repository

Scripts used to generate the ClinVar conflicting classifications dataset on Kaggle

Basic Info

Host: GitHub
Owner: arvkevi
License: mit
Language: Python
Default Branch: master
Homepage: https://www.kaggle.com/kevinarvai/clinvar-conflicting
Size: 46.3 MB

Statistics

Stars: 11
Watchers: 4
Forks: 8
Open Issues: 0
Releases: 0

Topics

bioinformatics genomics kaggle kaggle-dataset machine-learning

Created over 8 years ago · Last pushed almost 6 years ago

https://github.com/arvkevi/clinvar-kaggle/blob/master/

Scripts and data used to prepare a [Kaggle dataset](https://www.kaggle.com/kevinarvai/clinvar-conflicting).

**Generate dataset using ClinVar .vcf w/ VEP annotations:**
`python process_clinvar.py` will generate a version of the file `clinvar_conflicting.csv` with [vep annotations](https://useast.ensembl.org/Tools/VEP).

Check out the [notebook](https://github.com/arvkevi/clinvar-kaggle/blob/master/clinvar-conflicting-eda.ipynb) to see some exploratory data analysis.

## Problem Statement

The objective is to predict whether a ClinVar variant will have **conflicting classifications**.

*Conflicting classifications are when two of any of the following three classification categories are present for one variant, two submissions of one category is not considered conflicting.*

1. Likely Benign or Benign
2. VUS
3. Likely Pathogenic or Pathogenic

The `CLASS` feature in `clinvar_conflicting.csv` is a binary representation of whether or not a variant has conflicting classifications where `0` represents consistent classifications and `1` represents conflicting classifications.

Since this problem only relates to variants with multiple classifications, I removed all variants from the original ClinVar vcf which were only had one submission.

![](https://github.com/arvkevi/clinvar-kaggle/blob/master/clinvar-class-fig.png)

## Background

[ClinVar](https://www.ncbi.nlm.nih.gov/clinvar/) is a public resource containing annotations about human genetic variants. These variants are classified on a spectrum between benign, likely benign, uncertain significance, likely pathogenic, and pathogenic. Variants that have conflicting classifications (defined above) can cause confusion when clinicians or researchers try to interpret whether the variant has an impact on the disease of a given patient.

I'm exploring ideas for applying machine learning to genomics. I'm hoping this project will encourage others to think about the additional feature engineering that's probably necessary to confidently assess the objective. There could be benefit to identifying *single submission* variants that may yet to have assigned a **conflicting classification**.

## VEP annotations

Ensembl's [Variant Effect Predictor (VEP)](http://grch37.ensembl.org/Homo_sapiens/Tools/VEP) was used to annotate the original ClinVar `.vcf`. It provides additional information about variants that can serve as features for the dataset.
#### Step 1:
Download and rename the annotated `.vcf` as `clinvar.annotated.vcf`

#### Step 2:
Create the new dataset with vep annotations.
```python process_clinvar.py```

Owner

Name: Kevin Arvai
Login: arvkevi
Kind: user
Location: Washington, D.C.

Website: linkedin.com/in/kevinarvai/
Twitter: arvkevi
Repositories: 27
Profile: https://github.com/arvkevi

Data science & clinical genomics

GitHub Events

Total

Watch event: 4
Fork event: 1

Last Year

Watch event: 4
Fork event: 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/arvkevi/clinvar-kaggle

Science Score: 10.0%

Keywords

Repository

Basic Info

Statistics

Topics

https://github.com/arvkevi/clinvar-kaggle/blob/master/

Owner

GitHub Events

Total

Last Year