duvel

Data set for Detection of Unique Variant Ensemble in Literature

https://github.com/cnachteg/duvel

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (8.7%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Data set for Detection of Unique Variant Ensemble in Literature

Basic Info
  • Host: GitHub
  • Owner: cnachteg
  • License: mit
  • Language: Jupyter Notebook
  • Default Branch: main
  • Size: 5.9 MB
Statistics
  • Stars: 1
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 3
Created over 2 years ago · Last pushed almost 2 years ago
Metadata Files
Readme License Citation

README.md

DUVEL DOI

DUVEL stands for Detection of Unlimited Variant Ensemble in Literature.

Construction of DUVEL

References from OLIDA (https://olida.ibsquare.be) were selected and annotated with Pubtator with the genes and variants entities. Papers were further filtered to study those containing digenic variant combinations (i.e., variant combinations involving two genes). The candidates were limited to texts containing at most 256 tokens, as well as containing different genes and variants for each candidate. Scripts to create the unlabelled data sets can be found in the scripts/construction folder.

Annotation was done through the ALAMBIC (https://github.com/Trusted-AI-Labs/ALAMBIC) platform, within an active learning framework with the Margin selection strategy and with an active batch size of 500 samples. The model was a BiomedBERT model (https://huggingface.co/microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext), trained for 10 epochs and with a learning rate of 1e-5.

Fine-tuning experiments

Preliminary experiments were conducted with different biomedical large language models, with hyperparameter fine-tuning. Code for reproducing the experiments can be found in the scripts/fine_tuning folder. Additionally, the scripts to create and fine-tune with the simulated data sets to mimick data sets build without active learning (i.e, with only around 1% of positive samples overall the training set) can be found in the scripts/fine_tuning/low_positive folder.

The scripts used to evaluate the active learning process on the DUVEL test set, by excluding the from samples selected during the process, the samples belonging to said test set, can be found in the scripts/fine_tuning/AL_process

Data availibility

Csv files of the data can be found in the data folder, corresponding to the train/validation/test splits used for the fine-tuning in the experiments of the article. Moreover, the five simulated data sets to reproduce a construction withtout active learning can be found in the data/low_positive folder.

The train/validation/test splits of the data are also available on Huggingface (https://huggingface.co/datasets/cnachteg/DUVEL) and can be used with the following code :

python from datasets import load_dataset dataset = load_dataset("cnachteg/DUVEL")

Cite us

TBA

Owner

  • Name: Charlotte Nachtegael
  • Login: cnachteg
  • Kind: user

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this dataset, please cite it as below."
authors:
- family-names: "Nachtegael"
  given-names: "Charlotte"
  orcid: "https://orcid.org/0000-0002-5034-8975"
- family-names: "De Stefani"
  given-names: "Jacopo"
  orcid: "https://orcid.org/0000-0003-0257-4537"
- family-names: "Cnudde"
  given-names: "Anthony"
  orcid: "https://orcid.org/0000-0001-6363-6506"
- family-names: "Lenaerts"
  given-names: "Tom"
  orcid: "https://orcid.org/0000-0003-3645-1455"
title: "DUVEL"
version: 0.2.0-alpha
identifiers:
  - type: doi
    value: 10.5281/zenodo.10410665
date-released: 2023-12-20
url: "https://github.com/cnachteg/DUVEL"

GitHub Events

Total
Last Year