ai_bio_project

This is the carpentries repository for our funded project "How to Build FAIR Domain-Specific Datasets for fine tuning/training NLP models"

https://github.com/sara-morsy/ai_bio_project

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (7.8%) to scientific vocabulary

Last synced: 11 months ago · JSON representation ·

Repository

This is the carpentries repository for our funded project "How to Build FAIR Domain-Specific Datasets for fine tuning/training NLP models"

Basic Info

Host: GitHub
Owner: Sara-Morsy
License: other
Language: R
Default Branch: main
Homepage: https://sara-morsy.github.io/AI_BIO_project/
Size: 6.93 MB

Statistics

Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Releases: 0

Created about 1 year ago · Last pushed 12 months ago

Metadata Files

Readme Contributing License Code of conduct Citation

How to Build FAIR Domain-Specific Datasets for fine tuning/training NLP models?

Domain-specifc natural language processing (NLP) models extract data with high accuracy from unstructured text by identifying specialized vocabularies that can be used in different applications. These models are developed by using domain specific data. Through specialized datasets, researchers fine-tuned pretrained models for protein-protein relationships in the STRING database (1), and accelerated drug development by identifying chemical-gene and drug-drug interactions, as well as predicting peptide toxicity (2-4), and extracting brain connectivity data of neurological disorders (5). Despite their applications, fields like veterinary medicine and agricultural biology lack NLP-based applications. Barriers include the absence of high-quality domain-specific datasets, small and unbalanced datasets, and insufficient expertise to build datasets (6, 7). Manual annotation, a common necessity in these fields, is time-consuming and prone to bias, affecting model performance. To address this, we propose a training course focused on building FAIR (Findable, Accessible, Interoperable, Reusable) domain-specific datasets (6).

Our target audience are:

• Researchers looking to adopt NLP solutions for analyzing domain-specific text, even those who lack expertise in AI but have domain knowledge, which is crucial for building annotating data.
• Computational biology or AI researchers who work on building domain-specific NLP applications who need to overcome dataset scarcity and quality challenges.

Owner

Login: Sara-Morsy
Kind: user

Repositories: 1
Profile: https://github.com/Sara-Morsy

Citation (CITATION.cff)

# This template CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to replace its contents
# with information about your lesson.
# Remember to update this file periodically, 
# ensuring that the author list and other fields remain accurate.

cff-version: 1.2.0
title: FIXME
message: >-
  Please cite this lesson using the information in this file
  when you refer to it in publications, and/or if you
  re-use, adapt, or expand on the content in your own
  training material.
type: dataset
authors:
  - given-names: FIXME
    family-names: FIXME
abstract: >-
  FIXME Replace this with a short abstract describing the
  lesson, e.g. its target audience and main intended
  learning objectives.
license: CC-BY-4.0

GitHub Events

Total

Push event: 6

Last Year

Push event: 6

Dependencies

.github/workflows/pr-close-signal.yaml actions

actions/upload-artifact v4 composite

.github/workflows/pr-comment.yaml actions

actions/checkout v4 composite
carpentries/actions/check-valid-pr main composite
carpentries/actions/comment-diff main composite
carpentries/actions/download-workflow-artifact main composite

.github/workflows/pr-post-remove-branch.yaml actions

carpentries/actions/download-workflow-artifact main composite
carpentries/actions/remove-branch main composite

.github/workflows/pr-preflight.yaml actions

carpentries/actions/check-valid-pr main composite
carpentries/actions/comment-diff main composite

.github/workflows/pr-receive.yaml actions

actions/checkout v4 composite
actions/upload-artifact v4 composite
carpentries/actions/check-valid-pr main composite
carpentries/actions/setup-lesson-deps main composite
carpentries/actions/setup-sandpaper main composite
r-lib/actions/setup-pandoc v2 composite
r-lib/actions/setup-r v2 composite

.github/workflows/sandpaper-main.yaml actions

actions/checkout v4 composite
carpentries/actions/setup-lesson-deps main composite
carpentries/actions/setup-sandpaper main composite
r-lib/actions/setup-pandoc v2 composite
r-lib/actions/setup-r v2 composite

.github/workflows/update-cache.yaml actions

actions/checkout v4 composite
carpentries/actions/check-valid-credentials main composite
carpentries/actions/update-lockfile main composite
carpentries/create-pull-request main composite
r-lib/actions/setup-r v2 composite

.github/workflows/update-workflows.yaml actions

actions/checkout v4 composite
carpentries/actions/check-valid-credentials main composite
carpentries/actions/update-workflows main composite
carpentries/create-pull-request main composite

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science