https://github.com/bigbio/sdrf-cellline-metadata-db

SDRF Cell Line Metadata Database

https://github.com/bigbio/sdrf-cellline-metadata-db

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.3%) to scientific vocabulary

Keywords

annotations celllines hupo-psi mass-spectrometry metadata proteomics sdrf sdrf-proteomics
Last synced: 5 months ago · JSON representation

Repository

SDRF Cell Line Metadata Database

Basic Info
Statistics
  • Stars: 0
  • Watchers: 5
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Topics
annotations celllines hupo-psi mass-spectrometry metadata proteomics sdrf sdrf-proteomics
Created about 1 year ago · Last pushed about 1 year ago
Metadata Files
Readme License

README.md

SDRF Cell Line Metadata Database

DOI

This repository provides tools to create and manage a cell line metadata database for annotating SDRFs (Sample and Data Relationship Format) in proteomics studies. The primary use case is enhancing annotation consistency for quantms.org datasets. The scripts integrate multiple ontologies and natural language processing (NLP) methods to standardize cell line metadata.

You can query the database for cell line metadata, including information on the organism, tissue, disease, and other relevant fields in GitHub. The database is designed to be easily extensible and can be updated with new cell line information.

NOTE: This is NOT an Ontology of cell lines; but a registry/table where users can find the required information about a cell line in a standardized format for SDRF annotation.


Table of Contents

  1. Motivation
  2. Metadata Sources
  3. Ontologies
  4. Database Structure
  5. Features
  6. Requirements
  7. Installation
  8. Usage
  9. Contribution
  10. License

Motivation

Cell lines are essential in biological research but often lack standardized metadata, leading to inconsistencies. This repository aims to:

  • Create a centralized database for cell line metadata.
  • Facilitate annotation and validation of cell line SDRFs, particularly in proteomics datasets.

Cell line metadata sources

We integrate metadata from three main sources and additional curation efforts:

  1. Cellosaurus:
    The primary metadata source.

  2. Cell Model Passports:
    A collection of cell lines from various sources.

  3. Expression Atlas (EA):
    Metadata curated over RNA experiments for over 10 years.

    • Collected data: Stored in the ea/ folder.
    • Script: ea_db.py processes this source.

Additional Curation: Manual annotation is performed using data from:
- Coriell Cell Line Catalog
- Cell Bank RIKEN
- ATCC


Ontologies

The following ontologies are used for annotation:

  1. MONDO:
    Used to annotate the disease associated with a cell line.

  2. BTO:
    Provides additional references for cell line IDs.


Database Structure

The database is implemented using tsv and contains the following key fields:

| Field Name | Description | |---------------------------|----------------------------------------------------------------| | cell line | Cell line code | | cellosaurus name | Name as annotated in Cellosaurus ID. | | cellosaurus accession | Accession ID from Cellosaurus AC. | | bto cell line | Name as annotated in BTO. | | organism | Organism species (from Cellosaurus). | | organism part | Annotated from supplementary sources. | | sampling site | Sampling site of the cell line. | | age | Age of the cell line (from Cellosaurus or additional sources). | | developmental stage | Developmental stage (inferred from age if missing). | | sex | Sex information (from Cellosaurus). | | ancestry category | Ancestry classification (from Cellosaurus or supplementary sources). | | disease | Agreed-upon disease annotation across sources. | | cell type | Agreed-upon cell type annotation across sources. | | material type | Agreed-upon material classification. | | synonyms | Consolidated synonyms and accessions from all sources. | | curated | Curation status: _not curated_, _AI curated_, or _manual curated_. |

Note: The final database is provided as a tab-delimited file for easy integration. It can be loaded into tools like Pandas or viewed directly via GitHub's table renderer.


Features

  • Standardizes metadata from multiple sources.
  • Uses ontologies to annotate diseases and tissue information.
  • Supports AI-based curation and manual validation for accuracy.
  • Provides easy-to-query tab-delimited outputs.

SDRF Cell Line Annotator

This script annotates the cell lines from an SDRF (Sample to Data relationship format) with cell line information from a provided cell line metadata database. It matches cell line names from the SDRF with entries in the database, considering exact matches for cell line, cellosaurus name, and cellosaurus accession, as well as partial matches against synonyms. If a match is found, the corresponding metadata (e.g., organism, disease, age, and more) is provided. If no match is found, the fields are populated with "not available" and a warning is logged.

bash python annotator.py --sdrf-file MSV000085836.sdrf.tsv --db-file cl-annotations-db.tsv --output-file suggested-terms.tsv

Key Features:

  • Database Matching: Matches cell line names from the SDRF file against a cell line database with multiple matching criteria (exact and synonym-based).
  • Synonym Handling: Synonyms in the database are split by semicolon and compared to the cell line names, ensuring flexible matching.
  • Logging and Error Handling: Warnings are logged for any unmatched cell lines, and errors are gracefully handled.
  • TSV Output: Annotates and outputs the results to a new TSV file, maintaining structured data for downstream analysis.

Requirements

To use the scripts, ensure the following is installed:

  • Python 3.x
  • Required libraries:
  • pandas
  • numpy
  • spacy
  • install the en_core_web_lg model for spaCy: python -m spacy download en_core_web_sm

Code of Conduct

We strive to foster a welcoming, inclusive, and respectful community where everyone feels encouraged to participate and contribute. As contributors and maintainers, we are committed to upholding ethical standards to prevent conflicts, harassment, and discrimination. We ask all participants to communicate respectfully, avoid personal attacks, and be constructive in their feedback. Contributions should be made with honesty, empathy, and respect for differing perspectives. Read the full Code of Conduct.

Commenting and contributing

We welcome contributions from the community. If you would like to contribute, please open an issue or a pull request. We will review your contribution and provide feedback. We aim to be inclusive and collaborative, and we welcome all contributions that are in line with our goals.

  • If you want to contribute to the manuscript, please do the following:
    • Fork the repository
    • Change the content manuscript.md
    • Submit a pull request
    • We will review your contribution and provide feedback
  • If you want to discuss a topic, please open an issue.

NOTE: If, based on your contribution, you would like to be added as a co-author, please open an issue and provide your name and affiliation and a short description of your contribution or a link to the relevant issue and pull request.


Contributors

Owner

  • Name: BigBio Stack
  • Login: bigbio
  • Kind: organization
  • Email: proteomicsstack@gmail.com
  • Location: Cambridge, UK

Provide big data solutions Bioinformatics

GitHub Events

Total
  • Release event: 1
  • Issue comment event: 5
  • Push event: 23
  • Pull request review event: 2
  • Pull request event: 10
  • Create event: 3
Last Year
  • Release event: 1
  • Issue comment event: 5
  • Push event: 23
  • Pull request review event: 2
  • Pull request event: 10
  • Create event: 3