https://github.com/bigbio/sdrf-cellline-metadata-db

SDRF Cell Line Metadata Database

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
✓
DOI references
Found 3 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.3%) to scientific vocabulary

Keywords

annotations celllines hupo-psi mass-spectrometry metadata proteomics sdrf sdrf-proteomics

Last synced: 10 months ago · JSON representation

Repository

SDRF Cell Line Metadata Database

Basic Info

Host: GitHub
Owner: bigbio
License: apache-2.0
Language: Python
Default Branch: main
Homepage: https://github.com/bigbio/sdrf-cellline-metadata-db/blob/main/cl-annotations-db.tsv
Size: 19.2 MB

Statistics

Stars: 0
Watchers: 5
Forks: 0
Open Issues: 0
Releases: 0

Topics

annotations celllines hupo-psi mass-spectrometry metadata proteomics sdrf sdrf-proteomics

Created over 1 year ago · Last pushed over 1 year ago

Metadata Files

Readme License

SDRF Cell Line Metadata Database

This repository provides tools to create and manage a cell line metadata database for annotating SDRFs (Sample and Data Relationship Format) in proteomics studies. The primary use case is enhancing annotation consistency for quantms.org datasets. The scripts integrate multiple ontologies and natural language processing (NLP) methods to standardize cell line metadata.

You can query the database for cell line metadata, including information on the organism, tissue, disease, and other relevant fields in GitHub. The database is designed to be easily extensible and can be updated with new cell line information.

NOTE: This is NOT an Ontology of cell lines; but a registry/table where users can find the required information about a cell line in a standardized format for SDRF annotation.

Motivation
Metadata Sources
Ontologies
Database Structure
Features
Requirements
Installation
Usage
Contribution
License

Motivation

Cell lines are essential in biological research but often lack standardized metadata, leading to inconsistencies. This repository aims to:

Create a centralized database for cell line metadata.
Facilitate annotation and validation of cell line SDRFs, particularly in proteomics datasets.

Cell line metadata sources

We integrate metadata from three main sources and additional curation efforts:

Cellosaurus:
The primary metadata source.
- Download: cellosaurus.txt
- Script: cellosaurus_db.py extracts relevant fields and transform some of the cellosaurus fields to SDRF compatible format.
Cell Model Passports:
A collection of cell lines from various sources.
- Input file: modellist20240110.csv
- Script: cellpassports_db.py processes this data.
Expression Atlas (EA):
Metadata curated over RNA experiments for over 10 years.
- Collected data: Stored in the ea/ folder.
- Script: ea_db.py processes this source.

Additional Curation: Manual annotation is performed using data from:
- Coriell Cell Line Catalog
- Cell Bank RIKEN
- ATCC

Ontologies

The following ontologies are used for annotation:

MONDO:
Used to annotate the disease associated with a cell line.
BTO:
Provides additional references for cell line IDs.

Database Structure

The database is implemented using tsv and contains the following key fields:

| Field Name | Description | |---------------------------|----------------------------------------------------------------| | cell line | Cell line code | | cellosaurus name | Name as annotated in Cellosaurus ID. | | cellosaurus accession | Accession ID from Cellosaurus AC. | | bto cell line | Name as annotated in BTO. | | organism | Organism species (from Cellosaurus). | | organism part | Annotated from supplementary sources. | | sampling site | Sampling site of the cell line. | | age | Age of the cell line (from Cellosaurus or additional sources). | | developmental stage | Developmental stage (inferred from age if missing). | | sex | Sex information (from Cellosaurus). | | ancestry category | Ancestry classification (from Cellosaurus or supplementary sources). | | disease | Agreed-upon disease annotation across sources. | | cell type | Agreed-upon cell type annotation across sources. | | material type | Agreed-upon material classification. | | synonyms | Consolidated synonyms and accessions from all sources. | | curated | Curation status: _not curated_, _AI curated_, or _manual curated_. |

Note: The final database is provided as a tab-delimited file for easy integration. It can be loaded into tools like Pandas or viewed directly via GitHub's table renderer.

Features

Standardizes metadata from multiple sources.
Uses ontologies to annotate diseases and tissue information.
Supports AI-based curation and manual validation for accuracy.
Provides easy-to-query tab-delimited outputs.

SDRF Cell Line Annotator

This script annotates the cell lines from an SDRF (Sample to Data relationship format) with cell line information from a provided cell line metadata database. It matches cell line names from the SDRF with entries in the database, considering exact matches for cell line, cellosaurus name, and cellosaurus accession, as well as partial matches against synonyms. If a match is found, the corresponding metadata (e.g., organism, disease, age, and more) is provided. If no match is found, the fields are populated with "not available" and a warning is logged.

bash python annotator.py --sdrf-file MSV000085836.sdrf.tsv --db-file cl-annotations-db.tsv --output-file suggested-terms.tsv

Key Features:

Database Matching: Matches cell line names from the SDRF file against a cell line database with multiple matching criteria (exact and synonym-based).
Synonym Handling: Synonyms in the database are split by semicolon and compared to the cell line names, ensuring flexible matching.
Logging and Error Handling: Warnings are logged for any unmatched cell lines, and errors are gracefully handled.
TSV Output: Annotates and outputs the results to a new TSV file, maintaining structured data for downstream analysis.

Requirements

To use the scripts, ensure the following is installed:

Python 3.x
Required libraries:
pandas
numpy
spacy
install the en_core_web_lg model for spaCy: python -m spacy download en_core_web_sm

Code of Conduct

We strive to foster a welcoming, inclusive, and respectful community where everyone feels encouraged to participate and contribute. As contributors and maintainers, we are committed to upholding ethical standards to prevent conflicts, harassment, and discrimination. We ask all participants to communicate respectfully, avoid personal attacks, and be constructive in their feedback. Contributions should be made with honesty, empathy, and respect for differing perspectives. Read the full Code of Conduct.

Commenting and contributing

We welcome contributions from the community. If you would like to contribute, please open an issue or a pull request. We will review your contribution and provide feedback. We aim to be inclusive and collaborative, and we welcome all contributions that are in line with our goals.

If you want to contribute to the manuscript, please do the following:
- Fork the repository
- Change the content manuscript.md
- Submit a pull request
- We will review your contribution and provide feedback
If you want to discuss a topic, please open an issue.

NOTE: If, based on your contribution, you would like to be added as a co-author, please open an issue and provide your name and affiliation and a short description of your contribution or a link to the relevant issue and pull request.

Contributors

Yasset Perez-Riverol - EMBL-EBI, UK

Owner

Name: BigBio Stack
Login: bigbio
Kind: organization
Email: proteomicsstack@gmail.com
Location: Cambridge, UK

Website: http://bigbio.xyz
Repositories: 24
Profile: https://github.com/bigbio

Provide big data solutions Bioinformatics

GitHub Events

Total

Release event: 1
Issue comment event: 5
Push event: 23
Pull request review event: 2
Pull request event: 10
Create event: 3

Last Year

Release event: 1
Issue comment event: 5
Push event: 23
Pull request review event: 2
Pull request event: 10
Create event: 3

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/bigbio/sdrf-cellline-metadata-db

Science Score: 36.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

SDRF Cell Line Metadata Database

Table of Contents

Motivation

Cell line metadata sources

Ontologies

Database Structure

Features

SDRF Cell Line Annotator

Key Features:

Requirements

Code of Conduct

Commenting and contributing

Contributors

Owner

GitHub Events

Total

Last Year