crocodeel

CroCoDeEL is a tool that detects cross-sample contamination in shotgun metagenomic data

https://github.com/metagenopolis/crocodeel

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 11 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.1%) to scientific vocabulary

Keywords

contamination-detection metagenomics quality-control
Last synced: 6 months ago · JSON representation ·

Repository

CroCoDeEL is a tool that detects cross-sample contamination in shotgun metagenomic data

Basic Info
  • Host: GitHub
  • Owner: metagenopolis
  • License: gpl-3.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 2.74 MB
Statistics
  • Stars: 24
  • Watchers: 2
  • Forks: 0
  • Open Issues: 1
  • Releases: 9
Topics
contamination-detection metagenomics quality-control
Created almost 2 years ago · Last pushed 7 months ago
Metadata Files
Readme License Citation

README.md

CroCoDeEL : CROss-sample COntamination DEtection and Estimation of its Level 🐊

install with conda PyPI DOI

Introduction

CroCoDeEL is a tool that detects cross-sample contamination (aka well-to-well leakage) in shotgun metagenomic data.\ It accurately identifies contaminated samples but also pinpoints contamination sources and estimates contamination rates.\ CroCoDeEL relies only on species abundance tables and does not need negative controls nor sample position during processing (i.e. plate maps).

logo

Installation

CroCoDeEL is available on bioconda: conda create --name crocodeel_env -c conda-forge -c bioconda crocodeel conda activate crocodeel_env

Alternatively, you can use pip with Python ≥ 3.12: pip install crocodeel

Docker and Singularity containers are also available on BioContainers

Installation test

To verify that CroCoDeEL is installed correctly, run the following command:
crocodeel test_install This command runs CroCoDeEL on a toy dataset and checks whether the generated results match the expected ones.
To inspect the results, you can rerun the command with the --keep-results parameter.

Quick start

Input

CroCoDeEL takes as input a species abundance table in TSV format.\ The first column should correspond to species names. The other columns correspond to the abundance of species in each sample.\ An example is available here.

| species_name | sample1 | sample2 | sample3 | ... | |:----------------|:-------:|:-------:|:-------:|:--------:| | species 1 | 0 | 0.05 | 0.07 | ... | | species 2 | 0.1 | 0.01 | 0 | ... | | ... | ... | ... | ... | ... |

CroCoDeEL works with relative abundances. The table will automatically be normalized so the abundance of each column equals 1.

Important: CroCoDeEL requires accurate estimation of the abundance of subdominant species.\ We strongly recommend using the Meteor software suite to generate the species abundance table.\ Alternatively, MetaPhlan4 can be used (parameter: --tax_level t), although it will fail to detect low-level contaminations.\ We advise against using other taxonomic profilers that, according to our benchmarks, do not meet this requirement.

Search for contamination

Run the following command to identify cross-sample contamination: crocodeel search_conta -s species_abundance.tsv -c contamination_events.tsv CroCoDeEL will output all detected contamination events in the file contaminationevents.tsv_.\ This TSV file includes the following details for each contamination event: - The contamination source - The contaminated sample (target) - The estimated contamination rate - The score (probability) computed by the Random Forest model - The species specifically introduced into the target by contamination

An example output file is available here.

If you are using MetaPhlan4, we strongly recommend filtering out low-abundance species to improve CroCoDeEL's sensitivity.\ Use the --filter-low-ab option as shown below: crocodeel search_conta -s species_abundance.tsv --filter-low-ab 20 -c contamination_events.tsv

Visualization of the results

Contaminations events can be visually inspected by generating a PDF file consisting in scatterplots. crocodeel plot_conta -s species_abundance.tsv -c contamination_events.tsv -r contamination_events.pdf Each scatterplot compares in a log-scale the species abundance profiles of a contaminated sample (x-axis) and its contamination source (y-axis).\ The contamination line (in red) highlights species specifically introduced by contamination.\ An example is available here.

Easy workflow

Alternatively, you can search for cross-sample contamination and create the PDF report in one command. crocodeel easy_wf -s species_abundance.tsv -c contamination_events.tsv -r contamination_events.pdf

Results interpretation

CroCoDeEL will probably report false contamination events for samples with similar species abundances profiles (e.g. longitudinal data, animals raised together).\ For non-related samples, CroCoDeEL may occasionally generate false positives that can be filtered out by a human-expert.\ Thus, we strongly recommend inspecting scatterplots of each contamination event to discard potential false positives.\ Please check the wiki for more information.

Reproduce results of the paper

Species abundance tables of the training, validation and test datasets are available in this repository.
You can use CroCoDeEL to analyze these tables and reproduce the results presented in the paper.
For example, to process Plate 3 from the Lou et al. dataset, first download the species abundance table:
wget --content-disposition 'https://entrepot.recherche.data.gouv.fr/api/access/datafile/:persistentId?persistentId=doi:10.57745/BH1RKY' and then run CroCoDeEL:
crocodeel easy_wf -s PRJNA698986_P3.meteor.tab -c PRJNA698986_P3.meteor.crocodeel.tsv -r PRJNA698986_P3.meteor.crocodeel.pdf

Train a new Random Forest model

Advanced users can train a custom Random Forest model, which classifies sample pairs as contaminated or not.
You will need a species abundance table with labeled contaminated and non-contaminated sample pairs, to be used for training and testing.
To get started, you can download and decompress the dataset we used to train CroCoDeEL's default model:
wget --content-disposition 'https://entrepot.recherche.data.gouv.fr/api/access/datafile/:persistentId?persistentId=doi:10.57745/IBIPVG' xz -d training_dataset.meteor.tsv.xz Then, use the following command to train a new model:
crocodeel train_model -s training_dataset.meteor.tsv -m crocodeel_model.tsv -r crocodeel_model_perf.tsv Finally, to use your trained model instead of the default one, pass it with the -m option:
crocodeel search_conta -s species_ab.tsv -m crocodeel_model.tsv -c conta_events.tsv

Citation

If you find CroCoDeEL useful, please cite:\ Goulet, L. et al. "CroCoDeEL: accurate control-free detection of cross-sample contamination in metagenomic data" bioRxiv (2025). https://doi.org/10.1101/2025.01.15.633153.

Owner

  • Name: metagenopolis
  • Login: metagenopolis
  • Kind: organization

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Goulet
    given-names: Lindsay
    email: lindsay.goulet@inrae.fr
    affiliation: INRAE
  - family-names: Plaza Oñate
    given-names: Florian
    email: florian.plaza-onate@inrae.fr
    affiliation: INRAE
  - family-names: Prifti
    given-names: Edi
    email: edi.prifti@ird.fr
    affiliation: IRD
  - family-names: Belda
    given-names: Eugeni
    email: eugeni.belda@ird.fr
    affiliation: IRD
  - family-names: Le Chatelier
    given-names: Emmanuelle
    email: emmanuelle.le-chatelier@inrae.fr
    affiliation: INRAE
  - family-names: Gautreau
    given-names: Guillaume
    email: guillaume.gautreau@inrae.fr
    affiliation: INRAE
title: "CroCoDeEL: CROss-sample COntamination DEtection and Estimation of its Level"
version: 1.0.8
doi: 10.1101/2025.01.15.633153
date-released: 2025-07-22

GitHub Events

Total
  • Create event: 8
  • Release event: 6
  • Issues event: 2
  • Watch event: 16
  • Delete event: 1
  • Push event: 52
  • Pull request event: 2
Last Year
  • Create event: 8
  • Release event: 6
  • Issues event: 2
  • Watch event: 16
  • Delete event: 1
  • Push event: 52
  • Pull request event: 2

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 27 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 9
  • Total maintainers: 1
pypi.org: crocodeel

CroCoDeEL is a tool that detects cross-sample (aka well-to-well) contamination in shotgun metagenomic data

  • Versions: 9
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 27 Last month
Rankings
Dependent packages count: 9.5%
Average: 35.9%
Dependent repos count: 62.4%
Maintainers (1)
Last synced: 6 months ago

Dependencies

poetry.lock pypi
  • astroid 3.1.0
  • colorama 0.4.6
  • contourpy 1.2.1
  • cycler 0.12.1
  • dill 0.3.8
  • fonttools 4.51.0
  • isort 5.13.2
  • joblib 1.4.0
  • kiwisolver 1.4.5
  • matplotlib 3.8.4
  • mccabe 0.7.0
  • mypy 1.10.0
  • mypy-extensions 1.0.0
  • numpy 1.26.4
  • packaging 24.0
  • pandas 2.2.2
  • pandas-stubs 2.2.1.240316
  • pillow 10.3.0
  • platformdirs 4.2.1
  • pylint 3.1.0
  • pyparsing 3.1.2
  • python-dateutil 2.9.0.post0
  • pytz 2024.1
  • scikit-learn 1.3.0
  • scipy 1.13.0
  • six 1.16.0
  • threadpoolctl 3.5.0
  • tomlkit 0.12.4
  • tqdm 4.66.2
  • types-pytz 2024.1.0.20240417
  • types-tqdm 4.66.0.20240417
  • typing-extensions 4.11.0
  • tzdata 2024.1
pyproject.toml pypi
  • mypy ^1.10 mypy
  • pandas-stubs ^2.2 mypy
  • types-tqdm ^4.66 mypy
  • pylint ^3.1 pylint
  • joblib ^1.4
  • matplotlib ^3.8
  • numpy ^1.26
  • pandas ^2.2
  • python >=3.12
  • scikit-learn =1.3.0
  • scipy ^1.13
  • tqdm ^4.66
.github/workflows/mirror_gitlab.yml actions
  • actions/checkout v3 composite
  • pixta-dev/repository-mirroring-action v1 composite
.github/workflows/publish_crocodeel.yml actions
  • abatilo/actions-poetry v2 composite
  • actions/checkout v4 composite
  • actions/setup-python v5 composite