causobiome

A Multi-Stage Pipeline for Microbiome-Driven Causal Discovery and Intervention Design in Colorectal Cancer

https://github.com/ascanofficiel2/causobiome

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (6.3%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

A Multi-Stage Pipeline for Microbiome-Driven Causal Discovery and Intervention Design in Colorectal Cancer

Basic Info
  • Host: GitHub
  • Owner: AscanOfficiel2
  • License: mit
  • Language: Python
  • Default Branch: main
  • Size: 60.5 KB
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 1
Created about 1 year ago · Last pushed about 1 year ago
Metadata Files
Readme License Citation

README.md

CausoBiome: A Microbiome Causality and Biomarker Discovery framework for Colorectal Cancer

🧠 Purpose

CausoBiome is a modular, stage-aware, and functionally grounded microbiome analysis pipeline purpose-built to uncover microbial features (species, ARGs, and VFs) that are functionally and causally linked to human disease progression most notably colorectal cancer (CRC).

Unlike conventional pipelines that stop at descriptive comparisons, CausoBiome is designed to bridge microbiome discovery with translational insight by leveraging modern statistical learning, causal inference, and external validation strategies.CausoBiome builds upon high-quality, genome-resolved metagenomic

preprocessing pipelines from the genome-resolved-urban-microbiome-biosurveillance repository, specifically Modules 01_Bioinformatics and 02_Quality_Batch_subsetting


🎯 Design Philosophy

CausoBiome is built around five central goals:

  1. Stage-awareness
    It models microbiome dynamics along clinically meaningful transitions, e.g.
    Healthy → Adenoma → Cancer, capturing directional microbial shifts rather than static contrasts.

  2. Causality over correlation
    Through Double Machine Learning (DML) CausoBiome estimates the average treatment effect (ATE) of each feature while controlling for key confounders (e.g., Age, BMI, Sex).

  3. Functional resolution
    It analyzes both taxonomic (species-level) and functional (ARGs, VFs) features to capture mechanisms of microbial influence, including antibiotic resistance, immune modulation, and virulence.

  4. Feature robustness
    Using bootstrap stability, permutation importance, and ordinal forest modeling, the pipeline identifies robust biomarkers that generalize across datasets.

  5. Generalizability
    By incorporating external cohort validation, trend consistency, and statistical replication, CausoBiome ensures that its findings are not dataset-specific artifacts.


🔹 Step 1: Preprocessing and Normalization

CausoBiome builds upon genome-resolved upstream modules (adapted from genome-resolved-urban-microbiome-biosurveillance) starting from the 01_Bioinformatics module and then proceed to the 02_Quality_batch_subsetting module by running specifically the script normalize_species_counts.py.Then transition into CausoBiome starting from module 01_Quality_Normalization_Batch_ecology from the script 01_mag_quality_metrics_analysis.py to deliver:

  • MAG binning & QC (completeness, contamination)
  • Species abundance normalization (Bracken)
  • ARG/VF normalization (CARD, VFDB + Hellinger transform)
  • Batch correction using ComBat / limma

🔹 Step 2: Ecology and Diversity

It performs alpha/beta diversity analyses across disease stages using:

  • Shannon, Simpson, Richness indices
  • Bray–Curtis dissimilarity and NMDS ordinations
  • PERMANOVA and ANOSIM statistical tests
  • Dispersion tests and Cliff’s delta effect size metrics

🔹 Step 3: Machine Learning for Feature Selection

ML classifiers (Random Forest, SVM, LR, GB, Ordinal Forest) are benchmarked using:

  • Foldwise cross-validation
  • Model comparison using Accuracy, Kappa, MCC, and F1 Macro
  • Feature importance via permutation + bootstrap ranking
  • Signature score projection and retraining on top biomarkers

🔹 Step 4: Causal Inference via DML

Using the econML framework, CausoBiome applies LinearDML to:

  • Estimate ATE per microbial gene on CRC stage (Healthy–>Cancer)
  • Control for Age, BMI, Sex using machine-learned nuisance functions
  • Compute confidence intervals, E-values, and required sample sizes

🔹 Step 5: Microbial Interaction Modeling

Microbial feature interactions are analyzed through:

  • Pairwise causal effect modeling (LinearDML on gene-gene products)
  • Identification of synergistic vs. antagonistic effects
  • Network construction with node centrality metrics
  • Visualization of causal hubs and clusters

🔹 Step 6: External Validation

CausoBiome validates biomarker generalizability via:

  • Mann–Whitney tests in both internal and external datasets
  • Directional trend comparison (CRC vs. Control)
  • Concordance barplots and scatterplots
  • Venn diagrams of statistical overlap

🧬 Who Should Use CausoBiome?

CausoBiome is ideal for:

  • Microbiome researchers aiming for causal inference beyond correlation
  • Cancer biologists exploring functional microbial signatures
  • Clinical bioinformaticians validating microbial biomarkers across cohorts
  • Systems biologists modeling microbial interactions and networks

Pipeline Structure

```text CausoBiome/ ├── 01QualityNormalizationBatchecology/ │ ├── 01magqualitymetricsanalysis.py │ ├── 02Taxoncountsnormalization.py │ ├── 03Taxonsubsetpathogenicsamples.py │ ├── 04taxonbatchcorrection.R │ ├── 05taxonecology.R │ ├── 06argvfnormalization.py │ ├── 07argvfbatchcorrection.R │ ├── 08argvfecology.R │ └── 01READMEModuleQCNormalization.md │ ├── 02MachinelearningFeatureselection/ │ ├── 01mlbenchmarkARG-VF.sh │ ├── 02RFTUNINGARG-VF.sh │ ├── 03ARGVFdatageneration.py │ ├── 04validationmodel.sh │ ├── 05Getfeatureimportance.sh │ ├── 06retraintop20.sh │ ├── 07foldwiseComparisonARG-VF.R │ ├── 08ARGVFOrdinalforest.R │ ├── 09taxonMachinelearningfeature.sh │ ├── 10TaxonOrdinalforest.R │ ├── 11taxonstatscorrelations.py │ └── 02READMEModuleMachineLearningFULL.md │ ├── 03CausalAnalysis/ │ ├── 01Complexheatmapargvf.R │ ├── 02causalanalysis.py │ └── 03READMEModuleCausalAnalysis.md │ ├── 04Externalcohortvalidation/ │ ├── ExternalcohortValidation.ipynb │ └── 04READMEModuleExternalValidation.md │ ├── data/ │ ├── combinedARGVFDBfinalDATA.csv │ ├── MetadataAlignedVFARGCountMatrix.csv │ └── ... (external validation matrices) │ └── READMECausoBiome_Detailed.md

```


🚀 What Makes CausoBiome Novel?

CausoBiome differs from typical microbiome pipelines by addressing not just what is different but what functionally drives disease progression. Its novelty lies in several aspects:

1. Stage-Aware Ordinal Modeling

  • Directly models disease trajectory: Healthy → Adenoma → Cancer
  • Employs Ordinal Forests and Double Machine Learning (DML) to capture progression-aware microbial signals

2. Causal Inference Core

  • Implements DML via econML to estimate Average Treatment Effects (ATEs) per feature
  • Controls for covariates like Age, Sex, BMI using flexible machine-learned nuisance models
  • Computes:
    • E-values for sensitivity to unmeasured confounding
    • Bootstrap confidence intervals for robustness
    • Required sample sizes for validation studies

3. Synthetic Cohort Generation

  • Produces PCA-enhanced synthetic metagenomes
  • Enables model testing under both balanced and realistic class distributions

4. Intervention-Ready Outputs

  • Identifies synergistic and antagonistic microbial gene interactions
  • Constructs causal and co-occurrence networks from inferred ATE effects
  • Annotates ARGs and VFs with known mechanisms and matched microbial species

⚙️ Key Functional Highlights

| Component | Description | |----------------------|-----------------------------------------------------------------------------| | Ecological Analysis | Rich alpha/beta diversity metrics, NMDS ordination, dispersion tests | | Machine Learning | Foldwise benchmarking of RF, SVM, KNN, GB, Logistic Regression | | Feature Robustness | Combined permutation importance + bootstrap stability (Top 20 features) | | Ordinal Forest | Identifies rank-aware discriminative genes and taxa | | Signature Score | Mean and PCA1 scores for individual-level CRC burden assessment | | Causal Estimation | ATEs from LinearDML with Random Forests as base learners | | Interaction Effects | Gene × gene product modeling for combined causal effects | | External Validation | Tests trend consistency in independent cohort (e.g., PRJEB10878) | | Heatmaps & Networks | Visualizes consistent biomarkers and their interaction hubs |


🔄 Extensibility

CausoBiome is designed with modularity and disease-agnostic flexibility:

  • Supports any disease with ordered clinical stages (e.g., liver fibrosis, IBD, NAFLD)
  • Compatible with metabolomic, proteomic, or transcriptomic feature matrices

- Adaptable to longitudinal or time-series microbiome datasets

Pipeline Origin

CausoBiome is an extension of the upstream genome-resolved-urban-microbiome-biosurveillance workflow in:

GitHub: genome-resolved-urban-microbiome-biosurveillance

  • Users should start from the 01_Bioinformatics module and then proceed to the 02_Quality_batch_subsetting module by running specifically scripts normalize_species_counts.py
  • Then transition into CausoBiome starting from module 01_Quality_Normalization_Batch_ecology from the script:01magqualitymetricsanalysis.py`.

️ Requirements

  • SLURM-based HPC environment (for .sh scripts)
  • Python ≥ 3.11 and R ≥ 4.0
  • Pip packages: pandas, numpy, scikit-learn, matplotlib, seaborn, joblib, econml, networkx, statsmodels
  • R packages: vegan, sva, ggplot2, umap, pairwiseAdonis, FSA, etc.

Tools/Databases

  • VFDB – Virulence Factor Database
  • CARD – Comprehensive Antibiotic Resistance Database
  • ComBat – Batch correction via the sva R package
  • EconML – Causal inference library for treatment effect estimation
  • scikit-learn – Model training, permutation importance, ROC/AUC scoring

Output Highlights

  • Diversity metrics, ordination plots
  • Classification metrics (F1, MCC, AUROC)
  • Feature importance/stability plots
  • Microbial features causal estimates (DML ATE)
  • Robustness plots (E-values, bootstraps)
  • Microbial interaction networks (weighted, annotated)

Use Cases

  • Functional microbiome biomarker discovery
  • Ecological profiling of CRC microbiomes
  • Translational microbiome-based risk stratification
  • Design of synthetic consortia or microbial interventions

License

MIT License — free to use, adapt, and cite with attribution.

This Framework, otherwise referred to CausoBiome is currently part of a manuscript under peer review. This repository is shared under the MIT License to promote transparency and reproducibility.

We kindly request that you do not republish or repackage this methodology before journal publication.

Citation

If you use CausoBiome, please cite the following manuscript:

Ascandari, A., Aminu, S., Benhida, R., & Daoud, R. (2025).
A Core Genome-Resolved Microbial Resistome–Virulome Hub Causally Drives Colorectal Cancer Progression.

Submitted Articles Related to the Framework

Ascandari, A., Aminu, S., Benhida, R., & Daoud, R. (2025).
A Core Genome-Resolved Microbial Resistome–Virulome Hub Causally Drives Colorectal Cancer Progression (under review).

Contact

For questions, feedback, or collaboration regarding this framework, please reach out:

AbdulAziz Ascandari, PhD Researcher, Department of Chemical and Biochemical Sciences, University Mohammed VI Polytechnic (UM6P), Morocco, abdulaziz.ascandari@um6p.ma

Prof. Rachid Daoud, Group Leader & Supervisor, Department of Chemical and Biochemical Sciences, University Mohammed VI Polytechnic (UM6P), Morocco, rachid.daoud@um6p.ma

Owner

  • Name: AbdulAziz Ascandari
  • Login: AscanOfficiel2
  • Kind: user
  • Location: Ben Guerir, Morocco
  • Company: University Mohammed VI Polytechnic

I am a Doctoral Researcher in Biomedical Scientist and a computational Biology enthusiast. I enjoy watching blockbuster movies and going on nature walks.

GitHub Events

Total
  • Release event: 1
  • Public event: 1
  • Push event: 6
Last Year
  • Release event: 1
  • Public event: 1
  • Push event: 6