causobiome

A Multi-Stage Pipeline for Microbiome-Driven Causal Discovery and Intervention Design in Colorectal Cancer

https://github.com/ascanofficiel2/causobiome

Last synced: 10 months ago · JSON representation

Repository

A Multi-Stage Pipeline for Microbiome-Driven Causal Discovery and Intervention Design in Colorectal Cancer

Basic Info

Host: GitHub
Owner: AscanOfficiel2
License: mit
Language: Python
Default Branch: main
Size: 60.5 KB

Statistics

Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Releases: 1

Created about 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme License Citation

CausoBiome: A Microbiome Causality and Biomarker Discovery framework for Colorectal Cancer

🧠 Purpose

CausoBiome is a modular, stage-aware, and functionally grounded microbiome analysis pipeline purpose-built to uncover microbial features (species, ARGs, and VFs) that are functionally and causally linked to human disease progression most notably colorectal cancer (CRC).

Unlike conventional pipelines that stop at descriptive comparisons, CausoBiome is designed to bridge microbiome discovery with translational insight by leveraging modern statistical learning, causal inference, and external validation strategies.CausoBiome builds upon high-quality, genome-resolved metagenomic

preprocessing pipelines from the genome-resolved-urban-microbiome-biosurveillance repository, specifically Modules 01_Bioinformatics and 02_Quality_Batch_subsetting

🎯 Design Philosophy

CausoBiome is built around five central goals:

Stage-awareness
It models microbiome dynamics along clinically meaningful transitions, e.g.
Healthy → Adenoma → Cancer, capturing directional microbial shifts rather than static contrasts.
Causality over correlation
Through Double Machine Learning (DML) CausoBiome estimates the average treatment effect (ATE) of each feature while controlling for key confounders (e.g., Age, BMI, Sex).
Functional resolution
It analyzes both taxonomic (species-level) and functional (ARGs, VFs) features to capture mechanisms of microbial influence, including antibiotic resistance, immune modulation, and virulence.
Feature robustness
Using bootstrap stability, permutation importance, and ordinal forest modeling, the pipeline identifies robust biomarkers that generalize across datasets.
Generalizability
By incorporating external cohort validation, trend consistency, and statistical replication, CausoBiome ensures that its findings are not dataset-specific artifacts.

🔹 Step 1: Preprocessing and Normalization

CausoBiome builds upon genome-resolved upstream modules (adapted from genome-resolved-urban-microbiome-biosurveillance) starting from the 01_Bioinformatics module and then proceed to the 02_Quality_batch_subsetting module by running specifically the script normalize_species_counts.py.Then transition into CausoBiome starting from module 01_Quality_Normalization_Batch_ecology from the script 01_mag_quality_metrics_analysis.py to deliver:

MAG binning & QC (completeness, contamination)
Species abundance normalization (Bracken)
ARG/VF normalization (CARD, VFDB + Hellinger transform)
Batch correction using ComBat / limma

🔹 Step 2: Ecology and Diversity

It performs alpha/beta diversity analyses across disease stages using:

Shannon, Simpson, Richness indices
Bray–Curtis dissimilarity and NMDS ordinations
PERMANOVA and ANOSIM statistical tests
Dispersion tests and Cliff’s delta effect size metrics

🔹 Step 3: Machine Learning for Feature Selection

ML classifiers (Random Forest, SVM, LR, GB, Ordinal Forest) are benchmarked using:

Foldwise cross-validation
Model comparison using Accuracy, Kappa, MCC, and F1 Macro
Feature importance via permutation + bootstrap ranking
Signature score projection and retraining on top biomarkers

🔹 Step 4: Causal Inference via DML

Using the econML framework, CausoBiome applies LinearDML to:

Estimate ATE per microbial gene on CRC stage (Healthy–>Cancer)
Control for Age, BMI, Sex using machine-learned nuisance functions
Compute confidence intervals, E-values, and required sample sizes

🔹 Step 5: Microbial Interaction Modeling

Microbial feature interactions are analyzed through:

Pairwise causal effect modeling (LinearDML on gene-gene products)
Identification of synergistic vs. antagonistic effects
Network construction with node centrality metrics
Visualization of causal hubs and clusters

🔹 Step 6: External Validation

CausoBiome validates biomarker generalizability via:

Mann–Whitney tests in both internal and external datasets
Directional trend comparison (CRC vs. Control)
Concordance barplots and scatterplots
Venn diagrams of statistical overlap

🧬 Who Should Use CausoBiome?

CausoBiome is ideal for:

Microbiome researchers aiming for causal inference beyond correlation
Cancer biologists exploring functional microbial signatures
Clinical bioinformaticians validating microbial biomarkers across cohorts
Systems biologists modeling microbial interactions and networks

Pipeline Structure

```text CausoBiome/ ├── 01QualityNormalizationBatchecology/ │ ├── 01magqualitymetricsanalysis.py │ ├── 02Taxoncountsnormalization.py │ ├── 03Taxonsubsetpathogenicsamples.py │ ├── 04taxonbatchcorrection.R │ ├── 05taxonecology.R │ ├── 06argvfnormalization.py │ ├── 07argvfbatchcorrection.R │ ├── 08argvfecology.R │ └── 01READMEModuleQCNormalization.md │ ├── 02MachinelearningFeatureselection/ │ ├── 01mlbenchmarkARG-VF.sh │ ├── 02RFTUNINGARG-VF.sh │ ├── 03ARGVFdatageneration.py │ ├── 04validationmodel.sh │ ├── 05Getfeatureimportance.sh │ ├── 06retraintop20.sh │ ├── 07foldwiseComparisonARG-VF.R │ ├── 08ARGVFOrdinalforest.R │ ├── 09taxonMachinelearningfeature.sh │ ├── 10TaxonOrdinalforest.R │ ├── 11taxonstatscorrelations.py │ └── 02READMEModuleMachineLearningFULL.md │ ├── 03CausalAnalysis/ │ ├── 01Complexheatmapargvf.R │ ├── 02causalanalysis.py │ └── 03READMEModuleCausalAnalysis.md │ ├── 04Externalcohortvalidation/ │ ├── ExternalcohortValidation.ipynb │ └── 04READMEModuleExternalValidation.md │ ├── data/ │ ├── combinedARGVFDBfinalDATA.csv │ ├── MetadataAlignedVFARGCountMatrix.csv │ └── ... (external validation matrices) │ └── READMECausoBiome_Detailed.md

```

🚀 What Makes CausoBiome Novel?

CausoBiome differs from typical microbiome pipelines by addressing not just what is different but what functionally drives disease progression. Its novelty lies in several aspects:

1. Stage-Aware Ordinal Modeling

Directly models disease trajectory: Healthy → Adenoma → Cancer
Employs Ordinal Forests and Double Machine Learning (DML) to capture progression-aware microbial signals

2. Causal Inference Core

Implements DML via econML to estimate Average Treatment Effects (ATEs) per feature
Controls for covariates like Age, Sex, BMI using flexible machine-learned nuisance models
Computes:
- E-values for sensitivity to unmeasured confounding
- Bootstrap confidence intervals for robustness
- Required sample sizes for validation studies

3. Synthetic Cohort Generation

Produces PCA-enhanced synthetic metagenomes
Enables model testing under both balanced and realistic class distributions

4. Intervention-Ready Outputs

Identifies synergistic and antagonistic microbial gene interactions
Constructs causal and co-occurrence networks from inferred ATE effects
Annotates ARGs and VFs with known mechanisms and matched microbial species

⚙️ Key Functional Highlights

| Component | Description | |----------------------|-----------------------------------------------------------------------------| | Ecological Analysis | Rich alpha/beta diversity metrics, NMDS ordination, dispersion tests | | Machine Learning | Foldwise benchmarking of RF, SVM, KNN, GB, Logistic Regression | | Feature Robustness | Combined permutation importance + bootstrap stability (Top 20 features) | | Ordinal Forest | Identifies rank-aware discriminative genes and taxa | | Signature Score | Mean and PCA1 scores for individual-level CRC burden assessment | | Causal Estimation | ATEs from LinearDML with Random Forests as base learners | | Interaction Effects | Gene × gene product modeling for combined causal effects | | External Validation | Tests trend consistency in independent cohort (e.g., PRJEB10878) | | Heatmaps & Networks | Visualizes consistent biomarkers and their interaction hubs |

🔄 Extensibility

CausoBiome is designed with modularity and disease-agnostic flexibility:

Supports any disease with ordered clinical stages (e.g., liver fibrosis, IBD, NAFLD)
Compatible with metabolomic, proteomic, or transcriptomic feature matrices

- Adaptable to longitudinal or time-series microbiome datasets

Pipeline Origin

CausoBiome is an extension of the upstream genome-resolved-urban-microbiome-biosurveillance workflow in:

GitHub: genome-resolved-urban-microbiome-biosurveillance

Users should start from the 01_Bioinformatics module and then proceed to the 02_Quality_batch_subsetting module by running specifically scripts normalize_species_counts.py
Then transition into CausoBiome starting from module 01_Quality_Normalization_Batch_ecology from the script:01magqualitymetricsanalysis.py`.

️ Requirements

SLURM-based HPC environment (for .sh scripts)
Python ≥ 3.11 and R ≥ 4.0
Pip packages: pandas, numpy, scikit-learn, matplotlib, seaborn, joblib, econml, networkx, statsmodels
R packages: vegan, sva, ggplot2, umap, pairwiseAdonis, FSA, etc.

Tools/Databases

VFDB – Virulence Factor Database
CARD – Comprehensive Antibiotic Resistance Database
ComBat – Batch correction via the sva R package
EconML – Causal inference library for treatment effect estimation
scikit-learn – Model training, permutation importance, ROC/AUC scoring

Output Highlights

Diversity metrics, ordination plots
Classification metrics (F1, MCC, AUROC)
Feature importance/stability plots
Microbial features causal estimates (DML ATE)
Robustness plots (E-values, bootstraps)
Microbial interaction networks (weighted, annotated)

Use Cases

Functional microbiome biomarker discovery
Ecological profiling of CRC microbiomes
Translational microbiome-based risk stratification
Design of synthetic consortia or microbial interventions

License

MIT License — free to use, adapt, and cite with attribution.

This Framework, otherwise referred to CausoBiome is currently part of a manuscript under peer review. This repository is shared under the MIT License to promote transparency and reproducibility.

We kindly request that you do not republish or repackage this methodology before journal publication.

Citation

If you use CausoBiome, please cite the following manuscript:

Ascandari, A., Aminu, S., Benhida, R., & Daoud, R. (2025).
A Core Genome-Resolved Microbial Resistome–Virulome Hub Causally Drives Colorectal Cancer Progression.

Submitted Articles Related to the Framework

Ascandari, A., Aminu, S., Benhida, R., & Daoud, R. (2025).
A Core Genome-Resolved Microbial Resistome–Virulome Hub Causally Drives Colorectal Cancer Progression (under review).

Contact

For questions, feedback, or collaboration regarding this framework, please reach out:

AbdulAziz Ascandari, PhD Researcher, Department of Chemical and Biochemical Sciences, University Mohammed VI Polytechnic (UM6P), Morocco, abdulaziz.ascandari@um6p.ma

Prof. Rachid Daoud, Group Leader & Supervisor, Department of Chemical and Biochemical Sciences, University Mohammed VI Polytechnic (UM6P), Morocco, rachid.daoud@um6p.ma

Owner

Name: AbdulAziz Ascandari
Login: AscanOfficiel2
Kind: user
Location: Ben Guerir, Morocco
Company: University Mohammed VI Polytechnic

Repositories: 1
Profile: https://github.com/AscanOfficiel2

I am a Doctoral Researcher in Biomedical Scientist and a computational Biology enthusiast. I enjoy watching blockbuster movies and going on nature walks.