investigut

Lineage-specific microbial protein prediction facilitates exploration of protein ecology within the human gut.

https://github.com/matt-schmitz/investigut

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 2 DOI reference(s) in README
✓
Academic publication links
Links to: nature.com, zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (16.0%) to scientific vocabulary

Last synced: 9 months ago · JSON representation ·

Repository

Lineage-specific microbial protein prediction facilitates exploration of protein ecology within the human gut.

Basic Info

Host: GitHub
Owner: Matt-Schmitz
License: gpl-3.0
Language: Python
Default Branch: main
Homepage:
Size: 24.3 MB

Statistics

Stars: 3
Watchers: 2
Forks: 2
Open Issues: 0
Releases: 4

Created about 2 years ago · Last pushed about 1 year ago

Metadata Files

Readme License Citation

What is InvestiGUT?

InvestiGUT has been designed to allow users to study the ecological prevalence of a protein sequence across human gut microbiome samples.

How does InvestiGUT work?

InvestiGUT (v1.1) accepts either a single protein sequence or a set of sequences, as defined by the options ‘–s’ or ‘–m’ respectively. User-provided protein sequences are matched to the collection of human gut microbiome proteins via DIAMOND (v2.1.11) alignment with a default minimum query and subject coverage of 90% and identity of 90%, which can be defined by the user. User proteins can be analysed individually (-s), or as a group (-m) where all proteins must be present within a sample to be considered. The group analysis allows users to determine how frequently all enzymes in a certain pathway or all subunits in a protein complex are present. Output is divided into metagenomic and species-based analysis results, with both being available as raw data in the form of TSV files and as automatically generated vector figures.

Metadata for each metagenomic sample included age, sexgender, BMI, smoker-status, westernised-diet status, use of antibiotics, and geographical location. Additionally, disease prevalence includes output for Crohn’s disease, ulcerative colitis, colorectal cancer, Clostridioides difficile infection, type-2 diabetes, rheumatoid arthritis, and fatty liver disease. Metadata vocabulary is consistent with that used in the curatedMetagenomeData repository. Prevalence of the queried sequence in each of these groups is quantified, and compared either between categories, or healthy controls, using Fisher’s exact test from the scipy.stats module in Python.

Querying the user input protein against the 3,594 nonredundant genomes provides a list of species which contain the protein, termed “function-positive species”. The taxonomic range of the protein is determined by the prevalence of the protein with each genus, family, and phylum of positive species. The function-positive fraction is obtained as the cumulative relative abundance of function-positive genomes across 4,624 individuals for which pre-computed relative abundance values are available.

Instructions

Clone the repository.
bash git clone https://github.com/Matt-Schmitz/InvestiGUT.git
Enter the InvestiGUT folder.
bash cd InvestiGUT
Create the environment with mamba.
bash mamba create --no-channel-priority -n investigut \ -c bioconda -c conda-forge \ "python=3.11" "numpy=1.24.3" "scipy=1.10.1" \ "conda-forge::matplotlib-base" "seaborn=0.13.0" \ "libsqlite=3.48.0" "pandas=1.5.3" "statsmodels=0.13.5" \ "ete3=3.1.2" "openpyxl=3.0.10" "bioconda::diamond=2.1.11"
Activate the environment.
bash mamba activate investigut
Download proteins and create DIAMOND database. After the DIAMOND database has been created, the compressed fasta data can be deleted.

[!WARNING] The compressed data take up 79GB (two files). The DIAMOND database will take up 311GB.

bash ./download_fasta.sh 6. Add the InvestiGUT folder to your ~/.bashrc and apply the changes.
- Add to .bashrc: export PATH="/PATH/TO/InvestiGUT:$PATH" - Enter in terminal: bash source ~/.bashrc - Reactivate the mamba environment: bash mamba activate investigut

Run InvestiGUT on chosen proteins.
Basic usage:

bash investigut.py -i /PATH/TO/FASTA/FILE

Options list:

[!TIP] Any arguments not recognized by InvesitGut will be passed on to DIAMOND (e.g., --threads, --faster, --ultra-sensitive). -h, --help show this help message and exit -i {INPUT} input fasta file -o {OUTPUT} output folder -s single mode (default) -m multi mode -d {DMND_TSV_OUTPUT} optional DIAMOND .tsv output file to skip alignment --query-cover {NUM} DIAMOND query-coverage (default 90.0) --subject-cover {NUM} DIAMOND subject-coverage (default 90.0) --id {NUM} DIAMOND %id (default 90) --low overrides default DIAMOND query/sujbect coverage and %id giving all a value of 50

Example Usage

In the example folder, a file called "seaweed.fa" contains two bacterial proteins, Bp1670 and Bp1689, that are involved in seaweed digestion (taken from https://www.nature.com/articles/nature08937). In multi mode (-m), metagenomes and metagenome-assembled-genomes (MAGs) containing all proteins in the fasta file are counted as positive, whereas in single mode (-s), metagenomes and MAGs are separately examined for each protein sequence. The following command searches for matches to both proteins (-m) using a subject/query-coverage of 50% and a percetage identity threshhold of 50% (--low). bash investigut.py -i /PATH/TO/seaweed.fa -m --low The resulting output can be found under ./examples/Bp1670 + Bp1689. In this folder, the "overview.txt" file contains global prevalence within the metagenomes, disease prevalence statistics, country prevalence statistics, and other demographic factor data (smokers vs. non-smokers, BMI, gender, age by decade of life, and antibiotic usage). The MAG overview data include a list of positive gut bacteria, occurrence of the protein(s) by taxonomic rank, and the cumulative relative abundance of species containing the protein(s) of interest within the two cohorts of the origin data (https://www.nature.com/articles/s41467-022-31502-1).

The following is an example of an auto-generated figure of prevalence by country found within the output folder after running the seaweed digestion proteins mentioned above through InvestiGUT.

Prevalence_of_Bp1670 + Bp1689_by_Country

Database Creation

[!TIP] A premade database is automatically downloaded when following the instructions above. The steps below are only necessary to create a custom database.

Clone the repository.
bash git clone https://github.com/Matt-Schmitz/InvestiGUT.git
Enter the InvestiGut folder.
bash cd InvestiGUT
Create the environment with mamba.
bash mamba create --no-channel-priority -n investigut_database \ -c conda-forge -c bioconda \ "python=3.9" "conda-forge::gsl=2.7" \ "bioconda::augustus=3.5.0" fraggenescan phanotate \ "bioconda::prodigal=2.6.3" kraken2 "bioconda::pyrodigal" \ glimmerhmm metagene_annotator snap ete3=3.1.2 \ "bioconda::nextflow=24.04.2"
Activate the environment.
bash mamba activate investigut_database
With Kraken 2 install in the mamba environment, you need to create the taxonomy prediction database. > [!TIP] > Check out the Kraken 2 manual for a more detailed list of options.

Build the standard database. bash kraken2-build --standard --db $DBNAME Alternatively, a custom database can be created. The following database includes bacteria, archaea, and viruses from NCBI. bash kraken2-build --download-taxonomy --db $DBNAME kraken2-build --download-library bacteria --db $DBNAME kraken2-build --download-library archaea --db $DBNAME kraken2-build --download-library viral --db $DBNAME kraken2-build --build --db $DBNAME

Make the Kraken 2 database available in RAM for fast taxonomy prediciton.

Make the tmpfs folder that will be mounted. bash mkdir /PATH/TO/tmpfs

Edit the system configuration file. bash sudo nano /etc/fstab

In the fstab file, add the following line. Modify the path to the tmpfs directory and the size to be large enough for the hash.k2d, opts.k2d, and taxo.k2d files created in step 5.

bash tmpfs /PATH/TO/tmpfs tmpfs rw,nodev,noexec,nosuid,size=120g,user,noauto 0 0

Reload the modified fstab file. bash sudo systemctl daemon-reload

Move the Kraken 2 files into the tmpfs directory. bash mv /PATH/TO/hash.k2d /PATH/TO/tmpfs/hash.k2d mv /PATH/TO/opts.k2d /PATH/TO/tmpfs/opts.k2d mv /PATH/TO/taxo.k2d /PATH/TO/tmpfs/taxo.k2d

Make GeneMark* software available to InvestiGUT. The lisence of GeneMark* software does not allow them to be packaged with InvestiGUT, so they must be manually downloaded.

https://genemark.bme.gatech.edu/GeneMark/license_download.cgi text GeneMarkS -> Add /PATH/TO/gmsn.pl to PATH for usage with InvestiGUT GeneMarkS2 -> Add /PATH/TO/gms2.pl to PATH for usage with InvestiGUT GeneMarkST -> Add /PATH/TO/gmst.pl to PATH for usage with InvestiGUT MetaGeneMark -> Add /PATH/TO/gmhmmp to PATH for usage with InvestiGUT MetaGeneMark2 -> Add /PATH/TO/run_mgm.pl to PATH for usage with InvestiGUT

Set parameters and run the pipeline.

In the pipeline.nf file, there are the following options.

```bash params.inputDir = '/PATH/TO/input' inputDirMAGs = params.inputDir + "/MAGs" inputDirMetagenomes = params.inputDir + "/metagenomes" params.outputDir = '/PATH/TO/output' params.krakendbDir = '/PATH/TO/tmpfs'

params.threads = Runtime.runtime.availableProcessors() // maximum number available params.memoryMapping = true pathToNodesDmp = "/PATH/TO/nodes.dmp"

// GeneMark* tools are available for use after download: params.toolsfororg_group = """{ "Archaea":["FragGeneScan", "GeneMarkST","MetaGeneMark"], "Bacteria":["MetaGeneAnnotator","MetaGeneMark","Pyrodigal"], "Eukaryota":["Augustus","Pyrodigal","Snap"], "Viruses":["MetaGeneAnnotator","GeneMarkS","GeneMarkS2"] }"""

// Leaving these paths blank will assume that they have been added to PATH // The lisence of GeneMark* software does not allow them to be packaged with InvestiGUT pathToGeneMarkS = "" // "/PATH/TO/gmsn.pl" pathToGeneMarkS2 = "" // "/PATH/TO/gms2.pl" pathToGeneMarkST = "" // "/PATH/TO/gmst.pl" pathToMetaGeneMark = "" // "/PATH/TO/gmhmmp" pathToMetaGeneMark2 = "" // "/PATH/TO/run_mgm.pl" ```

[!WARNING] The input files must be in the format {studyname}{sampleid}.fa(.gz)? as studyname and sampleid appear in https://waldronlab.io/curatedMetagenomicData/

Run the pipeline. bash nextflow pipeline.nf

Owner

Login: Matt-Schmitz
Kind: user

Repositories: 1
Profile: https://github.com/Matt-Schmitz

Citation (CITATION.cff)

cff-version: 1.2.0
title: InvestiGut
message: >-
  Lineage-specific microbial protein prediction within the
  human gut
type: software
authors:
  - given-names: Matthias Andreas
    family-names: Schmitz
    orcid: 'https://orcid.org/0009-0008-4418-4012'
    affiliation: >-
      Functional Microbiome Research Group, RWTH University
      Hospital
  - given-names: Nicholas John
    family-names: Dimonaco
    affiliation: >-
      Department of Medicine, McMaster University; Farncombe
      Family Digestive Health Research Institute, McMaster
      University; School of Biological Sciences, Queen’s
      University Belfast; Department of Computer Science,
      Aberystwyth University
    orcid: 'https://orcid.org/0000-0002-3808-206X'
  - given-names: Thomas
    family-names: Clavel
    orcid: 'https://orcid.org/0000-0002-7229-5595'
    affiliation: >-
      Functional Microbiome Research Group, RWTH University
      Hospital
  - affiliation: >-
      Functional Microbiome Research Group, RWTH University
      Hospital
    orcid: 'https://orcid.org/0000-0003-2244-7412'
    given-names: Thomas Charles Adrian
    family-names: Hitch
repository-code: 'https://github.com/Matt-Schmitz/InvestiGut'

GitHub Events

Total

Create event: 9
Issues event: 2
Release event: 5
Watch event: 1
Delete event: 7
Issue comment event: 5
Push event: 4

Last Year

Create event: 9
Issues event: 2
Release event: 5
Watch event: 1
Delete event: 7
Issue comment event: 5
Push event: 4

Dependencies

data/metadata/KarlssonFH_2013/setup.py pypi

environment.yml conda

diamond 2.1.8.*
ete3 3.1.2.*
matplotlib 3.7.1.*
numpy 1.24.3.*
openpyxl 3.0.10.*
pandas 1.5.3.*
python 3.11.*
scipy 1.10.1.*
seaborn 0.13.0.*
statsmodels 0.13.5.*

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science