https://github.com/arcadia-science/2022-mtx-not-in-mgx-pairs

https://github.com/arcadia-science/2022-mtx-not-in-mgx-pairs

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
    Links to: biorxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.5%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: Arcadia-Science
  • Language: Jupyter Notebook
  • Default Branch: main
  • Size: 1.69 MB
Statistics
  • Stars: 0
  • Watchers: 2
  • Forks: 0
  • Open Issues: 7
  • Releases: 0
Created almost 4 years ago · Last pushed over 3 years ago
Metadata Files
Readme

README.md

Identifying sequences that are in a metatranscriptome but not in a metagenome

This repository curates a set of paired metagenomes and metatranscriptomes and provides a pipeline to rapidly identify the fraction of sequences in a metatranscriptome that are not in a metagenome. The pipeline is shaped around a metadata file, inputs/metadata-paired-mgx-mtx.tsv, that contains a sample name (sample_name), metagenome SRA run accession (mgx_run_accession; SRR*, ERR*, DRR*), metatranscriptome SRA run accession (mtx_run_accession), and a sample type (sample_type). Using the run accessions, it downloads the sequencing data from the SRA and generates a FracMinHash sketch of each run. Then, it uses the paired information encoded in the metadata table to subtract the metagenome sketch from the metatranscriptome sketch. This produces an estimate of the fraction metatranscriptome sequences not found in the paired metagenome. These estimates are also clustered by sample_type to generate biome-specific estimates. The pipeline also analyzes the fraction of metatranscriptome-specific sequences that are shared between samples to discover what fraction of sequences we are systematically missing within and across biomes.

Some metagenome and metatranscripome pairs are true pairs that were extracted from the same sample while others are from separate samples taken from the same location at the same time.

Getting started with this repository

This repository uses snakemake to run the pipeline and conda to manage software environments and installations. You can find operating system-specific instructions for installing miniconda here. After installing conda and mamba, run the following command to create the pipeline run environment. We installaed Miniconda3 version py39_4.12.0 and mamba version 0.15.3.

mamba env create -n mtx_mgx --file environment.yml conda activate mtx_mgx

To start the pipeline, run: snakemake --use-conda -j 2

Running this repository on AWS

This repository was executed on an AWS EC2 instance (Ubuntu 22.04 LTS ami-085284d24fe829cd0, t2.large, 500 GiB EBS gp2 root storage). The instance was configured using the following commands:

``` curl -JLO https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x8664.sh # download the miniconda installation script bash Miniconda3-latest-Linux-x8664.sh # run the miniconda installation script. Accept the license and follow the defaults. source ~/.bashrc # source the .bashrc for miniconda to be available in the environment

configure miniconda channel order

conda config --add channels defaults conda config --add channels bioconda conda config --add channels conda-forge conda config --set channel_priority strict conda install mamba # install mamba for faster software installation. ```

Once miniconda is configured, clone the repository and cd into it, then follow the instructions in the above section. git clone https://github.com/Arcadia-Science/2022-mtx-not-in-mgx-pairs.git cd 2022-mtx-not-in-mgx-pairs

Note that there is a known issue with the conda package sra-tools=2.11.0 that will prevent this workflow from running on Mac operating systems.

Next steps

TBD

Owner

  • Name: Arcadia Science
  • Login: Arcadia-Science
  • Kind: organization
  • Location: United States of America

GitHub Events

Total
Last Year

Dependencies

environment.yml conda
  • pandas 1.4.3.*
  • snakemake-minimal 7.12.1.*
  • sourmash-minimal 4.4.3.*