atavide_lite

A simpler version of atavide that relies only on slurm or PBS scripts. Some of the settings are specific for our compute resources

https://github.com/linsalrob/atavide_lite

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 7 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.2%) to scientific vocabulary

Keywords

bioinformatics metagenomics metagenomics-analysis metagenomics-bioinformatics metagenomics-pipeline metagenomics-toolkit microbiome microbiome-analysis-pipelines microbiome-workflow
Last synced: 6 months ago · JSON representation ·

Repository

A simpler version of atavide that relies only on slurm or PBS scripts. Some of the settings are specific for our compute resources

Basic Info
  • Host: GitHub
  • Owner: linsalrob
  • License: mit
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 547 KB
Statistics
  • Stars: 3
  • Watchers: 2
  • Forks: 2
  • Open Issues: 5
  • Releases: 2
Topics
bioinformatics metagenomics metagenomics-analysis metagenomics-bioinformatics metagenomics-pipeline metagenomics-toolkit microbiome microbiome-analysis-pipelines microbiome-workflow
Created over 2 years ago · Last pushed 6 months ago
Metadata Files
Readme License Citation

README.md

Edwards Lab License: MIT GitHub language count DOI

atavide lite

Atavide lite is a simple, yet complete workflow for metagenomics data analysis, including QC/QA, optional host removal, annotation, assembly and cross-assembly, and individual read based annotations.

The motivation is based on the more complete atavide pipeline we built that uses snakemake as a workflow manager. We found that solution to be effective, but routine failure at different steps was hard to debug and follow. In addition, as we move between compute resources, we need to adjust the time and memory requirements for each step, and that was not easy to do with snakemake.

Our goal is to provide a simple, easy to use, and easy to understand pipeline that can be used for metagenomics data analysis, but one that is broken down into individual steps that can be run one at a time, and that can be easily modified to suit your compute resources.

Our solution is to craft a series of scripts, suitable for different clusters. We still lean on snakemake for some parts of the pipeline, but we run each part separately, so it is straightforward to see what has worked and what has failed. We provide generic slurm and pbs scripts that will run on most clusters, and then there are specific scripts for the Pawsey Supercomputing Centre setonix system, Flinders University's deepthought cluster, and the National Computational Infrastructure Gadi system. These are the machines that we use everyday, and so we maintain those scripts to ensure that they work for us. If you would like to ammend the scripts for your cluster, please submit a pull request, or open an issue, and we will try to address it.

In our experience, each cluster has enough minor differences that it is easier to maintain individual scripts for each cluster, rather than trying to make a single script that works on all clusters.

Pipeline steps

Our pipeline is designed to be run in a series of steps, and each step can be run independently. In our day to day work we lean heavily on the --dependency option in sbatch to ensure that each step is run only after the previous step has completed successfully.

  1. Run fastp to trim Illumina or Nanopore barcodes. We provide those in adapters
  2. Use minimap2 and samtools to filter out host and not host reads. Currently, host reads are ignored. Host can be human, sharks, coral, or anything else.
  3. Use mmseqs easy-taxonomy to compare not host reads to UniRef. By default we use UniRef50 but you could use any version.
  4. Create a taxonomic summary for each sample and make a single .tsv file.
  5. Connect in the subsystems from BV-BRC, and make a table that includes subsystems and taxonomic information
  6. Create a subsystems taxonomy for the data
  7. Use vamb to bin the reads into MAGs

We have described all the steps in a detailed description of the workflows

different versions

Paired vs Single End

In our current processing, we have: - Paired end reads from MGI or Illumina sequencing. Those files usually end _R1.fastq.gz and _R2.fastq.gz, and the code looks for those. - Single end reads from ONT sequencing. These files end .fastq.gz

We have two versions of the pipeline that work with either paired end or single end, and you need to choose the appropriate version for your data.

However: If you download some sequences from SRA, ENA, or DDBJ they may have paired end reads that end _1.fastq.gz and _2.fastq.gz in which case you should change the names (see the README for a simple command). You might also have Illumina sequencing single end reads (which is oldschool!), in which case, you should use the pawsey minion pipeline to process that data.

See the verions: - pawsey slurm -- use this for paired end (R1 and R2) reads. Although it's designed to run on Pawsey's setonix, it will probably work on any system with a /scratch drive - pawsey minion -- use this for single end reads. Also designed to run on Pawsey's setonix. - deepthought_slurm - designed to work on Flinders deepthought infrastructure. This is esoteric and probably not portable, because the deepthought system has a $BGFS drive that is used for temporary storage. - nci_pbs - designed to work on the NCI infrastructrue.

Currently the pipeline depends on these software

  • samtools>=1.20
  • fastp
  • minimap2>=2.29
  • checkm-genome
  • mmseqs2
  • megahit
  • rclone
  • rsync
  • parallel
  • pigz
  • pytaxonkit
  • snakemake
  • sra-tools
  • snakemake-executor-plugin-cluster-generic
  • taxonkit

You can install all of these with:

mamba env create -f ~/atavide_lite/atavide_lite.yaml

Note: if you are using an ephemeral system like Pawsey, we also have a mechanism for making temporary conda installations. See the pawsey slurm or pawsey minion READMEs.

If you use atavide light, please cite it and then please also cite the other papers that describe these great tools.

System nuances

This is not a comprehensive list of the nuances of each system, but it provides some of the differences that we run into and the motivation for some of the choices we made in the scripts.

Flinders' deepthought

Deepthought uses slurm for scheduling and has a $BGFS drive that is used for temporary storage. This is fast local access that is only available to the compute node that you are running on. It is not shared between nodes. For most processing, it is a lot quicker to transfer the data to the $BGFS drive, and then run the processing there, and finally copying the files back to the working directory when they are complete. This is especially true for processes that use a lot of memory and create temporary files, such as megahit and mmseqs2.

Pawsey Supercomputing Centre's Setonix

Setonix uses slurm for scheduling and has a /scratch drive that is used for temporary storage, but is available from the compute nodes.

NCI's Gadi

Gadi uses PBS for scheduling a /g/data drive that is used for temporary storage. Gadi does not allow array jobs, so we have to run each step separately.

Gadi also has fast local drives that are accessible from compute nodes, and they are at $PBS_JOBFS

Owner

  • Name: Rob Edwards
  • Login: linsalrob
  • Kind: user
  • Location: Adelaide, Australia
  • Company: Flinders University

Professor of CS and Biology Writing bioinformatics code to study viruses, phages, and metagenomes.

Citation (citation.cff)

cff-version: 1.2.0
title: atavide_light
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Robert
    name-particle: Robert
    family-names: Edwards
    email: raedwards@gmail.com
    affiliation: Flinders University
    orcid: 'https://orcid.org/0000-0001-8383-8949'
identifiers:
  - type: doi
    value: 10.5281/zenodo.8221008
    description: Zenodo repository of release 0.1
repository-code: 'https://github.com/linsalrob/atavide_lite'
abstract: >-
  atavide_light is a series of steps for processing
  metagenomics data. Each step is independent but you will
  end up with a fully processed metagenome.
keywords:
  - metagenome
  - DNA sequencing
  - microbiome
  - bacteria
  - virus
license: MIT

GitHub Events

Total
  • Create event: 1
  • Issues event: 5
  • Release event: 1
  • Watch event: 1
  • Issue comment event: 2
  • Push event: 100
Last Year
  • Create event: 1
  • Issues event: 5
  • Release event: 1
  • Watch event: 1
  • Issue comment event: 2
  • Push event: 100