Science Score: 75.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 4 DOI reference(s) in README -
✓Academic publication links
Links to: ncbi.nlm.nih.gov -
○Academic email domains
-
✓Institutional organization owner
Organization sydney-informatics-hub has institutional domain (sydney.edu.au) -
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.6%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: Sydney-Informatics-Hub
- License: gpl-3.0
- Language: Shell
- Default Branch: main
- Size: 271 KB
Statistics
- Stars: 0
- Watchers: 5
- Forks: 1
- Open Issues: 0
- Releases: 1
Metadata Files
README.md
Gadi-Trinity
Description
This repository contains a staged Trinity workflow that can be run on the National Computational Infrastructure’s (NCI) Gadi supercomputer. Trinity performs de novo transcriptome assembly of RNA-seq data by combining three independent software modules Inchworm, Chrysalis and Butterfly to process RNA-seq reads. The algorithm can detect isoforms, handle paired-end reads, multiple insert sizes and strandedness. For more information see the Trinity user guide.
The Gadi-Trinity workflow leverages multiple nodes on NCI Gadi to run a number of Butterfly processes in parallel. This workflow is suitable for single sample and global assemblies of genomes < 2 Gb.
Set up
This repository contains all scripts and software required to run Gadi-Trinity. Before running this workflow, you will need to do the following:
- Clone the Gadi-Trinity repository from github (see ‘Installation’ below)
- Prepare the module archive by running
create-apps.shfrom theresourcesdirectory (See 'Software requirements' below) - Copy the template submission script
Scripts/template.shinto the project directory and edit for the project. Give it a meaningful name. - Make a list of fastq files to be submitted (See ‘Input’ below)
- Edit key input variables in
template.sh(See ‘Input’ below)- project= (er00)
- list= (path/to/fastq/list)
- seqtype= (fq or fa)
- tissue= make sure this pulls the correct field from your fq file name
- storage= the string to pass the PBS storage key, ie scratch/
+gdata/ - version= Choose from trinity version 2.9.1 or 2.12.0
Installation
Clone the trinity-NCI-Gadi repository to your project’s scratch directory
module load git
git clone https://github.com/Sydney-Informatics-Hub/Gadi-Trinity.git
Software requirements
Trinity requires the following software to be installed and loaded as modules from apps already installed on Gadi. A module archive of these software is created by running create-apps.sh in the resources directory.
trinity/2.9.1 or trinity/2.12.0 \ bowtie2/2.3.5.1 \ samtools/1.10 \ salmon/1.1.0 \ python2/2.7.17 or python3/3.7.4 \ jellyfish/2.3.0
Input
A plain text file containing a list of input fastq files is required input. In this file, each row corresponds to 1 sample. Each row consists of column 1: incremental number (for job array), column 2: read 1 name and column 3: read 2 name. This file can be created by running the following from the directory containing your fastq files:
readlink -f *.fastq.gz | sort -V | xargs -n 2 | cat -n > fastq.list
You will also need to edit key input variables in ‘set variables’ in template.sh that are required to run Trinity:
- project= (er00)
- list= (fastqlist.txt)
- seqtype= (fq)
Usage
Overview
To manage the data-intensive computation of Trinity, each job utilises /jobfs, requiring jobs to be copied between file systems on Gadi.
Once you have made the fastq.list and set the variables in template.sh simply run the workflow by:
sh template.sh
template.sh runs Trinity in 3 phases (Trinity1-3fb.pbs), each being launched as an independent PBS script.
trinity_1_fb.pbs: clusters inchworm contigs with Chrysalis and maps reads. Stops before the parallel assembly of clustered readstrinity_2_fb.pbs: assembles clusters of reads using Inchworm, Chrysalis and Butterfly. Chrysalis and Butterfly can be executed in parallel, each having independent input and output. This is the distributed part of the workflow.trinity_3_fb.pbs: final assembly. Harvests all assembled transcripts into a single multi-fasta file.
HPC usage report scripts are provided at the SIH repository for users to evaluate the KSU, walltime and resource consumption and efficiency of their job submissions. These scripts gather job request metrics from Gadi log files. To use, run all scripts from within the directories containing log files to be read.
Resource usage
The Trinity pipeline consists of a series of executables launched with a single command. Each of these stages have different compute resource requirements depending on the stage of the pipeline. The initial stages of the workflow (Inchworm and Chrysalis) are data-intensive and require high memory per core and the latter stages are scalable, embarrassingly parallel, single core jobs. General computing requirement recommendation from Trinity is ~1 Gb of RAM per ~1 M pairs of Illumina sequence reads.
The distributed part of the workflow is unlikely to require significant jobfs or memory resources. However, the initial phase of the workflow may need to run on the hugemem nodes. If this is the case, edit the qsub definition at the bottom of the template.sh script. As there are some serial bottlenecks in the first part of the workflow, reducing the requested resources may improve the 'efficiency' of the calculation. For instance half of a hugemem node (24 cores, 750 GB memory, 700 GB jobfs) may be sufficient for a larger assembly. Memory and jobfs requirements to process samples are sufficiently serviced with NCI Gadi’s normal nodes (48 CPUs, 400 Gb of /jobfs disk space).
Benchmarking metrics
The following benchmarking metrics were obtained using stem rust (Puccinia graminis) datasets with a genome size of ~170 Mb. Each of these were run on Gadi’s normal nodes (48 CPUs, 400 Gb of /jobfs disk space).
Wheat stem rust
| Job | CPUs | Mem | CPUtime | Walltimeused | JobFSused | Efficiency | Serviceunits | |--------------------|------|----------|-----------|---------------|------------|------------|---------------| | trinity1.pbs | 48 | 182.49GB | 68:27:15 | 2:59:16 | 193.35GB | 0.48 | 286.83 | | trinity2fb0.pbs | 48 | 80.33GB | 115:52:10 | 2:33:03 | 19.89GB | 0.95 | 244.88 | | trinity2fb1.pbs | 48 | 17.42GB | 18:51:03 | 0:26:00 | 243.04MB | 0.91 | 41.6 | | trinity3.pbs | 48 | 5.14GB | 0:00:12 | 0:01:26 | 267.1MB | 0 | 2.29 | | _Total__ | | | | 5:33:45 | | | 576 |
Rye rust
| Job | CPUs | Mem | CPUtime | Walltimeused | JobFSused | Efficiency | Serviceunits | |--------------------|------|----------|----------|---------------|------------|------------|---------------| | trinity1.pbs | 48 | 182.26GB | 23:37:52 | 2:51:09 | 182.89GB | 0.52 | 273.84 | | trinity2fb0.pbs | 48 | 66.32GB | 21:48:17 | 2:05:58 | 19.73GB | 0.93 | 201.55 | | trinity2fb1.pbs | 48 | 25.09GB | 1:51:32 | 0:02:51 | 61.87MB | 0.82 | 4.56 | | trinity3.pbs | 48 | 4.39GB | 0:00:08 | 0:00:16 | 192.7MB | 0.01 | 0.43 | | _Total__ | | | | 4:57:23 | | | 480 |
Scabrum rust
| Job | CPUs | Mem | CPUtime | Walltimeused | JobFSused | Efficiency | Serviceunits |
|--------------------|------|----------|----------|---------------|------------|------------|---------------|
| trinity1.pbs | 48 | 141.51GB | 37:21:15 | 1:46:16 | 111.15GB | 0.44 | 170.03 |
| trinity2fb0.pbs | 48 | 53.1GB | 99:12:17 | 2:12:50 | 11.39GB | 0.93 | 212.53 |
| trinity2fb1.pbs | 48 | 20.24GB | 11:01:19 | 0:15:38 | 185.78MB | 0.88 | 25.01 |
| trinity3.pbs | 48 | 4.54GB | 0:00:08 | 0:00:13 | 233.19MB | 0.01 | 0.35 |
| _Total__ | | | | 3:59:19 | | | 408 |
Additional notes
Trinity’s running time is exponentially related to the number of de Bruijn graph branches created. Given walltime limitations on Gadi, the Gadi-Trinity workflow is not recommended for use on genomes >2 Gb. For larger single sample and global assemblies, we recommend the Flashlite-Trinity workflow that runs Trinity on the University of Queensland’s HPC, Flashlite.
All work is performed local to the node in
/jobfsor in/dev/shm.At the end of
trinity_1_fb.pbs, a single tar file containing the full Trinity output directory is copied back to network storage. This will be >100 Gb.Each task running
trinity_2_fb.pbsworks on a single file bin representing ~100,000 tasks. Only the recursivetrinity.cmds and the relevant data from readpartitions are copied to the node. The full read_partitions directory is archived and pushed back to network storage at the end of processing. This will be up to 10 Gb.In
trinity_3_fb.pbs, only the fasta files from the distributed step are copied to the node. Only the full assembly is copied back.The
sort-recursive.pyscript is run bytrinity_2_fb.pbs. It will sort the recursive commands to run on input files based on their size (largest to smallest). This is included to avoid long running, single CPU jobs from holding up a whole node. It improves overall job efficiency.The scripts were designed to use a single project for KSU debiting and storage.
Acknowledgements
Acknowledgements (and co-authorship, where appropriate) are an important way for us to demonstrate the value we bring to your research. Your research outcomes are vital for ongoing funding of the Sydney Informatics Hub and national compute facilities.
Authors
- Tracy Chew (Sydney Informatics Hub, University of Sydney)
- Georgina Samaha (Sydney Informatics Hub, University of Sydney)
- Cali Willet (Sydney Informatics Hub, University of Sydney)
- Rosemarie Sadsad (Sydney Informatics Hub, University of Sydney)
- Rika Kobayashi (National Computational Infrastructure)
- Matthew Downton (National Computational Infrastructure)
- Ben Menadue (National Computational Infrastructure)
Suggested acknowledgement:
The authors acknowledge the support provided by the Sydney Informatics Hub, a Core Research Facility of the University of Sydney. This research/project was undertaken with the assistance of resources and services from the National Computational Infrastructure (NCI), which is supported by the Australian Government, and the Australian BioCommons which is enabled by NCRIS via Bioplatforms Australia funding.
Cite us to support us!
Chew, T., Samaha, G., Downton, M., Willet, C., Menadue, B. J., Kobayashi, R., & Sadsad, R. (2021). Gadi-Trinity (Version 1.0) [Computer software]. https://doi.org/10.48546/workflowhub.workflow.145.1
References
Grabherr MG, Haas BJ, Yassour M, et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol. 2011;29(7):644-652. Published 2011 May 15. doi:10.1038/nbt.1883
Haas BJ, Papanicolaou A, Yassour M, et al. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat Protoc. 2013;8(8):1494-1512. doi:10.1038/nprot.2013.084
Owner
- Name: Sydney Informatics Hub
- Login: Sydney-Informatics-Hub
- Kind: organization
- Email: sih.admin@sydney.edu.au
- Location: University of Sydney, Sydney Australia
- Website: https://sydney.edu.au/sydney-informatics-hub
- Twitter: Sydney_CRF
- Repositories: 189
- Profile: https://github.com/Sydney-Informatics-Hub
The Sydney Informatics Hub is a Core Research Facility of the University of Sydney, providing training and expertise on research data, analysis and computing.
Citation (CITATION.cff)
cff-version: 1.0.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Chew"
given-names: "Tracy"
orcid: "https://orcid.org/0000-0001-9529-7705"
- family-names: "Samaha"
given-names: "Georgina"
orcid: "https://orcid.org/0000-0003-0419-1476"
- family-names: "Downton"
given-names: "Matthew"
orcid: ""
- family-names: "Willet"
given-names: "Cali"
orcid: "https://orcid.org/0000-0001-8449-1502"
- family-names: "Menadue"
given-names: "Benjamin J"
orcid: "https://orcid.org/0000-0001-5013-6350"
- family-names: "Kobayashi"
given-names: "Rika"
orcid: "https://orcid.org/0000-0002-0672-833X"
- family-names: "Sadsad"
given-names: "Rosemarie"
orcid: "https://orcid.org/0000-0003-2488-953X"
title: "Gadi-Trinity"
version: 1.0
doi: 10.48546/workflowhub.workflow.145.1
date-released: 2021-06-18
url: "https://github.com/Sydney-Informatics-Hub/Gadi-Trinity"
GitHub Events
Total
Last Year
Issues and Pull Requests
Last synced: 12 months ago
All Time
- Total issues: 2
- Total pull requests: 7
- Average time to close issues: 5 months
- Average time to close pull requests: about 7 hours
- Total issue authors: 1
- Total pull request authors: 3
- Average comments per issue: 2.5
- Average comments per pull request: 0.0
- Merged pull requests: 7
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- hp2048 (2)
Pull Request Authors
- mattdton (3)
- georgiesamaha (2)
- tracychew (2)