https://github.com/alleninstitute/bulk-rna-snakeline

https://github.com/alleninstitute/bulk-rna-snakeline

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
  • Committers with academic emails
    1 of 2 committers (50.0%) from academic institutions
  • Institutional organization owner
    Organization alleninstitute has institutional domain (alleninstitute.org)
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.9%) to scientific vocabulary
Last synced: 6 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: AllenInstitute
  • Language: Python
  • Default Branch: main
  • Size: 658 KB
Statistics
  • Stars: 0
  • Watchers: 2
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created about 3 years ago · Last pushed 10 months ago
Metadata Files
Readme

README.md

Bulk-RNA-Snakeline

cover

Motivation

Due to the rapid advancements in sequencing technology, researchers are now able to generate massive amounts of biological data through an increased number of samples and more affordable options. This has led to a growing demand for simple, efficient methods to process and analyze large datasets, ultimately transforming them into meaningful and reproducible information. Workflow engines offer a valuable solution to this challenge, as they streamline and automate processing tasks, thus reducing the risk of user bias and errors that may arise from manual procedures.

Recognizing the need to adapt to the ever-increasing volume of data inputs and the constant evolution of processing software, the Bioinformatics Core team at the Allen Institute (BiCore) has begun transitioning towards automated workflows. In particular, they are employing Snakemake, a powerful workflow engine, to facilitate the quality assessment, trimming, and mapping of Bulk RNA-Sequencing (RNA-Seq) data. By embracing automated workflows, the BiCore team aims to improve efficiency, consistency, and reproducibility in their research, ultimately enhancing the overall quality of their findings.

Table of Contents

Quickstart Guide

Follow these steps to use the Bulk-RNA-Snakeline: 1. Download the repository (.zip), move it to your working directory, and unzip it 2. Create and load Conda environment with all dependencies: bash conda env create --name snakeline_env -f envs/Bulk-RNA-Snakeline.yml 3. Activate the Conda environment: bash conda activate snakeline_env 4. Move RAW Fastq Files into Bulk-RNA-Snakeline folder. 5. Prepare the pipeline by creating directory structure: bash python3 setup.py Or if sample_list.txt is supplied: bash python3 setup.py -s <name_of_sample_file> 6. Adjust parameters in config.yml: bash nano config/config.yml 7. Execute snakemake and run the workflow: bash snakemake --cores 160 -s <snakefile> Or using Slurm (optional): bash srun --partition=celltypes --mem=60g --time=24:00:00 snakemake --cores 160 -s main.smk bash sbatch run.sh 8. Troubleshooting common errors:

- A raised LockException:
    ```bash
    rm .snakemake/locks/*
    ```
- Directory cannot be locked:
    ```bash
    snakemake -s main.smk --unlock
    ```
- Incomplete Run:
    ```bash
    srun --partition=celltypes --mem=60g --time=24:00:00 snakemake --cores 160 -s main.smk --latency-wait 60 --rerun-incomplete
    ```
    ```bash
    sbatch rerun.sh
    ```

Note: This pipeline will take a long time depending on the data and number of cores available.

Required Tools

  • FastQC 0.11.9 (A quality control tool for high throughput sequence data)

  • CutAdapt 4.1 (Automates quality control and adapter trimming of fastq files)

  • STAR v2.7.1a (Spliced aware ultrafast transcript alligner to reference genome)

  • StringTie 2.2.1 (A fast and highly efficient assembler of RNA-Seq alignments into potential transcripts.)

About-Bulk-RNA-Snakeline

The Allen Institute's Bioinformatics Core team currently employs a pipeline to process raw Bulk RNA-Seq data. This existing pipeline, however, relies on users executing a series of custom bash scripts for each workflow step. This approach is not only time-consuming but also demands extra effort from users to ensure proper script execution, correct parameter adjustments, and accurate file paths. It is crucial to recognize that user errors can negatively impact downstream analyses and compromise result accuracy.

Furthermore, users often face input and output compatibility issues when running multiple scripts. Incompatibilities arise when the output files generated by one script are not compatible with the inputs required for another script due to differences in file formats or software versions. Additionally, the virtual environment must be checked to guarantee the successful installation of all necessary software tools and dependencies.

To minimize manual intervention and enhance the efficiency of processing Bulk RNA-Seq data, the BiCore team is transitioning from a basic Unix shell pipeline to Snakemake. As a user-friendly workflow engine, Snakemake processes data through well-defined rules, each consisting of input and output files, parameters, computational tasks, and, optionally, an environment path. Snakemake's unique features reduce code complexity and enhance readability. Designed specifically for bioinformatics analyses, Snakemake is a domain-specific language (DSL) that offers portability, readability, reproducibility, scalability, and reusability, making it the ideal choice for the BiCore team's needs.

Pipeline Overview

alt text

Directory Structure

alt text

Authors and History

  • Beagan Nguy - Algorithm Design
  • Anish Chakka - Project Manager

Acknowledgments

Allen Institute Bioinformatics Core Team

References

Johannes Köster, Sven Rahmann, Snakemake—a scalable bioinformatics workflow engine, Bioinformatics, Volume 28, Issue 19, 1 October 2012, Pages 2520–2522, https://doi.org/10.1093/bioinformatics/bts480

Owner

  • Name: Allen Institute
  • Login: AllenInstitute
  • Kind: organization
  • Location: Seattle, WA

Please visit http://alleninstitute.github.io/ for more information.

GitHub Events

Total
  • Push event: 2
Last Year
  • Push event: 2

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 157
  • Total Committers: 2
  • Avg Commits per committer: 78.5
  • Development Distribution Score (DDS): 0.045
Past Year
  • Commits: 2
  • Committers: 1
  • Avg Commits per committer: 2.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Beagan Nguy 3****g 150
Beagan Nguy b****y@h****g 7
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 9 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Dependencies

main/setup.py pypi
setup.py pypi