https://github.com/alleninstitute/bulk-rna-snakeline

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 2 DOI reference(s) in README
○
Academic publication links
✓
Committers with academic emails
1 of 2 committers (50.0%) from academic institutions
✓
Institutional organization owner
Organization alleninstitute has institutional domain (alleninstitute.org)
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.9%) to scientific vocabulary

Last synced: 9 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: AllenInstitute
Language: Python
Default Branch: main
Size: 658 KB

Statistics

Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Releases: 0

Created over 3 years ago · Last pushed about 1 year ago

Metadata Files

Readme

Bulk-RNA-Snakeline

cover

Motivation

Due to the rapid advancements in sequencing technology, researchers are now able to generate massive amounts of biological data through an increased number of samples and more affordable options. This has led to a growing demand for simple, efficient methods to process and analyze large datasets, ultimately transforming them into meaningful and reproducible information. Workflow engines offer a valuable solution to this challenge, as they streamline and automate processing tasks, thus reducing the risk of user bias and errors that may arise from manual procedures.

Recognizing the need to adapt to the ever-increasing volume of data inputs and the constant evolution of processing software, the Bioinformatics Core team at the Allen Institute (BiCore) has begun transitioning towards automated workflows. In particular, they are employing Snakemake, a powerful workflow engine, to facilitate the quality assessment, trimming, and mapping of Bulk RNA-Sequencing (RNA-Seq) data. By embracing automated workflows, the BiCore team aims to improve efficiency, consistency, and reproducibility in their research, ultimately enhancing the overall quality of their findings.

Quickstart-Guide
Required Tools
About-Snakeline
Authors and history
Pipeline Overview
Directory Structure
Acknowledgments
References

Quickstart Guide

Follow these steps to use the Bulk-RNA-Snakeline: 1. Download the repository (.zip), move it to your working directory, and unzip it 2. Create and load Conda environment with all dependencies: bash conda env create --name snakeline_env -f envs/Bulk-RNA-Snakeline.yml 3. Activate the Conda environment: bash conda activate snakeline_env 4. Move RAW Fastq Files into Bulk-RNA-Snakeline folder. 5. Prepare the pipeline by creating directory structure: bash python3 setup.py Or if sample_list.txt is supplied: bash python3 setup.py -s <name_of_sample_file> 6. Adjust parameters in config.yml: bash nano config/config.yml 7. Execute snakemake and run the workflow: bash snakemake --cores 160 -s <snakefile> Or using Slurm (optional): bash srun --partition=celltypes --mem=60g --time=24:00:00 snakemake --cores 160 -s main.smk bash sbatch run.sh 8. Troubleshooting common errors:

- A raised LockException:
    ```bash
    rm .snakemake/locks/*
    ```
- Directory cannot be locked:
    ```bash
    snakemake -s main.smk --unlock
    ```
- Incomplete Run:
    ```bash
    srun --partition=celltypes --mem=60g --time=24:00:00 snakemake --cores 160 -s main.smk --latency-wait 60 --rerun-incomplete
    ```
    ```bash
    sbatch rerun.sh
    ```

Note: This pipeline will take a long time depending on the data and number of cores available.

Required Tools

FastQC 0.11.9 (A quality control tool for high throughput sequence data)
CutAdapt 4.1 (Automates quality control and adapter trimming of fastq files)
STAR v2.7.1a (Spliced aware ultrafast transcript alligner to reference genome)
StringTie 2.2.1 (A fast and highly efficient assembler of RNA-Seq alignments into potential transcripts.)

About-Bulk-RNA-Snakeline

The Allen Institute's Bioinformatics Core team currently employs a pipeline to process raw Bulk RNA-Seq data. This existing pipeline, however, relies on users executing a series of custom bash scripts for each workflow step. This approach is not only time-consuming but also demands extra effort from users to ensure proper script execution, correct parameter adjustments, and accurate file paths. It is crucial to recognize that user errors can negatively impact downstream analyses and compromise result accuracy.

Furthermore, users often face input and output compatibility issues when running multiple scripts. Incompatibilities arise when the output files generated by one script are not compatible with the inputs required for another script due to differences in file formats or software versions. Additionally, the virtual environment must be checked to guarantee the successful installation of all necessary software tools and dependencies.

To minimize manual intervention and enhance the efficiency of processing Bulk RNA-Seq data, the BiCore team is transitioning from a basic Unix shell pipeline to Snakemake. As a user-friendly workflow engine, Snakemake processes data through well-defined rules, each consisting of input and output files, parameters, computational tasks, and, optionally, an environment path. Snakemake's unique features reduce code complexity and enhance readability. Designed specifically for bioinformatics analyses, Snakemake is a domain-specific language (DSL) that offers portability, readability, reproducibility, scalability, and reusability, making it the ideal choice for the BiCore team's needs.

Pipeline Overview

alt text

Directory Structure

alt text

Authors and History

Beagan Nguy - Algorithm Design
Anish Chakka - Project Manager

Acknowledgments

Allen Institute Bioinformatics Core Team

References

Johannes Köster, Sven Rahmann, Snakemake—a scalable bioinformatics workflow engine, Bioinformatics, Volume 28, Issue 19, 1 October 2012, Pages 2520–2522, https://doi.org/10.1093/bioinformatics/bts480

Owner

Name: Allen Institute
Login: AllenInstitute
Kind: organization
Location: Seattle, WA

Website: https://alleninstitute.org
Repositories: 184
Profile: https://github.com/AllenInstitute

Please visit http://alleninstitute.github.io/ for more information.

GitHub Events

Total

Push event: 2

Last Year

Push event: 2

Committers

Last synced: about 1 year ago

All Time

Total Commits: 157
Total Committers: 2
Avg Commits per committer: 78.5
Development Distribution Score (DDS): 0.045

Past Year

Commits: 2
Committers: 1
Avg Commits per committer: 2.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Beagan Nguy	3****g	150
Beagan Nguy	b**y@h**g	7

Committer Domains (Top 20 + Academic)

hpc-login.corp.alleninstitute.org: 1

Issues and Pull Requests

Last synced: about 1 year ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

https://github.com/alleninstitute/bulk-rna-snakeline

Science Score: 57.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Bulk-RNA-Snakeline

Motivation

Table of Contents

Quickstart Guide

Required Tools

About-Bulk-RNA-Snakeline

Pipeline Overview

Directory Structure

Authors and History

Acknowledgments

References

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies