sci_wiz-rna-seq-workflow

SCI_WIZ: Scotland institute(CRUK) workflow wizard

https://github.com/beatson-compbio/sci_wiz-rna-seq-workflow

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (19.1%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

SCI_WIZ: Scotland institute(CRUK) workflow wizard

Basic Info
  • Host: GitHub
  • Owner: Beatson-CompBio
  • License: mit
  • Language: Python
  • Default Branch: main
  • Size: 438 KB
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • Open Issues: 1
  • Releases: 1
Created over 1 year ago · Last pushed over 1 year ago
Metadata Files
Readme Changelog License Citation

README.md

CRUK Scotland Institute Workflow Wizard: sci_wiz

sci_wiz packages Nextflow framework and Python's modular approach under the hood to deliver easy-to-use functionalities for RNA-Seq data pre-processing using standard and widely accepted tools. The package is designed to install all the dependencies required to run the workflow so that you do not have to install them separately.

[!NOTE] You will need to download the STAR index, reference genome, and annotation files separately. The package will not download these files for you.

What sci_wiz can do:

  1. Data pre-processing: Running data pre-processing workflow will carry out the below steps and generate read count matrix. More information on the tools used in each step can be accessed using the links.
    1. Trimming: uses FastP.
    2. Reads QC: FastQC and MultiQC.
    3. Mapping: we are using STAR.
    4. Counting: FeatureCounts.
    5. Bam to Cram: for optimising BAM file storage.

System Requirements

Running this package will require you to have access to virtual machine or high-performance cluster(hpc). Below are the requirements for running this package:

  • Python 3.10 or higher
  • Slurm scheduler for running on HPC

[!NOTE] You should have required access to create and remove symlinks. Increase the number of file that your system can open using ulimit -n 3000. The number mentioned here is a suggestion from STAR developers in this issue.It is possible STAR might fails because of open file limit error.

Quick start

To help you quickly start your RNA-seq analysis, we have developed a package which can be quickly installed. It also provide you options to either use the Command Line Interface(CLI) or import it in a script. We recommend using a using a virtual environment to run all the analyses so that your system configuration remains as it is.

Installation

  • Let's start by creating a virtual environment and activate it. After running commands below, you should see the your virtual env. name at the far left end of the terminal. If not, please refer to python documentation on how to create a virtual env.

python python -m venv .sci_wiz && source .sci_wiz/bin/activate

  • Download the latest package .whl file from the release section of this repository and save it into your current working directory. Use the following command to install the package.

python pip install sci_wiz-{version}-py3-none-any.whl

  • You could check all the functionalities that sci_wiz provides using --help command.

console sci_wiz --help

Configuration

  • This is a required configuration step that would generate an user_input.ini file to store your inputs. Run below command in your terminal:

console sci_wiz create-config

  • You should have an user_input.ini file in your current working directory. The .ini file will take your input that required to run the data pre-processing smoothly.

```YAML [USERINPUT] projectname = G12yymmuniqueName # G12 is group code, yymm: year and month; uniqueName. profile = hpc # 'vm' if running on VM, 'hpc' if running on HPC. reads = /projectname//{R1,R2}001.fastq.gz # absolute path outputdir = /projectname/Data/ index = STAR75bpor150bp annotation = Org.OrgCode.110.gtf reference = Org.OrgCode.110.fa annotationbed = Org.OrgCode.110.bed batchinfo = false # batchinfo True will require run1, run2 batchdestination, inputreads will be ignored. run1 = run2 = batchdestination =

[TRIMMING] trimfrontread01 = 1 # will trim the front bases from read 1 trimfrontread02 = 1 # will trim front bases from read 2 trimtailread01 = 0 # will trim tail bases from read 1 trimtailread02 = 0 # will trim tail bases from read 2 ```

  • project_name: project name, it will provide you with an option to follow project naming convention.
  • profile: type of system, such as a virtual machine or high-compute cluster, you are using to run this analysis.
  • reads: Path to input raw RNA-seq reads in fastq.gz format.
  • output_dir: Base path for the output directory.
  • index: Path to the STAR index. Right now this workflow only supports alignment using STAR.
  • annotation: Path to the GTF file containing gene annotations.
  • reference: Path to the reference genome FASTA file.
  • annotation_bed: Path to the BED file containing gene annotations.
  • batch_info: Flag indicating whether raw files are available in multiple batches. If this is True, you will need to provide run1, run2, & batch_destination.
  • run1 and run2: Paths to raw data for batch setup.
  • batch_destination: Destination path for organized batch data.
  • trimfrontread_01: Number of bases trimmed from front of Read_01. Default is 1.
  • trimfrontread_02: Number of bases trimmed from front of Read_02. Default is 1.
  • trimtailread_01: Number of bases trimmed from tail of Read_01. Default is 0.
  • trimtailread_02: Number of bases trimmed from tail of Read_02. Default is 0.

Data Pre-processing

[!IMPORTANT] If you want to use the default trimming inputs then directly use the pre-processing command. Otherwise, have a look at this section first. Please make sure the system requirements are met before running the below commands.

  • Simply trigger the data pre-processing commands. This program will work smoothly if the below two conditions are satisfied
    • Given inputs are as expected.
    • You have permission to read all the required files & folder such as input.fastq.gz, index folder, annotation.gtf,annotation.bed, & reference.fa.

console sci_wiz run-preprocessing

Running in a virtual machine(VM)

  • Make sure you have entered vm as your profile in user_input.ini. For example:

YAML [USER_INPUT] profile = vm ...

Running in a virtual machine(HPC)

Dependency: Current workflow is only configured to work with SLURM.

  • Make sure your input data is available in the shared scratch, preferably in you current working directory.

  • profile for running data pre-processing in hpc should have input as below:

YAML [USER_INPUT] profile = hpc ...

Trimming raw data

The Illumina Stranded library preparation kit is used as the default kit. This kit requires trimming of the first base from both reads. The settings for this are the default in the user_input.ini file. If you are using a different library preparation kit, the trimming parameters may be different. Please check the documentation for your kit. If the kit requires different trimming or if you want to switch off trimming, you can do this by editing the user_input.ini file. The workflow uses FastP and the user_input.ini file uses the same flags as FastP but for a controlled set of parameters. The following command will just run the initial QC step, not trimming:

console sci_wiz run-initial-qc

Import as a module

Did I mention that you can import rna_seq module and carry out all the above steps in a script or jupyter notebook? Here is a quick example:

```python from sciwiz import rnaseq

rnaseq.generateconfig()

run below step after editing user_input.ini

rnaseq.launchdata_preprocessing() ```

Here is a jupyter notebook with same steps that you can expand according to your use case.

Report issues

If you find any issues with our code, you can reach out to us by:

  • Reporting Issues: If you encounter any issues or bugs, please create a detailed issue report on the repository.

  • Providing Feedback: Share your feedback on existing features or suggest improvements.

  • Documentation Edits: If you find any discrepancies or have suggestions for improving the documentation, feel free to submit edits or open an issue.

Citation

If you find sci_wiz useful in your research, please consider citing it:

bibtex @software{ sci_wiz, author = {Ojo, Ifedayo and Sikarwar, Mayank and Kwan, Ryan and Shaw, Robin and Miller, Crispin}, month = {1}, title = {CRUK Scotland Institute Workflow Wizard: sci_wiz}, url = {https://github.com/Beatson-CompBio/RNA-seq-workflow}, year = {2024} }

How to cite dependencies?

We will really appreciate if you could also cite the dependencies using this bib file:

  • Nextflow
  • Bamtools
  • FeatureCounts
  • Fastp
  • STAR
  • Multiqc

Owner

  • Name: CRUK Beatson Institute - Computational Biology
  • Login: Beatson-CompBio
  • Kind: organization

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: 'CRUK Scotland Institute Workflow Wizard: sci_wiz'
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Ifedayo
    family-names: Ojo
    email: i.ojo@crukscotlandinstitute.ac.uk
    affiliation: CRUK Scotland Institute
  - given-names: Mayank
    family-names: Sikarwar
    email: m.sikarwar@crukscotlandinstitute.ac.uk
    affiliation: CRUK Scotland Institute
  - given-names: Ryan
    family-names: Kwan
    email: s.kwan@crukscotlandinstitute.ac.uk
    affiliation: CRUK Scotland Institute
  - given-names: Robin
    family-names: Shaw
    email: r.shaw@crukscotlandinstitute.ac.uk
    affiliation: CRUK Scotland Institute
  - given-names: Crispin
    family-names: Miller
    email: crispin.miller@glasgow.ac.uk
    affiliation: CRUK Scotland Institute
identifiers:
  - type: url
    value: >-
      https://github.com/Beatson-CompBio/sci_wiz-rna-seq-workflow
    description: 'SCI_WIZ: Scotland institute(CRUK) workflow wizard'
repository-code: >-
  https://github.com/Beatson-CompBio/sci_wiz-rna-seq-workflow
license: MIT
version: 1.0.0

GitHub Events

Total
  • Issues event: 7
  • Fork event: 1
Last Year
  • Issues event: 7
  • Fork event: 1

Dependencies

.github/workflows/helperfunc_groovy.yml actions
  • actions/checkout v3 composite
.github/workflows/helperfunc_python.yml actions
  • actions/checkout v4 composite
  • actions/setup-python v5 composite
.github/workflows/pipeline_ci.yml actions
  • actions/checkout v4 composite
  • actions/setup-go v4 composite
  • eWaterCycle/setup-singularity v7 composite
  • nf-core/setup-nextflow v1 composite
.github/workflows/sci_wiz_ci_workflow.yaml actions
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
  • actions/upload-artifact v2 composite
.github/workflows/sci_wiz_release.yaml actions
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
  • actions/upload-artifact v2 composite
poetry.lock pypi
  • annotated-types 0.7.0
  • appnope 0.1.4
  • asttokens 2.4.1
  • attrs 23.2.0
  • black 24.4.2
  • certifi 2024.2.2
  • cffconvert 2.0.0
  • cffi 1.16.0
  • charset-normalizer 3.3.2
  • click 8.1.7
  • colorama 0.4.6
  • comm 0.2.2
  • coverage 7.5.2
  • debugpy 1.8.1
  • decorator 5.1.1
  • docopt 0.6.2
  • dotty-dict 1.3.1
  • exceptiongroup 1.2.1
  • executing 2.0.1
  • gitdb 4.0.11
  • gitpython 3.1.43
  • idna 3.7
  • importlib-resources 6.4.0
  • iniconfig 2.0.0
  • ipykernel 6.29.4
  • ipython 8.24.0
  • jedi 0.19.1
  • jinja2 3.1.3
  • jsonschema 3.2.0
  • jupyter-client 8.6.2
  • jupyter-core 5.7.2
  • markdown-it-py 3.0.0
  • markupsafe 2.1.5
  • matplotlib-inline 0.1.7
  • mdurl 0.1.2
  • mypy-extensions 1.0.0
  • nest-asyncio 1.6.0
  • nextflow 23.10.1
  • packaging 24.0
  • parso 0.8.4
  • pathspec 0.12.1
  • pexpect 4.9.0
  • platformdirs 4.2.2
  • pluggy 1.5.0
  • prompt-toolkit 3.0.45
  • psutil 5.9.8
  • ptyprocess 0.7.0
  • pure-eval 0.2.2
  • pycparser 2.22
  • pydantic 2.7.1
  • pydantic-core 2.18.2
  • pygments 2.18.0
  • pykwalify 1.8.0
  • pyrsistent 0.20.0
  • pytest 8.2.1
  • pytest-cov 5.0.0
  • python-dateutil 2.9.0.post0
  • python-gitlab 4.6.0
  • python-semantic-release 8.7.0
  • pywin32 306
  • pyzmq 26.0.3
  • requests 2.32.2
  • requests-toolbelt 1.0.0
  • rich 13.7.1
  • ruamel-yaml 0.18.6
  • ruamel-yaml-clib 0.2.8
  • setuptools 70.0.0
  • shellingham 1.5.4
  • six 1.16.0
  • smmap 5.0.1
  • stack-data 0.6.3
  • tomli 2.0.1
  • tomlkit 0.12.5
  • tornado 6.4
  • traitlets 5.14.3
  • typer 0.9.0
  • typing-extensions 4.12.0
  • urllib3 2.2.1
  • wcwidth 0.2.13
pyproject.toml pypi
  • black 23.12.1 develop
  • cffconvert 2.0.0 develop
  • flake8 7.0.0 develop
  • ipykernel 6.29.0 develop
  • nextflow 23.10.1 develop
  • pytest 7.4.4 develop
  • pytest-cov 4.1.0 develop
  • python-semantic-release 8.7.0 develop
  • Jinja2 3.1.3
  • nextflow 23.10.1
  • python ^3.10 || ^3.11 || ^3.12
  • rich 13.7.1
  • typer 0.9.0