sci_wiz-rna-seq-workflow
SCI_WIZ: Scotland institute(CRUK) workflow wizard
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (19.1%) to scientific vocabulary
Repository
SCI_WIZ: Scotland institute(CRUK) workflow wizard
Basic Info
- Host: GitHub
- Owner: Beatson-CompBio
- License: mit
- Language: Python
- Default Branch: main
- Size: 438 KB
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 1
- Releases: 1
Metadata Files
README.md
CRUK Scotland Institute Workflow Wizard: sci_wiz
sci_wiz packages Nextflow framework and Python's modular approach under the hood to deliver easy-to-use functionalities for RNA-Seq data pre-processing using standard and widely accepted tools. The package is designed to install all the dependencies required to run the workflow so that you do not have to install them separately.
[!NOTE] You will need to download the STAR index, reference genome, and annotation files separately. The package will not download these files for you.
What sci_wiz can do:
- Data pre-processing: Running data pre-processing workflow will carry out the below steps and generate read count matrix. More information on the tools used in each step can be accessed using the links.
- Trimming: uses FastP.
- Reads QC: FastQC and MultiQC.
- Mapping: we are using STAR.
- Counting: FeatureCounts.
- Bam to Cram: for optimising BAM file storage.
System Requirements
Running this package will require you to have access to virtual machine or high-performance cluster(hpc). Below are the requirements for running this package:
- Python 3.10 or higher
- Slurm scheduler for running on HPC
[!NOTE] You should have required access to create and remove symlinks. Increase the number of file that your system can open using
ulimit -n 3000. The number mentioned here is a suggestion from STAR developers in this issue.It is possible STAR might fails because of open file limit error.
Quick start
To help you quickly start your RNA-seq analysis, we have developed a package which can be quickly installed. It also provide you options to either use the Command Line Interface(CLI) or import it in a script. We recommend using a using a virtual environment to run all the analyses so that your system configuration remains as it is.
Installation
- Let's start by creating a virtual environment and activate it. After running commands below, you should see the your virtual env. name at the far left end of the terminal. If not, please refer to python documentation on how to create a virtual env.
python
python -m venv .sci_wiz && source .sci_wiz/bin/activate
- Download the latest package .whl file from the release section of this repository and save it into your current working directory. Use the following command to install the package.
python
pip install sci_wiz-{version}-py3-none-any.whl
- You could check all the functionalities that sci_wiz provides using
--helpcommand.
console
sci_wiz --help
Configuration
- This is a required configuration step that would generate an user_input.ini file to store your inputs. Run below command in your terminal:
console
sci_wiz create-config
- You should have an user_input.ini file in your current working directory. The .ini file will take your input that required to run the data pre-processing smoothly.
```YAML [USERINPUT] projectname = G12yymmuniqueName # G12 is group code, yymm: year and month; uniqueName. profile = hpc # 'vm' if running on VM, 'hpc' if running on HPC. reads = /projectname//{R1,R2}001.fastq.gz # absolute path outputdir = /projectname/Data/ index = STAR75bpor150bp annotation = Org.OrgCode.110.gtf reference = Org.OrgCode.110.fa annotationbed = Org.OrgCode.110.bed batchinfo = false # batchinfo True will require run1, run2 batchdestination, inputreads will be ignored. run1 = run2 = batchdestination =
[TRIMMING] trimfrontread01 = 1 # will trim the front bases from read 1 trimfrontread02 = 1 # will trim front bases from read 2 trimtailread01 = 0 # will trim tail bases from read 1 trimtailread02 = 0 # will trim tail bases from read 2 ```
- project_name: project name, it will provide you with an option to follow project naming convention.
- profile: type of system, such as a virtual machine or high-compute cluster, you are using to run this analysis.
- reads: Path to input raw RNA-seq reads in fastq.gz format.
- output_dir: Base path for the output directory.
- index: Path to the STAR index. Right now this workflow only supports alignment using STAR.
- annotation: Path to the GTF file containing gene annotations.
- reference: Path to the reference genome FASTA file.
- annotation_bed: Path to the BED file containing gene annotations.
- batch_info: Flag indicating whether raw files are available in multiple batches. If this is True, you will need to provide run1, run2, & batch_destination.
- run1 and run2: Paths to raw data for batch setup.
- batch_destination: Destination path for organized batch data.
- trimfrontread_01: Number of bases trimmed from front of Read_01. Default is 1.
- trimfrontread_02: Number of bases trimmed from front of Read_02. Default is 1.
- trimtailread_01: Number of bases trimmed from tail of Read_01. Default is 0.
- trimtailread_02: Number of bases trimmed from tail of Read_02. Default is 0.
Data Pre-processing
[!IMPORTANT] If you want to use the default trimming inputs then directly use the pre-processing command. Otherwise, have a look at this section first. Please make sure the system requirements are met before running the below commands.
- Simply trigger the data pre-processing commands. This program will work smoothly if the below two conditions are satisfied
- Given inputs are as expected.
- You have permission to read all the required files & folder such as input.fastq.gz, index folder, annotation.gtf,annotation.bed, & reference.fa.
console
sci_wiz run-preprocessing
Running in a virtual machine(VM)
- Make sure you have entered
vmas your profile in user_input.ini. For example:
YAML
[USER_INPUT]
profile = vm
...
Running in a virtual machine(HPC)
Dependency: Current workflow is only configured to work with SLURM.
Make sure your input data is available in the shared scratch, preferably in you current working directory.
profile for running data pre-processing in
hpcshould have input as below:
YAML
[USER_INPUT]
profile = hpc
...
Trimming raw data
The Illumina Stranded library preparation kit is used as the default kit. This kit requires trimming of the first base from both reads. The settings for this are the default in the user_input.ini file. If you are using a different library preparation kit, the trimming parameters may be different. Please check the documentation for your kit. If the kit requires different trimming or if you want to switch off trimming, you can do this by editing the user_input.ini file. The workflow uses FastP and the user_input.ini file uses the same flags as FastP but for a controlled set of parameters. The following command will just run the initial QC step, not trimming:
console
sci_wiz run-initial-qc
Import as a module
Did I mention that you can import rna_seq module and carry out all the above steps in a script or jupyter notebook? Here is a quick example:
```python from sciwiz import rnaseq
rnaseq.generateconfig()
run below step after editing user_input.ini
rnaseq.launchdata_preprocessing() ```
Here is a jupyter notebook with same steps that you can expand according to your use case.
Report issues
If you find any issues with our code, you can reach out to us by:
Reporting Issues: If you encounter any issues or bugs, please create a detailed issue report on the repository.
Providing Feedback: Share your feedback on existing features or suggest improvements.
Documentation Edits: If you find any discrepancies or have suggestions for improving the documentation, feel free to submit edits or open an issue.
Citation
If you find sci_wiz useful in your research, please consider citing it:
bibtex
@software{
sci_wiz,
author = {Ojo, Ifedayo and Sikarwar, Mayank and Kwan, Ryan and Shaw, Robin and Miller, Crispin},
month = {1},
title = {CRUK Scotland Institute Workflow Wizard: sci_wiz},
url = {https://github.com/Beatson-CompBio/RNA-seq-workflow},
year = {2024}
}
How to cite dependencies?
We will really appreciate if you could also cite the dependencies using this bib file:
- Nextflow
- Bamtools
- FeatureCounts
- Fastp
- STAR
- Multiqc
Owner
- Name: CRUK Beatson Institute - Computational Biology
- Login: Beatson-CompBio
- Kind: organization
- Repositories: 1
- Profile: https://github.com/Beatson-CompBio
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: 'CRUK Scotland Institute Workflow Wizard: sci_wiz'
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- given-names: Ifedayo
family-names: Ojo
email: i.ojo@crukscotlandinstitute.ac.uk
affiliation: CRUK Scotland Institute
- given-names: Mayank
family-names: Sikarwar
email: m.sikarwar@crukscotlandinstitute.ac.uk
affiliation: CRUK Scotland Institute
- given-names: Ryan
family-names: Kwan
email: s.kwan@crukscotlandinstitute.ac.uk
affiliation: CRUK Scotland Institute
- given-names: Robin
family-names: Shaw
email: r.shaw@crukscotlandinstitute.ac.uk
affiliation: CRUK Scotland Institute
- given-names: Crispin
family-names: Miller
email: crispin.miller@glasgow.ac.uk
affiliation: CRUK Scotland Institute
identifiers:
- type: url
value: >-
https://github.com/Beatson-CompBio/sci_wiz-rna-seq-workflow
description: 'SCI_WIZ: Scotland institute(CRUK) workflow wizard'
repository-code: >-
https://github.com/Beatson-CompBio/sci_wiz-rna-seq-workflow
license: MIT
version: 1.0.0
GitHub Events
Total
- Issues event: 7
- Fork event: 1
Last Year
- Issues event: 7
- Fork event: 1
Dependencies
- actions/checkout v3 composite
- actions/checkout v4 composite
- actions/setup-python v5 composite
- actions/checkout v4 composite
- actions/setup-go v4 composite
- eWaterCycle/setup-singularity v7 composite
- nf-core/setup-nextflow v1 composite
- actions/checkout v2 composite
- actions/setup-python v2 composite
- actions/upload-artifact v2 composite
- actions/checkout v2 composite
- actions/setup-python v2 composite
- actions/upload-artifact v2 composite
- annotated-types 0.7.0
- appnope 0.1.4
- asttokens 2.4.1
- attrs 23.2.0
- black 24.4.2
- certifi 2024.2.2
- cffconvert 2.0.0
- cffi 1.16.0
- charset-normalizer 3.3.2
- click 8.1.7
- colorama 0.4.6
- comm 0.2.2
- coverage 7.5.2
- debugpy 1.8.1
- decorator 5.1.1
- docopt 0.6.2
- dotty-dict 1.3.1
- exceptiongroup 1.2.1
- executing 2.0.1
- gitdb 4.0.11
- gitpython 3.1.43
- idna 3.7
- importlib-resources 6.4.0
- iniconfig 2.0.0
- ipykernel 6.29.4
- ipython 8.24.0
- jedi 0.19.1
- jinja2 3.1.3
- jsonschema 3.2.0
- jupyter-client 8.6.2
- jupyter-core 5.7.2
- markdown-it-py 3.0.0
- markupsafe 2.1.5
- matplotlib-inline 0.1.7
- mdurl 0.1.2
- mypy-extensions 1.0.0
- nest-asyncio 1.6.0
- nextflow 23.10.1
- packaging 24.0
- parso 0.8.4
- pathspec 0.12.1
- pexpect 4.9.0
- platformdirs 4.2.2
- pluggy 1.5.0
- prompt-toolkit 3.0.45
- psutil 5.9.8
- ptyprocess 0.7.0
- pure-eval 0.2.2
- pycparser 2.22
- pydantic 2.7.1
- pydantic-core 2.18.2
- pygments 2.18.0
- pykwalify 1.8.0
- pyrsistent 0.20.0
- pytest 8.2.1
- pytest-cov 5.0.0
- python-dateutil 2.9.0.post0
- python-gitlab 4.6.0
- python-semantic-release 8.7.0
- pywin32 306
- pyzmq 26.0.3
- requests 2.32.2
- requests-toolbelt 1.0.0
- rich 13.7.1
- ruamel-yaml 0.18.6
- ruamel-yaml-clib 0.2.8
- setuptools 70.0.0
- shellingham 1.5.4
- six 1.16.0
- smmap 5.0.1
- stack-data 0.6.3
- tomli 2.0.1
- tomlkit 0.12.5
- tornado 6.4
- traitlets 5.14.3
- typer 0.9.0
- typing-extensions 4.12.0
- urllib3 2.2.1
- wcwidth 0.2.13
- black 23.12.1 develop
- cffconvert 2.0.0 develop
- flake8 7.0.0 develop
- ipykernel 6.29.0 develop
- nextflow 23.10.1 develop
- pytest 7.4.4 develop
- pytest-cov 4.1.0 develop
- python-semantic-release 8.7.0 develop
- Jinja2 3.1.3
- nextflow 23.10.1
- python ^3.10 || ^3.11 || ^3.12
- rich 13.7.1
- typer 0.9.0