Science Score: 57.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 3 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.8%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: deluair
- License: mit
- Language: Python
- Default Branch: master
- Size: 0 Bytes
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
RNA-seq Analysis Platform
This repository contains a comprehensive platform for RNA-seq data analysis, from raw data retrieval to differential expression and splicing analysis. It features a user-friendly web interface and a powerful command-line pipeline, making it suitable for both biologists and bioinformaticians. The platform is designed to be robust, with features like automatic retry mechanisms for API calls and fallback mock data generation.
<!-- Replace with an actual screenshot -->
Screenshots
|
|
|:--:|
| Web App |
To-Do
- [ ] Add more tests
- [ ] Add more documentation
- [ ] Add more features
Contact
If you have any questions, please contact us at deluair@gmail.com.
Acknowledgements
This project was inspired by the work of others.
Support
If you need support, please open an issue on GitHub.
Authors
- M. Deluair Hossen - Initial work - deluair
See also the list of contributors who participated in this project.
Versioning
We use SemVer for versioning. For the versions available, see the tags on this repository.
Badges
Table of Contents
Features
- Automated Data Retrieval: Download data directly from NCBI's Sequence Read Archive (SRA).
- Robust Error Handling: Includes exponential backoff and retry mechanisms for NCBI API calls to handle rate limiting.
- Mock Data Generation: Can generate realistic mock FASTQ files if downloads fail, ensuring the pipeline can run end-to-end for demonstration and testing.
- Quality Control: Integrated FastQC for assessing raw read quality.
- Alignment: Uses the STAR aligner for mapping reads to a reference genome.
- Differential Expression Analysis:
- Primary analysis using R/DESeq2.
- A pure Python-based implementation as a fallback for systems without R.
- Generates volcano plots and MA plots.
- Splicing Analysis: A mock module to identify and visualize alternative splicing events (SE, MXE, A3SS, A5SS, RI).
- Web Interface: A Flask-based web application to run and monitor analysis steps.
- Comprehensive Reporting: Generates detailed HTML reports for each analysis stage.
- Scientific Manuscript Generation: Includes a LaTeX template and
Makefileto automatically generate a publication-quality manuscript from the analysis results.
Project Structure
rnaseq/
├── app.py # Flask web application
├── pipeline.py # Main command-line pipeline script
├── demo.py # Script to run a quick demonstration
├── run_full_analysis.py # Script to run the full analysis pipeline with mock data
├── requirements.txt # Python package dependencies
├── setup.py # Project setup script
├── config/ # Configuration files for analysis runs
├── data/ # Raw and reference data
├── results/ # Output directory for analysis results
├── logs/ # Log files
├── manuscript/ # LaTeX source and figures for the scientific manuscript
├── src/ # Main source code
│ ├── data_retrieval/ # Data downloaders (e.g., from NCBI)
│ ├── preprocessing/ # QC and read trimming tools
│ ├── alignment/ # STAR aligner wrapper
│ ├── analysis/ # DE and splicing analysis modules
│ ├── quantification/ # Gene quantification tools
│ └── visualization/ # Plot generation modules
├── static/ # CSS and JavaScript for the web app
├── templates/ # HTML templates for the web app
└── README.md # This file
Getting Started
Prerequisites
Before you begin, ensure you have the following dependencies installed:
- Python 3.8+
- Miniconda or Anaconda: Recommended for managing environments.
- Bioinformatics Tools:
- SRA Toolkit: For downloading data from NCBI.
- FastQC: For quality control of raw sequencing data.
- STAR: For aligning reads to the reference genome.
- R and DESeq2 (Optional but recommended): For differential expression analysis.
- LaTeX Distribution (Optional): For compiling the manuscript (e.g., MacTeX on macOS, TeX Live on Linux).
On macOS, you can install most of these with Homebrew:
bash
brew install sratoolkit
brew install fastqc
brew install star
brew install R
Installation
Clone the repository:
bash git clone https://github.com/deluair/RNAseq.git cd RNAseqCreate and activate a conda environment:
bash conda create -n rnaseq_env python=3.9 conda activate rnaseq_envInstall Python dependencies:
bash pip install -r requirements.txtInstall DESeq2 in R (if you installed R):
R if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("DESeq2")Download Reference Genome and Annotation: Place your reference genome (
.fa) and gene annotation (.gtf) files in thedata/reference/directory. Example files for soybean are provided. Note that these files are not tracked by Git and must be acquired separately.
Usage
You can use the Makefile for common tasks:
- make install: Install dependencies.
- make lint: Lint the code.
- make test: Run tests.
- make run: Run the web application.
- make demo: Run the demo script.
- make full_analysis: Run the full analysis script.
- make manuscript: Compile the manuscript.
- make clean: Remove temporary files.
- make all: Install, lint, and test.
Web Application
To start the web application, run:
bash
make run
The application will be accessible at http://127.0.0.1:5001. The web interface allows you to:
- Search for and download datasets from NCBI.
- Run quality control.
- Perform differential expression and splicing analysis.
- View results and reports.
Demonstration Script
The demo.py script runs a small, pre-configured analysis on sample data.
bash
make demo
This will generate sample reports in the demo_results/ directory.
Full Analysis Pipeline
The run_full_analysis.py script executes the complete end-to-end pipeline using mock data. This is useful for testing the integration of all modules.
bash
make full_analysis
This script simulates a full analysis run, including DE and splicing analysis, and generates a comprehensive HTML report in results/full_analysis_example/.
Manuscript Generation
A LaTeX manuscript template is provided in the manuscript/ directory. After running an analysis, you can compile the manuscript to a PDF.
bash
make manuscript
This will produce a soybean_drought_rnaseq.pdf file.
Modules
Data Retrieval (src/data_retrieval)
Handles the download of public datasets from NCBI SRA. It is designed to be resilient to network issues and API limits, using a retry strategy with exponential backoff. If a download ultimately fails, it can create a mock FASTQ file to allow downstream steps to proceed.
Quality Control (src/preprocessing)
Uses FastQC to generate quality reports for each FASTQ file, which are essential for identifying issues with raw sequencing data.
Alignment (src/alignment)
A wrapper for the STAR aligner. It takes FASTQ files and a reference genome index to produce BAM alignment files.
Differential Expression Analysis (src/analysis)
-
deseq2_analyzer.py: Interfaces with R/DESeq2 for robust DE analysis. -
python_de_analyzer.py: A pure Python implementation providing an alternative when R is not available. It performs normalization and statistical tests to find differentially expressed genes.
Splicing Analysis (src/analysis)
The splicing_analyzer.py module provides a framework for identifying alternative splicing events. The current implementation is a mock that generates a realistic report, but it can be extended with tools like rMATS.
Visualization (src/visualization)
Generates various plots for interpreting the analysis results, including volcano plots, MA plots, heatmaps, and splicing event diagrams.
Contributing
Contributions are welcome! Please feel free to submit a pull request or open an issue to discuss proposed changes.
- Fork the repository.
- Create a new branch (
git checkout -b feature/your-feature-name). - Make your changes.
- Commit your changes (
git commit -m 'Add some feature'). - Push to the branch (
git push origin feature/your-feature-name). - Open a pull request.
License
This project is licensed under the MIT License - see the LICENSE.md file for details.
Citation
If you use this software, please cite it as below.
@software{hossen_2024_rnaseq,
author = {Hossen, M. Deluair},
title = {{RNA-seq Analysis Platform}},
month = jan,
year = 2024,
publisher = {Zenodo},
version = {1.0.0},
doi = {10.5281/zenodo.1234},
url = {https://doi.org/10.5281/zenodo.1234}
}
Owner
- Name: Md Deluair Hossen
- Login: deluair
- Kind: user
- Location: Knoxville, TN
- Company: University of Tennessee
- Website: https://deluair.com/
- Repositories: 1
- Profile: https://github.com/deluair
Post doctoral research associates, Economics
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Hossen"
given-names: "M. Deluair"
orcid: "https://orcid.org/0000-0002-deluair-hossen"
title: "RNA-seq Analysis Platform"
version: 1.0.0
doi: 10.5281/zenodo.1234
date-released: 2024-01-01
GitHub Events
Total
- Push event: 38
- Create event: 2
Last Year
- Push event: 38
- Create event: 2
Dependencies
- actions/checkout v2 composite
- actions/setup-python v2 composite
- biopython >=1.79
- click >=8.0.0
- dash >=2.0.0
- dash-bootstrap-components >=1.0.0
- flask >=2.0.0
- flask-cors >=3.0.10
- loguru >=0.6.0
- matplotlib >=3.5.0
- numpy >=1.21.0
- pandas >=1.3.0
- plotly >=5.0.0
- pysam >=0.19.0
- pyyaml >=6.0
- requests >=2.25.0
- scikit-learn >=1.0.0
- scipy >=1.7.0
- seaborn >=0.11.0
- tqdm >=4.62.0
- urllib3 >=1.26.0