deg-pipeline-assistant

This pipeline takes normalized RNA-seq data and outputs differentially expressed genes and pathways.

https://github.com/shaan7071/deg-pipeline-assistant

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (15.2%) to scientific vocabulary

Last synced: 9 months ago · JSON representation ·

Repository

This pipeline takes normalized RNA-seq data and outputs differentially expressed genes and pathways.

Basic Info

Host: GitHub
Owner: Shaan7071
License: mit
Language: Python
Default Branch: main
Size: 204 KB

Statistics

Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created about 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme License Citation

DEG Pipeline Assistant

A comprehensive tool for analyzing RNA-seq data, identifying differentially expressed genes (DEGs), and performing pathway enrichment analysis.

Overview

This pipeline takes normalized RNA-seq data and performs a complete analysis workflow, generating various visualizations and identifying biologically significant patterns. The tool features both a command-line interface and a user-friendly Streamlit web application, making it accessible to both bioinformaticians and researchers with limited programming experience.

Features

Data Processing: Creates standardized metadata from normalized counts
Quality Control: Generates boxplots, correlation heatmaps, and PCA plots
Differential Expression Analysis: Processes DESeq2 files to identify DEGs
Visualization: Creates MA plots, volcano plots, and DEG heatmaps
Ortholog Mapping: Maps genes to human orthologs for cross-species analysis
Enrichment Analysis: Performs GO and KEGG pathway enrichment analysis
Interactive UI: Streamlit-based web interface for easy parameter configuration
AI Assistant: Integrated AI functionality to help with parameter selection

Installation

Prerequisites

Python 3.13+
Required Python packages (install via pip): pandas numpy matplotlib seaborn scipy statsmodels scikit-learn streamlit click requests gseapy mygene openai

Setup

Clone the repository bash git clone https://github.com/yourusername/rnaseq-pipeline-assistant.git
Navigate to the project directory bash cd rnaseq-pipeline-assistant
Install dependencies bash pip install -r requirements.txt
Set up OpenAI API key (for AI assistant functionality) bash export OPENAI_API_KEY="your-api-key"

Usage

Command Line Interface

The pipeline can be run using the CLI with two main commands:

Setup: Initialize the directory structure bash python pipeline_CLI.py --norm-file "path/to/normalized_data.csv" --pw-data "path/to/pairwise_data" --base-dir "output_directory" --num-replicates 3 --conditions "Control" --conditions "Treatment1" --conditions "Treatment2" setup
Run Analysis: Execute the full analysis pipeline bash python pipeline_CLI.py --norm-file "path/to/normalized_data.csv" --pw-data "path/to/pairwise_data" --base-dir "output_directory" --num-replicates 3 --conditions "Control" --conditions "Treatment1" --conditions "Treatment2" run-all --model-organism "drerio" --pw-interest "Treatment1_vs_Control" --pw-interest "Treatment2_vs_Control" --log2fc-threshold 1.0 --padj-threshold 0.05 --enrich-sig-cutoff 0.05

Web Interface

For a more user-friendly experience, run the Streamlit app:

bash streamlit run app.py

This will open a web interface where you can: - Input all parameters through form fields - Generate and review pipeline commands - Execute commands directly from the interface - View real-time command output

Alternatively, connect this repository to the Streamlit cloud at https://streamlit.io/ to handle data that requires more memory.

Parameters

Setup Parameters

norm-file: Path to the normalized data file (CSV format)
pw-data: Directory containing pairwise comparison files
base-dir: Base output directory for results
conditions: Experimental conditions (specify multiple with repeated flags)
num-replicates: Number of replicates per condition

Analysis Parameters

model-organism: Model organism code (e.g., "drerio" for zebrafish)
pw-interest: Pairwise comparisons of interest (e.g., "TreatmentvsControl")
log2fc-threshold: Log2 fold change threshold for DEG identification (default: 1.0)
padj-threshold: Adjusted p-value threshold (default: 0.05)
enrich-sig-cutoff: Significance cutoff for enrichment analysis (default: 0.05)

Optional Flags

Skip specific analysis steps with these flags: - --skip-boxplots: Skip boxplot generation - --skip-correlation-heatmap: Skip correlation heatmap - --skip-pca: Skip PCA analysis - --skip-ma-plots: Skip MA plot generation - --skip-volcano-plots: Skip volcano plot generation - --skip-heatmap: Skip DEG heatmap - --skip-go: Skip GO enrichment analysis - --skip-kegg: Skip KEGG enrichment analysis

Output Structure

The pipeline creates a standardized directory structure:

base_dir/ ├── data/ │ ├── DESeq2/ │ ├── ortholog_mapping/ │ └── GO_and_KEGG_enrichments/ ├── plots/ │ ├── boxplots/ │ ├── correlation_heatmap/ │ ├── PCA/ │ ├── MA_plots/ │ ├── volcano_plots/ │ ├── DEG_heatmap/ │ ├── GO_enrichment/ │ └── KEGG_enrichment/ └── results/ ├── DEGs/ ├── GO_enrichment/ └── KEGG_enrichment/

File Descriptions

app.py: Streamlit web application interface that
ai_assistant.py: AI functionality that collects parameters from the user in natural language and transforms them into valid commands
pipeline_CLI.py: Command-line interface for the pipeline
DEGpipeline.py: Core pipeline functionality and analysis modules

Example Workflow

Prepare normalized RNA-seq data in CSV format
Prepare pairwise comparison files
Run the setup command to create directory structure
Run the analysis command with appropriate parameters
Examine the generated plots and results files
Interpret biological significance of DEGs and enriched pathways

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

For questions or support, please contact [i.banwait7@gmail.com].

Acknowledgments

This pipeline uses several open-source libraries including pandas, matplotlib, seaborn, and gseapy
Ortholog mapping is performed using the g:Profiler API
Pathway enrichment analysis uses the Enrichr API through gseapy

Owner

Name: Ishaan Banwait
Login: Shaan7071
Kind: user

Repositories: 1
Profile: https://github.com/Shaan7071

Bioinformatics student aiming to have an impact in the global healthcare industry.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: "Banwait"
    given-names: "Ishaan"
    orcid: "https://orcid.org/0009-0002-1431-4599"
title: "DEG Pipeline Assistant"
version: 1.0.0
date-released: 2025-04-17
url: "https://github.com/Shaan7071/DEG-pipeline-assistant"
repository-code: "https://github.com/yourusername/rnaseq-pipeline-assistant"
abstract: "A comprehensive tool for analyzing RNA-seq data, identifying differentially expressed genes (DEGs), and performing pathway enrichment analysis."
keywords:
  - RNA-seq
  - bioinformatics
  - differential expression
  - pathway analysis
  - transcriptomics
license: MIT

GitHub Events

Total

Watch event: 1
Delete event: 2
Push event: 27
Create event: 2

Last Year

Watch event: 1
Delete event: 2
Push event: 27
Create event: 2

Dependencies

requirements.txt pypi

click >=8.1.8
gseapy >=1.1.7
matplotlib >=3.10.1
matplotlib-inline >=0.1.7
mygene >=3.2.2
numpy >=2.2.3
openai >=1.70.0
pandas >=2.2.3
pydeseq2 >=0.5.0
requests >=2.32.3
scikit-learn >=1.6.1
scipy >=1.15.2
seaborn >=0.13.2
statsmodels >=0.14.4
streamlit >=1.44.1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science