deg-pipeline-assistant
This pipeline takes normalized RNA-seq data and outputs differentially expressed genes and pathways.
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (15.2%) to scientific vocabulary
Repository
This pipeline takes normalized RNA-seq data and outputs differentially expressed genes and pathways.
Basic Info
- Host: GitHub
- Owner: Shaan7071
- License: mit
- Language: Python
- Default Branch: main
- Size: 204 KB
Statistics
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
DEG Pipeline Assistant
A comprehensive tool for analyzing RNA-seq data, identifying differentially expressed genes (DEGs), and performing pathway enrichment analysis.
Overview
This pipeline takes normalized RNA-seq data and performs a complete analysis workflow, generating various visualizations and identifying biologically significant patterns. The tool features both a command-line interface and a user-friendly Streamlit web application, making it accessible to both bioinformaticians and researchers with limited programming experience.
Features
- Data Processing: Creates standardized metadata from normalized counts
- Quality Control: Generates boxplots, correlation heatmaps, and PCA plots
- Differential Expression Analysis: Processes DESeq2 files to identify DEGs
- Visualization: Creates MA plots, volcano plots, and DEG heatmaps
- Ortholog Mapping: Maps genes to human orthologs for cross-species analysis
- Enrichment Analysis: Performs GO and KEGG pathway enrichment analysis
- Interactive UI: Streamlit-based web interface for easy parameter configuration
- AI Assistant: Integrated AI functionality to help with parameter selection
Installation
Prerequisites
- Python 3.13+
- Required Python packages (install via pip):
pandas numpy matplotlib seaborn scipy statsmodels scikit-learn streamlit click requests gseapy mygene openai
Setup
- Clone the repository
bash git clone https://github.com/yourusername/rnaseq-pipeline-assistant.git - Navigate to the project directory
bash cd rnaseq-pipeline-assistant - Install dependencies
bash pip install -r requirements.txt - Set up OpenAI API key (for AI assistant functionality)
bash export OPENAI_API_KEY="your-api-key"
Usage
Command Line Interface
The pipeline can be run using the CLI with two main commands:
Setup: Initialize the directory structure
bash python pipeline_CLI.py --norm-file "path/to/normalized_data.csv" --pw-data "path/to/pairwise_data" --base-dir "output_directory" --num-replicates 3 --conditions "Control" --conditions "Treatment1" --conditions "Treatment2" setupRun Analysis: Execute the full analysis pipeline
bash python pipeline_CLI.py --norm-file "path/to/normalized_data.csv" --pw-data "path/to/pairwise_data" --base-dir "output_directory" --num-replicates 3 --conditions "Control" --conditions "Treatment1" --conditions "Treatment2" run-all --model-organism "drerio" --pw-interest "Treatment1_vs_Control" --pw-interest "Treatment2_vs_Control" --log2fc-threshold 1.0 --padj-threshold 0.05 --enrich-sig-cutoff 0.05
Web Interface
For a more user-friendly experience, run the Streamlit app:
bash
streamlit run app.py
This will open a web interface where you can: - Input all parameters through form fields - Generate and review pipeline commands - Execute commands directly from the interface - View real-time command output
Alternatively, connect this repository to the Streamlit cloud at https://streamlit.io/ to handle data that requires more memory.
Parameters
Setup Parameters
- norm-file: Path to the normalized data file (CSV format)
- pw-data: Directory containing pairwise comparison files
- base-dir: Base output directory for results
- conditions: Experimental conditions (specify multiple with repeated flags)
- num-replicates: Number of replicates per condition
Analysis Parameters
- model-organism: Model organism code (e.g., "drerio" for zebrafish)
- pw-interest: Pairwise comparisons of interest (e.g., "TreatmentvsControl")
- log2fc-threshold: Log2 fold change threshold for DEG identification (default: 1.0)
- padj-threshold: Adjusted p-value threshold (default: 0.05)
- enrich-sig-cutoff: Significance cutoff for enrichment analysis (default: 0.05)
Optional Flags
Skip specific analysis steps with these flags: - --skip-boxplots: Skip boxplot generation - --skip-correlation-heatmap: Skip correlation heatmap - --skip-pca: Skip PCA analysis - --skip-ma-plots: Skip MA plot generation - --skip-volcano-plots: Skip volcano plot generation - --skip-heatmap: Skip DEG heatmap - --skip-go: Skip GO enrichment analysis - --skip-kegg: Skip KEGG enrichment analysis
Output Structure
The pipeline creates a standardized directory structure:
base_dir/
├── data/
│ ├── DESeq2/
│ ├── ortholog_mapping/
│ └── GO_and_KEGG_enrichments/
├── plots/
│ ├── boxplots/
│ ├── correlation_heatmap/
│ ├── PCA/
│ ├── MA_plots/
│ ├── volcano_plots/
│ ├── DEG_heatmap/
│ ├── GO_enrichment/
│ └── KEGG_enrichment/
└── results/
├── DEGs/
├── GO_enrichment/
└── KEGG_enrichment/
File Descriptions
- app.py: Streamlit web application interface that
- ai_assistant.py: AI functionality that collects parameters from the user in natural language and transforms them into valid commands
- pipeline_CLI.py: Command-line interface for the pipeline
- DEGpipeline.py: Core pipeline functionality and analysis modules
Example Workflow
- Prepare normalized RNA-seq data in CSV format
- Prepare pairwise comparison files
- Run the setup command to create directory structure
- Run the analysis command with appropriate parameters
- Examine the generated plots and results files
- Interpret biological significance of DEGs and enriched pathways
License
This project is licensed under the MIT License - see the LICENSE file for details.
Contact
For questions or support, please contact [i.banwait7@gmail.com].
Acknowledgments
- This pipeline uses several open-source libraries including pandas, matplotlib, seaborn, and gseapy
- Ortholog mapping is performed using the g:Profiler API
- Pathway enrichment analysis uses the Enrichr API through gseapy
Owner
- Name: Ishaan Banwait
- Login: Shaan7071
- Kind: user
- Repositories: 1
- Profile: https://github.com/Shaan7071
Bioinformatics student aiming to have an impact in the global healthcare industry.
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Banwait"
given-names: "Ishaan"
orcid: "https://orcid.org/0009-0002-1431-4599"
title: "DEG Pipeline Assistant"
version: 1.0.0
date-released: 2025-04-17
url: "https://github.com/Shaan7071/DEG-pipeline-assistant"
repository-code: "https://github.com/yourusername/rnaseq-pipeline-assistant"
abstract: "A comprehensive tool for analyzing RNA-seq data, identifying differentially expressed genes (DEGs), and performing pathway enrichment analysis."
keywords:
- RNA-seq
- bioinformatics
- differential expression
- pathway analysis
- transcriptomics
license: MIT
GitHub Events
Total
- Watch event: 1
- Delete event: 2
- Push event: 27
- Create event: 2
Last Year
- Watch event: 1
- Delete event: 2
- Push event: 27
- Create event: 2
Dependencies
- click >=8.1.8
- gseapy >=1.1.7
- matplotlib >=3.10.1
- matplotlib-inline >=0.1.7
- mygene >=3.2.2
- numpy >=2.2.3
- openai >=1.70.0
- pandas >=2.2.3
- pydeseq2 >=0.5.0
- requests >=2.32.3
- scikit-learn >=1.6.1
- scipy >=1.15.2
- seaborn >=0.13.2
- statsmodels >=0.14.4
- streamlit >=1.44.1