forensic-mirna-analysis

Forensic miRNA analysis pipeline for body fluid identification using GEO datasets

https://github.com/gvmfhy/forensic-mirna-analysis

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
    Links to: ncbi.nlm.nih.gov, zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.6%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Forensic miRNA analysis pipeline for body fluid identification using GEO datasets

Basic Info
  • Host: GitHub
  • Owner: gvmfhy
  • License: mit
  • Language: Python
  • Default Branch: main
  • Size: 17.5 MB
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 2
Created 8 months ago · Last pushed 8 months ago
Metadata Files
Readme License Citation

README.md

Vibe Coding: AI-assisted Development of Forensic miRNA Analysis Pipeline using Claude Opus 4

DOI

A bioinformatics pipeline for identifying microRNA (miRNA) signatures in forensic body fluid samples, developed through transparent human-AI collaboration. This project demonstrates "vibe coding" - using natural language to guide AI through complex scientific software development, resulting in a working pipeline that identifies 393 miRNA markers for forensic body fluid identification.

🎯 Development Transparency

This entire pipeline was developed through AI-assisted programming using Claude Opus 4. The complete, unedited development session is available: - View the full vibe coding transcript - 338,952 lines of human-AI interaction - Read about the development process - Including failures and recovery strategies - Visit the web version - For rendered documentation

🔬 Project Overview

This pipeline processes two complementary microarray datasets: - GSE153135: Two-color Agilent arrays (GPR format) with pooled samples - GSE49630: Affymetrix ST arrays (CEL format) with individual samples

The analysis identifies 393 forensic miRNA marker candidates with large effect sizes suitable for body fluid identification.

📊 Key Results

From analysis of n=5 samples per body fluid type: - Blood markers: 155 candidates (e.g., hsa-miR-486-5p with 3243-fold enrichment) - *Semen markers: 167 candidates (e.g., hsa-miR-891a with 175-fold enrichment) - *Saliva markers: 49 candidates (e.g., hsa-miR-205 with 275-fold enrichment) - *Vaginal markers*: 22 candidates (e.g., hsa-miR-138-1-star with 7-fold enrichment)

*Note: Large fold-changes may reflect variance in small sample sizes and require validation

⚠️ Important Limitations

  • Small sample size: Only n=5 samples per body fluid type
  • No independent validation: Results require confirmation in larger cohorts
  • Statistical constraints: FDR < 0.05 not achievable with 13,564 tests and small n
  • Forensic applicability: These are research findings, not validated forensic markers
  • Development history: See development logs for complete transparency including errors

🚀 Quick Start

Prerequisites

  • Python 3.8+ with pandas, numpy, scipy, matplotlib, seaborn
  • R 4.5+ with BiocManager, oligo, limma packages
  • macOS/Linux environment (tested on macOS ARM64)

Installation

```bash

Clone the repository

git clone https://github.com/YOUR_USERNAME/forensic-mirna-analysis.git cd forensic-mirna-analysis

Set up Python environment

python -m venv venv source venv/bin/activate pip install -r requirements.txt

Set up R environment

Rscript setuprenv.sh ```

Data Download

Due to size constraints, raw data files are not included. Download from GEO:

```bash

Create data directories

mkdir -p data/raw/GSE153135GPR data/raw/GSE49630CEL

Download GSE153135 (GPR files)

Visit https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE153135

Download GSE49630 (CEL files)

Visit https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE49630

```

Running the Analysis

```bash

1. Process GPR files (two-color arrays)

python scripts/preprocessing/gpr_parser.py

2. Process CEL files (Affymetrix arrays)

Rscript scripts/preprocessing/celprocessorminimal.R

3. Run forensic marker analysis

python scripts/analysis/celpracticalforensic.py ```

📁 Project Structure

├── data/ │ ├── raw/ # Raw data files (not in repo) │ └── processed/ # Normalized expression matrices ├── scripts/ │ ├── preprocessing/ # Data parsing and normalization │ └── analysis/ # Statistical analysis and visualization ├── results/ │ ├── continuous_expression_analysis/ # GPR analysis results │ └── cel_practical_forensic/ # CEL forensic markers ├── docs/ │ ├── DETAILED_ANALYSIS_WALKTHROUGH.md │ └── RESULTS_AND_VISUALIZATIONS_STRUCTURE.md └── requirements.txt

🔍 Analysis Approach

Multi-Tier Expression Framework

The pipeline implements a multi-tier detection system for continuous expression data: - High confidence: log2 ratio > -2.0 - Moderate confidence: -6.0 to -2.0
- Low confidence: -10.0 to -6.0 - Undetected: < -10.0

Statistical Methods

  • Non-parametric Wilcoxon tests (suitable for small samples)
  • Cohen's d effect size (prioritized over p-values)
  • FDR correction for multiple testing
  • Forensic-specific thresholds (>4-fold change, >99% specificity)

📚 Documentation

Comprehensive documentation is available: - Detailed Analysis Walkthrough - Step-by-step process - Results Structure - Output file descriptions - Fully documented script versions in *_documented.py/R files

⚠️ Important Notes

Platform-Specific Requirements

  • CEL files MUST use oligo package (not affy) for ST arrays
  • macOS ARM64: Requires specific R configuration (see setuprenv.sh)

Statistical Limitations

  • Small sample sizes (n=2-5 per fluid) limit statistical power
  • Strict FDR < 0.05 not achievable with 13,564 tests
  • Focus on effect size over p-values for forensic relevance

🤝 Contributing

Contributions are welcome! Please: 1. Fork the repository 2. Create a feature branch 3. Add comprehensive documentation 4. Submit a pull request

📄 License

This project is licensed under the MIT License - see LICENSE file for details.

🙏 Acknowledgments

  • GEO datasets GSE153135 and GSE49630 authors
  • Bioconductor community for R packages
  • Forensic miRNA research community
  • Claude Opus 4 (Anthropic) for AI programming assistance

💡 LLMs and Scientific Discovery

Historically, experimental science has operated in siloed compartments, often insulated from broader technological innovations, especially those involving computational advancements. Computationally adjacent "dry-lab" fields rapidly incorporate cutting-edge AI tools, while traditional "wet-lab" research frequently lags behind due to habits of secrecy, selective reporting, and institutional inertia.

Despite known errors and current limitations, the widespread use of LLMs ensures their lasting impact. They underpin future technologies and scientific methodologies, making transparent practices today not just beneficial but essential.

Scientific publishing currently suffers from an entrenched culture of presenting only successful results. By openly documenting failures, misunderstandings, and friction points through transparent, raw development logs, we directly challenge this problematic status quo. Publishing these records online, such as full terminal logs hosted on publicly accessible websites, achieves two goals:

First, it immediately improves reproducibility, honesty, and transparency in scientific discourse. Researchers can directly examine genuine human-AI interactions, learn from recorded failures, and clearly understand breakdown points. We can better criticize each other and offer our domain specific knowledge when we notice errors.

Second, openly shared development logs represent an investment in future AI development. Public repositories containing candid transcripts may enter LLM training datasets, helping AI models learn from realistic, error-inclusive examples of human-AI collaboration, thereby improving their capability to navigate and mitigate such friction points.

By contributing realistic scientific problem-solving data to public repositories, individual researchers actively shape how future models understand and respond to scientific queries. This approach values instructive failures as essential components of scientific progress.

Publishing raw terminal logs directly targets the largest visibility gap in science by surfacing misfires and dead ends that traditional papers omit. This transparency provides peers—and future LLMs—access to the full causal chain of discovery, improving present-day scientific transparency and influencing human-AI interactions.

📞 Contact

For questions or collaborations, please open an issue on GitHub.


Note: This pipeline was developed for research purposes. Forensic applications require additional validation on independent samples. The development process demonstrates that sophisticated bioinformatics tools can be created through AI collaboration by researchers without traditional programming expertise.

Owner

  • Name: Austin_Patrick
  • Login: gvmfhy
  • Kind: user

noob

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: 'Vibe Coding: AI-assisted Development of Forensic miRNA Analysis Pipeline using Claude Opus 4'
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Austin P.
    family-names: Morrissey
    email: austinm@bu.edu
    affiliation: 'Pinpoint RNA'
    orcid: 'https://orcid.org/0009-0007-8578-2687'
    website: 'https://www.linkedin.com/in/austin-p-morrissey/'
identifiers:
  - type: doi
    value: 10.5281/zenodo.15860835
    description: The concept DOI for the collection containing all versions of this software.
repository-code: 'https://github.com/gvmfhy/forensic-mirna-analysis'
url: 'https://austinpatrick.substack.com'
abstract: >-
  A bioinformatics pipeline for identifying microRNA (miRNA) 
  signatures in forensic body fluid samples, developed through 
  transparent AI-assisted programming using Claude Opus 4. This 
  project demonstrates "vibe coding" - using natural language to 
  guide AI through complex scientific software development. The 
  pipeline processes publicly available microarray datasets 
  (GSE153135 and GSE49630) to discover miRNA markers that can 
  distinguish between blood, saliva, semen, and vaginal secretions 
  for forensic identification purposes. The analysis uses standard 
  statistical methods and identifies 393 forensic marker candidates 
  with large effect sizes. Complete development logs are included 
  to promote transparency in AI-assisted scientific computing.
keywords:
  - forensics
  - miRNA
  - microRNA
  - bioinformatics
  - body fluid identification
  - gene expression
  - microarray analysis
  - AI-assisted development
  - vibe coding
  - Claude Opus 4
  - transparent scientific computing
  - human-AI collaboration
license: MIT
version: 1.0.0
date-released: '2025-07-11'

GitHub Events

Total
  • Release event: 1
  • Push event: 3
  • Create event: 2
Last Year
  • Release event: 1
  • Push event: 3
  • Create event: 2