hmm_kunitz_domain

Hidden Markov Model-based classification and structural validation of Kunitz-type protein domains. Includes sequence alignment, model training, validation on curated datasets, and structural quality assessment using PDBeFold and RMSD/Q-score analysis. Final project for the Laboratory of Bioinformatics 1 course at the University of Bologna.

https://github.com/kianinsilico/hmm_kunitz_domain

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.9%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Hidden Markov Model-based classification and structural validation of Kunitz-type protein domains. Includes sequence alignment, model training, validation on curated datasets, and structural quality assessment using PDBeFold and RMSD/Q-score analysis. Final project for the Laboratory of Bioinformatics 1 course at the University of Bologna.

Basic Info
  • Host: GitHub
  • Owner: kianinsilico
  • License: other
  • Language: Shell
  • Default Branch: main
  • Homepage:
  • Size: 7.7 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created 11 months ago · Last pushed 7 months ago
Metadata Files
Readme Citation

README.md

HMM-Based Detection of Kunitz Domains from Structure-Derived Alignments

License: CC BY-NC-SA 4.0 Citation: CFF University of Bologna Bioinformatics Lab 1 HMMER Python 3.13+ TM-align MMseqs2 Project Status

Kunitz Domain HMM Classification Project Banner

This repository contains the complete pipeline and materials for building, evaluating, and validating a profile HMM designed to detect Kunitz-type domains in protein sequences, as part of the final assessment for Laboratory of Bioinformatics 1 @ University of Bologna.


Project Overview

This project addresses the identification of Kunitz-type serine protease inhibitors through a custom-trained HMM, evaluated for performance on diverse validation datasets. The model was built from structure-based alignments, evaluated using strict quality metrics, and benchmarked against known positive and negative datasets.


🗂️ Repository Structure

bash ├── data/ # All datasets and processed data │ ├── datasets/ # FASTA files, classification data, evaluation results │ ├── pdbs/ # Cleaned single-chain PDB structures │ ├── processed_data/ # Intermediate processing results │ ├── raw_data/ # Original input data │ ├── raw_pdbs/ # Downloaded full PDB structures │ ├── tmalign_results/ # Pairwise alignment outputs │ ├── visualization/ # Plots and visual analysis results │ └── consistency_check.txt # Data validation results ├── scripts/ # Processing pipeline scripts │ ├── 01_csv_to_fasta.sh # Convert CSV to FASTA format │ ├── 02_mmseqs_cluster.sh # Sequence clustering with MMseqs2 │ ├── 03_extract_ids.sh # ID formatting for PDBeFold │ ├── 04_extract_chains.sh # PDB chain extraction │ ├── 05_run_tmalign.sh # Structural alignment pipeline │ ├── 06_parse_tmalign.py # TM-align results parser │ ├── 07_kunitz_superposition.cxc # ChimeraX superposition script │ ├── 08_plot_pdbefold_matrices.py # Visualization scripts │ ├── 09_build_hmm.sh # HMM model construction │ ├── 11_consistency_check.sh # Data validation │ ├── 12_kunitz_hmm_validator.sh # Enhanced HMM evaluation tool ├── docs/ # Documentation and visuals │ ├── .repo_visulas/ # Repository graphics and banner │ └── results/ # Analysis results and reports ├── environment_full.yml # Conda environment specification ├── LICENCE # License file ├── LICENCE-DATA.txt # Data license information └── README.md # Project documentation

Project Workflow Summary

1. Data Acquisition and Preprocessing

  • Data was retrieved from RCSB PDB using a custom query:
    • Pfam ID: PF00014
    • Resolution ≤ 3 Å
    • Sequence length between 45–80 residues
  • A custom report was downloaded from the RCSB website including the following fields:

    • Entry ID
    • Polymer Entity ID
    • Sequence
    • Annotation Identifier
    • Chain ID
  • Extracted FASTA sequences from the CSV report using scripts/01_csv_to_fasta.sh

2. Sequence Clustering

  • Clustered with MMseqs2 using scripts/02_mmseqs_cluster.sh
  • Identity threshold: 90%, coverage: 80%
  • Output: representative sequences for further analysis

3. ID Extraction for Structural Search

  • Used scripts/03_extract_ids.sh to format IDs for PDBeFold

4. Structural Filtering

  • Extracted desired chains from downloaded PDB files using: ./scripts/04_extract_chains.sh <cleaned_id_list> <raw_pdb_dir> <output_dir>
  • Manual QC in AliView and ChimeraX identified and excluded:
    • 1yld_B (truncated structure)
    • 5jbt_Y (structurally divergent)

5. Structural Alignment and Quality Assessment

  • Ran all-vs-all TM-align using scripts/05_run_tmalign.sh
  • Parsed and visualized results using scripts/06_parse_tmalign.py
    • Outputs: RMSD and TM-score matrices, rankings, heatmaps
    • Top reference: 1f5r_I

6. Superposition

  • Structures were aligned in ChimeraX using Matchmaker
  • Alignment centered on 1f5r:I
  • Outputs include .cxs session file and figure

7: HMM Construction

bash hmmbuild kunitz_model.hmm pdb_kunitz_PDBeFold_alignment_clean.fasta.ali

8: HMM Evaluation & Cross-Validation

The final HMM model undergoes rigorous evaluation using a 2-fold cross-validation approach to assess its performance on independent datasets. This evaluation process ensures robust performance metrics and validates the model's ability to correctly identify Kunitz-type domains while minimizing false positives.

Evaluation Methodology: - Cross-validation design: 2-fold splitting to maximize dataset utilization - Overlap removal: Training sequences are filtered out to prevent data leakage - Performance metrics: Matthews Correlation Coefficient (MCC), True Positive Rate (TPR), and Positive Predictive Value (PPV) - Threshold optimization: Multiple E-value thresholds tested to find optimal cutoffs

Validation Datasets: - human_kunitz.fasta - Human Kunitz domain sequences (positive set) - human_notkunitz.fasta - Human non-Kunitz sequences (negative control) - nothuman_kunitz.fasta - Non-human Kunitz sequences (diversity test) - uniprot_sprot.fasta - SwissProt background sequences (large negative set)

Analysis Pipeline: 1. Combine input datasets and remove training sequence overlaps 2. Random split into balanced positive/negative folds 3. Execute hmmsearch against the trained HMM model 4. Parse results and calculate classification performance 5. Generate comprehensive performance reports and threshold analysis


🛠️ Environment Setup

The project includes a streamlined conda environment with only essential bioinformatics tools:

```bash

Create the environment

conda env create -f environment_full.yml

Activate the environment

conda activate kunitz ```

Key dependencies: - Python 3.13 with scientific computing stack (NumPy, Pandas, SciPy) - Bioinformatics tools: BLAST, HMMER, MMseqs2, MUSCLE, TM-align - Analysis tools: BioPython, Matplotlib, Seaborn - Sequence utilities: SeqKit, PDB-tools


📈 Performance Results

| UniProt ID | Length | # Domains (PF00014) | Domain Position(s) | Comments | |------------|--------|----------------------|--------------------|------------------------| | A0A1Q1NL17 | 101 | 1 | 32--88 | Short sequence | | O62247 | 202 | 1 | 138--184 | Domain near C-terminal | | Q8WPG5 | 134 | 2 | 17--69, 83--129 | Tandem domains | | D3GGZ8 | 195 | 1 | 120--190 | Domain near C-terminal |

Best result observed with E-value threshold = 1e-06 using full sequence mode: MCC = 0.9945, TPR = 1.0, PPV = 0.989

Owner

  • Name: Kianoush Keshani
  • Login: kianinsilico
  • Kind: user

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
type: software
title: "Kunitz HMM Classification Project"
abstract: "HMM-based classification and evaluation pipeline for Kunitz domain proteins using cross-validation and performance assessment tools."
authors:
  - family-names: "Keshani"
    given-names: "Kianoush"
    orcid: "https://orcid.org/0000-0000-0000-0000"
    affiliation: "University of Bologna"
version: "1.0.0"
date-released: 2025-01-24
url: "https://github.com/kianinsilico/labexam"
repository-code: "https://github.com/kianinsilico/labexam"
license: "CC-BY-NC-SA-4.0"
keywords:
  - "bioinformatics"
  - "protein-classification"
  - "hmm"
  - "kunitz-domain"
  - "cross-validation"
  - "machine-learning"
  - "structural-biology"
  - "computational-biology"
preferred-citation:
  type: software
  title: "Kunitz HMM Classification Project: HMM-based classification and evaluation pipeline for Kunitz domain proteins"
  authors:
    - family-names: "Keshani"
      given-names: "Kianoush"
      orcid: "https://orcid.org/0000-0000-0000-0000"
      affiliation: "University of Bologna"
  year: 2025
  url: "https://github.com/kianinsilico/labexam"
  notes: "Laboratory examination project for Laboratory of Bioinformatics 1 course"

GitHub Events

Total
  • Push event: 7
Last Year
  • Push event: 7