fasta-rbca-resolver

Fasta Automated Rule-Based Country Assignment (RBCA) for influenza

https://github.com/bambusaoldhamii/fasta-rbca-resolver

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.9%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Fasta Automated Rule-Based Country Assignment (RBCA) for influenza

Basic Info
  • Host: GitHub
  • Owner: Bambusaoldhamii
  • License: other
  • Language: Jupyter Notebook
  • Default Branch: main
  • Size: 1.69 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 2
Created 10 months ago · Last pushed 10 months ago
Metadata Files
Readme License Citation

README.md

Rule-Based Country Assignment (RBCA) for Avian Influenza FASTA files

This repository contains the implementation of a rule-based pipeline for assigning standardized ISO 3166-1 country names to geographic metadata embedded in avian influenza FASTA headers. The main script is implemented in Fasta RBCA R1.ipynb.

Overview

This study uses a deterministic string-matching strategy to resolve sampling locations from HA segment FASTA headers. Virus names are expected to follow the GISAID-recommended format:
A/host/location/isolate/year.

The third component (location) is extracted using regular expressions and matched to a standardized dictionary (location_to_country_ISO_3166_1.json) that maps known location strings to ISO 3166-1 short English country names. Unmatched locations are labeled as Other.

Features

  • Deterministic parsing using regular expressions
  • ISO 3166-1-based country mapping (no AI inference)
  • Robust handling of location extraction errors
  • Export of location list and country sample counts
  • Compatible with Jupyter Notebook and Python 3.12+

Output

  • location_list.csv: Extracted location names and frequencies
  • country_stat.csv: Country-level sample count distribution

Requirements

Install dependencies with:

bash pip install -r requirements.txt

Or create a dedicated Conda environment:

bash conda create -n rbca-env python=3.12 conda activate rbca-env pip install -r requirements.txt

🚀 How to Run

  1. Launch Jupyter:

bash jupyter notebook

  1. Open the file: Fasta RBCA R1.ipynb

  2. Follow the notebook cells step-by-step.

📂 Note: Make sure your FASTA file is placed in the same directory as the notebook. The script automatically detects the latest .fasta file for processing.

Citation

He, Jie-Long. (2025). Fasta RBCA R1: Rule-Based Country Assignment (RBCA) for Avian Influenza FASTA files (v1.0.2). Zenodo. https://doi.org/10.5281/zenodo.15342773

License

MIT License

Owner

  • Login: Bambusaoldhamii
  • Kind: user

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this notebook, please cite it as below:"
title: "Fasta RBCA R1: Rule-Based Country Assignment (RBCA) for Avian Influenza FASTA files"
version: 1.0.2
authors:
  - family-names: YourLastName
    given-names: YourFirstName
    affiliation: Asia University
date-released: 2025-05-05
doi: 10.5281/zenodo.15342773
url: https://github.com/yourusername/fasta-rbca-resolver

GitHub Events

Total
  • Release event: 4
  • Push event: 10
  • Create event: 5
Last Year
  • Release event: 4
  • Push event: 10
  • Create event: 5

Dependencies

requirements.txt pypi
  • biopython ==1.85
  • geopandas ==0.14.3
  • matplotlib ==3.8.4
  • pandas ==2.2.3
  • plotly ==5.21.0
  • tqdm ==4.67.1