https://github.com/biodivhealth/lf_nigeria_reports
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.7%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: BioDivHealth
- License: cc0-1.0
- Language: Python
- Default Branch: main
- Size: 313 KB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Lassa Fever Reports Scraping Pipeline
A Python-based data processing pipeline for scraping, processing, and analyzing Lassa fever situation reports from the Nigeria Centre for Disease Control (NCDC).
Newest up-to-date dataset is available here 🔗
Project Overview
This pipeline automates the end-to-end processing of weekly Lassa fever reports: - Scrapes the NCDC website for report listings and extracts metadata. - Downloads raw PDF reports and organizes them by year. - Enhances table images in PDFs for accurate data extraction. - Uses Google Gemini AI to extract structured case data (Suspected, Confirmed, Probable, HCW, Deaths) at state and week granularity. - Validates logical consistency (Suspected ≥ Confirmed ≥ Deaths) with retry and correction logic. - Combines per-year CSV datasets into a unified master CSV for time-series analysis.
Data sources: - Raw PDF situation reports from NCDC - Intermediate enhanced table images - Yearly extracted CSVs - Final combined master CSV
Repository Structure
text
Lassa_Reports_Scraping/
├── README.md # This file
├── main.py # Orchestrates the full pipeline
├── requirements.txt # Python dependencies
├── .env # Environment variables (API keys)
├── data/
│ ├── raw/ # Raw downloaded PDFs (organized by year)
│ ├── processed/ # Processed data and images
│ │ ├── PDF/ # Enhanced table images for each year
│ │ ├── CSV/ # Extracted and sorted CSV data by year
│ └── documentation/ # Metadata and status tracking CSVs
├── exports/ # Exported final datasets (CSV, README)
├── src/ # Core pipeline scripts
│ ├── 01_URL_Sourcing.py # Scrape report URLs and metadata, update Supabase
│ ├── 02_PDF_Download_Supabase.py # Sync/download PDFs, update download status in Supabase
│ ├── 03a_SyncEnhancement.py # Sync enhanced status between B2 and Supabase
│ ├── 03b_TableEnhancement_Supabase.py # Enhance table images, upload to B2, update DB
│ ├── 04a_SyncProcessed.py # Sync processed (CSV) status between B2 and Supabase
│ ├── 04b_LLM_Extraction_Supabase.py # Extract tables from images using Gemini AI, save as CSV, update DB
│ ├── 05a_SyncCombiningStatus.py # Sync 'combined' status for CSVs between local, B2, and Supabase
│ ├── 05b_PushToDB.py # Push processed CSVs to main DB table (lassa_data)
│ ├── 05c_CombinedStatus.py # Ensure DB 'combined' status matches data table
│ ├── 05d_CleanStates.py # Standardize state names in lassa_data
│ ├── 06_CloudSync.py # Upload all pipeline artifacts to B2 cloud storage
│ ├── 07_ExportData.py # Export final data to CSV, upload to Supabase Storage
│ └── utils/ # Utility modules (logging, cloud, db, validation, etc.)
└── notebooks/ # Jupyter notebooks and experiments
Pipeline Script Overview
| Script Name | Description | |-----------------------------------|-------------| | 01URLSourcing.py | Scrape NCDC website for Lassa fever reports, extract metadata, update Supabase | | 02PDFDownloadSupabase.py | Sync/download PDFs from B2, update download status in Supabase | | 03aSyncEnhancement.py | Sync 'enhanced' status for images between B2 and Supabase | | 03bTableEnhancementSupabase.py | Enhance table images from PDFs, upload to B2, update DB | | 04aSyncProcessed.py | Sync 'processed' (CSV) status between B2 and Supabase | | 04bLLMExtractionSupabase.py | Extract tables from enhanced images using Gemini AI, validate, save as CSV, update DB | | 05aSyncCombiningStatus.py | Sync 'combined' status for CSVs between local, B2, and Supabase | | 05bPushToDB.py | Push processed CSVs to the main DB table (lassadata) | | 05cCombinedStatus.py | Ensure DB 'combined' status matches actual data table | | 05dCleanStates.py | Standardize and clean state names in lassadata | | 06CloudSync.py | Upload all pipeline artifacts (PDFs, images, CSVs) to B2 cloud storage | | 07ExportData.py | Export final, cleaned data to CSV and Supabase Storage |
Setup
Clone the repository and create a virtual environment:
bash git clone <repo_url> cd Lassa_Reports_Scraping python3 -m venv .venv source .venv/bin/activate pip install -r requirements.txtCreate a
.envfile in the project root with your API keys, e.g.:bash GOOGLE_GENAI_API_KEY=<your_key> B2_APPLICATION_KEY_ID=<your_key> B2_APPLICATION_KEY=<your_key> B2_BUCKET_NAME=<your_key> DATABASE_URL=<your_key> SUPABASE_URL=<your_key> SUPABASE_KEY=<your_key>
Usage
Run the Full Pipeline
bash
python main.py
This executes the following steps in order:
- URL Sourcing (
src/01_URL_Sourcing.py) - PDF Download (
src/02_PDF_Download_Supabase.py) - SyncEnhancement (
src/03a_SyncEnhancement.py) - Table Enhancement (
src/03b_TableEnhancement_Supabase.py) - SyncProcessed (
src/04a_SyncProcessed.py) - LLM Extraction (
src/04b_LLM_Extraction_Supabase.py) - SyncCombiningStatus (
src/05a_SyncCombiningStatus.py) - PushToDB (
src/05b_PushToDB.py) - CombinedStatus (
src/05c_CombinedStatus.py) - State Cleaning (
src/05d_CleanStates.py) - CloudSync (
src/06_CloudSync.py) - ExportData (
src/07_ExportData.py)
Data Flow
- Raw PDFs:
data/raw/,PDFs_Sourced/ - Enhanced Images:
data/processed/PDFs_Lines_{year}/ - Extracted CSV:
data/processed/CSV_LF_{year}_Sorted/ - Combined Master CSV:
data/processed/combined_lassa_data_{years}.csv - Metadata:
data/documentation/website_raw_data.csvand status CSVs
Data Access
Download Latest Data
The pipeline automatically exports the latest Lassa fever case data to CSV files. You can access this data in two ways:
Direct Download from GitHub:
- Navigate to the exports directory in the repository
- Download
lassa_data_latest.csvfor the most recent data
Supabase Storage:
- The data is also available through Supabase Storage
- Direct download link: click here
Data Format
Each CSV file contains the following columns:
- year: Year of the report
- week: Epidemiological week number
- states: Nigerian state name
- suspected: Number of suspected cases
- confirmed: Number of confirmed cases
- probable: Number of probable cases
- hcw: Number of healthcare worker cases
- deaths: Number of deaths
License and data attribution
- License (code and derived CSVs): CC0 1.0 Universal (Public Domain Dedication). See
LICENSE. - Source data attribution: The raw situation report PDFs and underlying figures are published by the Nigeria Centre for Disease Control (NCDC) and made publicly available. This repository automates retrieval and produces a cleaned, combined dataset for convenience; we do not claim ownership over NCDC materials.
How to cite
If you use this repository or the dataset, please cite it. A machine-readable citation file is provided in CITATION.cff. Example citations:
- Software: Trebski, A. (2025). Lassa Fever NCDC Reports Sourcing Pipeline. GitHub. https://github.com/BioDivHealth/LFNigeriaReports
- Dataset: Trebski, A. (2025). Lassa Fever weekly NCDC repors dataset. https://github.com/BioDivHealth/LFNigeriaReports/blob/main/exports/lassadatalatest.csv.
Owner
- Name: Biodiversity and Health
- Login: BioDivHealth
- Kind: organization
- Repositories: 1
- Profile: https://github.com/BioDivHealth
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software or the derived dataset, please cite it as below."
title: "Lassa Fever NCDC Reports Sourcing Pipeline"
repository-code: "https://github.com/BioDivHealth/LF_Nigeria_Reports"
license: CC0-1.0
keywords:
- lassa fever
- nigeria
- epidemiology
- data pipeline
authors:
- family-names: Trebski
given-names: Artur
preferred-citation:
type: software
title: "Lassa Fever NCDC Reports Sourcing Pipeline"
authors:
- family-names: Trebski
given-names: Artur
repository-code: "https://github.com/BioDivHealth/LF_Nigeria_Reports"
year: 2025
url: "https://github.com/BioDivHealth/LF_Nigeria_Reports"
references:
- type: dataset
title: "Lassa Fever weekly NCDC reports dataset"
authors:
- family-names: Trebski
given-names: Artur
year: 2025
url: "https://github.com/BioDivHealth/LF_Nigeria_Reports/blob/main/exports/lassa_data_latest.csv"
notes: "Derived from publicly available weekly situation reports by the Nigeria Centre for Disease Control (NCDC)."
- type: webpage
title: "Nigeria Centre for Disease Control (NCDC) Weekly Epidemiological Reports"
url: "https://ncdc.gov.ng/diseases/sitreps/?cat=5&name=An%20update%20of%20Lassa%20fever%20outbreak%20in%20Nigeria"
GitHub Events
Total
- Push event: 3
Last Year
- Push event: 3
Dependencies
- actions/checkout v3 composite
- githubocto/flat v3 composite
- actions/checkout v3 composite
- actions/setup-python v4 composite
- actions/checkout v3 composite
- actions/setup-python v4 composite
- actions/checkout v3 composite
- actions/setup-python v4 composite
- Pillow >=10.0.0
- PyMuPDF >=1.22.5
- b2sdk >=2.8.1
- beautifulsoup4 >=4.12.2
- cloudscraper >=1.2.71
- fuzzywuzzy >=0.18.0
- google-genai >=1.12.1
- google-generativeai >=0.8.4
- numpy >=1.24.0
- opencv-python >=4.11.0.86
- pandas >=1.4.0
- psycopg2-binary >=2.9.10
- pydantic >=2.4.0
- python-dotenv >=1.0.0
- requests >=2.31.0
- sqlalchemy >=2.0.40