https://github.com/biodivhealth/lf_nigeria_reports

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.7%) to scientific vocabulary

Last synced: 9 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: BioDivHealth
License: cc0-1.0
Language: Python
Default Branch: main
Size: 313 KB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created over 1 year ago · Last pushed 9 months ago

Metadata Files

Readme License Citation

Lassa Fever Reports Scraping Pipeline

A Python-based data processing pipeline for scraping, processing, and analyzing Lassa fever situation reports from the Nigeria Centre for Disease Control (NCDC).

Newest up-to-date dataset is available here 🔗

Project Overview

This pipeline automates the end-to-end processing of weekly Lassa fever reports: - Scrapes the NCDC website for report listings and extracts metadata. - Downloads raw PDF reports and organizes them by year. - Enhances table images in PDFs for accurate data extraction. - Uses Google Gemini AI to extract structured case data (Suspected, Confirmed, Probable, HCW, Deaths) at state and week granularity. - Validates logical consistency (Suspected ≥ Confirmed ≥ Deaths) with retry and correction logic. - Combines per-year CSV datasets into a unified master CSV for time-series analysis.

Data sources: - Raw PDF situation reports from NCDC - Intermediate enhanced table images - Yearly extracted CSVs - Final combined master CSV

Repository Structure

text Lassa_Reports_Scraping/ ├── README.md ├── main.py ├── requirements.txt ├── .env ├── data/ │ ├── raw/ │ ├── processed/ │ │ ├── PDF/ │ │ ├── CSV/ │ └── documentation/ ├── exports/ ├── src/ │ ├── 01_URL_Sourcing.py │ ├── 02_PDF_Download_Supabase.py │ ├── 03a_SyncEnhancement.py │ ├── 03b_TableEnhan │ ├── 04a_SyncProcessed.py │ ├── 04b_LLM_Extrac │ ├── 05a_SyncCombiningStatus.py │ ├── 05b_PushToDB.py │ ├── 05c_CombinedStatus.py │ ├── 05d_CleanStates.py │ ├── 06_CloudSync.py │ ├── 07_ExportData.py │ └── utils/ └── notebooks/ # This file # Orchestrates the full pipeline # Python dependencies # Environment variables (API keys) # Raw downloaded PDFs (organized by year) # Processed data and images # Enhanced table images for each year # Extracted and sorted CSV data by year # Metadata and status tracking CSVs # Exported final datasets (CSV, README) # Core pipeline scripts # Scrape report URLs and metadata, update Supabase # Sync/download PDFs, update download status in Supabase # Sync enhanced status between B2 and Supabase cement_Supabase.py # Enhance table images, upload to B2, update DB # Sync processed (CSV) status between B2 and Supabase tion_Supabase.py # Extract tables from images using Gemini AI, save as CSV, update DB # Sync 'combined' status for CSVs between local, B2, and Supabase # Push processed CSVs to main DB table (lassa_data) # Ensure DB 'combined' status matches data table # Standardize state names in lassa_data # Upload all pipeline artifacts to B2 cloud storage # Export final data to CSV, upload to Supabase Storage # Utility modules (logging, cloud, db, validation, etc.) # Jupyter notebooks and experiments

Pipeline Script Overview

| Script Name | Description | |-----------------------------------|-------------| | 01URLSourcing.py | Scrape NCDC website for Lassa fever reports, extract metadata, update Supabase | | 02PDFDownloadSupabase.py | Sync/download PDFs from B2, update download status in Supabase | | 03aSyncEnhancement.py | Sync 'enhanced' status for images between B2 and Supabase | | 03bTableEnhancementSupabase.py | Enhance table images from PDFs, upload to B2, update DB | | 04aSyncProcessed.py | Sync 'processed' (CSV) status between B2 and Supabase | | 04bLLMExtractionSupabase.py | Extract tables from enhanced images using Gemini AI, validate, save as CSV, update DB | | 05aSyncCombiningStatus.py | Sync 'combined' status for CSVs between local, B2, and Supabase | | 05bPushToDB.py | Push processed CSVs to the main DB table (lassadata) | | 05cCombinedStatus.py | Ensure DB 'combined' status matches actual data table | | 05dCleanStates.py | Standardize and clean state names in lassadata | | 06CloudSync.py | Upload all pipeline artifacts (PDFs, images, CSVs) to B2 cloud storage | | 07ExportData.py | Export final, cleaned data to CSV and Supabase Storage |

Setup

Clone the repository and create a virtual environment: bash git clone <repo_url> cd Lassa_Reports_Scraping python3 -m venv .venv source .venv/bin/activate pip install -r requirements.txt
Create a .env file in the project root with your API keys, e.g.: bash GOOGLE_GENAI_API_KEY=<your_key> B2_APPLICATION_KEY_ID=<your_key> B2_APPLICATION_KEY=<your_key> B2_BUCKET_NAME=<your_key> DATABASE_URL=<your_key> SUPABASE_URL=<your_key> SUPABASE_KEY=<your_key>

Usage

Run the Full Pipeline

bash python main.py

This executes the following steps in order:

URL Sourcing (src/01_URL_Sourcing.py)
PDF Download (src/02_PDF_Download_Supabase.py)
SyncEnhancement (src/03a_SyncEnhancement.py)
Table Enhancement (src/03b_TableEnhancement_Supabase.py)
SyncProcessed (src/04a_SyncProcessed.py)
LLM Extraction (src/04b_LLM_Extraction_Supabase.py)
SyncCombiningStatus (src/05a_SyncCombiningStatus.py)
PushToDB (src/05b_PushToDB.py)
CombinedStatus (src/05c_CombinedStatus.py)
State Cleaning (src/05d_CleanStates.py)
CloudSync (src/06_CloudSync.py)
ExportData (src/07_ExportData.py)

Data Flow

Raw PDFs: data/raw/, PDFs_Sourced/
Enhanced Images: data/processed/PDFs_Lines_{year}/
Extracted CSV: data/processed/CSV_LF_{year}_Sorted/
Combined Master CSV: data/processed/combined_lassa_data_{years}.csv
Metadata: data/documentation/website_raw_data.csv and status CSVs

Data Access

Download Latest Data

The pipeline automatically exports the latest Lassa fever case data to CSV files. You can access this data in two ways:

Direct Download from GitHub:
- Navigate to the exports directory in the repository
- Download lassa_data_latest.csv for the most recent data
Supabase Storage:
- The data is also available through Supabase Storage
- Direct download link: click here

Data Format

Each CSV file contains the following columns: - year: Year of the report - week: Epidemiological week number - states: Nigerian state name - suspected: Number of suspected cases - confirmed: Number of confirmed cases - probable: Number of probable cases - hcw: Number of healthcare worker cases - deaths: Number of deaths

License and data attribution

License (code and derived CSVs): CC0 1.0 Universal (Public Domain Dedication). See LICENSE.
Source data attribution: The raw situation report PDFs and underlying figures are published by the Nigeria Centre for Disease Control (NCDC) and made publicly available. This repository automates retrieval and produces a cleaned, combined dataset for convenience; we do not claim ownership over NCDC materials.

How to cite

If you use this repository or the dataset, please cite it. A machine-readable citation file is provided in CITATION.cff. Example citations:

Software: Trebski, A. (2025). Lassa Fever NCDC Reports Sourcing Pipeline. GitHub. https://github.com/BioDivHealth/LFNigeriaReports
Dataset: Trebski, A. (2025). Lassa Fever weekly NCDC repors dataset. https://github.com/BioDivHealth/LFNigeriaReports/blob/main/exports/lassadatalatest.csv.

Owner

Name: Biodiversity and Health
Login: BioDivHealth
Kind: organization

Repositories: 1
Profile: https://github.com/BioDivHealth

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software or the derived dataset, please cite it as below."
title: "Lassa Fever NCDC Reports Sourcing Pipeline"
repository-code: "https://github.com/BioDivHealth/LF_Nigeria_Reports"
license: CC0-1.0
keywords:
  - lassa fever
  - nigeria
  - epidemiology
  - data pipeline
authors:
  - family-names: Trebski
    given-names: Artur
preferred-citation:
  type: software
  title: "Lassa Fever NCDC Reports Sourcing Pipeline"
  authors:
    - family-names: Trebski
      given-names: Artur
  repository-code: "https://github.com/BioDivHealth/LF_Nigeria_Reports"
  year: 2025
  url: "https://github.com/BioDivHealth/LF_Nigeria_Reports"
references:
  - type: dataset
    title: "Lassa Fever weekly NCDC reports dataset"
    authors:
      - family-names: Trebski
        given-names: Artur
    year: 2025
    url: "https://github.com/BioDivHealth/LF_Nigeria_Reports/blob/main/exports/lassa_data_latest.csv"
    notes: "Derived from publicly available weekly situation reports by the Nigeria Centre for Disease Control (NCDC)."
  - type: webpage
    title: "Nigeria Centre for Disease Control (NCDC) Weekly Epidemiological Reports"
    url: "https://ncdc.gov.ng/diseases/sitreps/?cat=5&name=An%20update%20of%20Lassa%20fever%20outbreak%20in%20Nigeria"

GitHub Events

Total

Push event: 3

Last Year

Push event: 3

Dependencies

.github/workflows/flat-one.yml actions

actions/checkout v3 composite
githubocto/flat v3 composite

.github/workflows/lassa-scraping.yml actions

actions/checkout v3 composite
actions/setup-python v4 composite

.github/workflows/test.yml actions

actions/checkout v3 composite
actions/setup-python v4 composite

.github/workflows/url-sourcing.yml actions

actions/checkout v3 composite
actions/setup-python v4 composite

requirements.txt pypi

Pillow >=10.0.0
PyMuPDF >=1.22.5
b2sdk >=2.8.1
beautifulsoup4 >=4.12.2
cloudscraper >=1.2.71
fuzzywuzzy >=0.18.0
google-genai >=1.12.1
google-generativeai >=0.8.4
numpy >=1.24.0
opencv-python >=4.11.0.86
pandas >=1.4.0
psycopg2-binary >=2.9.10
pydantic >=2.4.0
python-dotenv >=1.0.0
requests >=2.31.0
sqlalchemy >=2.0.40

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science