epylabel

This repository contains the code for the manuscript Ensemble-labeling of infectious diseases time series to evaluate early warning systems with which you can reproduce the manuscript's results and figures.

https://github.com/robert-koch-institut/epylabel

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 6 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.1%) to scientific vocabulary

Keywords

7-tage-inzidenz covid-19 epidemiologie epidemiology germany gesundheitsberichterstattung incidence infections infektion inzidenz open-data open-source public-health-surveillance python r rki sars-cov-2
Last synced: 6 months ago · JSON representation ·

Repository

This repository contains the code for the manuscript Ensemble-labeling of infectious diseases time series to evaluate early warning systems with which you can reproduce the manuscript's results and figures.

Basic Info
  • Host: GitHub
  • Owner: robert-koch-institut
  • License: mit
  • Language: Python
  • Default Branch: main
  • Size: 4.84 MB
Statistics
  • Stars: 1
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 1
Topics
7-tage-inzidenz covid-19 epidemiologie epidemiology germany gesundheitsberichterstattung incidence infections infektion inzidenz open-data open-source public-health-surveillance python r rki sars-cov-2
Created over 1 year ago · Last pushed 6 months ago
Metadata Files
Readme License Citation

Readme.md

Documentation

Epylabel: Ensemble-labeling of infectious diseases time series




Andreas Hicketier¹, Moritz Bach¹, Philip Oedi¹, Alexander Ullrich¹, & Auss Abbood²


  ¹ Robert Koch-Institut | Unit 32
  ² Robert Koch-Institut | ZIG 1


Cite
Hicketier, A., Bach, M., Oedi, P., Ullrich, A., & Abbood, A. (2024). Epylabel: Ensemble-labeling of infectious diseases time series. Zenodo. https://doi.org/10.5281/zenodo.12665040


Abstract
This repository contains the code for the manuscript "Ensemble-labeling of Infectious Diseases Time Series to Evaluate Early Warning Systems" (Epylabel), with which the manuscript's results and figures can be reproduced. Developed at the Robert Koch Institute within the DAKI-FWS project, this Python/R-based tool combines several individual labeling techniques through a majority-voting ensemble to detect diverse outbreak patterns across varying spatial resolutions. The resulting labels were used to benchmark machine learning models and compare them with traditional outbreak detection methods.


Table of Content <!-- TOCSTART: {"headingdepth": 2} --> - Project Information - Installation - Running the Code - Code - Data - Collaborate - Publication platforms - License <!-- TOC_END -->


<!-- HEADER_END -->


This repository contains the code for the manuscript Ensemble-labeling of infectious diseases time series to evaluate early warning systems with which you can reproduce the manuscript's results and figures.

Project Information

This code was developed at the Robert Koch Institute as part of the project Daten- und KI-gestütztes Frühwarnsystem zur Stabilisierung der deutschen Wirtschaft funded by the Federal Ministry for Economic Affairs and Climate Action. The project launched 1st December 2021 and ends on 30th November 2024. Together with over a dozen research and industry partners, we work on preventing economic loss as seen during the COVID-19 pandemic with the help of early warning systems. These are not limited to infectious diseases but within a work package on early warning for infectious diseases, this code was developed. For more information on the project, visit the DAKI-FWS Website and the Webiste Digitale-Technologien of the German Federal Ministry for Economic Affairs and Climate Action.

Administrative and organizational information

This work was conducted by staff from Unit 32 | Surveillance with technical supervision by Alexander Ullrich and Auss Abbood from ZIG 1 | Information Centre for International Health Protection (INIG). The publication of the code as well as the quality management of the metadata is done by department MF 4 | Domain Specific Data and Research Data Management. Questions regarding data management and the publication infrastructure can be directed to the Open Data Team of the Department MF4 at OpenData@rki.de.

Motivation

Early warnings systems (EWS) can help make informed public health decisions. Depending on the EWS, various evaluation strategies exist such as simulating data with outbreaks or using expert-labeled data. In the absence of ground truth knowledge about outbreaks, we can use post-hoc labeling methods. While these perform well for a selection of well-behaved disease time series, they do not perform as well on heterogeneous COVID-19 time series. To address this gap for evaluation, we propose an adaptive labeling method that produces useful labels on highly heterogeneous, non-stationary COVID-19 time series.

This repository allows you to use our self-developed ensemble labeling method. It helps detect various outbreak types like waves or short peaks as occurring on different spatial resolutions and uses a majority vote to assign outbreak labels post-hoc for evaluation of EWSs. This repository also contains evaluation experiments where our self-produced labels were used to train machine learning models, which we compared with traditional outbreak detection methods.

Installation

Our scripts make use of Python and R. Please make sure you have both programming languages installed. We also encourage users to use conda as an environment management tool for this repo. After installing Anaconda or Miniconda, run the following commands in a properly configured shell:

commandline conda env create -f environment.yml conda activate epylabel

Running the Code

Warning: This repo uses rpy2, a Python library that enables running R code and libraries in Python. As of now, this library is not supported for Windows and this repo may not work for you if you use Windows.

Reproduce Labels

To reproduce the labels presented in the manuscript run python paper_labels.py after the appropriate conda environment has been activated. Note, you need to navigate to the folder containing this script for it to work.

Generate Figures

You can also reproduce the figures from the manuscript using python paper_plots.py

Generating Docs

You can build the docs with Sphinx:

commandline sphinx-build -b html docs/source/ docs/build/

Code

This repo is using a pipeline approach to compose the ensemble of labeling methods. Each labeling method inherits from the abstract class Transformation (see labeler.py). Theses Classes need to implement the transform() method that either return labels or transformed data.

The Pipeline class allows you to execute transform operations of various labeling methods successively.

Lastly, the Ensemble class implements the routine for the majority vote of each single labeling method in the ensemble. The code can be extended to use more labeling methods. Each method would only need to inherit from Transformation.

If another ensemble voting mechanism is desired, a new Ensemble class can be implemented where you specify your voting approach in the transform() method. This way, our code is open to new implementations and variations.

Below, you can find a shortened and commented version of paper_labels.py to illustrate how generating labels with our ensemble approach works.

```python import pandas as pd

from epylabel.labeler import (Bcp,Changerate,Ensemble,Shapelet,WaveFinder) from epylabel.pipeline import Pipeline from paper_labels import StandardForm

Instatiate single labeling methods with adequate parameters

cr = Changerate() bcp = Bcp() wv = WaveFinder() sp = Shapelet()

Instatiate ensemble

ens = Ensemble(n_min=2)

Download RKI COVID-19 data

datarkiurl = ( "https://raw.githubusercontent.com/robert-koch-institut/" "COVID-197-Tage-InzidenzinDeutschland/main/" "COVID-19-Faelle7-Tage-InzidenzDeutschland.csv" ) datarki = pd.readcsv(datarki_url)

Rearrange data

datawide = Pipeline([StandardForm()]).transform(datarki) datawidefaelle = Pipeline( [ StandardForm("Faelleneu"), ] ).transform(datarki)

Label data with single labeling methods

bcplabels = Pipeline( [ cr, bcp, ] ).transform(datawidefaelle) splabels = Pipeline([sp]).transform(datawide) wvlabels = Pipeline([wv]).transform(data_wide)

Combine labeling methods in ensemble

bcpspwvlabels = Pipeline([ens]).transform(bcplabels, splabels, wvlabels)

```

Data

The code in this repository depends on reported COVID-19 cases in Germany. The main function paper_labels.py, which is more closely explained in the next section, downloads data from the Robert Koch Institute's Open Data Repository on GitHub for which it then produces the labels as described in the manuscript.

There are three datasets that will be downloaded to build timeseries of newly reported cases. New cases are in the CSV's column Faelle_neu. Region identifiers which are named Bundesland_id for federal countries and Landkreis_id for counties, are renamed to location by the script. The reporting date Meldedatum is renamed to target and the case numbers to value. Without a regional stratification, i.e., timeseries for Germany only, the column location gets the value 0. Age stratification of the data is ignored.

The repository is using the latest data form the RKI "7-Tage-Inzidenz der COVID-19-Fälle in Deutschland" dataset provided on Github:

https://github.com/robert-koch-institut/COVID-19_7-Tage-Inzidenz_in_Deutschland

All versions of the currently daily updated data, are also published on Zenodo.org:

Robert Koch-Institut (2024): 7-Tage-Inzidenz der COVID-19-Fälle in Deutschland, Berlin: Zenodo. DOI: 10.5281/zenodo.7129007

| Description | URL | | -------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- | | COVID-19 cases in Germany per county | https://raw.githubusercontent.com/robert-koch-institut/COVID-19_7-Tage-Inzidenz_in_Deutschland/main/COVID-19-Faelle_7-Tage-Inzidenz_Landkreise.csv | | COVID-19 cases in Germany per federal state | https://raw.githubusercontent.com/robert-koch-institut/COVID-19_7-Tage-Inzidenz_in_Deutschland/main/COVID-19-Faelle_7-Tage-Inzidenz_Bundeslaender.csv | | COVID-19 cases in Germany without startification | https://raw.githubusercontent.com/robert-koch-institut/COVID-19_7-Tage-Inzidenz_in_Deutschland/main/COVID-19-Faelle_7-Tage-Inzidenz_Deutschland.csv |

After the transformation, the data has the following structure:

| Column | Datatype | Description | | - | --------------------- |--------------------------------------------------- | | value | integer | Number of reported COVID-19 cases | | target | string | Reporting date (yyyy-mm-dd) | | location | string |The five-digit community identification code for counties, two-digit code for federal countries, and a 0 for the whole of Germany |

Formatting

Data is downloaded as a comma-separated .csv file. The character encoding is UTF-8. Values are separated by ",".

Collaborate

If you want to participate in our project, feel free to fork this repo and send us pull requests. To make sure everything is working please use pre-commit. It will run a few tests and lints before a commit can be made. To install pre-commit, run

pre-commit install

Publication platforms

This software publication is available on Zenodo.org, GitHub.com and OpenCoDE:

  • https://zenodo.org/communities/robertkochinstitut
  • https://github.com/robert-koch-institut
  • https://gitlab.opencode.de/robert-koch-institut

License

Epylabel: Ensemble-labeling of infectious diseases time series is free and open-source software, published under the terms of the MIT license. <!-- FOOTER_END -->

Owner

  • Name: Robert Koch-Institut
  • Login: robert-koch-institut
  • Kind: organization
  • Location: Berlin

Das RKI ist die zentrale Einrichtung der deutschen Bundesregierung auf dem Gebiet der Krankheitsüberwachung und -prävention.

Citation (citation.cff)

cff-version: 1.2.0
type: software
title: 'Epylabel: Ensemble-labeling of infectious diseases time series'
abstract: >-
  This repository contains the code for the manuscript Ensemble-labeling of
  infectious diseases time series to evaluate early warning systems with which
  you can reproduce the manuscript's results and figures.
date-released: '2024-07-19'
keywords:
  - COVID-19
  - SARS-CoV-2
  - Inzidenz
  - Incidence
  - 7-Tage-Inzidenz
  - Infections
  - Infektion
  - Gesundheitsberichterstattung
  - Public health surveillance
  - Epidemiologie
  - Epidemiology
  - Germany
  - Open Data
  - Open Source
  - Python
  - R
  - RKI
message: Cite me!
url: https://robert-koch-institut.github.io/epylabel
license: MIT
doi: 10.5281/zenodo.12665040
version: '1.0'
authors:
  - family-names: Hicketier
    given-names: Andreas
    affiliation: Robert Koch-Institut
    orcid: 0009-0000-5882-852X
    email: hicketiera@rki.de
  - family-names: Bach
    given-names: Moritz
    affiliation: Robert Koch-Institut
    orcid: 0009-0003-3062-0585
  - family-names: Oedi
    given-names: Philip
    affiliation: Robert Koch-Institut
    orcid: 0009-0001-7112-505X
  - family-names: Ullrich
    given-names: Alexander
    affiliation: Robert Koch-Institut
    orcid: 0000-0002-4894-6124
  - family-names: Abbood
    given-names: Auss
    affiliation: Robert Koch-Institut
    orcid: 0000-0002-4428-168X

GitHub Events

Total
  • Watch event: 1
  • Push event: 2
  • Pull request event: 2
Last Year
  • Watch event: 1
  • Push event: 2
  • Pull request event: 2

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 0
  • Total pull requests: 1
  • Average time to close issues: N/A
  • Average time to close pull requests: 17 minutes
  • Total issue authors: 0
  • Total pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 1
  • Average time to close issues: N/A
  • Average time to close pull requests: 17 minutes
  • Issue authors: 0
  • Pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • RKIOpenData (1)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

pyproject.toml pypi
.github/workflows/Build_and_deploy_website.yml actions
  • robert-koch-institut/OpenData-Website main composite
.github/workflows/Create_release_with_latest_tag.yml actions
  • actions/checkout v4 composite
  • robert-koch-institut/OpenData-Workflows/Create_release_on_tag_push main composite
.github/workflows/OpenData_Workfow.yml actions
.github/workflows/Sync_OpenData_repo_to_OpenCoDE.yml actions
  • robert-koch-institut/OpenData-Workflows/Sync_OpenData_repo_to_OpenCoDE main composite
environment.yml pypi