pandemic_alert_model_social_media

The repository contains codes and data for "Development of an early alert model for pandemic situations in Germany".

https://github.com/danqi123/pandemic_alert_model_social_media

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 2 DOI reference(s) in README
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.5%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

The repository contains codes and data for "Development of an early alert model for pandemic situations in Germany".

Basic Info

Host: GitHub
Owner: danqi123
Language: Python
Default Branch: main
Size: 14.2 MB

Statistics

Stars: 1
Watchers: 1
Forks: 1
Open Issues: 0
Releases: 0

Created about 3 years ago · Last pushed over 2 years ago

Metadata Files

Readme Citation

pandemicalertsocial_media

==============================

The repository contains codes and data for the manuscript "Development of an early alert model for pandemic situations in Germany".

The paper contains three parts: 1. Generation of symptom corpus and preparation of Google Trends/Twitter longitudinal datasets with multidimensional symptom features. 2. Log-linear regression model for up-/ down-trend analysis. 3. Random Forest and LSTMs for up-/ down-trend forecasting.

Project Organization

├── LICENSE
├── Makefile                        <- Makefile with commands like `make data` or `make train`
├── README.md                       <- The top-level README for developers using this project.
├── data_repo
│   ├── Gold_standard               <- The final results of trend analysis (surveillance data)
│   ├── Combined                    <- The final results of trend analysis (Combined trace)
│   ├── Google_Trends               <- The final results of trend analysis (Google Trends)
│   ├── Twitter                     <- The final results of trend analysis (Twitter)
│   ├── Knowledge_graph             <- The final results of the hypergeometric test.
│   ├── processed                   <- The retrieval of Google Trends/ Twitter longitudinal datasets and symptoms with German translations.
│   └── raw                         <- The raw data of surveillance gold standards (confirmed cases, deaths, and hospitalization.)
│
├── models                          <- Contains the optimal hyperparameters for constructing retrained models.
│
├── reports                         <- Generated analysis as HTML, PDF, LaTeX, etc.
│   ├── data                        <- The final results of trend forecasting (Google and Combined trace)
│   └── figures                     <- Generated figures to be used in reporting
│
├── requirements.txt                <- The requirements file for reproducing the analysis environment, e.g.
│                                      generated with `pip freeze > requirements.txt`
├── CITATION.cff                    <- citation information of the package.
├── setup.py                        <- makes project pip installable (pip install -e .) so src can be imported
├── src                             <- Source code for use in this project.
│   ├── __init__.py                 <- Makes src a Python module
│   ├── pytrends                    <- The package used for retrieving daily Google Trends data. source: https://github.com/GeneralMills/pytrends/tree/master/pytrend
│   ├── date.py                     <- Scripts to process dates.
│   ├── startup.py                  <- Initialize important variables and folders.
│   ├── disease_network.py          <- Source Scripts to generate COVID-19 symptom corpus.
│   ├── knowledge_graph.py          <- Click file to generate COVID-19 symptom corpus.
│   ├── google_trends_daily.py      
│   ├── google_trends.py            <- Scripts to retrieve Google Trends data.
│   ├── twitter_api.py              <- Scripts to retrieve Twitter data.
│   ├── log_linear_regression.py    <- Source Scripts to perform log-linear regression model.
│   ├── cli_trend_analysis.py       <- Click the file to perform a log-linear regression model.
│   ├── RF_data_preprocessing.py    <- Scripts to preprocess data for performing Random Forest.  
│   ├── Random_Forest_optuna.py     <- Scripts to tune hyperparameters in Random Forest.
│   ├── Random_Forest.py            <- Run Random Forest models and evaluate on the test set.
│   ├── LSTM_data_preprocessing.py  <- Scripts to preprocess data for LSTMs.
│   ├── LSTM_train_optuna.py        <- Scripts to tune hyperparameters in LSTMs.
│   ├── LSTM_train.py               <- Scripts to train LSTMs, evaluate on the test set, and perform SHAP algorithm.
└── tox.ini                         <- tox file with settings for running tox; see tox.readthedocs.io

Generate a knowledge graph to get the symptom corpus.

input data: symptoms from symptom ontology (https://www.ebi.ac.uk/ols/ontologies/symp)

          convert .owl to .json: (http://vowl.visualdataweb.org/webvowl-old/webvowl-old.html)
          SYMP_ONTOLOGY = "data_repo/raw/symp.json"

output data: The top German symptoms from the hypergeometric test with low p_value and high volume of co-occurrences in SCAIView knowledge software (https://academia.scaiview.com/)

(/datarepo/Knowledgegraph/COVID/COVIDsymptomsfrom_hypergeometrictest.json)

1. Request SCAIView and get symptom_disease related IDs

python3 knowledge_graph.py get_symptom_disease_IDs COVID

2. Get the number of document counts of each disease

python3 knowledge_graph.py get_disease_count

3. Get the number of document counts of each symptom

python3 knowledge_graph.py get_symptoms_count

4. Get disease symptoms dict with corresponding p_values from hypergeometric test

python3 knowledge_graph.py perform_disease_hypergeo_test COVID 0.05 -v

5. Get top disease-related symptoms (COVID)

python3 knowledge_graph.py get_top_relevant_symptoms 0.05 50

6. Plot the symptoms with descending co-occurrences with COVID-19 in PubMed/PMC.

python3 knowledge_graph.py show_plot COVID_sort_pvalue_occurances.csv 25

7. Symptom translation

uses DeepL software to translate the top English symptoms from the Knowledge graph into German.
Note: here we give an example of the German terms we retrieved till June 2022. If you translate the terms into France OR retrieve new data, you should replace the file with the route:
(SCAIVIEW_SYMPTOM = "/data_repo/processed/symptom_translations.csv")

8. Get German symptom terms

input: the translated German symptoms.
output: .json file contains the German symptom corpus.
python3 knowledge_graph.py get_covid_symptoms

Social media and gold standard longitudinal datasets

1. Download Gold Standard data from Germany RKI.

 Surveillance data can be retrieved from the Robert Koch-Institut (RKI) GitHub repository (https://github.com/orgs/robert-koch-institut/repositories)
 The surveillance data from 2020-03-01 to 2022-06-28 is downloaded and saved in /data/raw/.

2. Retrieve social media data from Google Trends and Twitter with the symptom queries from the knowledge graph.

 Google Trends: src/scripts/google_trends.py
 Twitter: src/scripts/twitter_api.py (NEED credentials of academic Twitter developer API)
 Note: here we received Google and Twitter data from Jan 2020 to Jun 2022 as an example.
 and data is in 'data/processed/daily_google_german.csv' and 'data/processed/daily_twitter_german.csv'

Trend analysis

Background of trend analysis

STL decomposition to get the trend of time series raw data(https://www.statsmodels.org/dev/examples/notebooks/generated/stl_decomposition.html)
- For Google Trends and Twitter, STL period: 30
- For RKI confirmed cases, deaths, and hospitalization, STL period: 7
Log-linear regression model:
- window size: 14 days
- stride: 1 day
- alpha: 0.05

1. Get surveillance gold standard trends generated from log-linear regression model and save trends into csv files** (flag argument: RKIcase/ RKIdeath/ RKIhospitalization; stlnumber: 7/ 7/ 7)

python3 cli_trend_analysis.py generate_gold_standard_trend RKI_case 14 7
python3 cli_trend_analysis.py generate_gold_standard_trend RKI_death 14 7
python3 cli_trend_analysis.py generate_gold_standard_trend RKI_hospitalization 14 7

2. Get up- and down-trends of surveillance gold standard

python3 cli_trend_analysis.py get_trends_from_gold_standard RKI_case -f
python3 cli_trend_analysis.py get_trends_from_gold_standard RKI_death -f 
python3 cli_trend_analysis.py get_trends_from_gold_standard RKI_hospitalization -f

3. Making symptom-level trend analysis (Google Trends and Twitter)

python3 cli_trend_analysis.py generate_proxy_trend Google_Trends daily_google_german.csv 14 30
python3 cli_trend_analysis.py generate_proxy_trend Twitter daily_twitter_german.csv 14 30

4. Get evaluation metrics of individual symptoms and save the .csv file (Google Trends and Twitter)

python3 cli_trend_analysis.py generate_evaluation_metrics RKI_case Google_Trends 2022-03-01
python3 cli_trend_analysis.py generate_evaluation_metrics RKI_hospitalization Google_Trends 2022-03-01
python3 cli_trend_analysis.py generate_evaluation_metrics RKI_death Google_Trends 2022-03-01

python3 cli_trend_analysis.py generate_evaluation_metrics RKI_case Twitter 2022-03-01
python3 cli_trend_analysis.py generate_evaluation_metrics RKI_hospitalization Twitter 2022-03-01
python3 cli_trend_analysis.py generate_evaluation_metrics RKI_death Twitter 2022-03-01

5. Get the top 20 symptoms (based on the result of the hypergeometric test) for each digital trace (Google Trends and Twitter)

python3 cli_trend_analysis.py get_symptoms 20 google -f
python3 cli_trend_analysis.py get_symptoms 20 Twitter -f

6. Making digital trace (Google Trends, Twitter, and Combined) trend analysis and get up- and down-trends

python3 cli_trend_analysis.py combined_proxy 20 Google_Trends 0.05 2022-03-01 -r -f
python3 cli_trend_analysis.py combined_proxy 20 Twitter 0.05 2022-03-01 -r -f
python3 cli_trend_analysis.py get_combined_P_trends 0.05 2022-03-01 -f

7. Get evaluation metrics for each digital trace (Google Trends, Twitter, and Combined trace)

python3 cli_trend_analysis.py generate_metrics_for_combined_proxy_or_combinedP Google_Trends RKI_case 2022-03-01
python3 cli_trend_analysis.py generate_metrics_for_combined_proxy_or_combinedP Google_Trends RKI_death 2022-03-01
python3 cli_trend_analysis.py generate_metrics_for_combined_proxy_or_combinedP Google_Trends RKI_hospitalization 2022-03-01

python3 cli_trend_analysis.py generate_metrics_for_combined_proxy_or_combinedP Twitter RKI_case 2022-03-01
python3 cli_trend_analysis.py generate_metrics_for_combined_proxy_or_combinedP Twitter RKI_death 2022-03-01
python3 cli_trend_analysis.py generate_metrics_for_combined_proxy_or_combinedP Twitter RKI_hospitalization 2022-03-01

python3 cli_trend_analysis.py generate_metrics_for_combined_proxy_or_combinedP Combined RKI_case 2022-03-01
python3 cli_trend_analysis.py generate_metrics_for_combined_proxy_or_combinedP Combined RKI_death 2022-03-01
python3 cli_trend_analysis.py generate_metrics_for_combined_proxy_or_combinedP Combined RKI_hospitalization 2022-03-01

8. Visualization of trends of surveillance data and up-/down-trends of Google Trends and Combined trace

python3 cli_trend_analysis.py visualize_trend

9. Pairwise event visualization

python3 cli_trend_analysis.py plot_pairwise_trend_event RKI_case Up_trends 2020_2022 2022-03-01
python3 cli_trend_analysis.py plot_pairwise_trend_event RKI_case Down_trends 2020_2022 2022-03-01
python3 cli_trend_analysis.py plot_pairwise_trend_event RKI_death Up_trends 2020_2022 2022-03-01
python3 cli_trend_analysis.py plot_pairwise_trend_event RKI_death Down_trends 2020_2022 2022-03-01
python3 cli_trend_analysis.py plot_pairwise_trend_event RKI_hospitalization Up_trends 2020_2022 2022-03-01
python3 cli_trend_analysis.py plot_pairwise_trend_event RKI_hospitalization Down_trends 2020_2022 2022-03-01

* This function will also print out the percentage of onsets of up-trends.

Trend forecasting

Random Forest

Note: The forecasting horizon is set based on the result of trend analysis. We set the time points to split train/test sets. training length: 28 days forecasting horizon: consistent with the time interval in the log-linear regression model: 14 days

1. Prepare dataset for all feature space (Google Trends/Combined)

Note: proxy: Google; Combined
      gold_standard: RKI_case; RKI_hospitalization

python3 RF_data_preprocessing.py --proxy=Google --gold_standard=RKI_case --time_start='2020-03-01' --time_end='2022-06-15' --split_date='2022-04-01' --forecasting_horizon=14 --training_length=28

python3 RF_data_preprocessing.py --proxy=Google --gold_standard=RKI_hospitalization --time_start='2020-03-01' --time_end='2022-06-15' --split_date='2022-04-01' --forecasting_horizon=14 --training_length=28

python3 RF_data_preprocessing.py --proxy=Combined --gold_standard=RKI_case --time_start='2020-03-01' --time_end='2022-06-15' --split_date='2022-04-01' --forecasting_horizon=14 --training_length=28

python3 RF_data_preprocessing.py --proxy=Combined --gold_standard=RKI_hospitalization --time_start='2020-03-01' --time_end='2022-06-15'  --split_date='2022-04-01' --forecasting_horizon=14 --training_length=28

2. Run Random Forest models

Note: proxy: Google; Combined
      gold_standard: RKI_case; RKI_hospitalization

python3 Random_Forest_optuna.py --proxy=Google --gold_standard=RKI_case --forecasting_horizon=14 --training_length=28 --number_trial=90 --cv_initial_window=90 --cv_step_length=70 --cv_test_window=30

python3 Random_Forest.py --proxy=Google --gold_standard=RKI_case --forecasting_horizon=14 --mode=train --n_estimators=*** --max_depth=*** --min_samples_split=*** --min_samples_leaf=*** --max_features=***

python3 Random_Forest.py --proxy=Google --gold_standard=RKI_case --forecasting_horizon=14 --mode=test

LSTMs

Note: type: Google_confirmed_cases; Google_hospitalization; Combined_confirmed_cases; Combined_hospitalization

python3 LSTM_train_optuna.py --type=Google_confirmed_cases --forecasting_horizon=14 --tranining_length=28 --mode=train --GPU=0

python3 LSTM_train.py

All source code that is specific to this project.

Citation

Wang, D., Lentzen, M., Botz, J. et al. Development of an early alert model for pandemic situations in Germany. Sci Rep 13, 20780 (2023). https://doi.org/10.1038/s41598-023-48096-3

Project based on the cookiecutter data science project template. #cookiecutterdatascience

Owner

Name: Danqi
Login: danqi123
Kind: user
Location: Sankt Augustin, Germany
Company: Fraunhofer SCAI

Website: https://danqi123.github.io/
Twitter: danqi1013
Repositories: 2
Profile: https://github.com/danqi123

* Biomedical data scientist * Python/ ML/Data mining/Information retrieval/ time series data engineering

Citation (CITATION.cff)

cff-version: 1.2.0
title: >-
  Development of an early alert model for pandemic
  situations in Germany
message: >-
  If you use this software, please cite it using the
  metadata from this file. DOI: https://doi.org/10.21203/rs.3.rs-3108281/v1
type: software
authors:
  - given-names: Danqi
    family-names: Wang
    affiliation: Fraunhofer SCAI
repository-code: >-
  https://github.com/danqi123/pandemic_alert_model_social_media
abstract: >-
  The repository contains codes and data for the manuscript
  "Development of an early alert model for pandemic
  situations in Germany". 
keywords:
  - pandemic
  - alert
  - social media data
  - statistical model
  - machine learning
  
license: MIT
version: 1.0.0
date-released: '2023-06-25'

GitHub Events

Total

Last Year

Dependencies

alert_model/requirements.txt pypi

Deprecated ==1.2.14
Jinja2 ==3.1.2
Mako ==1.2.4
MarkupSafe ==2.1.3
PyYAML ==6.0
SQLAlchemy ==2.0.17
alembic ==1.11.1
arrow ==1.2.3
binaryornot ==0.4.4
certifi ==2023.5.7
chardet ==5.1.0
charset-normalizer ==3.1.0
click ==8.1.3
cmaes ==0.9.1
colorlog ==6.7.0
cookiecutter ==2.1.1
greenlet ==2.0.2
idna ==3.4
jinja2-time ==0.2.0
joblib ==1.2.0
kaleido ==0.2.1
numpy ==1.24.3
oauthlib ==3.2.2
optuna ==3.2.0
packaging ==23.1
pandas ==2.0.2
patsy ==0.5.3
plotly ==5.15.0
python-dateutil ==2.8.2
python-slugify ==8.0.1
pytz ==2023.3
requests ==2.31.0
requests-oauthlib ==1.3.1
scikit-base ==0.4.6
scikit-learn ==1.2.2
scipy ==1.10.1
six ==1.16.0
sktime ==0.19.2
statsmodels ==0.14.0
tenacity ==8.2.2
text-unidecode ==1.3
threadpoolctl ==3.1.0
tqdm ==4.65.0
tweepy ==4.9.0
typing_extensions ==4.6.3
tzdata ==2023.3
urllib3 ==2.0.3
wrapt ==1.15.0