pandemic_alert_model_social_media
The repository contains codes and data for "Development of an early alert model for pandemic situations in Germany".
https://github.com/danqi123/pandemic_alert_model_social_media
Science Score: 57.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 2 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.5%) to scientific vocabulary
Repository
The repository contains codes and data for "Development of an early alert model for pandemic situations in Germany".
Basic Info
- Host: GitHub
- Owner: danqi123
- Language: Python
- Default Branch: main
- Size: 14.2 MB
Statistics
- Stars: 1
- Watchers: 1
- Forks: 1
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
pandemicalertsocial_media
==============================
The repository contains codes and data for the manuscript "Development of an early alert model for pandemic situations in Germany".
The paper contains three parts: 1. Generation of symptom corpus and preparation of Google Trends/Twitter longitudinal datasets with multidimensional symptom features. 2. Log-linear regression model for up-/ down-trend analysis. 3. Random Forest and LSTMs for up-/ down-trend forecasting.
Project Organization
├── LICENSE
├── Makefile <- Makefile with commands like `make data` or `make train`
├── README.md <- The top-level README for developers using this project.
├── data_repo
│ ├── Gold_standard <- The final results of trend analysis (surveillance data)
│ ├── Combined <- The final results of trend analysis (Combined trace)
│ ├── Google_Trends <- The final results of trend analysis (Google Trends)
│ ├── Twitter <- The final results of trend analysis (Twitter)
│ ├── Knowledge_graph <- The final results of the hypergeometric test.
│ ├── processed <- The retrieval of Google Trends/ Twitter longitudinal datasets and symptoms with German translations.
│ └── raw <- The raw data of surveillance gold standards (confirmed cases, deaths, and hospitalization.)
│
├── models <- Contains the optimal hyperparameters for constructing retrained models.
│
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ ├── data <- The final results of trend forecasting (Google and Combined trace)
│ └── figures <- Generated figures to be used in reporting
│
├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
│ generated with `pip freeze > requirements.txt`
├── CITATION.cff <- citation information of the package.
├── setup.py <- makes project pip installable (pip install -e .) so src can be imported
├── src <- Source code for use in this project.
│ ├── __init__.py <- Makes src a Python module
│ ├── pytrends <- The package used for retrieving daily Google Trends data. source: https://github.com/GeneralMills/pytrends/tree/master/pytrend
│ ├── date.py <- Scripts to process dates.
│ ├── startup.py <- Initialize important variables and folders.
│ ├── disease_network.py <- Source Scripts to generate COVID-19 symptom corpus.
│ ├── knowledge_graph.py <- Click file to generate COVID-19 symptom corpus.
│ ├── google_trends_daily.py
│ ├── google_trends.py <- Scripts to retrieve Google Trends data.
│ ├── twitter_api.py <- Scripts to retrieve Twitter data.
│ ├── log_linear_regression.py <- Source Scripts to perform log-linear regression model.
│ ├── cli_trend_analysis.py <- Click the file to perform a log-linear regression model.
│ ├── RF_data_preprocessing.py <- Scripts to preprocess data for performing Random Forest.
│ ├── Random_Forest_optuna.py <- Scripts to tune hyperparameters in Random Forest.
│ ├── Random_Forest.py <- Run Random Forest models and evaluate on the test set.
│ ├── LSTM_data_preprocessing.py <- Scripts to preprocess data for LSTMs.
│ ├── LSTM_train_optuna.py <- Scripts to tune hyperparameters in LSTMs.
│ ├── LSTM_train.py <- Scripts to train LSTMs, evaluate on the test set, and perform SHAP algorithm.
└── tox.ini <- tox file with settings for running tox; see tox.readthedocs.io
Generate a knowledge graph to get the symptom corpus.
input data: symptoms from symptom ontology (https://www.ebi.ac.uk/ols/ontologies/symp)
convert .owl to .json: (http://vowl.visualdataweb.org/webvowl-old/webvowl-old.html)
SYMP_ONTOLOGY = "data_repo/raw/symp.json"
output data: The top German symptoms from the hypergeometric test with low p_value and high volume of co-occurrences in SCAIView knowledge software (https://academia.scaiview.com/)
(/datarepo/Knowledgegraph/COVID/COVIDsymptomsfrom_hypergeometrictest.json)
1. Request SCAIView and get symptom_disease related IDs
python3 knowledge_graph.py get_symptom_disease_IDs COVID
2. Get the number of document counts of each disease
python3 knowledge_graph.py get_disease_count
3. Get the number of document counts of each symptom
python3 knowledge_graph.py get_symptoms_count
4. Get disease symptoms dict with corresponding p_values from hypergeometric test
python3 knowledge_graph.py perform_disease_hypergeo_test COVID 0.05 -v
5. Get top disease-related symptoms (COVID)
python3 knowledge_graph.py get_top_relevant_symptoms 0.05 50
6. Plot the symptoms with descending co-occurrences with COVID-19 in PubMed/PMC.
python3 knowledge_graph.py show_plot COVID_sort_pvalue_occurances.csv 25
7. Symptom translation
uses DeepL software to translate the top English symptoms from the Knowledge graph into German.
Note: here we give an example of the German terms we retrieved till June 2022. If you translate the terms into France OR retrieve new data, you should replace the file with the route:
(SCAIVIEW_SYMPTOM = "/data_repo/processed/symptom_translations.csv")
8. Get German symptom terms
input: the translated German symptoms.
output: .json file contains the German symptom corpus.
python3 knowledge_graph.py get_covid_symptoms
Social media and gold standard longitudinal datasets
1. Download Gold Standard data from Germany RKI.
Surveillance data can be retrieved from the Robert Koch-Institut (RKI) GitHub repository (https://github.com/orgs/robert-koch-institut/repositories)
The surveillance data from 2020-03-01 to 2022-06-28 is downloaded and saved in /data/raw/.
2. Retrieve social media data from Google Trends and Twitter with the symptom queries from the knowledge graph.
Google Trends: src/scripts/google_trends.py
Twitter: src/scripts/twitter_api.py (NEED credentials of academic Twitter developer API)
Note: here we received Google and Twitter data from Jan 2020 to Jun 2022 as an example.
and data is in 'data/processed/daily_google_german.csv' and 'data/processed/daily_twitter_german.csv'
Trend analysis
Background of trend analysis
- STL decomposition to get the trend of time series raw data(https://www.statsmodels.org/dev/examples/notebooks/generated/stl_decomposition.html)
- For Google Trends and Twitter, STL period: 30
- For RKI confirmed cases, deaths, and hospitalization, STL period: 7
- Log-linear regression model:
- window size: 14 days
- stride: 1 day
- alpha: 0.05
1. Get surveillance gold standard trends generated from log-linear regression model and save trends into csv files** (flag argument: RKIcase/ RKIdeath/ RKIhospitalization; stlnumber: 7/ 7/ 7)
python3 cli_trend_analysis.py generate_gold_standard_trend RKI_case 14 7
python3 cli_trend_analysis.py generate_gold_standard_trend RKI_death 14 7
python3 cli_trend_analysis.py generate_gold_standard_trend RKI_hospitalization 14 7
2. Get up- and down-trends of surveillance gold standard
python3 cli_trend_analysis.py get_trends_from_gold_standard RKI_case -f
python3 cli_trend_analysis.py get_trends_from_gold_standard RKI_death -f
python3 cli_trend_analysis.py get_trends_from_gold_standard RKI_hospitalization -f
3. Making symptom-level trend analysis (Google Trends and Twitter)
python3 cli_trend_analysis.py generate_proxy_trend Google_Trends daily_google_german.csv 14 30
python3 cli_trend_analysis.py generate_proxy_trend Twitter daily_twitter_german.csv 14 30
4. Get evaluation metrics of individual symptoms and save the .csv file (Google Trends and Twitter)
python3 cli_trend_analysis.py generate_evaluation_metrics RKI_case Google_Trends 2022-03-01
python3 cli_trend_analysis.py generate_evaluation_metrics RKI_hospitalization Google_Trends 2022-03-01
python3 cli_trend_analysis.py generate_evaluation_metrics RKI_death Google_Trends 2022-03-01
python3 cli_trend_analysis.py generate_evaluation_metrics RKI_case Twitter 2022-03-01
python3 cli_trend_analysis.py generate_evaluation_metrics RKI_hospitalization Twitter 2022-03-01
python3 cli_trend_analysis.py generate_evaluation_metrics RKI_death Twitter 2022-03-01
5. Get the top 20 symptoms (based on the result of the hypergeometric test) for each digital trace (Google Trends and Twitter)
python3 cli_trend_analysis.py get_symptoms 20 google -f
python3 cli_trend_analysis.py get_symptoms 20 Twitter -f
6. Making digital trace (Google Trends, Twitter, and Combined) trend analysis and get up- and down-trends
python3 cli_trend_analysis.py combined_proxy 20 Google_Trends 0.05 2022-03-01 -r -f
python3 cli_trend_analysis.py combined_proxy 20 Twitter 0.05 2022-03-01 -r -f
python3 cli_trend_analysis.py get_combined_P_trends 0.05 2022-03-01 -f
7. Get evaluation metrics for each digital trace (Google Trends, Twitter, and Combined trace)
python3 cli_trend_analysis.py generate_metrics_for_combined_proxy_or_combinedP Google_Trends RKI_case 2022-03-01
python3 cli_trend_analysis.py generate_metrics_for_combined_proxy_or_combinedP Google_Trends RKI_death 2022-03-01
python3 cli_trend_analysis.py generate_metrics_for_combined_proxy_or_combinedP Google_Trends RKI_hospitalization 2022-03-01
python3 cli_trend_analysis.py generate_metrics_for_combined_proxy_or_combinedP Twitter RKI_case 2022-03-01
python3 cli_trend_analysis.py generate_metrics_for_combined_proxy_or_combinedP Twitter RKI_death 2022-03-01
python3 cli_trend_analysis.py generate_metrics_for_combined_proxy_or_combinedP Twitter RKI_hospitalization 2022-03-01
python3 cli_trend_analysis.py generate_metrics_for_combined_proxy_or_combinedP Combined RKI_case 2022-03-01
python3 cli_trend_analysis.py generate_metrics_for_combined_proxy_or_combinedP Combined RKI_death 2022-03-01
python3 cli_trend_analysis.py generate_metrics_for_combined_proxy_or_combinedP Combined RKI_hospitalization 2022-03-01
8. Visualization of trends of surveillance data and up-/down-trends of Google Trends and Combined trace
python3 cli_trend_analysis.py visualize_trend
9. Pairwise event visualization
python3 cli_trend_analysis.py plot_pairwise_trend_event RKI_case Up_trends 2020_2022 2022-03-01
python3 cli_trend_analysis.py plot_pairwise_trend_event RKI_case Down_trends 2020_2022 2022-03-01
python3 cli_trend_analysis.py plot_pairwise_trend_event RKI_death Up_trends 2020_2022 2022-03-01
python3 cli_trend_analysis.py plot_pairwise_trend_event RKI_death Down_trends 2020_2022 2022-03-01
python3 cli_trend_analysis.py plot_pairwise_trend_event RKI_hospitalization Up_trends 2020_2022 2022-03-01
python3 cli_trend_analysis.py plot_pairwise_trend_event RKI_hospitalization Down_trends 2020_2022 2022-03-01
* This function will also print out the percentage of onsets of up-trends.
Trend forecasting
Random Forest
Note: The forecasting horizon is set based on the result of trend analysis. We set the time points to split train/test sets. training length: 28 days forecasting horizon: consistent with the time interval in the log-linear regression model: 14 days
1. Prepare dataset for all feature space (Google Trends/Combined)
Note: proxy: Google; Combined
gold_standard: RKI_case; RKI_hospitalization
python3 RF_data_preprocessing.py --proxy=Google --gold_standard=RKI_case --time_start='2020-03-01' --time_end='2022-06-15' --split_date='2022-04-01' --forecasting_horizon=14 --training_length=28
python3 RF_data_preprocessing.py --proxy=Google --gold_standard=RKI_hospitalization --time_start='2020-03-01' --time_end='2022-06-15' --split_date='2022-04-01' --forecasting_horizon=14 --training_length=28
python3 RF_data_preprocessing.py --proxy=Combined --gold_standard=RKI_case --time_start='2020-03-01' --time_end='2022-06-15' --split_date='2022-04-01' --forecasting_horizon=14 --training_length=28
python3 RF_data_preprocessing.py --proxy=Combined --gold_standard=RKI_hospitalization --time_start='2020-03-01' --time_end='2022-06-15' --split_date='2022-04-01' --forecasting_horizon=14 --training_length=28
2. Run Random Forest models
Note: proxy: Google; Combined
gold_standard: RKI_case; RKI_hospitalization
python3 Random_Forest_optuna.py --proxy=Google --gold_standard=RKI_case --forecasting_horizon=14 --training_length=28 --number_trial=90 --cv_initial_window=90 --cv_step_length=70 --cv_test_window=30
python3 Random_Forest.py --proxy=Google --gold_standard=RKI_case --forecasting_horizon=14 --mode=train --n_estimators=*** --max_depth=*** --min_samples_split=*** --min_samples_leaf=*** --max_features=***
python3 Random_Forest.py --proxy=Google --gold_standard=RKI_case --forecasting_horizon=14 --mode=test
LSTMs
Note: type: Google_confirmed_cases; Google_hospitalization; Combined_confirmed_cases; Combined_hospitalization
python3 LSTM_train_optuna.py --type=Google_confirmed_cases --forecasting_horizon=14 --tranining_length=28 --mode=train --GPU=0
python3 LSTM_train.py
All source code that is specific to this project.
Citation
Wang, D., Lentzen, M., Botz, J. et al. Development of an early alert model for pandemic situations in Germany. Sci Rep 13, 20780 (2023). https://doi.org/10.1038/s41598-023-48096-3
Project based on the cookiecutter data science project template. #cookiecutterdatascience
Owner
- Name: Danqi
- Login: danqi123
- Kind: user
- Location: Sankt Augustin, Germany
- Company: Fraunhofer SCAI
- Website: https://danqi123.github.io/
- Twitter: danqi1013
- Repositories: 2
- Profile: https://github.com/danqi123
* Biomedical data scientist * Python/ ML/Data mining/Information retrieval/ time series data engineering
Citation (CITATION.cff)
cff-version: 1.2.0
title: >-
Development of an early alert model for pandemic
situations in Germany
message: >-
If you use this software, please cite it using the
metadata from this file. DOI: https://doi.org/10.21203/rs.3.rs-3108281/v1
type: software
authors:
- given-names: Danqi
family-names: Wang
affiliation: Fraunhofer SCAI
repository-code: >-
https://github.com/danqi123/pandemic_alert_model_social_media
abstract: >-
The repository contains codes and data for the manuscript
"Development of an early alert model for pandemic
situations in Germany".
keywords:
- pandemic
- alert
- social media data
- statistical model
- machine learning
license: MIT
version: 1.0.0
date-released: '2023-06-25'
GitHub Events
Total
Last Year
Dependencies
- Deprecated ==1.2.14
- Jinja2 ==3.1.2
- Mako ==1.2.4
- MarkupSafe ==2.1.3
- PyYAML ==6.0
- SQLAlchemy ==2.0.17
- alembic ==1.11.1
- arrow ==1.2.3
- binaryornot ==0.4.4
- certifi ==2023.5.7
- chardet ==5.1.0
- charset-normalizer ==3.1.0
- click ==8.1.3
- cmaes ==0.9.1
- colorlog ==6.7.0
- cookiecutter ==2.1.1
- greenlet ==2.0.2
- idna ==3.4
- jinja2-time ==0.2.0
- joblib ==1.2.0
- kaleido ==0.2.1
- numpy ==1.24.3
- oauthlib ==3.2.2
- optuna ==3.2.0
- packaging ==23.1
- pandas ==2.0.2
- patsy ==0.5.3
- plotly ==5.15.0
- python-dateutil ==2.8.2
- python-slugify ==8.0.1
- pytz ==2023.3
- requests ==2.31.0
- requests-oauthlib ==1.3.1
- scikit-base ==0.4.6
- scikit-learn ==1.2.2
- scipy ==1.10.1
- six ==1.16.0
- sktime ==0.19.2
- statsmodels ==0.14.0
- tenacity ==8.2.2
- text-unidecode ==1.3
- threadpoolctl ==3.1.0
- tqdm ==4.65.0
- tweepy ==4.9.0
- typing_extensions ==4.6.3
- tzdata ==2023.3
- urllib3 ==2.0.3
- wrapt ==1.15.0