https://github.com/calgo-lab/gwl-forecast-pipeline

model pipeline for groundwater level prediction in German

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
✓
DOI references
Found 2 DOI reference(s) in README
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.5%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

model pipeline for groundwater level prediction in German

Basic Info

Host: GitHub
Owner: calgo-lab
License: mpl-2.0
Language: Python
Default Branch: main
Size: 103 KB

Statistics

Stars: 0
Watchers: 3
Forks: 0
Open Issues: 0
Releases: 0

Created over 3 years ago · Last pushed over 3 years ago

https://github.com/calgo-lab/gwl-forecast-pipeline/blob/main/

# Groundwater Level Forecast Pipeline

Data- and Machine-Learning-Pipeline for the prediction of Groundwater Levels in Germany.

## About

The goal of the given study is to develop a _global_ groundwater level time series forecasting model
for Germany with the help of Convolutional Neural Networks (CNN) in a sequence to sequence
(seq2seq) setup. The model takes observed groundwater levels sequences as well as exogenous
attributes, such as weather data and geomorphic features as inputs in the form of temporal 2D
map data sequences (3D input) to predict a future sequence of groundwater levels at a given location.
Model inputs are gridded groundwater level observations as well as geological map data on relief,
soil and hydrogeologic attributes. Further inputs are gridded
weather parameters (humidity, temperature and precipitation), vegetational data, and land cover
data. The given set of input parameters covers a wide range of possible factors of influence and
offers the chance for the model to learn seasonal impacts, natural effects like inflows and outflows
as well as human impact on the groundwater, such as withdrawals in urban, industrial or agricultural areas. The approach using 3D input data enables the model to extract spatio-temporal
relationships, while the global approach helps to learn the underlying relationships by a great
variety of hydrogeologic conditions.

The given python-package holds the functionality of the entire Machine Learning Pipeline, offering
a convenient interface for:
* Downloading Raw Data
* Preparing Raw Data (Data Harmonization)
* Preprocessing Data (Feature Engineering and Normalization)
* Training Models
* Hyperparameter Optimization
* Run Forecasts
* Model Evaluation

## Installation

### Requirements

#### RAM

Most of the stages are optimized for low RAM usage via stream processing and file system caches, 
such as Data Loading, Data Preprocessing and Model Training. RAM-usage is configurable via
several parameters (documented below). For flawless execution we recommend at minimum 64 GB of RAM. 

#### Disk Space

For raw and processed data about 110 GB of free disk space are needed. When file system caches
are activated (to save RAM on data preprocessing and training) the disk usage will increase with
selected number of wells, selected training period and selected raster size.

### OPTION A: Install via `pip`

#### Requirements

##### Python

To install the package a python version of `3.8` (or higher) is required.
 
##### Linux

While many parts of the pipeline are OS-agnostic, the _Data Preparation Stage_ relies on subprocesses
that are specific to Linux.
The Rasterization of geospatial data is done with the help of the [`gdal`-library](https://gdal.org/). 
In order to run the _Data Preparation Stage_ the libraries `gdal-bin` and `libgdal-dev` are
required to be installed on your system.

##### GPU / CUDA

For GPU accelerated model performance, [`CUDA`](https://developer.nvidia.com/cuda-downloads) and [`cuDNN`](https://developer.nvidia.com/cudnn) are required to be installed on your system.

##### Installation
install the package via `pip`
```shell
pip install git+https://github.com/calgo-lab/gwl-forecast-pipeline
```

### OPTION B: Use docker container

The docker container comes with all dependencies.

```shell
git clone https://github.com/calgo-lab/gwl-forecast-pipeline .
docker build -t calgo-lab/gwl-forecast-pipeline .
```

## Configuration / Setup

### Data Sets and Data Access

Most of the used data sets are publicly available for direct download. An exception to this
are the [EU-DEM](https://land.copernicus.eu/imagery-in-situ/eu-dem) data sets that can be downloaded after a free registration. 
The data sets on groundwater levels and groundwater well meta data are not published. In order to gain access, contact the authors of the study.


### Settings

1. create a `settings.ini` file declaring the following config variables:

```ini
[settings]
EUDEM_ELEVATION_URL=...
EUDEM_SLOPE_URL=...
EUDEM_ASPECT_URL=...
GWL_URL=...
DROP_GWL_PERIODS_URL=...
BENCHMARK_RESULTS_URL=...
WELL_META_URL=...
DATA_PATH=...
MODEL_PATH=...
PREPROCESSOR_CACHE_PATH=...
PREDICTION_RESULT_PATH=...
HYPEROPT_RESULT_PATH=...
SCORE_RESULT_PATH=...
```

| VARIABLE                  | SEMANTICS                                                                                           | Required | Default         |
|---------------------------|-----------------------------------------------------------------------------------------------------|----------|-----------------|
| `EUDEM_ELEVATION_URL`     | URL for EU-DEM elevation raw data (see section on data download below)                              | yes      |                 |
| `EUDEM_SLOPE_URL`         | URL for EU-DEM slope raw data (see section on data download below)                                  | yes      |                 |
| `EUDEM_ASPECT_URL`        | URL for EU-DEM aspect raw data (see section on data download below)                                 | yes      |                 |
| `GWL_URL`                 | URL for groundwater level raw time series data, provided by BGR                                     | yes      |                 |
| `DROP_GWL_PERIODS_URL`    | URL for data on detected implausible periods of groundwater level time series' to be removed        | yes      |                 |
| `WELL_META_URL`           | URL for static features and meta data on groundwater wells, provided by BGR                         | yes      |                 |
| `BENCHMARK_RESULTS_URL`   | URL for benchmark wells and scores by [Wunsch et al.](https://doi.org/10.1038/s41467-022-28770-2)   | yes      |                 |
| `HYRAS_HURS_URL`          | URL for HYRAS humidity raw data                                                                     | no       | see `config.py` |
| `HYRAS_TAS_URL`           | URL for HYRAS mean air temperature raw data                                                         | no       | see `config.py` |
| `HYRAS_PR_URL`            | URL for HYRAS precipitation raw data                                                                | no       | see `config.py` |
| `SWR_1000_URL`            | URL for Mean Annual Rate of Percolation data                                                        | no       | see `config.py` |
| `GWN_1000_URL`            | URL for Mean Annual Groundwater Recharge Rate data                                                  | no       | see `config.py` |
| `HUEK_250_URL`            | URL for hydrogeologic overview map data                                                             | no       | see `config.py` |
| `LAI_URL`                 | URL for Copernicus Leaf Area Index raw data                                                         | no       | see `config.py` |
| `CLC_URL`                 | URL for Copernicus Land Cover raw data                                                              | no       | see `config.py` |
| `DATA_PATH`               | path for data storage                                                                               | yes      |                 |
| `MODEL_PATH`              | path for model storage                                                                              | yes      |                 |
| `PREPROCESSOR_CACHE_PATH` | path to store preprocessed data                                                                     | yes      |                 |
| `PREDICTION_RESULT_PATH`  | path to store predictions                                                                           | yes      |                 |
| `HYPEROPT_RESULT_PATH`    | path to store results of hyperparameter optimization                                                | yes      |                 |
| `SCORE_RESULT_PATH`       | path to store scores                                                                                | yes      |                 |
| `CSV_LOGGER_PATH`         | path for csv logging of model training                                                              | no       | '' (None)       |
| `TENSORBOARD_PATH`        | path for tensorboard logs of model training                                                         | no       | '' (None)       |
| `GPU`                     | whether to use a GPU for model training (0 or 1)                                                    | no       | 0 (False)       |
| `DATA_IN_MEMORY`          | whether to keep complete set of preprocessed data in RAM for faster model training (0 or 1)         | no       | 0 (False)       |
| `LOGGING_CONF`            | path to logging configuration file                                                                  | no       | '' (None)       |

2. Set an environment variable pointing to the directory holding the `settings.ini`-file.
```shell
export SETTINGS_PATH=path/to/settings.ini
```

#### Logging

Many package functions log their activities on DEBUG level. Custom Logging can be defined via a [logging config file](https://docs.python.org/3/howto/logging.html#configuring-logging).
The package's parent logger's name is `'gwl_forecast_pipeline'`. Register your logging configuration file in the `LOGGING_CONF` variable in `settings.ini`.
If no logging config file is provided, then all functions log to `stdout` per default.

### Download the raw data

Run the `gwl_download_data`-command in your shell, in order to download and extract the raw data. Make sure to have configured
all download URLs in the `settings.ini` and the `SETTINGS_PATH` environment variable beforehand. Read the section below on how to obtain the EU-DEM data. You will need about **30GB of free disk space** for the raw data. 
The files that are expected to be downloaded and their respective data sets and file sizes are listed in the table below. 
The downloaded files will be stored in `DATA_PATH`.

```shell
gwl_download_data
```

| Dataset            | File                                                   | Size (MB) |
|--------------------|--------------------------------------------------------|-----------|
| HYRAS (HURS)       | hurs_hyras_5_1951_2020_v5-0_de.nc                      | 920       |
| HYRAS (TAS)        | tas_hyras_5_1951_2020_v5-0_de.nc                       | 349       |
| HYRAS (PR)         | pr_hyras_1_1931_2020_v5-0_de.nc                        | 11195     |
| EU-DEM (ELEVATION) | eu_dem_v11_E40N20.TIF                                  | 4906      |
| EU-DEM (ELEVATION) | eu_dem_v11_E40N30.TIF                                  | 4038      |
| EU-DEM (SLOPE)     | EUD_CP-SLOP_4500025000-AA.tif                          | 441       |
| EU-DEM (SLOPE)     | EUD_CP-SLOP_4500035000-AA.tif                          | 116       |
| EU-DEM (ASPECT)    | EUD_CP-ASPC_4500025000-AA.tif                          | 1405      |
| EU-DEM (ASPECT)    | EUD_CP-ASPC_4500035000-AA.tif                          | 1194      |
| HUEK250            | huek250__25832_v103_poly.dbf                           | 58        |
| HUEK250            | huek250__25832_v103_poly.shp                           | 170       |
| GWN1000            | GWN1000__3034_v1_raster1.tif                           | 0.5       |
| SWR1000            | swr1000_250.tif                                        | 13        |
| CLC                | U2018_CLC2012_V2020_20u1.tif                           | 206       |
| LAI                | 12 files (lai_01/w001001.adf, lai_02/w001001.adf, ...) | ~1080     |
| GWL                | groundwater_levels.feather                             | 1394      |
| WELL META          | gwl_germany_features.csv                               | 11        |


#### EU-DEM

In order to access the data from EU-DEM data set, follow these steps:

1. Register on https://land.copernicus.eu
2. Go to https://land.copernicus.eu/imagery-in-situ/eu-dem/eu-dem-v1-0-and-derived-products
3. Choose the following products and select the following map tiles for download (Download Tab):

    #### EU-DEM Elevation (v1.0) tiles:
   * EU-DEM 45000-35000: `EUD_CP-DEMS_4500035000-AA` 
   * EU-DEM 45000-25000: `EUD_CP-DEMS_4500025000-AA`
    
    #### EU-DEM Slope (v1.0) tiles: 
    * slope 4500035000: `EUD_CP-SLOP_4500035000-AA`
    * slope 4500025000: `EUD_CP-SLOP_4500025000-AA`

    #### EU-DEM Aspect (v1.0) tiles:
    * Aspect 45000-35000: `EUD_CP-ASPC_4500035000-AA`
    * Aspect 45000-25000: `EUD_CP-ASPC_4500025000-AA`
4. For each product you are provided with a download link combining the 2 tiles in one compressed archive. Place the download links into the 
respective variables in the `settings.ini`-file. 

### Prepare the raw data

The data preparation stage harmonizes the raw data to a common format (GeoTiff),
common temporal resolution (1 week), common Coordinate Reference System (EPSG:3034), common spatial resolution (1km x 1km) and common 
spatial and temporal extent. 

To obtain the harmonized data, run the `gwl_prepare_data`-command. This process may take several hours.
The resulting data is stored in `DATA_PATH`.

```shell
gwl_prepare_data
```

## Usage

### Load data

The `raw_data` is loaded in a lazy manner, the result of the `DataLoader.load_data()`-function
is a triple of (meta data, generator for static raster features, generator for temporal raster features)
Use `max_chunk_size=` in order to define the number of samples per iteration in the generator and 
therefore to control the RAM consumption of the data loading process.

```python
import pandas as pd
from gwl_forecast_pipeline import DataLoader

WELL_IDS = ['BB_25470023', 'BB_25470024']
START = pd.Timestamp(2000, 1, 1) 
END = pd.Timestamp(2014, 1, 1)
RASTER_SIZE = 5  # km

data_loader = DataLoader()
raw_data = data_loader.load_data(WELL_IDS, START, END, RASTER_SIZE, max_chunk_size=3500)
```

### Pre-process data

The data preprocessing involves the feature engineering and normalization. The
pre-processed data are `numpy`-arrays. The processed data is divided
into separated arrays: categorical static raster features (int32), numeric static raster features (float32),
groundwater level raster features (float32), exogenous temporal raster features (float32), and the target variable (float32).
The return value of the `preprocess`-function is of a custom type, `DataContainer` which
bundles all arrays and abstracts from the storage. The arrays are either stored in-memory 
or in the file system under `PREPROCESSOR_CACHE_PATH`. This is controlled by the parameter `use_fs-buffer=`.
Once, pre-processed, the data and the fitted preprocessor can be reused. The preprocessor is considered 
a part of the model and therefore is stored under `MODEL_PATH` by the name of the model. The preprocessor requires
a `ModelConfig`-object, which holds information on the raster size and data normalization.


```python
from gwl_forecast_pipeline import (
    Preprocessor,
    CNNModelConfig,
)

model_conf = CNNModelConfig(
   name='my_model',
   raster_size=RASTER_SIZE,
   target_normalized=True,
   scale_per_group=True,
   # ... there are way more params which are not of interest here
)

try:
   preprocessor = Preprocessor.from_cache(model_conf)
except:
   preprocessor = Preprocessor(model_conf)
   
train_data = preprocessor.preprocess(raw_train_data, fit=True, use_fs_buffer=True)
val_data = preprocessor.preprocess(raw_val_data, fit=False, use_fs_buffer=True)
test_data = preprocessor.preprocess(raw_test_data, fit=False, use_fs_buffer=True)

# store fitted preprocessor instance for reuse
preprocessor.store()
```

### Train a new model


Two types of models are available via `CNNModelConfig` or `ConvLSTMModelConfig`. The trained model is stored
in the file system under `MODEL_PATH`. There also exists a `fit_model`-function to train an existing model. 
```python
from gwl_forecast_pipeline import (
    fit_new_model,
    CNNModelConfig,
    ConvLSTMModelConfig,
    FEATURES,
)

model_conf = CNNModelConfig(
    name='my_model',
    lag=4, # number of lag observations in weeks
    lead=1, # length of predicted sequence in weeks
    loss='mse', # name of the loss function, choices are standard TensorFlow losses and custom loss-function: "mean_group_mse"
    epochs=10, 
    batch_size=512, 
    learning_rate=.0001,
    batch_norm=True,
    dropout=.5, # drop out rate in encoder, decoder and dense layers
    n_dense_layers=1, # number of final dense layers after the encoder-decoder
    n_encoder_layers=2, # number of network layers in the encoder
    n_decoder_layers=2, # number of network layers in the decoder
    n_nodes=32, # number of nodes in the first encoder layer, number of nodes in subsequent layers is derived from this number
    dropout_embedding=.2, # dropout rate after the embedding layers
    dropout_static_features=.33, # dropout rate for static features
    dropout_temporal_features=.25, # dropout rate for temporal features
    pre_decoder_dropout=.25, # dropout rate between encoder and decoder
    early_stop_patience=10, # number of epochs for early stopping patience
    weighted_feature=FEATURES.ROCK_TYPE,  # name of a static categorical feature to weigh samples
    sample_weights={0: 1., 1: 1., 2: 1.5, 3: 2., 4: 1.}, # {channel: weight}, channels can be found in constants-module or raw data
    # if weighted_feature and sample_weights are provided, a weighted MSE will be applied as loss-function
)

history = fit_new_model(train_data, model_conf, val_data=val_data)
```

### Make predictions and obtain scores

`predict`-function runs the model inference and returns the predicted values, as well as
the true values indexed by well_id, timestamp and forecast horizon. 
`score`-function evaluates the predictions by NSE, nRMSE and rMBE. 

```python
from gwl_forecast_pipeline import (
    predict,
    score,
)

predictions = predict(model_conf, test_data)
predictions['y'] = preprocessor.inverse_transform_gwl(
   predictions.index.get_level_values('proj_id'), 
   predictions[['y']].values,
)
predictions['y_hat'] = preprocessor.inverse_transform_gwl(
   predictions.index.get_level_values('proj_id'), 
   predictions[['y_hat']].values,
)
scores = score(predictions)
```

### Hyperparameter Optimization

t.b.d.

Owner

Name: Cognitive Algorithms Lab
Login: calgo-lab
Kind: organization
Location: Germany

Repositories: 2
Profile: https://github.com/calgo-lab

GitHub Events

Total

Watch event: 1

Last Year

Watch event: 1

Dependencies

Dockerfile docker

nvidia/cuda 11.2.0-cudnn8-runtime-ubuntu20.04 build

requirements.txt pypi

PyYAML *
StrEnum *
binpacking *
dask *
geocube *
geopandas *
joblib *
netCDF4 *
numpy *
pandas *
patool *
pyarrow *
pydantic *
pyproj *
python-decouple *
rasterio *
requests *
rioxarray *
scikit-learn *
tensorflow *
tqdm *
xarray *

requirements_dev.txt pypi

absl-py ==1.4.0 development
affine ==2.4.0 development
appdirs ==1.4.4 development
astunparse ==1.6.3 development
attrs ==22.2.0 development
binpacking ==1.5.2 development
cachetools ==5.3.0 development
certifi ==2022.12.7 development
cftime ==1.6.2 development
charset-normalizer ==3.0.1 development
click ==8.1.3 development
click-plugins ==1.1.1 development
cligj ==0.7.2 development
dask ==2023.2.0 development
fiona ==1.9.1 development
flatbuffers ==23.1.21 development
future ==0.18.3 development
gast ==0.4.0 development
geocube ==0.3.3 development
geopandas ==0.12.2 development
google-auth ==2.16.1 development
google-auth-oauthlib ==0.4.6 development
google-pasta ==0.2.0 development
grpcio ==1.51.1 development
h5py ==3.8.0 development
idna ==3.4 development
importlib-metadata ==6.0.0 development
joblib ==1.2.0 development
keras ==2.11.0 development
libclang ==15.0.6.1 development
markdown ==3.4.1 development
markupsafe ==2.1.2 development
munch ==2.5.0 development
netcdf4 ==1.6.2 development
numpy ==1.24.2 development
oauthlib ==3.2.2 development
odc-geo ==0.3.3 development
opt-einsum ==3.3.0 development
packaging ==23.0 development
pandas ==1.5.3 development
patool ==1.12 development
protobuf ==3.19.6 development
pyarrow ==11.0.0 development
pyasn1 ==0.4.8 development
pyasn1-modules ==0.2.8 development
pydantic ==1.10.5 development
pyparsing ==3.0.9 development
pyproj ==3.4.1 development
python-dateutil ==2.8.2 development
python-decouple ==3.7 development
pytz ==2022.7.1 development
pyyaml ==6.0 development
rasterio ==1.3.6 development
requests ==2.28.2 development
requests-oauthlib ==1.3.1 development
rioxarray ==0.13.3 development
rsa ==4.9 development
scikit-learn ==1.2.1 development
scipy ==1.10.1 development
shapely ==2.0.1 development
six ==1.16.0 development
snuggs ==1.4.7 development
strenum ==0.4.9 development
tensorboard ==2.11.2 development
tensorboard-data-server ==0.6.1 development
tensorboard-plugin-wit ==1.8.1 development
tensorflow ==2.11.0 development
tensorflow-estimator ==2.11.0 development
tensorflow-io-gcs-filesystem ==0.30.0 development
termcolor ==2.2.0 development
threadpoolctl ==3.1.0 development
tqdm ==4.64.1 development
typing-extensions ==4.5.0 development
urllib3 ==1.26.14 development
werkzeug ==2.2.3 development
wheel ==0.38.4 development
wrapt ==1.14.1 development
xarray ==2023.1.0 development
zipp ==3.14.0 development

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/calgo-lab/gwl-forecast-pipeline

Science Score: 13.0%

Repository

Basic Info

Statistics

https://github.com/calgo-lab/gwl-forecast-pipeline/blob/main/

Owner

GitHub Events

Total

Last Year

Dependencies