gates_lpdm_emulator

https://github.com/elenafillo/gates_lpdm_emulator

Scientific Fields

Earth and Environmental Sciences Physical Sciences - 30% confidence

Last synced: 6 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: elenafillo
License: gpl-3.0
Language: Jupyter Notebook
Default Branch: new_load_functions
Size: 253 MB

Statistics

Stars: 1
Watchers: 2
Forks: 0
Open Issues: 0
Releases: 2

Created over 2 years ago · Last pushed 6 months ago

Metadata Files

Readme License Citation

GATESLPDMemulator

This repo implements the model described at Enabling Fast Greenhouse Gas Emissions Inference from Satellites with GATES: a Graph-Neural-Network Atmospheric Transport Emulation System (egusphere-2025-2392)

Replicating the paper

1) Download all the necessary data 2) Set up the environment 3) Run launchtrain.sh to train a model 4) Run launchemulate.sh to make predictions at 200x200 size 5) Run integrate_fps.py file to bias correct (make sure to edit the paths at the top!) 6) Use the ACRC repo with the bias-corrected, integrated footprints to do inversion

Environment - check this section!

See environmentshort.yml This file does not contain torch and related packages - this is because you will need to install separately a CUDA-enabled version or not depending on where you are running the code. BluePebble has pytorch+cuda pre-installed, which you can load when you submit jobs to the queue (see the launchtrain.sh file). To run notebooks or files on the login node, you will need torch and associated packages installed in a different environment, which I manually import when running notebooks with the following line. Alternatively you could have two parallel envs (graphnet to run on cluster, and graphnet+torch to run on login) but that might get more confusing if you need to install packages! sys.path.insert(0, "/path/to/environment_with_torch/env_name/lib/python3.8/site-packages/") import torch

Loading data

`LoadBaseSatelliteData` loads data from the directories (provided or default). It does not crop or interpolate

original_data = LoadBaseSatelliteData(year=2016, region="SAHARA", freq=40, verbose=True, load_everything=True) originaldata has attributes originaldata.fpdatafull , originaldata.metfile, originaldata.topogfile and originaldata.landcoverfile

`LoadSquareSatelliteData` loads data, interpolating to the time of the footprints and cropping to a square of size size x size around the footprint's measurement point

cropped_data = LoadSquareSatelliteData(year=2016, region="SAHARA", freq=40, size=200, verbose=True, load_everything=True) croppeddata has attributes croppeddata.fpdata (np array), croppeddata.met (Dataset interpolated in time and cropped in space), and cropped_data.topog (Dataset cropped in space, with variables topog and landcover).

Notes: - size should be even! - freq reduces the time frequency of the data before loading to reduce computational expense (ie freq=3 will only load one in every three footprints and corresponding data) - sometimes the cropped area escapes the actual footprint domain. - if deleteoutofdomain=True, these footprints are deleted from the dataset - if the footprints that escape the domain are kept, use filloutofdomain_with to specify if the out-of-domain areas should be filled with "zeros" or "nans"

Setting up data

Extracting inputs with `get_square_satellite_inputs()`

Extract the inputs with shape (time, lat, lon, variable), for multiple atmospheric levels and times if passed. The parameter time_deltas allows extracting and interpolating meteorology further back in time, for t-delta Hours where t is the footprint measurement time. e.g. time_deltas=[6] means that the meteorological variables will be returned at t, and t-6h.

``` variables = {"xwind":[3,15], "ywind":[3], "surfaceairpressure":[]} staticvariables=["latcoords", "loncoords", "xcoords", "ycoords", "topog", "landcover"] inputs, inputnames = getsquaresatelliteinputs(croppeddata, variables, staticvariables=staticvariables, timedeltas=[6], returnvariable_names=True)

```

Variables

Meteorological (time-dependent)
- air_pressure
- air_temperature
- atmosphereboundarylayer_thickness: height of the layer of the atmosphere within which the effects of friction are significant (roughly the lowest one or two kilometers of the atmosphere). 2D - no height component
- surfaceairpressure: 2D - no height component
- upwardairvelocity
- x_wind
- y_wind
- wind_angle
- wind_speed
Static
- topog
- landcover
- sinlatcoords/sinloncoords/coslatcoords/cosloncoords - sin and cos of coordinates. This is to encode cyclical variables when using a sphere (ie the whole world), but probably is not as useful when only using a reduced domain
- latcoords/loncoords - coordinates at each node
- distance_centre - Euclidean distance from the release point, calculated with x/y coords rather than actual distance - this is to speed up calculation, as the lat/lon frame of reference (and therefore distances) changes only slightly for each footprint
- xcoords/ycoords - numerical indeces of each node in a x/y style, passing centered_coords=True returns 0,0 as the center (so negative x coordinates are west, negative y coordinates are south), otherwise 0,0 is the South-West corner and all x/y coords are positive, with the release point at int(size/2), int(size/2). For best practice and better inference across sizes pass centered_coords=True
- binary_centre - zero for all nodes except the release point which is 1

Preparing dataset

The FootprintsDataset object sets up the inputs and outputs to be loaded to the DataLoader, and makes any needed transformations. The transformations I'm doing currently are:
outputs:
- logv4: Takes log of output data where non-zero, and offsets by minimum value so its above. train_dataset.fp contains the transformed data, and train_dataset.fp_untransformed the original footprint data
inputs
- clever_transform_3 - applies a sklearn preprocessing.StandardScaler() to each variable and level, across all timedeltas (eg all the xwind data at level 3 is scaled together, so is at level 9 etc). Applies min-max transform to landcover. The transformers are stored in a dictionary of format "{"variablename":{level1:transformer, level2:transformer...},...}" stored at traindataset.transformers. The feature names need to be passed to input_names for this transform to work

The test dataset can be transformed using the trained transformers from the train dataset by passing a test_mode dictionary as shown below

``` traindataset = FootprintsDataset(inputs=inputs, fp=croppeddata.fpdata, inputnames=inputnames, inputtransforms=["clevertransform3"], output_transforms= ["logv4"])

testdataset = FootprintsDataset(inputs=inputs, fp=croppeddata.fpdata, inputnames=inputnames, inputnames=names, testmode=traindataset.transform_parameters)

trainloader = DataLoader(traindataset, batchsize=5, shuffle=True) trainloader = DataLoader(testdataset, batchsize=5, shuffle=False) ```

Model

Check model_description.md for more info on the architecture!

The model operates on two levels: lat-lon grid (of same resolution and shape for the inputs and the outputs) and an intermediate abstract layer, with nodes arranged in hexagons. In the current setup, the model builts a grid and mesh pair using the location of a "reference footprint", and all predictions are done on this grid. An improvement would be to explore a way to select the best reference footrpint, or to find a way to do this dynamically for each footprint grid, _ = getgrid(data, parameters.get("gridreference_fp"))

Constructing a model

Create a model with the following:

``` auxdim = len(inputvariables["others"]) featuredim=np.shape(inputs)[-1]-auxdim

model = GraphSatelliteForecaster(grid, wholeworld=False, featuredim=featuredim, auxdim=auxdim, numblocks=4, nodedim=64, edgedim=64, hiddenlayersprocessornode=3, hiddenlayersprocessoredge=2, hiddenlayersdecoder=1, hiddendimprocessornode=16, hiddendimprocessoredge=16, hiddendimdecoder=16, resolution=4, output_dim=1)

optimizer = optim.AdamW(model.parameters(), lr=lr) ``Parameters: - grid: list of lat-lon tuples for each of the nodes, outputted with get_all_inputs_graphnet_satellite_v4 - whole_world: wether to create a mesh graph that spans the whole globe, or only mesh nodes above the grid nodes passed in grid. Always False at this stage - feature_dim and aux_dim: length of meteorological and non-met time inputs, respectively. Legacy from original code, right now they make no difference as long as feature_dim+aux_dim=total number of dims. feature_dim+aux_dim = length of the green node features in the Encoder in the diagram in model_description.md - resolution: resolution of the mesh grid, as determined by the [h3 library](https://h3geo.org/docs/core-library/restable). The lower the resolution, the bigger the hexagons are. Resolution of 4 (used throughout the models) covers between one and three grid nodes, resolution of 3 covers around 10-15 grid nodes, resolution of 5 covers none or one grid nodes. Resolution of 5 and below do not work (as the mesh needs to cover the domain completely and this resolution is too fine-grained to). - NN parameters - please refer to diagram above which I need to label at some point: -numblocks- number of processor blocks (pink cubes). These update the mesh edges and nodes sequentially -nodedim- size of mesh node feature array (ie length of yellow mesh node feature in diagram above) -edgedim- size of mesh edge feature array (ie length of dashed yellow mesh edge feature above) -hiddenlayersprocessornode- number of hidden layers in the Node Encoder and Node Updater, each of sizehiddendimprocessornode(blue blocks in Node Encoder and Node Updater) -hiddenlayersprocessoredge- number of hidden layers in the Edge Encoder and Edge Updater, each of sizehiddendimprocessoredge(blue blocks in Edge Encoder and Edge Updater) -hiddenlayersdecoder- number of hidden layers in the Decoder, each of sizehiddendimdecoder(blue block in Node Decoder) -outputdim` - dimension of Decoder Output (flat green square in Decoder above)

Training

Use the general_train.py file to train a model. Set up all your model parameters using the parametertemplatetrain.txt file (note that modelname should be unique!). A new folder will be created, and populated with the following: ``` . └── modelname/ ├── gridmodelname.pickle ├── trainingsettingsmodelname.json ├── modelnameupdates.txt ├── transformparametersmodelname.pickle ├── trainingimgs/ │ ├── modelname0.png │ ├── modelname1.png │ └── ... ├── modelname50.pt ├── modelname_100.pt └── ... ```

## Predicting: same model and size If you want to use a model modelname that you have trained to predict footprints, in the same domain and size, you can pass the parameter file directly to generalmake_prediction.py

python general_make_prediction.py training_settings_model_name.json --file_path /path/to/model

This will create a new folder in the directory called predictions. Each file in predictions is a .nc array with - coords lat, lon and time - attributes: - fp - the true footprints, in the original dataspace - transfp: the true footprints, in the transformed dataspace - predictions: the predicted footprints, as outputted by the model in the transformed dataspace - transpredictions: the predicted footprints, in the original dataspace

Predicting: different size, different domain etc

To use a trained model model_name to predict but with a different set-up (e.g. predict at 200x200 when you trained at 50x50), you will need to create a folder newmodelname: - grid of lat/lon values (used during training for consistency) - trained input and output transformers - input, dataset and model parameters . └── model_name/ ├── grid_new_model_name.pickle # this is needed if the emulating size is different to the training size ├── training_settings_new_model_name.json # copy the parameter file from the reference model and add a section Add the following to training_settings_new_model_name.json, to indicate which model should be used to emulate, and the size that it was trained on "reference_model": { "model_name":"model_name_to_predict_from", "domain_size": 50 }

Inference and evaluation - WIP!

After inference, the footprints might need bias correcting. For the inversion, they also need to be integrated into the original domain shape. Use the integrate_fps.py file to apply bias correction, and save the footprints in the original footprint domain

Inversion - TO DO!

Owner

Login: elenafillo
Kind: user

Repositories: 4
Profile: https://github.com/elenafillo

Citation (CITATION.cff)

cff-version: 1.1.0
authors:
- family-names: Fillola
  given-names: Elena
  orcid: https://orcid.org/0000-0003-4706-9833
- family-names: Clark
  given-names: Jeff
  orcid: https://orcid.org/0000-0003-0118-3999
- family-names: Keshtmand
  given-names: Nawid
title: "GATES: A Graph-Neural-Network Atmospheric Transport Emulation System"
version: v0.1.0-beta
date-released: 2025-07-31
url: "https://github.com/elenafillo/GATES_LPDM_emulator"

GitHub Events

Total

Release event: 2
Watch event: 1
Delete event: 1
Push event: 22
Create event: 2

Last Year

Release event: 2
Watch event: 1
Delete event: 1
Push event: 22
Create event: 2

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

gates_lpdm_emulator

Science Score: 44.0%

Scientific Fields

Repository

Basic Info

Statistics

Metadata Files

README.md

GATESLPDMemulator

Replicating the paper

Environment - check this section!

Loading data

`LoadBaseSatelliteData` loads data from the directories (provided or default). It does not crop or interpolate

`LoadSquareSatelliteData` loads data, interpolating to the time of the footprints and cropping to a square of size size x size around the footprint's measurement point

Setting up data

Extracting inputs with `get_square_satellite_inputs()`

Variables

- binary_centre - zero for all nodes except the release point which is 1

Preparing dataset

Model

Constructing a model

Training

Predicting: different size, different domain etc

Inference and evaluation - WIP!

Inversion - TO DO!

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

gates_lpdm_emulator

Science Score: 44.0%

Scientific Fields

Repository

Basic Info

Statistics

Metadata Files

README.md

GATESLPDMemulator

Replicating the paper

Environment - check this section!

Loading data

LoadBaseSatelliteData loads data from the directories (provided or default). It does not crop or interpolate

LoadSquareSatelliteData loads data, interpolating to the time of the footprints and cropping to a square of size size x size around the footprint's measurement point

Setting up data

Extracting inputs with get_square_satellite_inputs()

Variables

- binary_centre - zero for all nodes except the release point which is 1

Preparing dataset

Model

Constructing a model

Training

Predicting: different size, different domain etc

Inference and evaluation - WIP!

Inversion - TO DO!

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

`LoadBaseSatelliteData` loads data from the directories (provided or default). It does not crop or interpolate

`LoadSquareSatelliteData` loads data, interpolating to the time of the footprints and cropping to a square of size size x size around the footprint's measurement point

Extracting inputs with `get_square_satellite_inputs()`