gates_lpdm_emulator
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.0%) to scientific vocabulary
Scientific Fields
Repository
Basic Info
- Host: GitHub
- Owner: elenafillo
- License: gpl-3.0
- Language: Jupyter Notebook
- Default Branch: new_load_functions
- Size: 253 MB
Statistics
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 2
Metadata Files
README.md
GATESLPDMemulator
This repo implements the model described at Enabling Fast Greenhouse Gas Emissions Inference from Satellites with GATES: a Graph-Neural-Network Atmospheric Transport Emulation System (egusphere-2025-2392)
Replicating the paper
1) Download all the necessary data 2) Set up the environment 3) Run launchtrain.sh to train a model 4) Run launchemulate.sh to make predictions at 200x200 size 5) Run integrate_fps.py file to bias correct (make sure to edit the paths at the top!) 6) Use the ACRC repo with the bias-corrected, integrated footprints to do inversion
Environment - check this section!
See environmentshort.yml
This file does not contain torch and related packages - this is because you will need to install separately a CUDA-enabled version or not depending on where you are running the code. BluePebble has pytorch+cuda pre-installed, which you can load when you submit jobs to the queue (see the launchtrain.sh file). To run notebooks or files on the login node, you will need torch and associated packages installed in a different environment, which I manually import when running notebooks with the following line. Alternatively you could have two parallel envs (graphnet to run on cluster, and graphnet+torch to run on login) but that might get more confusing if you need to install packages!
sys.path.insert(0, "/path/to/environment_with_torch/env_name/lib/python3.8/site-packages/")
import torch
Loading data
LoadBaseSatelliteData loads data from the directories (provided or default). It does not crop or interpolate
original_data = LoadBaseSatelliteData(year=2016, region="SAHARA", freq=40, verbose=True, load_everything=True)
originaldata has attributes originaldata.fpdatafull , originaldata.metfile, originaldata.topogfile and originaldata.landcoverfile
LoadSquareSatelliteData loads data, interpolating to the time of the footprints and cropping to a square of size size x size around the footprint's measurement point
cropped_data = LoadSquareSatelliteData(year=2016, region="SAHARA", freq=40, size=200, verbose=True, load_everything=True)
croppeddata has attributes croppeddata.fpdata (np array), croppeddata.met (Dataset interpolated in time and cropped in space), and cropped_data.topog (Dataset cropped in space, with variables topog and landcover).
Notes:
- size should be even!
- freq reduces the time frequency of the data before loading to reduce computational expense (ie freq=3 will only load one in every three footprints and corresponding data)
- sometimes the cropped area escapes the actual footprint domain.
- if deleteoutofdomain=True, these footprints are deleted from the dataset
- if the footprints that escape the domain are kept, use filloutofdomain_with to specify if the out-of-domain areas should be filled with "zeros" or "nans"
Setting up data
Extracting inputs with get_square_satellite_inputs()
Extract the inputs with shape (time, lat, lon, variable), for multiple atmospheric levels and times if passed.
The parameter time_deltas allows extracting and interpolating meteorology further back in time, for t-delta Hours where t is the footprint measurement time. e.g. time_deltas=[6] means that the meteorological variables will be returned at t, and t-6h.
``` variables = {"xwind":[3,15], "ywind":[3], "surfaceairpressure":[]} staticvariables=["latcoords", "loncoords", "xcoords", "ycoords", "topog", "landcover"] inputs, inputnames = getsquaresatelliteinputs(croppeddata, variables, staticvariables=staticvariables, timedeltas=[6], returnvariable_names=True)
```
Variables
- Meteorological (time-dependent)
- air_pressure
- air_temperature
- atmosphereboundarylayer_thickness: height of the layer of the atmosphere within which the effects of friction are significant (roughly the lowest one or two kilometers of the atmosphere). 2D - no height component
- surfaceairpressure: 2D - no height component
- upwardairvelocity
- x_wind
- y_wind
- wind_angle
- wind_speed
- Static
- topog
- landcover
- sinlatcoords/sinloncoords/coslatcoords/cosloncoords - sin and cos of coordinates. This is to encode cyclical variables when using a sphere (ie the whole world), but probably is not as useful when only using a reduced domain
- latcoords/loncoords - coordinates at each node
- distance_centre - Euclidean distance from the release point, calculated with x/y coords rather than actual distance - this is to speed up calculation, as the lat/lon frame of reference (and therefore distances) changes only slightly for each footprint
- xcoords/ycoords - numerical indeces of each node in a x/y style, passing
centered_coords=Truereturns 0,0 as the center (so negative x coordinates are west, negative y coordinates are south), otherwise 0,0 is the South-West corner and all x/y coords are positive, with the release point at int(size/2), int(size/2). For best practice and better inference across sizes pass centered_coords=True
- binary_centre - zero for all nodes except the release point which is 1
Preparing dataset
The
FootprintsDatasetobject sets up the inputs and outputs to be loaded to the DataLoader, and makes any needed transformations. The transformations I'm doing currently are: - outputs:
- logv4: Takes log of output data where non-zero, and offsets by minimum value so its above.
train_dataset.fpcontains the transformed data, andtrain_dataset.fp_untransformedthe original footprint data
- logv4: Takes log of output data where non-zero, and offsets by minimum value so its above.
- inputs
clever_transform_3- applies a sklearnpreprocessing.StandardScaler()to each variable and level, across all timedeltas (eg all the xwind data at level 3 is scaled together, so is at level 9 etc). Applies min-max transform to landcover. The transformers are stored in a dictionary of format "{"variablename":{level1:transformer, level2:transformer...},...}" stored at traindataset.transformers. The feature names need to be passed toinput_namesfor this transform to work
The test dataset can be transformed using the trained transformers from the train dataset by passing a test_mode dictionary as shown below
``` traindataset = FootprintsDataset(inputs=inputs, fp=croppeddata.fpdata, inputnames=inputnames, inputtransforms=["clevertransform3"], output_transforms= ["logv4"])
testdataset = FootprintsDataset(inputs=inputs, fp=croppeddata.fpdata, inputnames=inputnames, inputnames=names, testmode=traindataset.transform_parameters)
trainloader = DataLoader(traindataset, batchsize=5, shuffle=True) trainloader = DataLoader(testdataset, batchsize=5, shuffle=False) ```
Model
Check model_description.md for more info on the architecture!
The model operates on two levels: lat-lon grid (of same resolution and shape for the inputs and the outputs) and an intermediate abstract layer, with nodes arranged in hexagons. In the current setup, the model builts a grid and mesh pair using the location of a "reference footprint", and all predictions are done on this grid. An improvement would be to explore a way to select the best reference footrpint, or to find a way to do this dynamically for each footprint grid, _ = getgrid(data, parameters.get("gridreference_fp"))
Constructing a model
Create a model with the following:
``` auxdim = len(inputvariables["others"]) featuredim=np.shape(inputs)[-1]-auxdim
model = GraphSatelliteForecaster(grid, wholeworld=False, featuredim=featuredim, auxdim=auxdim, numblocks=4, nodedim=64, edgedim=64, hiddenlayersprocessornode=3, hiddenlayersprocessoredge=2, hiddenlayersdecoder=1, hiddendimprocessornode=16, hiddendimprocessoredge=16, hiddendimdecoder=16, resolution=4, output_dim=1)
optimizer = optim.AdamW(model.parameters(), lr=lr)
``
Parameters:
- grid: list of lat-lon tuples for each of the nodes, outputted with get_all_inputs_graphnet_satellite_v4
- whole_world: wether to create a mesh graph that spans the whole globe, or only mesh nodes above the grid nodes passed in grid. Always False at this stage
- feature_dim and aux_dim: length of meteorological and non-met time inputs, respectively. Legacy from original code, right now they make no difference as long as feature_dim+aux_dim=total number of dims. feature_dim+aux_dim = length of the green node features in the Encoder in the diagram in model_description.md
- resolution: resolution of the mesh grid, as determined by the [h3 library](https://h3geo.org/docs/core-library/restable). The lower the resolution, the bigger the hexagons are. Resolution of 4 (used throughout the models) covers between one and three grid nodes, resolution of 3 covers around 10-15 grid nodes, resolution of 5 covers none or one grid nodes. Resolution of 5 and below do not work (as the mesh needs to cover the domain completely and this resolution is too fine-grained to).
- NN parameters - please refer to diagram above which I need to label at some point:
-numblocks- number of processor blocks (pink cubes). These update the mesh edges and nodes sequentially
-nodedim- size of mesh node feature array (ie length of yellow mesh node feature in diagram above)
-edgedim- size of mesh edge feature array (ie length of dashed yellow mesh edge feature above)
-hiddenlayersprocessornode- number of hidden layers in the Node Encoder and Node Updater, each of sizehiddendimprocessornode(blue blocks in Node Encoder and Node Updater)
-hiddenlayersprocessoredge- number of hidden layers in the Edge Encoder and Edge Updater, each of sizehiddendimprocessoredge(blue blocks in Edge Encoder and Edge Updater)
-hiddenlayersdecoder- number of hidden layers in the Decoder, each of sizehiddendimdecoder(blue block in Node Decoder)
-outputdim` - dimension of Decoder Output (flat green square in Decoder above)
Training
Use the general_train.py file to train a model. Set up all your model parameters using the parametertemplatetrain.txt file (note that modelname should be unique!). A new folder will be created, and populated with the following:
```
.
└── modelname/
├── gridmodelname.pickle
├── trainingsettingsmodelname.json
├── modelnameupdates.txt
├── transformparametersmodelname.pickle
├── trainingimgs/
│ ├── modelname0.png
│ ├── modelname1.png
│ └── ...
├── modelname50.pt
├── modelname_100.pt
└── ...
```
## Predicting: same model and size If you want to use a model modelname that you have trained to predict footprints, in the same domain and size, you can pass the parameter file directly to generalmake_prediction.py
python general_make_prediction.py training_settings_model_name.json --file_path /path/to/model
This will create a new folder in the directory called predictions. Each file in predictions is a .nc array with
- coords lat, lon and time
- attributes:
- fp - the true footprints, in the original dataspace
- transfp: the true footprints, in the transformed dataspace
- predictions: the predicted footprints, as outputted by the model in the transformed dataspace
- transpredictions: the predicted footprints, in the original dataspace
Predicting: different size, different domain etc
To use a trained model model_name to predict but with a different set-up (e.g. predict at 200x200 when you trained at 50x50), you will need to create a folder newmodelname:
- grid of lat/lon values (used during training for consistency)
- trained input and output transformers
- input, dataset and model parameters
.
└── model_name/
├── grid_new_model_name.pickle # this is needed if the emulating size is different to the training size
├── training_settings_new_model_name.json # copy the parameter file from the reference model and add a section
Add the following to training_settings_new_model_name.json, to indicate which model should be used to emulate, and the size that it was trained on
"reference_model": {
"model_name":"model_name_to_predict_from",
"domain_size": 50
}
Inference and evaluation - WIP!
After inference, the footprints might need bias correcting. For the inversion, they also need to be integrated into the original domain shape. Use the integrate_fps.py file to apply bias correction, and save the footprints in the original footprint domain
Inversion - TO DO!
Owner
- Login: elenafillo
- Kind: user
- Repositories: 4
- Profile: https://github.com/elenafillo
Citation (CITATION.cff)
cff-version: 1.1.0 authors: - family-names: Fillola given-names: Elena orcid: https://orcid.org/0000-0003-4706-9833 - family-names: Clark given-names: Jeff orcid: https://orcid.org/0000-0003-0118-3999 - family-names: Keshtmand given-names: Nawid title: "GATES: A Graph-Neural-Network Atmospheric Transport Emulation System" version: v0.1.0-beta date-released: 2025-07-31 url: "https://github.com/elenafillo/GATES_LPDM_emulator"
GitHub Events
Total
- Release event: 2
- Watch event: 1
- Delete event: 1
- Push event: 22
- Create event: 2
Last Year
- Release event: 2
- Watch event: 1
- Delete event: 1
- Push event: 22
- Create event: 2