spinacc

A spinup acceleration procedure for land surface models (LSM) . Developer team Mandresy Rasolonjatovo, Tianzhang Cai, Matthew Archer, Daniel Goll

https://github.com/calipso-project/spinacc

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
    Links to: wiley.com
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (17.7%) to scientific vocabulary
Last synced: 7 months ago · JSON representation ·

Repository

A spinup acceleration procedure for land surface models (LSM) . Developer team Mandresy Rasolonjatovo, Tianzhang Cai, Matthew Archer, Daniel Goll

Basic Info
  • Host: GitHub
  • Owner: CALIPSO-project
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 5.73 MB
Statistics
  • Stars: 4
  • Watchers: 6
  • Forks: 0
  • Open Issues: 49
  • Releases: 0
Created about 5 years ago · Last pushed 9 months ago
Metadata Files
Readme Changelog Contributing Citation

README.md

SPINacc

A spinup acceleration tool for land surface model (LSM) family of ORCHIDEE.

Concept: The proposed machine-learning (ML)-enabled spin-up acceleration procedure (MLA) predicts the steady-state of any land pixel of the full model domain after training on a representative subset of pixels. As the computational efficiency of the current generation of LSMs scales linearly with the number of pixels and years simulated, MLA reduces the computation time quasi-linearly with the number of pixels predicted by ML.

Documentation of aims, concepts, workflows are described in Sun et al (2022).

202208_ML_manuscript_figures_v1 0 pptx (2)

Contents

The SPINacc package includes: * main.py - The main python module that steers the execution of SPINacc. * DEF_*/ - Directories with configuration files for each of the supported ORCHIDEE versions. * config.py - Settings to configure the machine learning performance. * varlist.json - Configure paths to ORCHIDEE forcing output and climate data. * varlist-explained.md - Documentation of data sources used in SPINacc. * Tools/* - Modules called by main.py * AuxilaryTools/SteadyState_checker.py - Tool to assess the state of equilibration in ORCHIDEE simulations. * tests/ - Reproducibility and regression tests * ORCHIDEE_cecill.txt - ORCHIDEE's license file * job - Job file for a bash environment * job_tcsh - Job file for a tcsh environment

Usage

Running SPINacc

Here are the steps to launch SPINacc end-to-end, including the optional tests.

SPINacc has been tested and developed using Python==3.9.*.

Installation

  1. Navigate to the location in which you wish to install and clone the repo as so: git clone git@github.com:CALIPSO-project/SPINacc.git
  2. Create a virtual environment and activate: python3 -m venv ./venv3 source ./venv3/bin/activate
  3. Build all relevant dependencies: cd SPINacc pip install -r requirements.txt

Get data from Zenodo

These instructions are applicable regardless of the system you work on, however if you already have access to datasets on the Obelix supercomputer it is likely that SPINacc will run with minimal modification (see Running on Obelix if you believe this is the case). We provide a ZENODO repository that contains forcing data here as well as reference output for reproducibility testing.

It includes: * ORCHIDEE_forcing_data - Explained in DEF_Trunk/varlist-explained.md * reference data - necessary to run the reproducibility checks (Now OUTDATED see Reproducibility tests).

The setup-data.sh script has been provided to automate the download of the associated ZENODO repository and set paths to the forcing data and climate data in DEF_Trunk/varlist.json. The ZENODO repository does not include climate data files (variable name twodeg, without this, initialisation will fail and SPINacc will be unable to proceed). The climate data will be made available upon request to Daniel Goll (https://www.lsce.ipsl.fr/en/pisp/daniel-goll/).

To ensure the script works without error, set the MYTWODEG and MYFORCING paths appropriately. The MYFORCING path points to where you want the forcing data to be extracted to. The default location is ORCHIDEE_forcing_data in the project root.

The script runs the sed command to replace all occurences of /home/surface5/vbastri/ with the downloaded and extracted ORCHIDEE_forcing_data in /your/path/to/forcing/vlad_files/vlad_files/ in DEF_Trunk/varlist.json. This can be done manually if desired.

Running SPINacc

These instructions are designed to get up and running with SPINacc quickly and then run the accompanying tests. See the section below on Obtaining 'best' performance for a more detailed overview of how to optimally adjust ML performance.

  1. In DEF_Trunk/config.py modify the results_dir variable to point to a different path if desired. To run SPINacc from end-to-end, ensure that the steps are set as follows: ``` tasks = [ 1, 2, 4, 5, ]

    1 = test clustering

    2 = clustering

    3 = compress forcing

    4 = ML

    5 = evaluation / visualisation

    `` If running from scratch, ensure thatstartfromscratchis set toTrueinconfig.py. Thestartfromsratchstep creates apackdata.nc` file and only needs to be done once for a given version of ORCHIDEE. It is also possible to run just a single task, if desired.

  2. Then run: python main.py DEF_Trunk/ By default, main.py will look for the DEF_Trunk directory. SPINacc supports passing other configuration / job directories as arguments to main.py (i.e. python main.py DEF_CNP2/. It is helpful to create copies of the default configurations and then modify for your own purposes to avoid continuously stashing work. )

    Results are located in your output directory under MLacc_results.csv. Visualisations of R2, Slope and dNRMSE are can be found each component in Eval_all_biomassCpool.png, Eval_all_litterCpool.png and Eval_all_somCpool.png.

    For other versions of ORCHIDEE, i.e. CNP2, outputs will be structured similarly.

Set up baseline reproducibility checks

It is possible to run a set of baseline checks that compare the code to the reference output. As of January 2025, the reference dataset has been updated and is now stored in https://github.com/ma595/SPINacc-results for CNP2 and Trunk. We are working towards a new Zenodo release. These tests are useful to ensure that regressions have not been unexpectedly introduced during development.

  1. Begin by downloading the reference output from GitHub.

    git clone https://github.com/ma595/SPINacc-results

  2. In DEF_Trunk/config.py set the reference_dir variable to point to SPINacc-results/Trunk.

  3. [Optional] To execute the reproducibility checks at runtime ensure that True values are set in all relevant steps in DEF_Trunk/config.py.

  4. Alternatively, the tests can be executed after the successful completion of a run by doing the following:

    pytest --trunk=DEF_Trunk/ -v --capture=sys Above it is possible to point to different output directories with the --trunk flag.

    To run a single test do:

    pytest --trunk=DEF_Trunk -v --capture=sys ./tests/test_task4.py The command line arguments -v and --capture=sys makes test output more visible to users.

  5. The configuration config.py in branch main should be configured correctly. But if not, ensure that the following assignments have been made.

    ``` kmeansclusters = 4 maxkmeansclusters = 9 randomseed = 1000

    algorithms = ['bt',] takeyearaverage = True takeunique = False smotebat = True selmostPFTs = False ``` The SPINacc-results repo also contains the https://github.com/ma595/SPINacc-results/tree/main/jobs/DEF_Trunk settings used to obtain the reference output.

  6. The checks are as follows:

- `test_init.py`: Computes recursive compare of `packdata.nc` to reference `packdata.nc`.
- `test_task1.py`: Checks `dist_all.npy` to the reference.
- `test_task2.py`: Checks `IDloc.npy`, `IDSel.npy` and `IDx.npy` to the reference.
- `test_task3.py`: Currently not checked.
- `test_task4.py`: Compares the new `MLacc_results.csv` across all components. Tolerance is 1e-2.
- `test_task4_2.py`: Compares the updated restart file `SBG_FGSPIN.340Y.ORC22v8034_22501231_stomate_rest.nc` to reference.

Automatic testing

An automated test that runs the entire DEF_Trunkpipeline from end-to-end is executed when a release is tagged. It can be forced to run using GitHub's command line tool gh. See the the official documentation for how to install on your system. Then execute the remote test as follows:

gh run list --workflow=build-and-run.yml

Configuration of SPINacc

The following settings can change the performance of SPINacc:

  • algorithms: ML algorithms. Multiple can be selected for any given run. The results will be stacked in the MLacc_results.csv. Options include:
    • bt: Bagging tree
    • rf: Random forest
    • nn: Neural network
    • ridge : Ridge regression
    • best : A 'shotgun' approach that selects the best performing ml algorithm for the given target variable. This is assessed based on the performance on a subset of the data (see select_best_model in train.py), so worse performance may be exhibited on some variables compared to selecting bt directly.
  • take_year_average (required): If True, all annual data is averaged into a single year's worth of data. If False, all years are used - this has the effect of multiplying the quantity of training data, X, for a given target variable Y, by the number of years.
  • smote_bat (required): Synthetic minority oversampling.
  • take_unique(default - True): Take unique pixels only from output of Clustering step - will reduce the number of selected pixels, removing duplicates. This function was kept to gain correspondence with a previous implementation of SPINacc.
  • old_cluster (default - True): If True, the clustering step will use the old clustering method - i.e. Randomly samples Nc examples or takes all samples if number of samples is less than Nc. If old_cluster = False, the new clustering method will take the max(Nc, 20% subset of locations).
  • sel_most_PFT_sites (default - False): If True and old_cluster = False, it will preferentially select samples that contain more PFTs using the 20% rule detailed previously. If old_cluster = True and sel_most_PFT_sites = True, an error is thrown.

We recommend always setting parallel = True in config.py to speed up the execution of SPINacc. The serial and parallel execution gives exactly the same results, however it may sometimes be useful to turn this off for debugging purposes.

Obtaining best performance.

The following settings are recommended to obtain best machine learning performance with SPINacc. Note that training time will be longer with take_year_average set to False.

algorithms = ["best"] take_year_average = False # this will take much longer to finish. take_unique = True smote_bat = True

A new clustering approach is still being tested to see if performance is improved. See PR #93. To test the new implementation set the following:

sel_most_PFTs = True old_cluster = False

Running on the Obelix Supercomputer

If you are already using the obelix supercomputer is likely that SPINacc will work without much adjustment to the varlist.json file.

Jobs can be submitted using the provided pbs scripts, job: * In job : setenv dirpython '/your/path/to/SPINacc/' and setenv dirdef 'DEF_Trunk/' * Then launch your first job using qsub -q short job, for task 1 * For tasks 3 and 4, it is better to use qsub -q medium job

Overview of the individual tasks

An overview of the tasks is provided as follows:

Task 0: Initialisation

Extracts climatic variables over 11 years and stores in a packdata.nc file. Subsequent steps are unable to proceed unless this step completes successfully.

Task 1: Optional clustering step

Evaluates the impact of varying the number of K-means clusters on model performance, setting a default of 4 clusters and producing a ‘dist_all.png’ graph.

dist_all

Task 2: Clustering

Performs the clustering using a K mean algorithm and saves the information on the location of the selected pixels (files starting with 'ID'). The location of the selected pixel (red) for a given PFT and all pixel with a cover fraction exceeding 'clusterthres' defined in varlist.json are plotted in the figures 'ClustResPFT**.png'. Example of PFT2 is shown here:

ClustRes_PFT2_trimed

Task 3: Compressed forcing

Creates compressed forcing files for ORCHIDEE, containing data for selected pixels only, aligned on a global pseudo-grid for efficient pixel-level simulations, with file specifications listed in varlist.json.

Task 4: Machine learning

  • Performs the ML training on results from ORCHIDEE simulation using the compressed forcing (production mode: resp-format=compressed) or global forcing (debug mode: resp-format=global).
  • Extrapolation to a global grid.
  • Writes the state variables into global restart files for ORCHIDEE. For Trunk, this is SBG_FGSPIN.340Y.ORC22v8034_22501231_stomate_rest.nc.
  • Evaluates ML training outputs vs real model outputs and writes performance metrics to MLacc_results.csv.

Task 5: Optional visualisation

This visualises ML performance from Task 4, offering two evaluation modes, global pixel evaluation and leave-one-cross-validation (LOOCV) for training sites, generating plots for various state variables at the PFT level, including comparisons of ML predictions with conventional spinup data.

Eval_all_loocv_biomassCpool_trim

Owner

  • Name: CALIPSO-project
  • Login: CALIPSO-project
  • Kind: organization

Citation (CITATION.CFF)

cff-version: 1.0.0
message: If you use this software, please cite it as below.
title: SPINacc
doi: TBD
authors:
  - name: Daniel Goll, Yan Sun
    orcid: https://orcid.org/0000-0001-9246-9671
version: 1.0.0
license: ORCHIDEE_cecill.txt
repository-code: git@github.com:dsgoll123/SPINacc.git
date-released: 2023-01-08

GitHub Events

Total
  • Create event: 26
  • Commit comment event: 3
  • Issues event: 30
  • Watch event: 2
  • Delete event: 13
  • Member event: 1
  • Issue comment event: 41
  • Push event: 105
  • Pull request event: 30
  • Pull request review event: 24
  • Pull request review comment event: 24
Last Year
  • Create event: 26
  • Commit comment event: 3
  • Issues event: 30
  • Watch event: 2
  • Delete event: 13
  • Member event: 1
  • Issue comment event: 41
  • Push event: 105
  • Pull request event: 30
  • Pull request review event: 24
  • Pull request review comment event: 24

Issues and Pull Requests

Last synced: 7 months ago

All Time
  • Total issues: 10
  • Total pull requests: 14
  • Average time to close issues: about 2 months
  • Average time to close pull requests: 3 months
  • Total issue authors: 3
  • Total pull request authors: 3
  • Average comments per issue: 0.1
  • Average comments per pull request: 1.07
  • Merged pull requests: 8
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 10
  • Pull requests: 13
  • Average time to close issues: about 2 months
  • Average time to close pull requests: about 2 months
  • Issue authors: 3
  • Pull request authors: 3
  • Average comments per issue: 0.1
  • Average comments per pull request: 0.46
  • Merged pull requests: 8
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • ma595 (32)
  • tztsai (4)
  • dsgoll123 (2)
  • Mandresy-code (2)
  • TomMelt (1)
Pull Request Authors
  • ma595 (21)
  • tztsai (6)
  • dsgoll123 (2)
  • Mandresy-code (2)
Top Labels
Issue Labels
iccs (4) good first issue (1)
Pull Request Labels
iccs (6) enhancement (1) good first issue (1)

Dependencies

requirements.txt pypi
  • imblearn ==0.0
  • matplotlib ==3.2.2
  • netCDF4 ==1.6.5
  • netCDF4 ==1.5.3
  • numpy ==1.18.5
  • pandas ==1.0.5
  • scikit_learn ==0.23.1
  • scipy ==1.5.0
  • semap ==1.2.1
  • torch ==2.1.1