nn-toc

A deep neural network based approach for the geospatial predicition of total organic carbon percentages in marine sediments.

https://github.com/paramnav/nn-toc

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 5 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.3%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

A deep neural network based approach for the geospatial predicition of total organic carbon percentages in marine sediments.

Basic Info
  • Host: GitHub
  • Owner: paramnav
  • License: other
  • Language: Jupyter Notebook
  • Default Branch: master
  • Size: 304 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 1
  • Open Issues: 0
  • Releases: 1
Created almost 2 years ago · Last pushed over 1 year ago
Metadata Files
Readme License Citation

README.md

Global Prediction Of Total Organic Carbon In Marine Sediments Using Deep Neural Networks (nn-toc)

DOI: https://doi.org/10.3289/SW32024

Use of this codebase in research publication requires citation as follows:

Parameswaran, Naveenkumar, Gonzales, Everardo, Burwicz-Galerne, Ewa , Braack, Malte and Wallmann, Klaus (2024) Global Prediction Of Total Organic Carbon In Marine Sediments Using Deep Neural Networks (nn-toc). DOI 10.3289/SW32024


Here we create a deep neural network based approach for the geospatial predicition of total organic carbon percentages in marine sediments. For running the repostory, features and labels are dwonloaded from a data source. The data is then pre-processed using the notebooks and the scripts provided. In this, we provide the scripts to train the model, predict total organic percentages using Deep Neural Networks(DNNs), K Nearest Neighbours(KNN) and Random Forests. We compare the different methodologies based on the model performance. The uncertainty in the deep learning model is evaluated using Monte Carlo dropout. Information gain is used to quantify this uncertainty, which provides the expected knowledge gain from sampling at a certain location in the ocean. Below is a folder structure or project organisation.

Project Organization

 LICENSE
 README.md          <- The top-level README for the repository description
 data
  interim        <- Intermediate data that has been transformed. includes preprocessed features, selected features, masks for different marine regions, etc.
  output         <- Final results from the model runs, that include the prediction maps, correlation plots, model performance, feature importance etc.
  raw            <- The original, immutable data, downloaded. Includes features and labels

 models             <- Trained deep learning models, or model summaries

 notebooks          
   TOC           <- /TOC has the notebooks to preprocess the data, run the models and postprocess the results.
          MakeTrainingFeatures.ipynb
          MakeGlobalFeatures.ipynb 
          TOC_NN_CS.ipynb
          TOC_NN_DO.ipynb
          TOC_NN_entire.ipynb    
          TOC_KNN_CS.ipynb
          TOC_KNN_DO.ipynb
          TOC_RF_CS.ipynb
          TOC_RF_DO.ipynb
          Visualisation.ipynb    
        InfoGain.ipynb
        ExtractLabels.ipynb   
          Infogain Experiment
                TOC_NN_CS_firsthalf.ipynb
                TOC_NN_CS_secondhalf.ipynb
                TOC_NN_DO_firsthalf.ipynb
                TOC_NN_DO_secondhalf.ipynb
                TOC_NN_DO_2_3.ipynb
                TOC_NN_CS_2_3.ipynb
 reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
  figures        <- Generated graphics and figures to be used in reporting

 requirements.txt   <- The requirements file for reproducing the environment, e.g.
                         generated with `pip freeze > requirements.txt`
 src                <- Source code for use in this project.
   preprocessing  <- Scripts to preprocess features and labels
          makeGlobalFeatures.py
        makeTrainingFeatures.py

   postprocessing <- Scripts to postprocess and visualise model results
          getmodelPerformance.py
          CurveFitting.py
        makeTrainingFeatures.py

   models         <- Scripts to train DNN models and then use trained models to make predictions, and explain the trained model
           trainDNN,py
           predictDNN.py
         explainDNN.py              

With these python files and notebooks, we provide instructions for reproducing the methods and results, that are described in detail in the submitted paper (Parameswaran et. al. 2024 GMD).

  • Start by cloning/forking the repo git clone https://git.geomar.de/open-source/nn-toc.git

  • Enter the local repository

cd nn-toc/

  • Create a virtual environment using conda with the dependencies listed in the nn-toc.yml file

conda env create --file nn-toc.yml

  • Activate the virtual environment

conda activate nn-toc

  • The entire set of feeatures and labels used in the project can be accessed from https://zenodo.org/records/11186224 (https://doi.org/10.5281/zenodo.11186224).

  • Please download the data folder and paste it inside nn-toc, to run the models and make sure that the file structure of data is followed.

A) Make Training Features

In order to make pre-process features for training the different models, we use the nn-toc/notebooks/TOC/MakeTrainingFeatures.ipynb

This notebook performs feature selection, extracts selected features from measurement locations and saves and creates the feature-label dataset for the models.

This notebook uses the functions from src/preprocessing/makeTrainingFeatures.py.

B) Make Global Features

For the global prediction, we need the features globally.

In order to make global features for prediction from different models, we use the nn-toc/notebooks/TOC/MakeGlobalFeatures.ipynb

In order to ease the memory requirements, we create chunks of global features.

This notebook uses functions from src/preprocessing/makeGlobalFeatures.py.

C) Running the different models.

Each method (Deep Neural Network: NN, K Nearest Neighbours: KNN, Random forests: RF) has a separate model for deep ocean and continental shelves.

The naming is as follows: nn-toc/notebooks/TOC/TOCmethodNameMarineRegion.ipynb

The scripts loads the features, builds the model, trains the model, gets the model performance, gets the feature importance(only for DNN), and predicts the total organic carbon concentrations globally using the model.

The script for the DNN uses the functions from src/models/trainDNN.py, src/models/predictDNN.py, and src/models/explainDNN.py

Please note that some of the processes in this notebook are GPU intensive.

The section of script to compile the ensemble of predictions from the Monte Carlo dropout is memory intensive.

D) Information Gain

With the compiled ensemble of prediction distirbutions, we can obtain the information gain map. For this run the script:

python3 src/postprocessing/CurveFitting.py

The script uses concurrent.futures for multi-processing and is computationally intensive.

After this, use the jupyter notebook nn-toc/notebooks/TOC/InfoGain.ipynb

This notebook evaluates the KL divergence (or Information gain) between the true distribution and the predicted distibution from the Monte Carlo dropout ensemble.

To check if information gain works and actually brings in improvement in the model, we did an experiment, by comparing outputs from a model trained with points of more information gain with a model trained with points of low information gain. This experiment is included in /notebooks/TOC/Infogain\ Experiment/.

E) Visualisation

We visualise the results from differnt methods in nn-toc/notebooks/TOC/Visualisation.ipynb, which uses the functions from src/postprocessing/visualizePredictionMap.py

We calculate the TOC stock globally and in different marine regions and tabulate the results in the manuscript.

Owner

  • Name: Naveen Kumar
  • Login: paramnav
  • Kind: user
  • Location: Kiel, Germany
  • Company: GEOMAR Helmholtz Centre for Ocean Research

Doctoral Researcher at GEOMAR Helmholtz Zentrum Kiel, Germany;M.Sc in Computational Sciences in Engineering at TU Braunschweig;B.Tech in Civil Engineering NITK.

GitHub Events

Total
  • Push event: 8
Last Year
  • Push event: 8

Dependencies

requirements.txt pypi
  • Shapely ==2.0.4
  • cmcrameri ==1.8
  • cmocean ==3.1.3
  • geopandas ==0.14.1
  • matplotlib ==3.8.0
  • numpy ==1.25.2
  • pandas ==2.2.2
  • rasterio ==1.3.9
  • scikit_learn ==1.2.2
  • scipy ==1.13.0
  • seaborn ==0.13.2
  • setuptools ==68.2.2
  • shap ==0.42.1
  • torch ==2.1.1
  • tqdm ==4.65.0
  • xarray ==0.20.2
setup.py pypi