https://github.com/bioatmosphere/adam
Agentic Data Aggregation and Modelling
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.6%) to scientific vocabulary
Repository
Agentic Data Aggregation and Modelling
Basic Info
- Host: GitHub
- Owner: bioatmosphere
- Language: Python
- Default Branch: main
- Size: 15 MB
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
ADAM
Agentic Data Aggregation and Modelling
Project Overview
ADAM is a comprehensive data synthesis and machine learning pipeline for predicting Below-ground Net Primary Productivity (BNPP) across global ecosystems. The project integrates multiple global datasets and applies eight state-of-the-art machine learning models to create benchmark predictions for ecosystem productivity research.
Key Features
- Multi-source Data Integration: Combines forest, grassland, climate, soil, and elevation datasets from global sources
- 4-Stage Pipeline: Automated workflow from data retrieval to global application
- 12 Machine Learning Models: Comprehensive benchmark suite including tree-based, neural network, linear, and ensemble methods
- Advanced Data Cleaning: Automated outlier detection with multiple statistical methods
- Elevation Integration: SRTM digital elevation model as topographic predictor
- Global Predictions: Applies trained models to worldwide land cover data
- Scientific Reproducibility: Comprehensive logging and modular architecture
Quick Start
Prerequisites
- Python 3.8+
- Required packages:
xarray,cartopy,scikit-learn,pandas,matplotlib,xgboost,pytorch - GDAL for geospatial data processing
Running the Pipeline
```bash
Run the complete 4-stage pipeline
python src/main.py --verbose
Run individual stages
python src/main.py --stage 1 # Data retrieval python src/main.py --stage 2 # Data integration python src/main.py --stage 3 # Machine learning python src/main.py --stage 4 # Global application ```
Training Individual Models
```bash
Train specific models (run from src/models/ directory)
python RFmodel.py # Random Forest python XGBoostmodel.py # XGBoost python LightGBMmodel.py # LightGBM python CatBoostmodel.py # CatBoost python MLPmodel.py # Multi-Layer Perceptron python TabNetmodel.py # TabNet python TabPFNmodel.py # TabPFN (Prior-Fitted Networks) python DeepEnsemblemodel.py # Deep Ensemble python ElasticNetmodel.py # Elastic Net python Ridgemodel.py # Ridge Regression python SVMmodel.py # Support Vector Machine python AutoGluonmodel.py # AutoGluon AutoML
Apply models globally (run from src/models/application/ directory)
python applyRFglobally.py # Random Forest global predictions python applyMLPglobally.py # MLP global predictions python applyXGBoostglobally.py # XGBoost global predictions ```
Directory Structure
ADAM/
├── src/ # Source code
│ ├── main.py # Pipeline orchestrator
│ ├── data_aggregation.py # Data integration
│ ├── bnpp/ # BNPP data processing
│ ├── ancillary/ # Climate & soil data
│ ├── landcover/ # Land cover processing
│ └── models/ # ML implementations
│ ├── RF_model.py # Random Forest model
│ ├── MLP_model.py # Multi-Layer Perceptron
│ ├── XGBoost_model.py # XGBoost model
│ └── application/ # Global model application
│ ├── apply_RF_globally.py
│ ├── apply_MLP_globally.py
│ └── apply_XGBoost_globally.py
├── ancillary/ # Environmental data
│ ├── terraclimate/ # Climate variables
│ ├── soilgrids/ # Soil properties
│ ├── glass/ # GPP satellite data
│ └── elevation_points.csv # SRTM elevation data
├── productivity/ # Productivity datasets
│ ├── forc/ # ForC forest data
│ ├── grassland/ # Grassland BNPP
│ └── globe/ # Global datasets
├── landcover/ # Land cover data
│ ├── synmap/ # SYNMAP vegetation
│ └── data/ # Biome classifications
└── output/ # Pipeline outputs
├── processed_data/
├── integrated_data/
├── models/
└── global_predictions/
Data Sources
- ForC: Global forest carbon database
- TerraClimate: Climate variables (1958-2019)
- SoilGrids: Global soil information
- GLASS: Global Land Surface Satellite products
- SYNMAP: Global vegetation mapping
- SRTM: Shuttle Radar Topography Mission elevation data
- Gherardi-Sala: Grassland belowground productivity database
Pipeline Stages
- Data Retrieval: Download and process source datasets
- Data Integration: Spatially align and merge all data sources with outlier detection
- Machine Learning: Train 12 machine learning models with comprehensive benchmarking
- Global Application: Apply trained models to create worldwide BNPP maps
Machine Learning Models
The pipeline implements twelve state-of-the-art machine learning models for comprehensive benchmarking:
Tree-Based Models
- Random Forest (
src/models/RF_model.py): Ensemble method with hyperparameter tuning via GridSearchCV - XGBoost (
src/models/XGBoost_model.py): Gradient boosting with advanced regularization and early stopping - LightGBM (
src/models/LightGBM_model.py): Fast gradient boosting framework optimized for efficiency - CatBoost (
src/models/CatBoost_model.py): Gradient boosting with categorical feature handling
Neural Network Models
- Multi-Layer Perceptron (
src/models/MLP_model.py): Deep learning with PyTorch and dropout regularization - TabNet (
src/models/TabNet_model.py): Attention-based neural network for tabular data with interpretability - TabPFN (
src/models/TabPFN_model.py): Prior-Fitted Networks with zero-shot learning capabilities - Deep Ensemble (
src/models/DeepEnsemble_model.py): Ensemble of neural networks for uncertainty quantification
Linear and Kernel Models
- Elastic Net (
src/models/ElasticNet_model.py): Regularized linear regression with L1/L2 penalties - Ridge Regression (
src/models/Ridge_model.py): L2-regularized linear regression with cross-validation - Support Vector Machine (
src/models/SVM_model.py): Non-linear regression with RBF kernel
AutoML Framework
- AutoGluon (
src/models/AutoGluon_model.py): Automated machine learning with model stacking and ensemble
Global Application (src/models/application/)
- Dedicated scripts for worldwide model application
- Climate and satellite data integration
- 0.5-degree resolution global predictions
Data Processing and Quality Control
Outlier Detection
The pipeline includes comprehensive outlier detection with multiple methods:
- IQR Method: Interquartile Range-based statistical outliers
- Z-Score Method: Standard deviation-based outliers
- Modified Z-Score: Median Absolute Deviation-based detection
- Domain-Based: Ecological constraints on BNPP values
- Geographic: Spatial clustering of outliers
Environmental Predictors (18 features)
- Climate (6): Actual evapotranspiration, potential evapotranspiration, precipitation, max/min temperature, vapor pressure deficit
- Satellite (1): Yearly Gross Primary Productivity from GLASS
- Soil Properties (4): Carbon stock, clay/silt/sand content
- Soil Chemistry (3): Nitrogen content, cation exchange capacity, pH
- Soil Physics (3): Bulk density, coarse fragments, soil moisture
- Topography (1): Elevation from SRTM
Output
The pipeline generates: - Cleaned datasets with outlier removal (1,367 samples from 1,482 original) - 12 trained machine learning models with comprehensive summaries - Global BNPP prediction maps at 0.5-degree resolution - Model evaluation metrics and cross-validation results - Feature importance analysis and visualization plots - Comprehensive execution logs and model benchmarking
Development
For development guidance and detailed architecture information, see CLAUDE.md.
Citation
If you use this code or data in your research, please cite:
[Citation information to be added]
License
[License to be specified]
Contact
For questions or collaboration opportunities, please open an issue on this repository.
Owner
- Name: Bin Wang
- Login: bioatmosphere
- Kind: user
- Location: Oak Ridge
- Website: https://bwangecology.wordpress.com/
- Twitter: bioatmo_sphere
- Repositories: 4
- Profile: https://github.com/bioatmosphere
Studying biosphere-atmosphere interactions with interwoven theory- and data-driven approaches
GitHub Events
Total
- Push event: 9
- Create event: 2
Last Year
- Push event: 9
- Create event: 2