mangrove
Using Graph Neural Networks to regress baryonic properties directly from full dark matter merger trees.
Science Score: 41.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (15.9%) to scientific vocabulary
Keywords
Scientific Fields
Repository
Using Graph Neural Networks to regress baryonic properties directly from full dark matter merger trees.
Basic Info
- Host: GitHub
- Owner: astrockragh
- License: mit
- Language: Jupyter Notebook
- Default Branch: main
- Homepage: https://arxiv.org/abs/2210.13473
- Size: 100 MB
Statistics
- Stars: 24
- Watchers: 3
- Forks: 2
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
Mangrove!

Short story: Using Dark Matter Merger Trees to infer galaxy properties works much better than everything else out there! Check out our Google Colab Tutorial on how to use the model at https://colab.research.google.com/drive/1XhZyH71svBaPXqIovAzgFf_ByabJKJyv
For people looking to reproduce the results of our paper (https://arxiv.org/abs/2210.13473), 𝙼𝚊𝚗𝚐𝚛𝚘𝚟𝚎: Learning Galaxy Properties from Merger Trees, the folders to look are in data and dev.
You can preprocess the merger tree with dataz.py (or dataSAM.py if you want fewer nodes), although you'll have to fit a transformer independently, since I found that to be the best. You can find a procedure for doing so in the subfolder transform, but if you're going for speed and already know what subset of features you're interested in, I suggest you fit the transformer in a different way. Ours is fit for each column which isn't fast.
Having restructured the merger tree, you can then do the training either as single experiments (using runexperiment.py) or as a sweep (using runsweep.py). All the things required to do the training and tracking are in the dev folder. Here, you'll find loss functions, learning rate schedulers, models and a script for doing the training on the gpu/cpu (the cpu version is outdated).
For examples of either a single experiment or a sweep, check out the guide.txt files in the expexample for the single experiment and expsweep_example for a sweep.
Required dependencies
This is for creating a conda environment to do your coding in
Replace your anaconda module with whatever you have on your computer/cluster
module load anaconda3/2021.5
conda create --name jtorch pytorch==1.9.0 jupyter torchvision==0.10.0 torchaudio==0.9.0 cudatoolkit=10.2 matplotlib tensorboard --channel pytorch
conda activate jtorch
pip install accelerate scikit-learn pandas
To determine the pytorch_geometric version that you need, check out https://pytorch-geometric.readthedocs.io/en/latest/notes/installation.html I recommend doing it through pip, not conda, like the below version, but check the docs
pip install torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric -f https://data.pyg.org/whl/torch-1.9.0+cu102.html
Structure of json experiment setup
- "experiment":(str) For logging on wandb, not necessary if local
- "group": (str) For logging locally (and grouping on wandb)
- "move": (bool) Whether or not to move a done experiment into done after running it
- "model": Model name (str), as in dev/models.py
"log": (bool) whether or not to log
"run_params": Parameters= group for doing the run
- "n_epochs": number of epochs (int)
- "n_trials": number of trials of the same experiment (int)
- "batch_size": batch size (int)
- "val_epoch": The number of epochs between validation set checks (int)
- "early_stopping": Whether to do early stopping (bool)
- "patience": patience (in epoch) for early stopping (int)
- "l1_lambda": L1 regularization (float)
- "l2_lambda": L2 regularization (float)
- "lossfunc": loss function name from dev/lossfuncs.py (str)
- "metrics": metric name from dev/metrics.py (str)
- "performanceplot": plot name from dev/evalplot.py (str)
- "save":Whether or not to save the model and results (bool)
- "seed": Setting torch/random seed or not (bool)
- "num_workers": Number of workers for dataloader, should be equal to number of cpu cores available on the system (int)
"learn_params": Parameters related to learning scheme
- "learning_rate": Learning rate (float)
- "schedule": Learning Rate scheduler name from dev/lr_schedule.py (str)
- "gup": Exponent for warmup (float). $(g{up})^{epoch}*\text{learning rate}/ ((g_{up})^{warmup})$
- "gdown": Exponent for cooldown (float). $(g{down})^{epoch}*\text{learning rate}$
- "warmup": Number of epochs to warm up for (int)
- "period": Period for cosine annealing in epochs (int)
- "eta_min": Mininum learning rate for cosine annealing (float)
"hyper_params": Hyper parameters for model
- "hidden_channels": Size of latent space (int)
- "conv_layers": Number of convolutional layers (int)
- "conv_activation": Activation function between convolutional layers (str) ['relu', 'leakyrelu']
- "decode_activation": Activation function between decode layers (str) ['relu', 'leakyrelu']
- "decode_layers": Number of decode layers (int)
- "layernorm": Whether or not to use layer normalization (bool)
- "agg": What type of global aggregation to use (str) ['sum', 'max']
- "variance": Whether or not to predict variance (bool)
- "rho": number of correlations to predict (int)
"data_params": Parameters for loading the data
- "case": Data path (str)
- "targets": Which targets to optimize for (list(int)),
- "del_feats": Which features to leave out (list(int))
- "split": Where to split between train/val (float (0;1) interval)
- "test": Whether or not to use testing mode (bool)
- "scale": Whether or not to scale targets to have mean 0, variance 1 (bool)
All in all an experiment file looks like this
```json { "experiment": "GraphMerge", "group": "finalGauss2dcorrall", "move":false, "model": "Sage", "log": true,
"run_params":{
"n_epochs": 500,
"n_trials": 1,
"batch_size": 256,
"val_epoch": 2,
"early_stopping": false,
"patience": 100,
"l1_lambda":0,
"l2_lambda":0,
"loss_func": "Gauss2d_corr",
"metrics": "test_multi_varrho",
"performance_plot": "multi_base",
"save":true,
"seed":true,
"num_workers": 4
}, "learnparams":{ "learningrate": 1e-2, "schedule": "onecycle", "gup":1, "gdown":0.95, "warmup":4, "period":5, "eta_min":1e-5 },
"hyper_params": {
"hidden_channels": 128,
"conv_layers": 5,
"conv_activation": "relu",
"decode_activation": "leakyrelu",
"decode_layers": 3,
"layernorm": true,
"agg": "sum",
"variance": true,
"rho": 1
},
"dataparams":{ "case": "vlargeall4tz0.0quantileraw", "targets": [0,1,2], "del_feats": [], "split": 0.8, "test": 0, "scale": 0 } } ```
Owner
- Name: Christian Kragh Jespersen
- Login: astrockragh
- Kind: user
- Company: Department of Astrophysics, Princeton University
- Website: astrockragh.github.io
- Twitter: astrockragh
- Repositories: 3
- Profile: https://github.com/astrockragh
Graduate student at Princeton (Astrophysics). Optimizing information gain from data with ML.
Citation (CITATION.md)
# Citing
To cite the code used for Mangrove, please use the following BibTeX entry:
```bibtex
@MISC{2023ascl.soft06015J,
author = {{Jespersen}, Christian Kragh and {Cranmer}, Miles and {Melchior}, Peter and {Ho}, Shirley and {Somerville}, Rachel S. and {Gabrielpillai}, Austen},
title = "{Mangrove: Infer galaxy properties using dark matter merger trees}",
keywords = {Software},
howpublished = {Astrophysics Source Code Library, record ascl:2306.015},
year = 2023,
month = jun,
eid = {ascl:2306.015},
pages = {ascl:2306.015},
archivePrefix = {ascl},
eprint = {2306.015},
adsurl = {https://ui.adsabs.harvard.edu/abs/2023ascl.soft06015J},
adsnote = {Provided by the SAO/NASA Astrophysics Data System}
}
```
To cite the Mangrove - paper, please use the following BibTeX entry:
```bibtex
@ARTICLE{2022ApJ...941....7J,
author = {{Jespersen}, Christian Kragh and {Cranmer}, Miles and {Melchior}, Peter and {Ho}, Shirley and {Somerville}, Rachel S. and {Gabrielpillai}, Austen},
title = "{Mangrove: Learning Galaxy Properties from Merger Trees}",
journal = {\apj},
keywords = {Galaxies, Astrostatistics, Algorithms, Hydrodynamical simulations, N-body simulations, Neural networks, 573, 1882, 1883, 767, 1083, 1933, Astrophysics - Astrophysics of Galaxies, Astrophysics - Instrumentation and Methods for Astrophysics, Computer Science - Machine Learning},
year = 2022,
month = dec,
volume = {941},
number = {1},
eid = {7},
pages = {7},
doi = {10.3847/1538-4357/ac9b18},
archivePrefix = {arXiv},
eprint = {2210.13473},
primaryClass = {astro-ph.GA},
adsurl = {https://ui.adsabs.harvard.edu/abs/2022ApJ...941....7J},
adsnote = {Provided by the SAO/NASA Astrophysics Data System}
}
```
GitHub Events
Total
- Push event: 2
- Create event: 1
Last Year
- Push event: 2
- Create event: 1