SISSO++

SISSO++: A C++ Implementation of the Sure-Independence Screening and Sparsifying Operator Approach - Published in JOSS (2022)

https://gitlab.com/sissopp_developers/sissopp

Science Score: 89.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in JOSS metadata
  • Academic publication links
  • Committers with academic emails
    7 of 9 committers (77.8%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
    Published in Journal of Open Source Software

Scientific Fields

Artificial Intelligence and Machine Learning Computer Science - 62% confidence
Economics Social Sciences - 40% confidence
Last synced: 4 months ago · JSON representation

Repository

C++ implementation of sisso.

Basic Info
  • Host: gitlab.com
  • Owner: sissopp_developers
  • License: apache-2.0
  • Default Branch: master
Statistics
  • Stars: 4
  • Forks: 4
  • Open Issues: 2
  • Releases: 0
Created over 4 years ago
Metadata Files
Readme Changelog Contributing License

README.md

C++ Implementation of SISSO with python bindings

Overview

This package provides a C++ implementation of SISSO with built in Python bindings for an efficient python interface. Future work will expand the python interface to include more postporcessing analysis tools.

For a more detailed explanation please visit our documentation here

Installation

The package uses a CMake build system, and compatible all versions of the C++ standard library after C++ 14. You can access the code here

Prerequisites

To install SISSO++ the following packages are needed:

  • CMake version 3.10 and up
  • A C++ compiler (compatible with C++ 14 and later, e.g. gcc 5.0+ or icpc 17.0+)
  • BLAS/LAPACK
  • MPI

Additionally the following packages needed by SISSO++ will be installed (if they are not installed already/if they cannot be found in $PATH)

To build and use the optional python bindings the following are also needed:

The setup of the python environment can be done using anaconda with

bash conda create -n sissopp_env python=3.9 numpy pandas scipy seaborn scikit-learn toml

Installing SISSO++

SISSO++ is installed using a cmake build system, with sample configuration files located in cmake/toolchains/ For example, here is initial_config.cmake file used to construct SISSO++ and the python bindings using the gnu compiler.

```

Basic Flags

set(CMAKECXXCOMPILER g++ CACHE STRING "") set(CMAKECCOMPILER gcc CACHE STRING "") set(CMAKECXXFLAGS "-O3 -march=native" CACHE STRING "") set(CMAKECFLAGS "-O3 -march=native" CACHE STRING "")

Feature Flags

set(SISSOBUILDPYTHON ON CACHE BOOL "") set(SISSOBUILDPARAMS ON CACHE BOOL "") ```

Because we want to build with the python bindings in this example and assuming there is no preexisting python environment, we need to first create/activate it. For this example we will use conda, but standard python installations or virtual environments are also possible.

bash conda create -n sissopp_env python=3.9 numpy pandas scipy seaborn scikit-learn toml conda activate sissopp_env

Note if you are using a python environment with a local MKL installation then make sure the versions of all accessible MKL libraries are the same.

Now we can install SISSO++ using initial_config.cmake and the following commands (this assumes gnu compiler and MKL are used, if you are using a different compiler/BLAS library change the flags to the relevant directories)

```bash export MKLROOT=/path/to/mkl/ export BOOST_ROOT=/path/to/boost

cd ~/sissopp/ mkdir build/; cd build/;

cmake -C initial_config.cmake ../ make make install ```

Once all the commands are run SISSO++ should be in the ~/SISSO++/main directory/bin/ directory.

Installing the Python Bindings Without Administrative Privileges

To install the python bindings on a machine where you do not have write privilege to the default python install directory (typical on most HPC systems), you must set the PYTHON_INSTDIR to a directory where you do have write access. This can be done by modifying the camke command to:

bash cmake -C initial_config.cmake -DPYTHON_INSTDIR=/path/to/python/install/directory/ ../

A standard local python installation directory for pip and conda is $HOME/.local/lib/python3.X/site-packages/ where X is the minor version of python. It is important that if you do set this variable then that directory is also inside your PYTHONPATH envrionment variable. This can be updated with

bash export PYTHONPATH=$PYTHONPATH:/path/to/python/install/directory/

If you are using anaconda, then this can be avoided by creating a new conda environment as detailed above.

You will need to set this variable and recompile the code (remove all build files first) if you see this error

```bash

CMake Error at src/cmakeinstall.cmake:114 (file): file cannot create directory: ${PTYHONBASEDIR}/lib/python3.X/site-packages/sissopp. Maybe need administrative privileges. Call Stack (most recent call first): cmakeinstall.cmake:42 (include)" ```

Install the Binary Without the Python Bindings

To install only the SISSO++ executable repeat the same commands as above but set USE_PYTHON in initial_config.cmake to OFF.

Running the Code

Input Files

To see a sample of the input files look in ~/sisso++/main directory/tests/exec_test. In this directory there are multiple subdirectories for different types of calculations, but the default/ directory would be the most common application.

To use the code two files are necessary: sisso.json and data.csv. data.csv stores all of the data for the calculation in a csv file. The first row in the file corresponds to the feature meta data with the following format expression (Unit). For example if one of the primary features used in the set is the lattice constant of a material the header would be lat_param (AA). The first column of the file are sample labels for all of the other rows, and is used to set the sample ids in the output files.

The input parameters are stored in sisso.json, here is a list of all possible variables that can be set in sisso.json

data.csv

The data file contains all relevant data and metadata to describe the individual features and samples. The first row of the file corresponds to the features metadata and has the following format expression (Unit) or expression. For the cases where no (Unit) is included in the header then the feature is considered to be dimensionless. For example if one of the primary features used in the set is the lattice constant of a material the header would be lat_param (AA), but the number of species in the material would be n_species because it is a dimensionless number.

The first column provide the labels for each sample in the data file, and is used to set the sample ids in the output files. In the simplest case, this can be just a running index. The data describing the property vector is defined in the column with an expression matching the property_key filed in the sisso.json file, and will not be included in the feature space. Additionally, an optional Task column whose header matches the task_key field in the sisso.json file can also be included in the data file. This column maps each sample to a respective task with a label defined in the task column. Below in a minimal example of the data file used to learn a model for a materials volume.

material, Structure_Type, Volume (AA^3), lat_param (AA) C, diamond, 45.64, 3.57 Si, diamond, 163.55, 5.47 Ge, diamond, 191.39, 5.76 Sn, diamond, 293.58, 6.65 Pb, diamond, 353.84, 7.07 LiF, rock_salt, 67.94, 4.08 NaF, rock_salt, 103.39, 4.69 KF, rock_salt, 159.00, 5.42 RbF, rock_salt, 189.01, 5.74 CsF, rock_salt, 228.33, 6.11

sisso.json

All input parameters that can not be extracted from the data file are defined in the sisso.json file.

Here is a complete example of a sisso.json file where the property and task keys match those in the above data file example.

json { "data_file": "data.csv", "property_key": "Volume", "task_key": "Structure_Type", "opset": ["add", "sub", "mult", "div", "sq", "cb", "cbrt", "sqrt"], "param_opset": [], "calc_type": "regression", "desc_dim": 2, "n_sis_select": 5, "max_rung": 2, "max_leaves": 4, "n_residual": 1, "n_models_store": 1, "n_rung_store": 1, "n_rung_generate": 0, "min_abs_feat_val": 1e-5, "max_abs_feat_val": 1e8, "leave_out_inds": [0, 5], "leave_out_frac": 0.25, "fix_intercept": false, "max_feat_cross_correlation": 1.0, "nlopt_seed": 13, "global_param_opt": false, "reparam_residual": true }

Performing the Calculation

Once the input files are made the code can be run using the following command

mpiexec -n 2 ~/sisso++/main directory/bin/sisso++ sisso.json

which will give the following output for the simple problem defined above

```text time inputparsing: 0.000721931 s time to generate feat sapce: 0.00288105 s Projection time: 0.00304198 s Time to get best features on rank : 1.09673e-05 s Complete final combination/selection from all ranks: 0.00282502 s Time for SIS: 0.00595999 s Time for l0-norm: 0.00260496 s Projection time: 0.000118971 s Time to get best features on rank : 1.38283e-05 s Complete final combination/selection from all ranks: 0.00240111 s Time for SIS: 0.00276804 s Time for l0-norm: 0.000256062 s Train RMSE: 0.293788 AA^3; Test RMSE: 0.186616 AA^3 c0 + a0 * (latparam^3)

Train RMSE: 0.0936332 AA^3; Test RMSE: 15.8298 AA^3 c0 + a0 * ((latparam^3)^2) + a1 * (sqrt(latparam)^3)

```

Analyzing the Results

Once the calculations are done, two sets of output files are generated. Two files that summarize the results from SIS in a computer and human readable manner are stored in: feature_space/ and every model used as a residual for SIS is stored in models/. The human readable file describing the selected feature space is feature_space/SIS_summary.txt which contains the projection score (The Pearson correlation to the target property or model residual). ```

FEAT_ID Score Feature Expression

0 0.99997909235669924 (latparam^3) 1 0.999036700010245471 ((latparam^2)^2) 2 0.998534266139345261 (latparam^2) 3 0.996929900301868899 (sqrt(latparam)^3) 4 0.994755117666830335 lat_param

-----------------------------------------------------------------------

5 0.0318376000648976157 ((latparam^3)^3) 6 0.00846237838476477863 ((latparam^3)^2) 7 0.00742498801557322716 cbrt(cbrt(latparam)) 8 0.00715447033658055554 cbrt(sqrt(latparam)) 9 0.00675695980092700429 sqrt(sqrt(lat_param))

---------------------------------------------------------------------

The computer readable file file is `feature_space/selected_features.txt` and contains a the list of selected features represented by an alphanumeric code where the integers are the index of the feature in the primary feature space and strings represent the operators. The order of each term in these expressions is the same as the order it would appear using postfix (reverse polish) notation.

FEAT_ID Feature Postfix Expression (RPN)

0 0|cb 1 0|sq|sq 2 0|sq 3 0|sqrt|cb 4 0

-----------------------------------------------------------------------

5 0|cb|cb 6 0|cb|sq 7 0|cbrt|cbrt 8 0|sqrt|cbrt 9 0|sqrt|sqrt

-----------------------------------------------------------------------

```

The model output files are split into train/test files sorted by the dimensionality of the model and by the train RMSE. The model with the lowest RMSE is stored in the lowest numbered file. For example train_dim_2_model_0.dat will have the best 2D model, train_dim_2_model_1.dat would have the second best, etc., whereas train_dim_1_model_0.dat will have the best 1D model. Each model file has a large header containing information about the features selected and model generated ```

c0 + a0 * (lat_param^3)

Property Label: $Volume$; Unit of the Property: AA^3

RMSE: 0.293787533962641; Max AE: 0.56084644346538

Coefficients

Task a0 c0

diamond, 1.000735616997855e+00, -1.551085274074442e-01,

rock_salt, 9.998140372873336e-01, 6.405707194855371e-02,

Feature Rung, Units, and Expressions

0; 1; AA^3; 0|cb; (latparam^3); $\left(lat{param}^3\right)$; (latparam).^3; latparam

Number of Samples Per Task

Task , nmatstrain

diamond, 4

rock_salt, 4

The first section of the header summarizes the model by providing a string representation of the model, defines the property's label and unit, and summarizes the error of the model.

c0 + a0 * (lat_param^3)

Property Label: $Volume$; Unit of the Property: AA^3

RMSE: 0.293787533962641; Max AE: 0.56084644346538

Next the linear coefficients (as shown in the first line) for each task is listed.

Coefficients

Task a0 c0

diamond, 1.000735616997855e+00, -1.551085274074442e-01,

rock_salt, 9.998140372873336e-01, 6.405707194855371e-02,

Then a description of each feature in the model is listed, including units and various expressions.

Feature Rung, Units, and Expressions

0; 1; AA^3; 0|cb; (latparam^3); $\left(lat{param}^3\right)$; (latparam).^3; latparam

Finally information about the number of samples in each task is given

Number of Samples Per Task

Task , nmatstrain

diamond, 4

rock_salt, 4

```

The header for the test data files contain the same information as the training file, with an additional line at the end to list all indexes included in the test set: ```

Test Indexes: [ 0, 5 ]

`` These indexes can be used to reproduce the results by settingleaveoutinds` to those listed on this line.

After this header in both file the following data is stored in the file:

```

Sample ID , Property Value , Property Value (EST) , Feature 0 Value

``` With this data, one can plot and analyzed the model, e.g., by using the python binding.

Using the Python Library

To see how the python interface can be used refer to the tutorials. If you get an error about not being able to load MKL libraries, you may have to run conda install numpy to get proper linking.

JOSS Publication

SISSO++: A C++ Implementation of the Sure-Independence Screening and Sparsifying Operator Approach
Published
March 16, 2022
Volume 7, Issue 71, Page 3960
Authors
Thomas A. r. Purcell ORCID
NOMAD Laboratory at the Fritz Haber Institute of the Max Planck Society and Humboldt University, Berlin, Germany
Matthias Scheffler
NOMAD Laboratory at the Fritz Haber Institute of the Max Planck Society and Humboldt University, Berlin, Germany
Christian Carbogno ORCID
NOMAD Laboratory at the Fritz Haber Institute of the Max Planck Society and Humboldt University, Berlin, Germany
Luca M. Ghiringhelli ORCID
NOMAD Laboratory at the Fritz Haber Institute of the Max Planck Society and Humboldt University, Berlin, Germany
Editor
Jarvist Moore Frost ORCID
Tags
SISSO Symbolic Regression Physics

Committers

Last synced: 4 months ago

All Time
  • Total Commits: 2,034
  • Total Committers: 9
  • Avg Commits per committer: 226.0
  • Development Distribution Score (DDS): 0.214
Past Year
  • Commits: 54
  • Committers: 3
  • Avg Commits per committer: 18.0
  • Development Distribution Score (DDS): 0.13
Top Committers
Name Email Commits
Thomas p****l@f****e 1,598
Sebastian Eibl s****l@m****e 159
Tom Purcell p****t@a****u 155
Yi Yao y****2@g****m 70
Thomas Purcell t****l@t****e 37
Matthew Evans g****t@m****e 11
Yi Yao y****o@f****e 2
Thomas A.R. Purcell p****t@j****u 1
William Huhn w****n@a****v 1

Issues and Pull Requests

Last synced: 4 months ago


Dependencies

requirements.txt pypi
  • matplotlib >=3.5.0
  • numpy >=1.21.2
  • pandas >=1.3.4
  • pytest >=6.2.5
  • python >=3.6,<3.10
  • scikit-learn >=1.0.1
  • scipy >=1.7.1
  • seaborn >=0.11.2
  • toml >=0.10.2