aweSOM

aweSOM: a CPU/GPU-accelerated Self-organizing Map and Statistically Combined Ensemble Framework for Machine-learning Clustering Analysis - Published in JOSS (2025)

https://github.com/tvh0021/awesom

Scientific Fields

Materials Science Physical Sciences - 40% confidence

Last synced: 6 months ago · JSON representation ·

Repository

Accelerated Self-organizing Map (SOM) and Statistically Combined Ensemble (SCE)

Basic Info

Host: GitHub
Owner: tvh0021
License: mit
Language: Python
Default Branch: public-release
Homepage: https://awesom.readthedocs.io/
Size: 68.7 MB

Statistics

Stars: 3
Watchers: 2
Forks: 1
Open Issues: 1
Releases: 1

Created about 2 years ago · Last pushed 11 months ago

Metadata Files

Readme License Citation

aweSOM - Accelerated Self-organizing Map (SOM) and Statistically Combined Ensemble (SCE)

This package combines a JIT-accelerated and parallelized implementation of SOM, integrating parts of POPSOM and a GPU-accelerated implementation of SCE using ensemble learning. It is optimized for large datasets, up to $\sim 10^8$ points.

aweSOM is developed specifically to identify intermittent structures (current sheets) in 3D plasma simulations (Ha et al., 2024). However, it can also be used for a variety of clustering and classification tasks.

Authors:

Trung Ha - University of Massachusetts-Amherst, Joonas Nättilä - University of Helsinki, Jordy Davelaar - Princeton University.

Version: 1.1.0

1. Installation

Install aweSOM and required dependencies:

bash git clone https://github.com/tvh0021/aweSOM.git cd aweSOM pip install .

Install JAX with CUDA support separately:

bash pip install --upgrade "jax[cuda12_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html

If your system does not support CUDA, you can skip this step. SCE will automatically fall back to the CPU. However, the CPU-only version can be significantly slower for large datasets (see the performance tests).

Experimental

Apple Silicon users can also use JAX with Metal support; follow the instructions in the JAX documentation to install the Metal backend.

2. Testing

We use pytest for the test module. Dependency has already been included in the requirements.txt file, and should be installed automatically with aweSOM.

To run tests for all modules in the root directory of aweSOM:

bash python -m pytest

You can also run specific test modules by specifying the path to the test file:

bash python -m pytest tests/[module]_test.py

Or run a specific test function within a module:

bash python -m pytest tests/[module]_test.py::test_[function]

If there is no GPU, or if the GPU is not CUDA-compatible, the sce_test.py module will fail partially. This is expected behavior, and SCE computation should still fall back to the CPU.

3. Basic Usage - SOM

Here are the basic steps to initialize a lattice and train the SOM to classify the Iris dataset.

The full Jupyter notebook can be found here.

python import numpy as np import matplotlib.pyplot as plt from aweSOM import Lattice

First, load the dataset and normalize

```python from sklearn.datasets import loadiris iris = loadiris()

print("Shape of the data :", iris.data.shape) print("Labeled classes :", iris.targetnames) print("Features in the set :", iris.featurenames) ```

text Shape of the data : (150, 4) Labeled classes : ['setosa' 'versicolor' 'virginica'] Features in the set : ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

Normalize the data with a custom scaler

python import aweSOM.run_som as rs iris_data_transformed = rs.manual_scaling(iris.data) Initilize the lattice and train

```python

Create an initial SOM instance

map = Lattice(xdim=40, ydim=15, alpha_0=0.5, train=100000)

Train the SOM with some data in the shape of $N \times F$.

truelabels = iris.target featurenames = iris.featurenames map.trainlattice(irisdatatransformed, featurenames, truelabels) ```

The trained SOM is saved at map.lattice

To visualize the SOM with U-matrix, which is saved at the end of training at map.umat

```python

Compute the unique centroids

naivecentroidsmatrix = map.computecentroids() # return the centroid associated with each node uniquecentroids = map.getuniquecentroids(map.compute_centroids()) # return the indivual centroids

plotcentroids['positionx'] = [x+0.5 for x in uniquecentroids['positionx']] plotcentroids['positiony'] = [y+0.5 for y in uniquecentroids['positiony']]

X,Y = np.meshgrid(np.arange(xdim)+0.5, np.arange(ydim)+0.5)

plt.figure(dpi=250) plt.pcolormesh(map.umat.T, cmap='viridis') plt.scatter(plotcentroids['positionx'],plotcentroids['positiony'], color='red', s=10) plt.colorbar(fraction=0.02) plt.contour(X, Y, map.umat.T, levels=np.linspace(np.min(map.umat),np.max(map.umat), 20), colors='black', alpha=0.5) plt.gca().set_aspect("equal") plt.title(rf'UMatrix for {xdim}x{ydim} SOM') ``` U-matrix of a 40x15 SOM trained on Iris dataset

There are 15 centroids in this U-matrix -> there are 15 clusters. Now from the geometry of the U-matrix, we can see there are clearly at least two cluster (separated by the large band of high value nodes), and at most four clusters.

Merge clusters using cost function

```python merge_threshold = 0.2 # empirical tests reveal a threshold between 0.2 and 0.4 usually works best

plot U-matrix with the connected components and ground truth labels (if the labels were supplied during map.train_lattice)

map.plotheat(map.umat, merge=True, mergecost=merge_threshold) ```

U-matrix with labels

Now, we project each data point onto the lattice and get back cluster-id

python final_clusters = map.assign_cluster_to_lattice(smoothing=None,merge_cost=merge_threshold) som_labels = map.assign_cluster_to_data(map.projection_2d, final_clusters)

Finally, we compare the aweSOM result to the ground truth

python fig, axs = plt.subplots(1, 2, figsize=(20, 10)) scatter_ground = axs[0].scatter(iris.data[:,1], iris.data[:,2], c=iris.target, cmap='viridis') axs[0].set_xlabel('Sepal Width') axs[0].set_ylabel('Petal Length') axs[0].legend(scatter_ground.legend_elements()[0], iris.target_names, loc="upper right", title="Classes") scatter_som = axs[1].scatter(iris.data[:,1], iris.data[:,2], c=som_labels, cmap='viridis') axs[1].set_xlabel('Sepal Width') axs[1].set_ylabel('Petal Length') axs[1].legend(scatter.legend_elements()[0], np.unique(final_clusters), loc="upper right", title="aweSOM") plt.show()

Scatter plot comparing ground truth with aweSOM clusters

Clearly, the mapping is: {'setosa' : 2, 'versicolor' : 1, 'virginica' : 0}

```python

Assign cluster number to class label; change manually

labelmap = {'setosa' : 2, 'versicolor' : 1, 'virginica' : 0} correctlabel = 0

for i in range(len(somlabels)): if int(somlabels[i]) == labelmap[iris.targetnames[iris.target[i]]]: correct_label += 1

print("Number of correct predictions: ", correctlabel) print("Accuracy = ", correctlabel/len(som_labels) * 100, "%")

Precision and Recall by class

precision = np.zeros(3) recall = np.zeros(3)

for i in range(3): tp = 0 fp = 0 fn = 0 for j in range(len(somlabels)): if int(somlabels[j]) == labelmap[iris.targetnames[i]]: if iris.target[j] == i: tp += 1 else: fp += 1 else: if iris.target[j] == i: fn += 1 precision[i] = tp/(tp+fp) recall[i] = tp/(tp+fn)

print("Precision: ", [float(np.round(precision[i],4))100 for i in range(3)], "%") print("Recall: ", [float(np.round(recall[i],4))100 for i in range(3)], "%") ```

text Number of correct predictions: 141 Accuracy = 94.0 % Precision: [100.0, 90.2, 91.84] % Recall: [100.0, 92.0, 90.0] %

Is the performance of the aweSOM model.

4. Basic Usage - SCE

If a dataset is complex, a single SOM result might not be sufficiently stable. Instead, we can generate multiple SOM realizations with slightly different initial parameters, then stack the results into a set of statistically significant clusters.

Let's use the same Iris dataset:

```python from aweSOM.runsom import savecluster_labels

set a parameter space to scan

parameters = {"xdim": [38, 40, 42], "ydim": [14, 16], "alpha0": [0.1, 0.5], "train": [10000, 50000, 100000]} mergethreshold = 0.2

for xdim in parameters["xdim"]: for ydim in parameters["ydim"]: for alpha0 in parameters["alpha0"]: for train in parameters["train"]: print(f'constructing aweSOM lattice for xdim={xdim}, ydim={ydim}, alpha={alpha0}, train={train}...', flush=True) map = Lattice(xdim, ydim, alpha0, train, ) map.trainlattice(irisdatatransformed, featurenames, labels) # projection2d = map.mapdatatolattice() finalclusters = map.assignclustertolattice(smoothing=None, mergecost=mergethreshold) somlabels = map.assignclustertodata(projection2d, finalclusters) saveclusterlabels(somlabels, xdim, ydim, alpha0, train, nameofdataset='iris') ```

This saves 36 realizations to the current working directory.

In the terminal (skip this step if you use the pre-generated files):

bash cd [path_to_aweSOM]/aweSOM/examples/iris/ mkdir som_results mv labels* som_results/

Then, bash cd som_results/ python3 [path_to_aweSOM]/aweSOM/src/aweSOM/sce.py --subfolder SCE --dims 150

This will create (or append to) the multimap_mappings.txt file inside som_results/SCE/ with the $G_{\rm sum}$ for each cluster. It also save the mask for each cluster $C$ as a .npy file.

In its simplest form, the SCE stacking can be performed point-by-point: $V{\rm SCE, i} = \SigmaC Mi \cdot G{\rm sum, C}$

Get the list of $G_{\rm sum}$ values, sorted in descending order

```python filepath = 'somresults/SCE/' filename = 'multimapmappings.txt'

from aweSOM.makesceclusters import getgsumvalues, plotgsumvalues

rankedgsumlist, maplist = getgsumvalues(filepath+file_name) ```

Add these values together

python sce_sum = np.zeros((len(iris_data_transformed))) for i in range(len(ranked_gsum_list)): current_cluster_mask = np.load(f"{file_path}/mask-{map_list[i][2]}-id{map_list[i][1]}.npy") sce_sum += current_cluster_mask

Visualize the result

python fig, axs = plt.subplots(1, 2, figsize=(20, 10)) scatter_ground = axs[0].scatter(iris.data[:,1], iris.data[:,2], c=iris.target, cmap='viridis') axs[0].set_xlabel('Sepal Width') axs[0].set_ylabel('Petal Length') axs[0].legend(scatter_ground.legend_elements()[0], iris.target_names, loc="upper right", title="Classes") scatter_sce = axs[1].scatter(iris.data[:,1], iris.data[:,2], c=sce_sum, cmap='viridis') axs[1].set_xlabel('Sepal Width') axs[1].set_ylabel('Petal Length') plt.colorbar(scatter_sce, ax=axs[1]) plt.show()

Comparison between ground truth and SCE Gsum gradient

Set a cutoff in $\Sigma G_{\rm sum}$ to obtain three clusters

```python signal_cutoff = [8000, 15000]

sceclusters = np.zeros((len(irisdatatransformed)), dtype=int) for i in range(len(scesum)): if scesum[i] < signalcutoff[0]: sceclusters[i] = 0 elif scesum[i] < signalcutoff[1]: sceclusters[i] = 1 else: sce_clusters[i] = 2 ```

The resulting SCE quality is (using the same code as in 2.):

text Number of correct predictions: 142 Accuracy = 94.66666666666667 % Precision: [100.0, 92.0, 92.0] % Recall: [100.0, 92.0, 92.0] %

Because of the simplicity of the Iris dataset, not much improvement is made with SCE, but the result is nevertheless consistent with the single SOM result.

5. Advanced Usage - Plasma Simulation

The Jupyter Notebook for the fiducial realization of SOM lattice is located here.

6. License

This project is licensed under the MIT License - see the LICENSE file for details.

7. Contributing

Anyone is welcome to contribute! Please fork the repository and create pull requests with proposed changes.

8. Contact

Additional inquiries/questions about aweSOM should be directed to my email: tvha@umass.edu

Owner

Name: Trung Ha
Login: tvh0021
Kind: user

Repositories: 1
Profile: https://github.com/tvh0021

JOSS Publication

aweSOM: a CPU/GPU-accelerated Self-organizing Map and Statistically Combined Ensemble Framework for Machine-learning Clustering Analysis

Published

April 04, 2025

DOI

10.21105/joss.07613

Volume 10, Issue 108, Page 7613

Authors

Trung Ha

Department of Astronomy, University of Massachusetts-Amherst, Amherst, MA 01003, USA, Center for Computational Astrophysics, Flatiron Institute, 162 Fifth Avenue, New York, NY 10010, USA, Department of Physics, University of North Texas, Denton, TX 76203, USA

Joonas Nättilä

Department of Physics, University of Helsinki, P.O. Box 64, University of Helsinki, FI-00014, Finland, Physics Department and Columbia Astrophysics Laboratory, Columbia University, 538 West 120th Street, New York, NY 10027, USA, Center for Computational Astrophysics, Flatiron Institute, 162 Fifth Avenue, New York, NY 10010, USA

Jordy Davelaar

Department of Astrophysical Sciences, Peyton Hall, Princeton University, Princeton, NJ 08544, USA, NASA Hubble Fellowship Program, Einstein Fellow, Center for Computational Astrophysics, Flatiron Institute, 162 Fifth Avenue, New York, NY 10010, USA, Physics Department and Columbia Astrophysics Laboratory, Columbia University, 538 West 120th Street, New York, NY 10027, USA

Editor

George K. Thiruvathukal

Citation (CITATION.cff)

cff-version: "1.2.0"
authors:
- family-names: Ha
  given-names: Trung
  orcid: "https://orcid.org/0000-0001-6600-2517"
- family-names: Nättilä
  given-names: Joonas
  orcid: "https://orcid.org/0000-0002-3226-4575"
- family-names: Davelaar
  given-names: Jordy
  orcid: "https://orcid.org/0000-0002-2685-2434"
contact:
- family-names: Ha
  given-names: Trung
  orcid: "https://orcid.org/0000-0001-6600-2517"
doi: 10.5281/zenodo.15098625
message: If you use this software, please cite our article in the
  Journal of Open Source Software.
preferred-citation:
  authors:
  - family-names: Ha
    given-names: Trung
    orcid: "https://orcid.org/0000-0001-6600-2517"
  - family-names: Nättilä
    given-names: Joonas
    orcid: "https://orcid.org/0000-0002-3226-4575"
  - family-names: Davelaar
    given-names: Jordy
    orcid: "https://orcid.org/0000-0002-2685-2434"
  date-published: 2025-04-04
  doi: 10.21105/joss.07613
  issn: 2475-9066
  issue: 108
  journal: Journal of Open Source Software
  publisher:
    name: Open Journals
  start: 7613
  title: "aweSOM: a CPU/GPU-accelerated Self-organizing Map and
    Statistically Combined Ensemble Framework for Machine-learning
    Clustering Analysis"
  type: article
  url: "https://joss.theoj.org/papers/10.21105/joss.07613"
  volume: 10
title: "`aweSOM`: a CPU/GPU-accelerated Self-organizing Map and
  Statistically Combined Ensemble Framework for Machine-learning
  Clustering Analysis"

GitHub Events

Total

Issues event: 6
Watch event: 2
Issue comment event: 7
Push event: 16
Create event: 2

Last Year

Issues event: 6
Watch event: 2
Issue comment event: 7
Push event: 16
Create event: 2

Committers

Last synced: 7 months ago

All Time

Total Commits: 119
Total Committers: 1
Avg Commits per committer: 119.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 105
Committers: 1
Avg Commits per committer: 105.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
tvh0021	t**a@m**u	119

Committer Domains (Top 20 + Academic)

my.unt.edu: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 4
Total pull requests: 0
Average time to close issues: 4 days
Average time to close pull requests: N/A
Total issue authors: 2
Total pull request authors: 0
Average comments per issue: 1.75
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 4
Pull requests: 0
Average time to close issues: 4 days
Average time to close pull requests: N/A
Issue authors: 2
Pull request authors: 0
Average comments per issue: 1.75
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

aweSOM

Science Score: 100.0%

Scientific Fields

Repository

Basic Info

Statistics

Metadata Files

README.md

aweSOM - Accelerated Self-organizing Map (SOM) and Statistically Combined Ensemble (SCE)

1. Installation

2. Testing

3. Basic Usage - SOM

Create an initial SOM instance

Train the SOM with some data in the shape of $N \times F$.

Compute the unique centroids

plot U-matrix with the connected components and ground truth labels (if the labels were supplied during map.train_lattice)

Assign cluster number to class label; change manually

Precision and Recall by class

4. Basic Usage - SCE

set a parameter space to scan

5. Advanced Usage - Plasma Simulation

6. License

7. Contributing

8. Contact

Owner

JOSS Publication

aweSOM: a CPU/GPU-accelerated Self-organizing Map and Statistically Combined Ensemble Framework for Machine-learning Clustering Analysis

Authors

Editor

Tags

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels