aweSOM

aweSOM: a CPU/GPU-accelerated Self-organizing Map and Statistically Combined Ensemble Framework for Machine-learning Clustering Analysis - Published in JOSS (2025)

https://github.com/tvh0021/awesom

Science Score: 100.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in JOSS metadata
  • Academic publication links
    Links to: arxiv.org, joss.theoj.org
  • Committers with academic emails
    1 of 1 committers (100.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
    Published in Journal of Open Source Software

Scientific Fields

Materials Science Physical Sciences - 40% confidence
Last synced: 4 months ago · JSON representation ·

Repository

Accelerated Self-organizing Map (SOM) and Statistically Combined Ensemble (SCE)

Basic Info
Statistics
  • Stars: 3
  • Watchers: 2
  • Forks: 1
  • Open Issues: 1
  • Releases: 1
Created almost 2 years ago · Last pushed 9 months ago
Metadata Files
Readme License Citation

README.md

aweSOM - Accelerated Self-organizing Map (SOM) and Statistically Combined Ensemble (SCE)

status

This package combines a JIT-accelerated and parallelized implementation of SOM, integrating parts of POPSOM and a GPU-accelerated implementation of SCE using ensemble learning. It is optimized for large datasets, up to $\sim 10^8$ points.

aweSOM is developed specifically to identify intermittent structures (current sheets) in 3D plasma simulations (Ha et al., 2024). However, it can also be used for a variety of clustering and classification tasks.

Authors:

Trung Ha - University of Massachusetts-Amherst, Joonas Nättilä - University of Helsinki, Jordy Davelaar - Princeton University.

Version: 1.1.0

1. Installation

  1. Install aweSOM and required dependencies:

bash git clone https://github.com/tvh0021/aweSOM.git cd aweSOM pip install .

  1. Install JAX with CUDA support separately:

bash pip install --upgrade "jax[cuda12_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html

If your system does not support CUDA, you can skip this step. SCE will automatically fall back to the CPU. However, the CPU-only version can be significantly slower for large datasets (see the performance tests).

Experimental

Apple Silicon users can also use JAX with Metal support; follow the instructions in the JAX documentation to install the Metal backend.

2. Testing

We use pytest for the test module. Dependency has already been included in the requirements.txt file, and should be installed automatically with aweSOM.

To run tests for all modules in the root directory of aweSOM:

bash python -m pytest

You can also run specific test modules by specifying the path to the test file:

bash python -m pytest tests/[module]_test.py

Or run a specific test function within a module:

bash python -m pytest tests/[module]_test.py::test_[function]

If there is no GPU, or if the GPU is not CUDA-compatible, the sce_test.py module will fail partially. This is expected behavior, and SCE computation should still fall back to the CPU.

3. Basic Usage - SOM

Here are the basic steps to initialize a lattice and train the SOM to classify the Iris dataset.

The full Jupyter notebook can be found here.

python import numpy as np import matplotlib.pyplot as plt from aweSOM import Lattice

First, load the dataset and normalize

```python from sklearn.datasets import loadiris iris = loadiris()

print("Shape of the data :", iris.data.shape) print("Labeled classes :", iris.targetnames) print("Features in the set :", iris.featurenames) ```

text Shape of the data : (150, 4) Labeled classes : ['setosa' 'versicolor' 'virginica'] Features in the set : ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

Normalize the data with a custom scaler

python import aweSOM.run_som as rs iris_data_transformed = rs.manual_scaling(iris.data) Initilize the lattice and train

```python

Create an initial SOM instance

map = Lattice(xdim=40, ydim=15, alpha_0=0.5, train=100000)

Train the SOM with some data in the shape of $N \times F$.

truelabels = iris.target featurenames = iris.featurenames map.trainlattice(irisdatatransformed, featurenames, truelabels) ```

The trained SOM is saved at map.lattice

To visualize the SOM with U-matrix, which is saved at the end of training at map.umat

```python

Compute the unique centroids

naivecentroidsmatrix = map.computecentroids() # return the centroid associated with each node uniquecentroids = map.getuniquecentroids(map.compute_centroids()) # return the indivual centroids

plotcentroids['positionx'] = [x+0.5 for x in uniquecentroids['positionx']] plotcentroids['positiony'] = [y+0.5 for y in uniquecentroids['positiony']]

X,Y = np.meshgrid(np.arange(xdim)+0.5, np.arange(ydim)+0.5)

plt.figure(dpi=250) plt.pcolormesh(map.umat.T, cmap='viridis') plt.scatter(plotcentroids['positionx'],plotcentroids['positiony'], color='red', s=10) plt.colorbar(fraction=0.02) plt.contour(X, Y, map.umat.T, levels=np.linspace(np.min(map.umat),np.max(map.umat), 20), colors='black', alpha=0.5) plt.gca().set_aspect("equal") plt.title(rf'UMatrix for {xdim}x{ydim} SOM') ``` U-matrix of a 40x15 SOM trained on Iris dataset

There are 15 centroids in this U-matrix -> there are 15 clusters. Now from the geometry of the U-matrix, we can see there are clearly at least two cluster (separated by the large band of high value nodes), and at most four clusters.

Merge clusters using cost function

```python merge_threshold = 0.2 # empirical tests reveal a threshold between 0.2 and 0.4 usually works best

plot U-matrix with the connected components and ground truth labels (if the labels were supplied during map.train_lattice)

map.plotheat(map.umat, merge=True, mergecost=merge_threshold) ```

U-matrix with labels

Now, we project each data point onto the lattice and get back cluster-id

python final_clusters = map.assign_cluster_to_lattice(smoothing=None,merge_cost=merge_threshold) som_labels = map.assign_cluster_to_data(map.projection_2d, final_clusters)

Finally, we compare the aweSOM result to the ground truth

python fig, axs = plt.subplots(1, 2, figsize=(20, 10)) scatter_ground = axs[0].scatter(iris.data[:,1], iris.data[:,2], c=iris.target, cmap='viridis') axs[0].set_xlabel('Sepal Width') axs[0].set_ylabel('Petal Length') axs[0].legend(scatter_ground.legend_elements()[0], iris.target_names, loc="upper right", title="Classes") scatter_som = axs[1].scatter(iris.data[:,1], iris.data[:,2], c=som_labels, cmap='viridis') axs[1].set_xlabel('Sepal Width') axs[1].set_ylabel('Petal Length') axs[1].legend(scatter.legend_elements()[0], np.unique(final_clusters), loc="upper right", title="aweSOM") plt.show()

Scatter plot comparing ground truth with aweSOM clusters

Clearly, the mapping is: {'setosa' : 2, 'versicolor' : 1, 'virginica' : 0}

```python

Assign cluster number to class label; change manually

labelmap = {'setosa' : 2, 'versicolor' : 1, 'virginica' : 0} correctlabel = 0

for i in range(len(somlabels)): if int(somlabels[i]) == labelmap[iris.targetnames[iris.target[i]]]: correct_label += 1

print("Number of correct predictions: ", correctlabel) print("Accuracy = ", correctlabel/len(som_labels) * 100, "%")

Precision and Recall by class

precision = np.zeros(3) recall = np.zeros(3)

for i in range(3): tp = 0 fp = 0 fn = 0 for j in range(len(somlabels)): if int(somlabels[j]) == labelmap[iris.targetnames[i]]: if iris.target[j] == i: tp += 1 else: fp += 1 else: if iris.target[j] == i: fn += 1 precision[i] = tp/(tp+fp) recall[i] = tp/(tp+fn)

print("Precision: ", [float(np.round(precision[i],4))100 for i in range(3)], "%") print("Recall: ", [float(np.round(recall[i],4))100 for i in range(3)], "%") ```

text Number of correct predictions: 141 Accuracy = 94.0 % Precision: [100.0, 90.2, 91.84] % Recall: [100.0, 92.0, 90.0] %

Is the performance of the aweSOM model.

4. Basic Usage - SCE

If a dataset is complex, a single SOM result might not be sufficiently stable. Instead, we can generate multiple SOM realizations with slightly different initial parameters, then stack the results into a set of statistically significant clusters.

Let's use the same Iris dataset:

```python from aweSOM.runsom import savecluster_labels

set a parameter space to scan

parameters = {"xdim": [38, 40, 42], "ydim": [14, 16], "alpha0": [0.1, 0.5], "train": [10000, 50000, 100000]} mergethreshold = 0.2

for xdim in parameters["xdim"]: for ydim in parameters["ydim"]: for alpha0 in parameters["alpha0"]: for train in parameters["train"]: print(f'constructing aweSOM lattice for xdim={xdim}, ydim={ydim}, alpha={alpha0}, train={train}...', flush=True) map = Lattice(xdim, ydim, alpha0, train, ) map.trainlattice(irisdatatransformed, featurenames, labels) # projection2d = map.mapdatatolattice() finalclusters = map.assignclustertolattice(smoothing=None, mergecost=mergethreshold) somlabels = map.assignclustertodata(projection2d, finalclusters) saveclusterlabels(somlabels, xdim, ydim, alpha0, train, nameofdataset='iris') ```

This saves 36 realizations to the current working directory.

In the terminal (skip this step if you use the pre-generated files):

bash cd [path_to_aweSOM]/aweSOM/examples/iris/ mkdir som_results mv labels* som_results/

Then, bash cd som_results/ python3 [path_to_aweSOM]/aweSOM/src/aweSOM/sce.py --subfolder SCE --dims 150

This will create (or append to) the multimap_mappings.txt file inside som_results/SCE/ with the $G_{\rm sum}$ for each cluster. It also save the mask for each cluster $C$ as a .npy file.

In its simplest form, the SCE stacking can be performed point-by-point: $V{\rm SCE, i} = \SigmaC Mi \cdot G{\rm sum, C}$

Get the list of $G_{\rm sum}$ values, sorted in descending order

```python filepath = 'somresults/SCE/' filename = 'multimapmappings.txt'

from aweSOM.makesceclusters import getgsumvalues, plotgsumvalues

rankedgsumlist, maplist = getgsumvalues(filepath+file_name) ```

Add these values together

python sce_sum = np.zeros((len(iris_data_transformed))) for i in range(len(ranked_gsum_list)): current_cluster_mask = np.load(f"{file_path}/mask-{map_list[i][2]}-id{map_list[i][1]}.npy") sce_sum += current_cluster_mask

Visualize the result

python fig, axs = plt.subplots(1, 2, figsize=(20, 10)) scatter_ground = axs[0].scatter(iris.data[:,1], iris.data[:,2], c=iris.target, cmap='viridis') axs[0].set_xlabel('Sepal Width') axs[0].set_ylabel('Petal Length') axs[0].legend(scatter_ground.legend_elements()[0], iris.target_names, loc="upper right", title="Classes") scatter_sce = axs[1].scatter(iris.data[:,1], iris.data[:,2], c=sce_sum, cmap='viridis') axs[1].set_xlabel('Sepal Width') axs[1].set_ylabel('Petal Length') plt.colorbar(scatter_sce, ax=axs[1]) plt.show()

Comparison between ground truth and SCE Gsum gradient

Set a cutoff in $\Sigma G_{\rm sum}$ to obtain three clusters

```python signal_cutoff = [8000, 15000]

sceclusters = np.zeros((len(irisdatatransformed)), dtype=int) for i in range(len(scesum)): if scesum[i] < signalcutoff[0]: sceclusters[i] = 0 elif scesum[i] < signalcutoff[1]: sceclusters[i] = 1 else: sce_clusters[i] = 2 ```

The resulting SCE quality is (using the same code as in 2.):

text Number of correct predictions: 142 Accuracy = 94.66666666666667 % Precision: [100.0, 92.0, 92.0] % Recall: [100.0, 92.0, 92.0] %

Because of the simplicity of the Iris dataset, not much improvement is made with SCE, but the result is nevertheless consistent with the single SOM result.

5. Advanced Usage - Plasma Simulation

The Jupyter Notebook for the fiducial realization of SOM lattice is located here.

6. License

This project is licensed under the MIT License - see the LICENSE file for details.

7. Contributing

Anyone is welcome to contribute! Please fork the repository and create pull requests with proposed changes.

8. Contact

Additional inquiries/questions about aweSOM should be directed to my email: tvha@umass.edu

Owner

  • Name: Trung Ha
  • Login: tvh0021
  • Kind: user

JOSS Publication

aweSOM: a CPU/GPU-accelerated Self-organizing Map and Statistically Combined Ensemble Framework for Machine-learning Clustering Analysis
Published
April 04, 2025
Volume 10, Issue 108, Page 7613
Authors
Trung Ha ORCID
Department of Astronomy, University of Massachusetts-Amherst, Amherst, MA 01003, USA, Center for Computational Astrophysics, Flatiron Institute, 162 Fifth Avenue, New York, NY 10010, USA, Department of Physics, University of North Texas, Denton, TX 76203, USA
Joonas Nättilä ORCID
Department of Physics, University of Helsinki, P.O. Box 64, University of Helsinki, FI-00014, Finland, Physics Department and Columbia Astrophysics Laboratory, Columbia University, 538 West 120th Street, New York, NY 10027, USA, Center for Computational Astrophysics, Flatiron Institute, 162 Fifth Avenue, New York, NY 10010, USA
Jordy Davelaar ORCID
Department of Astrophysical Sciences, Peyton Hall, Princeton University, Princeton, NJ 08544, USA, NASA Hubble Fellowship Program, Einstein Fellow, Center for Computational Astrophysics, Flatiron Institute, 162 Fifth Avenue, New York, NY 10010, USA, Physics Department and Columbia Astrophysics Laboratory, Columbia University, 538 West 120th Street, New York, NY 10027, USA
Editor
George K. Thiruvathukal ORCID
Tags
astronomy plasma

Citation (CITATION.cff)

cff-version: "1.2.0"
authors:
- family-names: Ha
  given-names: Trung
  orcid: "https://orcid.org/0000-0001-6600-2517"
- family-names: Nättilä
  given-names: Joonas
  orcid: "https://orcid.org/0000-0002-3226-4575"
- family-names: Davelaar
  given-names: Jordy
  orcid: "https://orcid.org/0000-0002-2685-2434"
contact:
- family-names: Ha
  given-names: Trung
  orcid: "https://orcid.org/0000-0001-6600-2517"
doi: 10.5281/zenodo.15098625
message: If you use this software, please cite our article in the
  Journal of Open Source Software.
preferred-citation:
  authors:
  - family-names: Ha
    given-names: Trung
    orcid: "https://orcid.org/0000-0001-6600-2517"
  - family-names: Nättilä
    given-names: Joonas
    orcid: "https://orcid.org/0000-0002-3226-4575"
  - family-names: Davelaar
    given-names: Jordy
    orcid: "https://orcid.org/0000-0002-2685-2434"
  date-published: 2025-04-04
  doi: 10.21105/joss.07613
  issn: 2475-9066
  issue: 108
  journal: Journal of Open Source Software
  publisher:
    name: Open Journals
  start: 7613
  title: "aweSOM: a CPU/GPU-accelerated Self-organizing Map and
    Statistically Combined Ensemble Framework for Machine-learning
    Clustering Analysis"
  type: article
  url: "https://joss.theoj.org/papers/10.21105/joss.07613"
  volume: 10
title: "`aweSOM`: a CPU/GPU-accelerated Self-organizing Map and
  Statistically Combined Ensemble Framework for Machine-learning
  Clustering Analysis"

GitHub Events

Total
  • Issues event: 6
  • Watch event: 2
  • Issue comment event: 7
  • Push event: 16
  • Create event: 2
Last Year
  • Issues event: 6
  • Watch event: 2
  • Issue comment event: 7
  • Push event: 16
  • Create event: 2

Committers

Last synced: 5 months ago

All Time
  • Total Commits: 119
  • Total Committers: 1
  • Avg Commits per committer: 119.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 105
  • Committers: 1
  • Avg Commits per committer: 105.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
tvh0021 t****a@m****u 119
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 4
  • Total pull requests: 0
  • Average time to close issues: 4 days
  • Average time to close pull requests: N/A
  • Total issue authors: 2
  • Total pull request authors: 0
  • Average comments per issue: 1.75
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 4
  • Pull requests: 0
  • Average time to close issues: 4 days
  • Average time to close pull requests: N/A
  • Issue authors: 2
  • Pull request authors: 0
  • Average comments per issue: 1.75
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • apizzuto (3)
  • zhangkai07 (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels