aweSOM
aweSOM: a CPU/GPU-accelerated Self-organizing Map and Statistically Combined Ensemble Framework for Machine-learning Clustering Analysis - Published in JOSS (2025)
Science Score: 100.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 1 DOI reference(s) in JOSS metadata -
✓Academic publication links
Links to: arxiv.org, joss.theoj.org -
✓Committers with academic emails
1 of 1 committers (100.0%) from academic institutions -
○Institutional organization owner
-
✓JOSS paper metadata
Published in Journal of Open Source Software
Scientific Fields
Repository
Accelerated Self-organizing Map (SOM) and Statistically Combined Ensemble (SCE)
Basic Info
- Host: GitHub
- Owner: tvh0021
- License: mit
- Language: Python
- Default Branch: public-release
- Homepage: https://awesom.readthedocs.io/
- Size: 68.7 MB
Statistics
- Stars: 3
- Watchers: 2
- Forks: 1
- Open Issues: 1
- Releases: 1
Metadata Files
README.md
aweSOM - Accelerated Self-organizing Map (SOM) and Statistically Combined Ensemble (SCE)
This package combines a JIT-accelerated and parallelized implementation of SOM, integrating parts of POPSOM and a GPU-accelerated implementation of SCE using ensemble learning. It is optimized for large datasets, up to $\sim 10^8$ points.
aweSOM is developed specifically to identify intermittent structures (current sheets) in 3D plasma simulations (Ha et al., 2024). However, it can also be used for a variety of clustering and classification tasks.
Authors:
Trung Ha - University of Massachusetts-Amherst, Joonas Nättilä - University of Helsinki, Jordy Davelaar - Princeton University.
Version: 1.1.0
1. Installation
- Install aweSOM and required dependencies:
bash
git clone https://github.com/tvh0021/aweSOM.git
cd aweSOM
pip install .
- Install JAX with CUDA support separately:
bash
pip install --upgrade "jax[cuda12_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
If your system does not support CUDA, you can skip this step. SCE will automatically fall back to the CPU. However, the CPU-only version can be significantly slower for large datasets (see the performance tests).
Experimental
Apple Silicon users can also use JAX with Metal support; follow the instructions in the JAX documentation to install the Metal backend.
2. Testing
We use pytest for the test module. Dependency has already been included in the requirements.txt file, and should be installed automatically with aweSOM.
To run tests for all modules in the root directory of aweSOM:
bash
python -m pytest
You can also run specific test modules by specifying the path to the test file:
bash
python -m pytest tests/[module]_test.py
Or run a specific test function within a module:
bash
python -m pytest tests/[module]_test.py::test_[function]
If there is no GPU, or if the GPU is not CUDA-compatible, the sce_test.py module will fail partially. This is expected behavior, and SCE computation should still fall back to the CPU.
3. Basic Usage - SOM
Here are the basic steps to initialize a lattice and train the SOM to classify the Iris dataset.
The full Jupyter notebook can be found here.
python
import numpy as np
import matplotlib.pyplot as plt
from aweSOM import Lattice
First, load the dataset and normalize
```python from sklearn.datasets import loadiris iris = loadiris()
print("Shape of the data :", iris.data.shape) print("Labeled classes :", iris.targetnames) print("Features in the set :", iris.featurenames) ```
text
Shape of the data : (150, 4)
Labeled classes : ['setosa' 'versicolor' 'virginica']
Features in the set : ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Normalize the data with a custom scaler
python
import aweSOM.run_som as rs
iris_data_transformed = rs.manual_scaling(iris.data)
Initilize the lattice and train
```python
Create an initial SOM instance
map = Lattice(xdim=40, ydim=15, alpha_0=0.5, train=100000)
Train the SOM with some data in the shape of $N \times F$.
truelabels = iris.target featurenames = iris.featurenames map.trainlattice(irisdatatransformed, featurenames, truelabels) ```
The trained SOM is saved at map.lattice
To visualize the SOM with U-matrix, which is saved at the end of training at map.umat
```python
Compute the unique centroids
naivecentroidsmatrix = map.computecentroids() # return the centroid associated with each node uniquecentroids = map.getuniquecentroids(map.compute_centroids()) # return the indivual centroids
plotcentroids['positionx'] = [x+0.5 for x in uniquecentroids['positionx']] plotcentroids['positiony'] = [y+0.5 for y in uniquecentroids['positiony']]
X,Y = np.meshgrid(np.arange(xdim)+0.5, np.arange(ydim)+0.5)
plt.figure(dpi=250)
plt.pcolormesh(map.umat.T, cmap='viridis')
plt.scatter(plotcentroids['positionx'],plotcentroids['positiony'], color='red', s=10)
plt.colorbar(fraction=0.02)
plt.contour(X, Y, map.umat.T, levels=np.linspace(np.min(map.umat),np.max(map.umat), 20), colors='black', alpha=0.5)
plt.gca().set_aspect("equal")
plt.title(rf'UMatrix for {xdim}x{ydim} SOM')
```

There are 15 centroids in this U-matrix -> there are 15 clusters. Now from the geometry of the U-matrix, we can see there are clearly at least two cluster (separated by the large band of high value nodes), and at most four clusters.
Merge clusters using cost function
```python merge_threshold = 0.2 # empirical tests reveal a threshold between 0.2 and 0.4 usually works best
plot U-matrix with the connected components and ground truth labels (if the labels were supplied during map.train_lattice)
map.plotheat(map.umat, merge=True, mergecost=merge_threshold) ```

Now, we project each data point onto the lattice and get back cluster-id
python
final_clusters = map.assign_cluster_to_lattice(smoothing=None,merge_cost=merge_threshold)
som_labels = map.assign_cluster_to_data(map.projection_2d, final_clusters)
Finally, we compare the aweSOM result to the ground truth
python
fig, axs = plt.subplots(1, 2, figsize=(20, 10))
scatter_ground = axs[0].scatter(iris.data[:,1], iris.data[:,2], c=iris.target, cmap='viridis')
axs[0].set_xlabel('Sepal Width')
axs[0].set_ylabel('Petal Length')
axs[0].legend(scatter_ground.legend_elements()[0], iris.target_names, loc="upper right", title="Classes")
scatter_som = axs[1].scatter(iris.data[:,1], iris.data[:,2], c=som_labels, cmap='viridis')
axs[1].set_xlabel('Sepal Width')
axs[1].set_ylabel('Petal Length')
axs[1].legend(scatter.legend_elements()[0], np.unique(final_clusters), loc="upper right", title="aweSOM")
plt.show()

Clearly, the mapping is: {'setosa' : 2, 'versicolor' : 1, 'virginica' : 0}
```python
Assign cluster number to class label; change manually
labelmap = {'setosa' : 2, 'versicolor' : 1, 'virginica' : 0} correctlabel = 0
for i in range(len(somlabels)): if int(somlabels[i]) == labelmap[iris.targetnames[iris.target[i]]]: correct_label += 1
print("Number of correct predictions: ", correctlabel) print("Accuracy = ", correctlabel/len(som_labels) * 100, "%")
Precision and Recall by class
precision = np.zeros(3) recall = np.zeros(3)
for i in range(3): tp = 0 fp = 0 fn = 0 for j in range(len(somlabels)): if int(somlabels[j]) == labelmap[iris.targetnames[i]]: if iris.target[j] == i: tp += 1 else: fp += 1 else: if iris.target[j] == i: fn += 1 precision[i] = tp/(tp+fp) recall[i] = tp/(tp+fn)
print("Precision: ", [float(np.round(precision[i],4))100 for i in range(3)], "%") print("Recall: ", [float(np.round(recall[i],4))100 for i in range(3)], "%") ```
text
Number of correct predictions: 141
Accuracy = 94.0 %
Precision: [100.0, 90.2, 91.84] %
Recall: [100.0, 92.0, 90.0] %
Is the performance of the aweSOM model.
4. Basic Usage - SCE
If a dataset is complex, a single SOM result might not be sufficiently stable. Instead, we can generate multiple SOM realizations with slightly different initial parameters, then stack the results into a set of statistically significant clusters.
Let's use the same Iris dataset:
```python from aweSOM.runsom import savecluster_labels
set a parameter space to scan
parameters = {"xdim": [38, 40, 42], "ydim": [14, 16], "alpha0": [0.1, 0.5], "train": [10000, 50000, 100000]} mergethreshold = 0.2
for xdim in parameters["xdim"]: for ydim in parameters["ydim"]: for alpha0 in parameters["alpha0"]: for train in parameters["train"]: print(f'constructing aweSOM lattice for xdim={xdim}, ydim={ydim}, alpha={alpha0}, train={train}...', flush=True) map = Lattice(xdim, ydim, alpha0, train, ) map.trainlattice(irisdatatransformed, featurenames, labels) # projection2d = map.mapdatatolattice() finalclusters = map.assignclustertolattice(smoothing=None, mergecost=mergethreshold) somlabels = map.assignclustertodata(projection2d, finalclusters) saveclusterlabels(somlabels, xdim, ydim, alpha0, train, nameofdataset='iris') ```
This saves 36 realizations to the current working directory.
In the terminal (skip this step if you use the pre-generated files):
bash
cd [path_to_aweSOM]/aweSOM/examples/iris/
mkdir som_results
mv labels* som_results/
Then,
bash
cd som_results/
python3 [path_to_aweSOM]/aweSOM/src/aweSOM/sce.py --subfolder SCE --dims 150
This will create (or append to) the multimap_mappings.txt file inside som_results/SCE/ with the $G_{\rm sum}$ for each cluster. It also save the mask for each cluster $C$ as a .npy file.
In its simplest form, the SCE stacking can be performed point-by-point: $V{\rm SCE, i} = \SigmaC Mi \cdot G{\rm sum, C}$
Get the list of $G_{\rm sum}$ values, sorted in descending order
```python filepath = 'somresults/SCE/' filename = 'multimapmappings.txt'
from aweSOM.makesceclusters import getgsumvalues, plotgsumvalues
rankedgsumlist, maplist = getgsumvalues(filepath+file_name) ```
Add these values together
python
sce_sum = np.zeros((len(iris_data_transformed)))
for i in range(len(ranked_gsum_list)):
current_cluster_mask = np.load(f"{file_path}/mask-{map_list[i][2]}-id{map_list[i][1]}.npy")
sce_sum += current_cluster_mask
Visualize the result
python
fig, axs = plt.subplots(1, 2, figsize=(20, 10))
scatter_ground = axs[0].scatter(iris.data[:,1], iris.data[:,2], c=iris.target, cmap='viridis')
axs[0].set_xlabel('Sepal Width')
axs[0].set_ylabel('Petal Length')
axs[0].legend(scatter_ground.legend_elements()[0], iris.target_names, loc="upper right", title="Classes")
scatter_sce = axs[1].scatter(iris.data[:,1], iris.data[:,2], c=sce_sum, cmap='viridis')
axs[1].set_xlabel('Sepal Width')
axs[1].set_ylabel('Petal Length')
plt.colorbar(scatter_sce, ax=axs[1])
plt.show()

Set a cutoff in $\Sigma G_{\rm sum}$ to obtain three clusters
```python signal_cutoff = [8000, 15000]
sceclusters = np.zeros((len(irisdatatransformed)), dtype=int) for i in range(len(scesum)): if scesum[i] < signalcutoff[0]: sceclusters[i] = 0 elif scesum[i] < signalcutoff[1]: sceclusters[i] = 1 else: sce_clusters[i] = 2 ```
The resulting SCE quality is (using the same code as in 2.):
text
Number of correct predictions: 142
Accuracy = 94.66666666666667 %
Precision: [100.0, 92.0, 92.0] %
Recall: [100.0, 92.0, 92.0] %
Because of the simplicity of the Iris dataset, not much improvement is made with SCE, but the result is nevertheless consistent with the single SOM result.
5. Advanced Usage - Plasma Simulation
The Jupyter Notebook for the fiducial realization of SOM lattice is located here.
6. License
This project is licensed under the MIT License - see the LICENSE file for details.
7. Contributing
Anyone is welcome to contribute! Please fork the repository and create pull requests with proposed changes.
8. Contact
Additional inquiries/questions about aweSOM should be directed to my email: tvha@umass.edu
Owner
- Name: Trung Ha
- Login: tvh0021
- Kind: user
- Repositories: 1
- Profile: https://github.com/tvh0021
JOSS Publication
aweSOM: a CPU/GPU-accelerated Self-organizing Map and Statistically Combined Ensemble Framework for Machine-learning Clustering Analysis
Authors
Department of Astronomy, University of Massachusetts-Amherst, Amherst, MA 01003, USA, Center for Computational Astrophysics, Flatiron Institute, 162 Fifth Avenue, New York, NY 10010, USA, Department of Physics, University of North Texas, Denton, TX 76203, USA
Department of Physics, University of Helsinki, P.O. Box 64, University of Helsinki, FI-00014, Finland, Physics Department and Columbia Astrophysics Laboratory, Columbia University, 538 West 120th Street, New York, NY 10027, USA, Center for Computational Astrophysics, Flatiron Institute, 162 Fifth Avenue, New York, NY 10010, USA
Department of Astrophysical Sciences, Peyton Hall, Princeton University, Princeton, NJ 08544, USA, NASA Hubble Fellowship Program, Einstein Fellow, Center for Computational Astrophysics, Flatiron Institute, 162 Fifth Avenue, New York, NY 10010, USA, Physics Department and Columbia Astrophysics Laboratory, Columbia University, 538 West 120th Street, New York, NY 10027, USA
Tags
astronomy plasmaCitation (CITATION.cff)
cff-version: "1.2.0"
authors:
- family-names: Ha
given-names: Trung
orcid: "https://orcid.org/0000-0001-6600-2517"
- family-names: Nättilä
given-names: Joonas
orcid: "https://orcid.org/0000-0002-3226-4575"
- family-names: Davelaar
given-names: Jordy
orcid: "https://orcid.org/0000-0002-2685-2434"
contact:
- family-names: Ha
given-names: Trung
orcid: "https://orcid.org/0000-0001-6600-2517"
doi: 10.5281/zenodo.15098625
message: If you use this software, please cite our article in the
Journal of Open Source Software.
preferred-citation:
authors:
- family-names: Ha
given-names: Trung
orcid: "https://orcid.org/0000-0001-6600-2517"
- family-names: Nättilä
given-names: Joonas
orcid: "https://orcid.org/0000-0002-3226-4575"
- family-names: Davelaar
given-names: Jordy
orcid: "https://orcid.org/0000-0002-2685-2434"
date-published: 2025-04-04
doi: 10.21105/joss.07613
issn: 2475-9066
issue: 108
journal: Journal of Open Source Software
publisher:
name: Open Journals
start: 7613
title: "aweSOM: a CPU/GPU-accelerated Self-organizing Map and
Statistically Combined Ensemble Framework for Machine-learning
Clustering Analysis"
type: article
url: "https://joss.theoj.org/papers/10.21105/joss.07613"
volume: 10
title: "`aweSOM`: a CPU/GPU-accelerated Self-organizing Map and
Statistically Combined Ensemble Framework for Machine-learning
Clustering Analysis"
GitHub Events
Total
- Issues event: 6
- Watch event: 2
- Issue comment event: 7
- Push event: 16
- Create event: 2
Last Year
- Issues event: 6
- Watch event: 2
- Issue comment event: 7
- Push event: 16
- Create event: 2
Committers
Last synced: 5 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| tvh0021 | t****a@m****u | 119 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 4 months ago
All Time
- Total issues: 4
- Total pull requests: 0
- Average time to close issues: 4 days
- Average time to close pull requests: N/A
- Total issue authors: 2
- Total pull request authors: 0
- Average comments per issue: 1.75
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 4
- Pull requests: 0
- Average time to close issues: 4 days
- Average time to close pull requests: N/A
- Issue authors: 2
- Pull request authors: 0
- Average comments per issue: 1.75
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- apizzuto (3)
- zhangkai07 (1)