https://github.com/alanoursland/k_factors

https://github.com/alanoursland/k_factors

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.7%) to scientific vocabulary
Last synced: 9 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: alanoursland
  • Language: Python
  • Default Branch: main
  • Size: 205 KB
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created 12 months ago · Last pushed 10 months ago
Metadata Files
Readme

README.md

K‑Factors: A Unified Framework for Clustering with Local Subspaces

Status: Research Implementation
This repository contains a comprehensive PyTorch implementation of K-Factors and related clustering algorithms. While the code is feature-complete and tested, APIs may evolve based on research needs and community feedback.

Overview

K-Factors is a novel hard-assignment clustering algorithm that extends K-Means and K-Subspaces by sequentially extracting orthogonal basis vectors for each cluster. The soft-assignment counterpart, C-Factors, is the classic Mixture of Probabilistic PCA (MoPPCA), renamed for pedagogical symmetry with the K-Means/C-Means naming convention.

Algorithm Family at a Glance

| Method | Assignment | Cluster Representation | Key Innovation | |---------------|------------|--------------------------|------------------------------------------| | K-Means | Hard | Centroids | Simple, fast, well-understood | | K-Factors | Hard | Sequential R-dim basis | Penalty mechanism prevents direction reuse | | C-Factors | Soft (EM) | Mixture of PPCA | Classic MoPPCA with pedagogical renaming | | GMM | Soft (EM) | Full Gaussians | Flexible covariance structures |

Theoretical Foundation

The library implements clustering algorithms across two main dimensions:

Hard Assignment Soft Assignment | | Centroids: K-Means Fuzzy C-Means | | Subspaces: K-Lines/K-Subspaces (via soft assignment) | | Sequential: K-Factors C-Factors (Mixture of PPCA) | | Full Gaussian: (via hard assignment) Gaussian Mixture Model

Installation

```bash

Clone the repository

git clone https://github.com/yourusername/kfactors.git cd kfactors

Install in development mode

pip install -e .

Or install dependencies directly

pip install torch numpy matplotlib scikit-learn ```

Quick Start

Basic Usage Pattern

All algorithms follow the sklearn pattern:

```python import torch from kfactors import KMeans, KFactors, GaussianMixture

Generate sample data

X = torch.randn(1000, 20)

Fit any algorithm

model = KMeans(n_clusters=5) # or any other algorithm model.fit(X)

Get cluster assignments

labels = model.predict(X)

Get cluster centers (available for all algorithms)

centers = model.clustercenters ```

GPU Acceleration

```python

Automatic GPU usage if available

model = KMeans(n_clusters=5, device='auto')

Explicit device selection

model = KMeans(n_clusters=5, device='cuda:0')

CPU only

model = KMeans(n_clusters=5, device='cpu') ```

Core Algorithms

1. K-Means

Standard centroid-based clustering:

```python from kfactors import KMeans

kmeans = KMeans( nclusters=5, init='k-means++', # or 'random' or custom centers maxiter=100, tol=1e-4, verbose=1 ) kmeans.fit(X)

Results

labels = kmeans.labels_ centers = kmeans.clustercenters inertia = kmeans.inertia_ ```

2. K-Factors (Novel Algorithm)

Sequential subspace extraction with penalty mechanism:

```python from kfactors import KFactors

Key innovation: sequential extraction prevents direction collapse

kfactors = KFactors( nclusters=4, ncomponents=3, # dimensions extracted sequentially penaltyweight=0.9, # controls direction reuse penalty maxiter=30 ) kfactors.fit(X)

Inspect stage-wise progress

for stage in kfactors.stagehistory: print(f"Stage {stage['stage']}: {stage['iterations']} iterations")

Access the orthonormal bases

bases = kfactors.clusterbases # (nclusters, nfeatures, n_components) ```

3. C-Factors (Mixture of PPCA)

Soft clustering with low-rank Gaussian models:

```python from kfactors import CFactors

This is classical Mixture of PPCA, renamed for consistency

cfactors = CFactors( nclusters=4, ncomponents=3, # latent dimensions max_iter=100 ) cfactors.fit(X)

Transform to latent space

Z = cfactors.transform(X)

Get soft assignments

responsibilities = cfactors.predict_proba(X) ```

4. K-Subspaces

Clusters data along r-dimensional affine subspaces:

```python from kfactors import KSubspaces

Fit 2D subspaces in 10D data

ksub = KSubspaces( nclusters=4, ncomponents=2, # subspace dimension max_iter=50 ) ksub.fit(X)

Transform to local coordinates

Xlocal = ksub.transform(X) # (nsamples, n_components)

Reconstruct from local coordinates

Xreconstructed = ksub.inversetransform(X_local) ```

5. Gaussian Mixture Model

Full probabilistic clustering with various covariance types:

```python from kfactors import GaussianMixture

gmm = GaussianMixture( nclusters=3, covariancetype='full', # 'diagonal', 'spherical', 'tied' ninit=5, # number of initializations maxiter=100 ) gmm.fit(X)

Get soft assignments (probabilities)

probs = gmm.predict_proba(X)

Generate new samples

Xnew, ynew = gmm.sample(100)

Model selection

bic = gmm.bic(X) aic = gmm.aic(X) ```

6. Fuzzy C-Means

Soft clustering with fuzzy memberships:

```python from kfactors import FuzzyCMeans

fcm = FuzzyCMeans( nclusters=4, m=2.0, # fuzziness parameter (m>1) maxiter=100 ) fcm.fit(X)

Get fuzzy memberships

memberships = fcm.u_ # (nsamples, nclusters)

Evaluate fuzziness

fpc = fcm.partitioncoefficient() # closer to 1 = less fuzzy fpe = fcm.partitionentropy() # lower = less fuzzy ```

Research Applications

The learned local bases from K-Factors and C-Factors are particularly useful for:

  1. Neural Network Weight Initialization: Use cluster bases as informed priors
  2. Data Compression: Piecewise linear encoding with local coordinates
  3. Anomaly Detection: Points with large residuals indicate anomalies
  4. Transfer Learning: Learn bases on source domain, adapt on target
  5. Interpretable Dimensionality Reduction: Each cluster has its own interpretable subspace

Evaluation and Metrics

Internal Metrics (no ground truth)

```python from kfactors.utils.metrics import ( silhouettescore, daviesbouldinscore, calinskiharabasz_score )

Fit model

model = KMeans(nclusters=5) model.fit(X) labels = model.labels

Evaluate

sil = silhouettescore(X, labels) # higher is better [-1, 1] db = daviesbouldinscore(X, labels) # lower is better ch = calinskiharabasz_score(X, labels) # higher is better ```

External Metrics (with ground truth)

```python from kfactors.utils.metrics import ( adjustedrandscore, normalizedmutualinfoscore, vmeasure_score )

Compare with true labels

ari = adjustedrandscore(truelabels, predlabels) nmi = normalizedmutualinfoscore(truelabels, predlabels) homo, comp, vscore = vmeasurescore(truelabels, predlabels) ```

Visualization

Basic 2D/3D Visualization

```python from kfactors import plotclusters2d, plotclusters3d import matplotlib.pyplot as plt

For 2D data

plotclusters2d(X_2d, labels, centers=centers) plt.show()

For 3D data

fig = plt.figure() ax = fig.addsubplot(111, projection='3d') plotclusters3d(X3d, labels, centers=centers, ax=ax) plt.show() ```

Fuzzy Clustering Visualization

```python from kfactors import plotfuzzyclusters

Show membership strengths

plotfuzzyclusters(X2d, memberships, centers=centers, showall=True) plt.show() ```

Advanced Usage

Custom Algorithm with Builder

```python from kfactors import ClusteringBuilder

Build custom algorithm by mixing components

algorithm = (ClusteringBuilder() .withsubspacerepresentation(dim=3) .withhardassignment() .withpcaupdate() .withkmeansplusplusinit() .withobjectiveconvergence(tol=1e-4) .withdevice('cuda') .build(n_clusters=5))

algorithm.fit(X) ```

Data Preprocessing

```python from kfactors.utils.validation import scaledata, handlemissing_values

Scale data

Xscaled = scaledata(X, method='standard') # or 'minmax', 'robust'

Handle missing values

Xclean = handlemissingvalues(X, method='mean') # or 'dropsamples', 'median' ```

Memory-Efficient Large Data

```python from kfactors.utils.device import autoselectdevice, estimatememoryusage

Estimate memory needs

memusage = estimatememoryusage( nsamples=100000, nfeatures=1000, nclusters=10, algorithm='kfactors' )

Auto-select best device based on data size

device = autoselectdevice( data_size=X.nbytes, algorithm='kfactors' )

model = KFactors(n_clusters=10, device=device) ```

Implementation Architecture

The codebase follows a modular design that separates concerns:

kfactors/ ├── algorithms/ # Main clustering algorithms ├── base/ # Core abstractions and interfaces ├── representations/# How clusters are modeled ├── assignments/ # How points are assigned to clusters ├── updates/ # How parameters are updated ├── initialization/ # How to initialize clusters ├── utils/ # Helper functions and metrics └── visualization/ # Plotting utilities

This architecture allows for easy extension and experimentation with new clustering variants.

Performance Characteristics

| Algorithm | Time Complexity | Space Complexity | Best For | |-------------|---------------------------|--------------------|---------------------------------| | K-Means | O(nkd) | O(nd + kd) | Large data, spherical clusters | | K-Lines | O(nkd + kd²) | O(nd + kd) | Data along 1D manifolds | | K-Subspaces | O(nkdR + kd²R) | O(nd + kdR) | Linear manifold data | | K-Factors | O(nkdR² + kd²R) | O(nd + ndR + kdR) | Sequential subspace discovery | | GMM | O(nkd² + kd³) | O(nd + kd²) | Flexible cluster shapes | | C-Factors | O(nkdR + kd²R + kR³) | O(nd + kdR) | High-dim with low-rank structure|

where n=samples, k=clusters, d=features, R=subspace dimension

Algorithm Selection Guide

  • K-Means: Fast, simple, spherical clusters
  • K-Lines/K-Subspaces: Data lies along linear subspaces
  • K-Factors: Complex data with local subspace structure, need interpretable bases
  • GMM: Need probabilistic assignments, non-spherical clusters
  • C-Factors: High-dimensional data with low-rank structure
  • Fuzzy C-Means: Overlapping clusters, soft boundaries

Contributing

This is an active research project. Contributions are welcome for: - Implementing planned features - Adding new clustering variants - Improving performance - Adding visualization tools - Extending applications

Please open an issue to discuss major changes before submitting a PR.

Acknowledgements

This work builds on foundational contributions: - MacQueen (1967): K-Means clustering - Bradley & Mangasarian (2000): K-Planes/K-Subspaces - Tipping & Bishop (1999): Probabilistic PCA and Mixtures thereof - Ghahramani & Hinton (1996): Factor Analyzers

The K-Factors algorithm represents a novel synthesis of these ideas, introducing the sequential extraction mechanism with direction penalties.

Citation

If you use K-Factors in your research, please cite:

bibtex @software{kfactors2024, title = {K-Factors: A Unified Framework for Clustering with Local Subspaces}, author = {Your Name}, year = {2024}, url = {https://github.com/yourusername/kfactors} }

License

MIT License - see LICENSE file for details.

Quick Start

Basic Usage Pattern

All algorithms follow the sklearn pattern:

```python import torch from kfactors import KMeans, KFactors, GaussianMixture

Generate sample data

X = torch.randn(1000, 20)

Fit any algorithm

model = KMeans(n_clusters=5) # or any other algorithm model.fit(X)

Get cluster assignments

labels = model.predict(X)

Get cluster centers (available for all algorithms)

centers = model.clustercenters ```

GPU Acceleration

```python

Automatic GPU usage if available

model = KMeans(n_clusters=5, device='auto')

Explicit device selection

model = KMeans(n_clusters=5, device='cuda:0')

CPU only

model = KMeans(n_clusters=5, device='cpu') ```

Core Algorithms

1. K-Means

Standard centroid-based clustering:

```python from kfactors import KMeans

kmeans = KMeans( nclusters=5, init='k-means++', # or 'random' or custom centers maxiter=100, tol=1e-4, verbose=1 ) kmeans.fit(X)

Results

labels = kmeans.labels_ centers = kmeans.clustercenters inertia = kmeans.inertia_ ```

2. K-Subspaces

Clusters data along r-dimensional affine subspaces:

```python from kfactors import KSubspaces

Fit 2D subspaces in 10D data

ksub = KSubspaces( nclusters=4, ncomponents=2, # subspace dimension max_iter=50 ) ksub.fit(X)

Transform to local coordinates

Xlocal = ksub.transform(X) # (nsamples, n_components)

Reconstruct from local coordinates

Xreconstructed = ksub.inversetransform(X_local) ```

3. K-Factors

Sequential subspace extraction with penalty mechanism:

```python from kfactors import KFactors

kfactors = KFactors( nclusters=4, ncomponents=3, # dimensions per cluster penaltyweight=0.9, # penalty for reusing directions maxiter=30 ) kfactors.fit(X)

Inspect stage-wise progress

for stage in kfactors.stagehistory: print(f"Stage {stage['stage']}: {stage['iterations']} iterations") ```

4. Gaussian Mixture Model

Full probabilistic clustering with various covariance types:

```python from kfactors import GaussianMixture

gmm = GaussianMixture( nclusters=3, covariancetype='full', # 'diagonal', 'spherical', 'tied' ninit=5, # number of initializations maxiter=100 ) gmm.fit(X)

Get soft assignments (probabilities)

probs = gmm.predict_proba(X)

Generate new samples

Xnew, ynew = gmm.sample(100)

Model selection

bic = gmm.bic(X) aic = gmm.aic(X) ```

5. C-Factors (Mixture of PPCA)

Soft clustering with low-rank Gaussian models:

```python from kfactors import CFactors

cfactors = CFactors( nclusters=4, ncomponents=3, # latent dimensions max_iter=100 ) cfactors.fit(X)

Transform to latent space

Z = cfactors.transform(X)

Get soft assignments

responsibilities = cfactors.predict_proba(X) ```

6. Fuzzy C-Means

Soft clustering with fuzzy memberships:

```python from kfactors import FuzzyCMeans

fcm = FuzzyCMeans( nclusters=4, m=2.0, # fuzziness parameter (m>1) maxiter=100 ) fcm.fit(X)

Get fuzzy memberships

memberships = fcm.u_ # (nsamples, nclusters)

Evaluate fuzziness

fpc = fcm.partitioncoefficient() # closer to 1 = less fuzzy fpe = fcm.partitionentropy() # lower = less fuzzy ```

Evaluation and Metrics

Internal Metrics (no ground truth)

```python from kfactors.utils.metrics import ( silhouettescore, daviesbouldinscore, calinskiharabasz_score )

Fit model

model = KMeans(nclusters=5) model.fit(X) labels = model.labels

Evaluate

sil = silhouettescore(X, labels) # higher is better [-1, 1] db = daviesbouldinscore(X, labels) # lower is better ch = calinskiharabasz_score(X, labels) # higher is better ```

External Metrics (with ground truth)

```python from kfactors.utils.metrics import ( adjustedrandscore, normalizedmutualinfoscore, vmeasure_score )

Compare with true labels

ari = adjustedrandscore(truelabels, predlabels) nmi = normalizedmutualinfoscore(truelabels, predlabels) homo, comp, vscore = vmeasurescore(truelabels, predlabels) ```

Visualization

Basic 2D/3D Visualization

```python from kfactors import plotclusters2d, plotclusters3d import matplotlib.pyplot as plt

For 2D data

plotclusters2d(X_2d, labels, centers=centers) plt.show()

For 3D data

fig = plt.figure() ax = fig.addsubplot(111, projection='3d') plotclusters3d(X3d, labels, centers=centers, ax=ax) plt.show() ```

Fuzzy Clustering Visualization

```python from kfactors import plotfuzzyclusters

Show membership strengths

plotfuzzyclusters(X2d, memberships, centers=centers, showall=True) plt.show() ```

Decision Boundaries

```python from kfactors import plotclusterboundaries

Requires 2D data

plotclusterboundaries(X_2d, model, resolution=200) plt.show() ```

Advanced Usage

Custom Algorithm with Builder

```python from kfactors import ClusteringBuilder

Build custom algorithm by mixing components

algorithm = (ClusteringBuilder() .withsubspacerepresentation(dim=3) .withhardassignment() .withpcaupdate() .withkmeansplusplusinit() .withobjectiveconvergence(tol=1e-4) .withdevice('cuda') .build(n_clusters=5))

algorithm.fit(X) ```

Data Preprocessing

```python from kfactors.utils.validation import scaledata, handlemissing_values

Scale data

Xscaled = scaledata(X, method='standard') # or 'minmax', 'robust'

Handle missing values

Xclean = handlemissingvalues(X, method='mean') # or 'dropsamples', 'median' ```

Memory-Efficient Large Data

```python from kfactors.utils.device import autoselectdevice, estimatememoryusage

Estimate memory needs

memusage = estimatememoryusage( nsamples=100000, nfeatures=1000, nclusters=10, algorithm='kfactors' )

Auto-select best device based on data size

device = autoselectdevice( data_size=X.nbytes, algorithm='kfactors' )

model = KFactors(n_clusters=10, device=device) ```

Design Patterns

1. Consistent API

All algorithms implement: - fit(X): Train the model - predict(X): Get hard cluster assignments - fit_predict(X): Fit and return labels - cluster_centers_: Access cluster centers

2. Soft Clustering

Algorithms with soft assignment additionally provide: - predict_proba(X): Get membership probabilities - score(X): Compute log-likelihood

3. Subspace Methods

Subspace-based algorithms (K-Subspaces, K-Factors, C-Factors) provide: - transform(X): Project to local coordinates - inverse_transform(X_transformed): Reconstruct original space

4. Component Architecture

The library uses modular components: - Representations: How clusters are modeled (centroids, subspaces, Gaussians) - Assignments: How points are assigned (hard, soft, fuzzy, penalized) - Updates: How parameters are updated (mean, PCA, EM) - Initialization: How to start (random, k-means++, spectral)

Performance Tips

  1. Use GPU for large datasets: Automatic with device='auto'
  2. Preprocessing: Scale features for better convergence
  3. Multiple runs: Use n_init for probabilistic algorithms
  4. Convergence: Adjust tol and max_iter based on data size
  5. Memory: Use device.estimate_memory_usage() for large data

Algorithm Selection Guide

  • K-Means: Fast, simple, spherical clusters
  • K-Lines/K-Subspaces: Data lies along linear subspaces
  • K-Factors: Complex data with local subspace structure
  • GMM: Need probabilistic assignments, non-spherical clusters
  • C-Factors: High-dimensional data with low-rank structure
  • Fuzzy C-Means: Overlapping clusters, soft boundaries

Citation

If you use K-Factors in your research, please cite:

bibtex @software{kfactors2024, title = {K-Factors: A Unified Framework for Clustering with Local Subspaces}, author = {Alan Oursland}, year = {2024}, url = {https://github.com/alanoursland/kfactors} }

Owner

  • Name: Alan Oursland
  • Login: alanoursland
  • Kind: user

GitHub Events

Total
  • Push event: 14
  • Create event: 2
Last Year
  • Push event: 14
  • Create event: 2