2025-11-19T05:43:14.227603

torchsom: The Reference PyTorch Library for Self-Organizing Maps

Berthier, Shokry, Moreaud et al.
This paper introduces torchsom, an open-source Python library that provides a reference implementation of the Self-Organizing Map (SOM) in PyTorch. This package offers three main features: (i) dimensionality reduction, (ii) clustering, and (iii) friendly data visualization. It relies on a PyTorch backend, enabling (i) fast and efficient training of SOMs through GPU acceleration, and (ii) easy and scalable integrations with PyTorch ecosystem. Moreover, torchsom follows the scikit-learn API for ease of use and extensibility. The library is released under the Apache 2.0 license with 90% test coverage, and its source code and documentation are available at https://github.com/michelin/TorchSOM.
academic

torchsom: The Reference PyTorch Library for Self-Organizing Maps

Basic Information

  • Paper ID: 2510.11147
  • Title: torchsom: The Reference PyTorch Library for Self-Organizing Maps
  • Authors: Louis Berthier, Ahmed Shokry, Maxime Moreaud, Guillaume Ramelet, Eric Moulines
  • Classification: stat.ML cs.LG
  • Publication Date: October 13, 2025
  • Paper Link: https://arxiv.org/abs/2510.11147

Abstract

This paper introduces torchsom, an open-source Python library based on PyTorch that provides a reference implementation for Self-Organizing Maps (SOMs). The library offers three primary functionalities: (1) dimensionality reduction, (2) clustering, and (3) user-friendly data visualization. Through its PyTorch backend, the library enables (1) fast and efficient SOM training with GPU acceleration, and (2) seamless extensible integration with the PyTorch ecosystem. Furthermore, torchsom follows the scikit-learn API design paradigm for ease of use and extensibility. The library is released under the Apache 2.0 license with 90% test coverage.

Research Background and Motivation

Problem Definition

Although Self-Organizing Maps (SOMs) remain an important and enduring machine learning technique with significant value in modern data analysis, existing Python SOM implementations suffer from notable deficiencies:

  1. Outdated Technical Architecture: Lack of GPU acceleration support
  2. Insufficient Ecosystem Integration: Difficulty integrating with modern deep learning frameworks
  3. Poor User Experience: Absence of user-friendly APIs and visualization capabilities
  4. Maintenance Issues: Existing libraries are poorly maintained with incomplete documentation

Research Significance

SOMs have broad application value across multiple domains:

  • Energy Industry: System monitoring and control
  • Biomedical Applications: Gene expression analysis, medical image processing
  • IoT Systems: Anomaly detection and pattern recognition
  • Chemical and Environmental Applications: Pollutant analysis and environmental monitoring
  • Business Cases: Market segmentation and customer analysis

Limitations of Existing Methods

Through comparative analysis of existing Python SOM libraries (MiniSom, SimpSOM, SOMPY, somoclu, som-pbc), the following issues are identified:

  1. Performance Limitations: Most are NumPy-based, lacking CUDA acceleration
  2. Incomplete Functionality: Absence of built-in clustering and advanced visualization features
  3. Insufficient Software Engineering Practices: Low test coverage, inadequate documentation
  4. Poor Extensibility: Low modularity, difficult to customize and extend

Core Contributions

  1. First Comprehensive PyTorch-based SOM Library: Provides complete SOM implementation supporting GPU acceleration and modern deep learning workflow integration
  2. Standardized API Design: Follows scikit-learn API style for consistent user experience
  3. Rich Visualization Tools: Provides 9 categories of visualization functionality supporting rectangular and hexagonal topologies
  4. Built-in Clustering Functionality: Integrates K-means, GMM, and HDBSCAN clustering algorithms
  5. High-Quality Software Engineering: 90% test coverage, complete documentation, modular design

Methodology Details

Task Definition

torchsom aims to provide a modernized SOM implementation supporting:

  • Input: High-dimensional datasets X ∈ ℝ^(N×k), where N is the number of samples and k is the feature dimension
  • Output: Trained SOM network, low-dimensional mapped representation, clustering results
  • Constraints: Preserve topological structure, minimize quantization error and topological error

Model Architecture

1. Core Module (torchsom.core)

Implements core functionality of the classical SOM algorithm:

  • fit(): Model training with automatic GPU acceleration support
  • cluster(): Clustering functionality
  • build_map(): Generate mappings suitable for visualization
  • collect_sample(): Identify optimal samples using topological and latent space distances

2. Utilities Module (torchsom.utils)

Provides foundational components for SOM parameterization and training:

  • Decay Functions: Learning rate and neighborhood width scheduling
  • Distance Metrics: Euclidean, cosine, Manhattan, Chebyshev distances
  • Neighborhood Kernels: Gaussian, Mexican hat, bubble, triangular kernel functions
  • Clustering Methods: K-means, GMM, HDBSCAN

3. Visualization Module (torchsom.visualization)

Provides 9 categories of comprehensive visualization functionality:

  • U-matrix: Mapping topology and clustering structure
  • Hit maps: Neuron activation patterns
  • Component planes: Feature-level analysis
  • Classification/metric maps: Target statistics
  • Score/rank maps: Quality assessment
  • Training curves: Convergence monitoring
  • Clustering maps: Clustering quality metrics

Technical Innovations

1. PyTorch Integration Advantages

# Weight update equation
w_ij(t+1) = w_ij(t) + α(t) · h_ij(t) · (x - w_ij(t))

Where:

  • α(t): Learning rate
  • h_ij(t): Neighborhood function
  • x: Input feature vector

2. Efficient Batch Processing Implementation

Through PyTorch's tensor operations and GPU parallel computing, batch processing is implemented to significantly enhance training efficiency.

3. Multiple Neighborhood Functions

Supports four neighborhood functions:

  • Gaussian Kernel: h^Gaussian_ij(t) = exp(-d²_ij / (2σ(t)²))
  • Mexican Hat Kernel: Provides more complex neighborhood relationships
  • Bubble Kernel: Step function form
  • Triangular Kernel: Linear decay

4. Adaptive Scheduling Strategies

Implements multiple parameter decay strategies:

  • Inverse Decay: α(t+1) = α(t) · γ/(γ + t)
  • Linear Decay: α(t+1) = α(t) · (1 - t/T)
  • Asymptotic Decay: For exponential convergence

Experimental Setup

Datasets

Synthetic datasets generated using scikit-learn's make_blobs():

  • Sample Scale: {240, 4000, 16000}
  • Feature Dimension: {4, 50, 300}
  • Grid Size: 25×15 (small), 90×70 (large)

Evaluation Metrics

  1. Quantization Error (QE): QE = (1/N) Σ ||x_i - w_BMU(x_i)||₂
  2. Topological Error (TE): Measures preservation of neighborhood relationships
  3. Runtime: Including initialization and training time

Comparison Methods

  • MiniSom (CPU): Most widely used SOM library
  • torchsom (CPU): CPU version implementation
  • torchsom (GPU): GPU-accelerated version

Implementation Details

  • PCA initialization
  • Rectangular topology
  • 100 training iterations
  • Gaussian neighborhood function
  • Euclidean distance

Experimental Results

Main Results

Performance Comparison (25×15 Grid)

DatasetMiniSom(CPU)torchsom(CPU)torchsom(GPU)
QE0.15-5.430.23-5.210.23-5.21
TE ImprovementBaseline34-81%↓34-81%↓
Speed ImprovementBaseline77-99%↑77-99%↑

Key Findings

  1. Topological Preservation Advantage: TE reduced by 34-81% compared to MiniSom
  2. Computational Efficiency Improvement: Training time reduced from thousands of seconds to tens of seconds
  3. Comparable Quantization Quality: Achieves comparable QE across all datasets
  4. Scalability: GPU version performs best on high-dimensional large-scale datasets

Ablation Studies

Experiments validate the contribution of each component:

  • Batch Processing Optimization: Significantly improves training speed
  • GPU Acceleration: Provides order-of-magnitude performance improvements on large-scale data
  • PyTorch Backend: Enables better memory management and parallel computing

Case Studies

Through visualization analysis on wine and Boston housing datasets, demonstrates:

  • Clear Clustering Boundaries: U-matrix effectively displays clustering structure
  • Reasonable Feature Mapping: Component planes reflect feature distribution
  • Good Classification Performance: Classification maps show clear decision boundaries

Comparison of Existing SOM Libraries

FeaturetorchsomMiniSomSimpSOMSOMPYsomoclu
FrameworkPyTorchNumPyNumPyNumPyC++
GPU SupportCUDACuPYCUDA
API Designscikit-learnCustomCustomMATLABCustom
VisualizationAdvancedModerateModerateBasic
Clustering

Technical Advantages

  1. Modern Architecture: Based on PyTorch ecosystem
  2. Standardized Interface: Follows scikit-learn conventions
  3. Complete Functionality: Integrates training, clustering, visualization
  4. High-Quality Implementation: 90% test coverage, complete documentation

Conclusions and Discussion

Main Conclusions

  1. torchsom provides the first comprehensive PyTorch-based SOM implementation
  2. Significantly improves topological preservation and computational efficiency while maintaining comparable quantization quality
  3. Rich visualization tools fill important gaps in existing SOM libraries
  4. Standardized API design facilitates integration with modern ML workflows

Limitations

  1. GPU Dependency: Optimal performance requires CUDA support
  2. Memory Requirements: Large-scale datasets may require substantial GPU memory
  3. Hyperparameter Sensitivity: Still requires careful tuning
  4. Domain-Specific Adaptation: Certain domain-specific requirements may need additional customization

Future Directions

  1. Algorithm Extensions: Support for additional SOM variants (e.g., Growing SOM)
  2. Distributed Training: Support for multi-GPU and distributed computing
  3. Automatic Hyperparameter Tuning: Integration of hyperparameter optimization functionality
  4. Domain Specialization: Optimization for specific application domains

In-Depth Evaluation

Strengths

  1. Technical Innovation: First deep integration of SOM with modern deep learning frameworks
  2. High Engineering Quality: 90% test coverage, complete documentation, modular design
  3. Strong Practical Value: Significant performance improvements and rich functionality
  4. Good Reproducibility: Open-source implementation with detailed experimental setup

Weaknesses

  1. Limited Theoretical Contribution: Primarily an engineering implementation with limited algorithmic innovation
  2. Limited Evaluation Scope: Mainly tested on synthetic data with few real-world application cases
  3. Incomplete Comparisons: Lacks detailed comparison with all existing SOM libraries
  4. Insufficient Scalability Verification: Performance on ultra-large-scale data requires further validation

Impact

  1. Domain Contribution: Provides a modernized tool platform for SOM research
  2. Practical Value: Lowers technical barriers to SOM application
  3. Ecosystem Impact: Promotes integration of traditional ML algorithms with modern frameworks
  4. Community Value: Open-source contribution facilitates SOM technology dissemination and development

Applicable Scenarios

  1. Exploratory Data Analysis: Visualization and understanding of high-dimensional data
  2. Anomaly Detection: Industrial monitoring and quality control
  3. Clustering Analysis: Customer segmentation, market analysis
  4. Feature Learning: As a preprocessing step in deep learning pipelines
  5. Educational Research: Teaching and research platform for SOM algorithms

References

  1. Kohonen, T. (1982). Self-organized formation of topologically correct feature maps
  2. Kohonen, T. (1990). The self-organizing map
  3. Vettigli, G. (2018). MiniSom: Minimalistic implementation of Self Organizing Maps
  4. Pedregosa, F. et al. (2011). Scikit-learn: Machine Learning in Python

Overall Assessment: This is a high-quality software engineering paper that significantly enhances SOM's usability and performance through modernized implementation. While algorithmic innovation is limited, its engineering value and practical significance are noteworthy, providing an excellent example of applying traditional machine learning algorithms in modern computing environments.