2025-11-21T21:28:15.928836

Techniques of Artificial Intelligence Applied to Near-Infrared Spectra

Sow, Diallo
This article explores the application of various artificial intelligence techniques to the analysis of near-infrared (NIR) spectra of paracetamol, within the spectral range of 900 nm to 1800 nm. The main objective is to evaluate the performance of several dimensionality reduction algorithms; namely, Principal Component Analysis (PCA), Kernel PCA (KPCA), Sparse Kernel PCA, t-Distributed Stochastic Neighbor Embedding (t-SNE), and Uniform Manifold Approximation and Projection (UMAP) in modeling and interpreting spectral features. These techniques, derived from data science and machine learning, are evaluated for their ability to simplify analysis and enhance the visualization of NIR spectra in pharmaceutical applications.
academic

Techniques of Artificial Intelligence Applied to Near-Infrared Spectra

Basic Information

  • Paper ID: 2510.10638
  • Title: Techniques of Artificial Intelligence Applied to Near-Infrared Spectra
  • Authors: Aminata Sow (Department of Physics, University of Science and Technology of Bamako, Mali), Tidiane Diallo (Faculty of Pharmacy, University of Science and Technology of Bamako, Mali)
  • Classification: physics.optics
  • Publication Date: October 12, 2025
  • Paper Link: https://arxiv.org/abs/2510.10638v1

Abstract

This paper explores the application of multiple artificial intelligence techniques to near-infrared (NIR) spectroscopic analysis of acetaminophen across the spectral range of 900-1800 nm. The primary objective is to evaluate the performance of several dimensionality reduction algorithms, including Principal Component Analysis (PCA), Kernel PCA (KPCA), Sparse Kernel PCA, t-distributed Stochastic Neighbor Embedding (t-SNE), and Uniform Manifold Approximation and Projection (UMAP), in their capacity to model and interpret spectral features. These techniques, derived from data science and machine learning, were evaluated for their ability to simplify analysis and enhance visualization of NIR spectra in pharmaceutical applications.

Research Background and Motivation

Problem Definition

The core problem addressed in this research is how to effectively process and analyze high-dimensional near-infrared spectroscopic data, particularly the challenges of dimensionality reduction and visualization of complex spectral data in pharmaceutical applications.

Importance Analysis

  1. Pharmaceutical Industry Demands: NIR spectroscopy technology possesses advantages in the pharmaceutical field including non-destructiveness, rapid analysis speed, and capability to handle complex mixtures, making it an important tool for quality control and component analysis.
  2. Curse of Dimensionality: NIR spectroscopic measurements typically produce high-dimensional data containing redundant or highly correlated features, which can obscure underlying structures and impair machine learning algorithm performance.
  3. Cross-disciplinary Applications: Beyond pharmaceuticals, NIR spectroscopy has widespread applications in the food industry, agriculture, and environmental science.

Limitations of Existing Methods

  • Traditional linear methods such as PCA can only capture linear relationships and cannot effectively handle complex nonlinear structures.
  • Lack of systematic comparative studies of different dimensionality reduction techniques in NIR spectral analysis.
  • Visualization and interpretation of high-dimensional spectral data remains a challenge.

Research Motivation

Building upon the authors' previous chemometric analysis work on acetaminophen NIR spectra, this research aims to explore advanced unsupervised machine learning techniques, particularly dimensionality reduction methods, to further reveal spectral behavior and latent patterns within the dataset.

Core Contributions

  1. Systematic Comparative Study: First systematic evaluation of five different dimensionality reduction algorithms (PCA, KPCA, Sparse KPCA, t-SNE, UMAP) in acetaminophen NIR spectral analysis.
  2. Nonlinear Structure Discovery: Confirmation of nonlinear structures in NIR spectral data through comparison of linear and nonlinear methods.
  3. Visualization Effect Assessment: Detailed comparison of different dimensionality reduction techniques in spectral data clustering and visualization.
  4. Preprocessing Strategy Optimization: Demonstration of the effectiveness of preprocessing methods including Standard Normal Variate (SNV) correction, detrending, and Multiplicative Scatter Correction (MSC).
  5. Clustering Performance Enhancement: Proof that clustering in dimensionality-reduced space outperforms clustering in the original high-dimensional space.

Methodology Details

Task Definition

The task of this research is to map high-dimensional NIR spectral data (spectral features within the 900-1800 nm range) to low-dimensional space (2D or 3D) while preserving important structural information in the data, facilitating visualization and subsequent clustering analysis.

Dimensionality Reduction Algorithm Architecture

1. Principal Component Analysis (PCA)

  • Principle: Projects data onto a new set of orthogonal axes (principal components), ordered by the amount of variance captured.
  • Mathematical Foundation: Based on eigenvalue decomposition of the covariance matrix.
  • Advantages: High computational efficiency, strong interpretability.
  • Limitations: Can only capture linear relationships.

2. Kernel Principal Component Analysis (KPCA)

  • Innovation: Uses kernel functions (e.g., Gaussian RBF kernel) to map data to high-dimensional feature space.
  • Implementation: Performs linear PCA in the transformed feature space.
  • Advantages: Capable of extracting nonlinear structures.
  • Application: Used to analyze nonlinear patterns in acetaminophen NIR spectra.

3. Sparse Kernel Principal Component Analysis (SKPCA)

  • Technical Characteristics: Introduces sparsity constraints based on KPCA.
  • Advantages: Reduces the number of support vectors, improving computational efficiency and interpretability.
  • Applicable Scenarios: Large-scale or high-dimensional datasets.

4. t-distributed Stochastic Neighbor Embedding (t-SNE)

  • Design Philosophy: Uses probability distributions to model pairwise similarities between data points.
  • Optimization Objective: Minimizes the Kullback-Leibler divergence between distributions in original and reduced spaces.
  • Strengths: Preserves local structure, reveals clustering in data.
  • Parameter Sensitivity: Sensitive to parameters such as perplexity and learning rate.

5. Uniform Manifold Approximation and Projection (UMAP)

  • Theoretical Foundation: Based on manifold learning and topological data analysis.
  • Implementation Approach: Constructs high-dimensional graph representation and optimizes structural similarity of low-dimensional graph.
  • Advantages: Better preservation of both local and global structure compared to t-SNE, higher computational efficiency.

Technical Innovations

  1. Multi-algorithm Integrated Evaluation: First systematic comparison of multiple dimensionality reduction techniques in NIR spectral analysis.
  2. Nonlinear Feature Mining: Reveals nonlinear relationships in spectral data through kernel methods and manifold learning techniques.
  3. Preprocessing and Dimensionality Reduction Integration: Organic combination of spectral preprocessing techniques with modern dimensionality reduction methods.
  4. Clustering Performance Optimization: Demonstrates the importance of dimensionality reduction preprocessing in improving clustering effectiveness.

Experimental Setup

Dataset

  • Sample Type: Acetaminophen NIR spectral data.
  • Spectral Range: 900-1800 nm.
  • Sample Classification: Divided into two categories based on content values:
    • Category 1: Samples with content >95 and <1015
    • Category 2: Remaining samples
  • Data Characteristics: High-dimensional spectral data with wavelength count exceeding sample count.

Preprocessing Methods

  1. Standard Normal Variate (SNV) Correction: Eliminates light scattering effects.
  2. Detrending: Removes baseline drift.
  3. Multiplicative Scatter Correction (MSC): Corrects scattering variations.

Evaluation Methods

  • Visualization Quality: Assessed through 2D and 3D embedding plots evaluating clustering separation.
  • Variance Preservation: Cumulative variance contribution rate of the first few principal components in PCA.
  • Clustering Performance: Comparison of clustering effectiveness in different spaces.

Clustering Algorithms

  • K-means: Applied to original high-dimensional data.
  • PAM (Partitioning Around Medoids): Applied to data after t-SNE dimensionality reduction.

Experimental Results

Main Results

Dimensionality Reduction Effect Comparison

  1. PCA Results:
    • First two principal components capture approximately 100% of total variance.
    • Unable to clearly separate samples into distinct clusters.
    • Highlights limitations in capturing nonlinear relationships.
  2. KPCA and Sparse KPCA:
    • Provide improved separation of overlapping spectral regions compared to linear PCA.
    • Sparse KPCA achieves this goal while using fewer support vectors.
    • Provide more interpretable and computationally efficient representations.
  3. t-SNE Performance:
    • Produces distinct and well-separated clusters.
    • Effectively preserves local neighborhood structure.
    • Sensitive to parameter settings such as perplexity.
    • Shows poor consistency in global cluster arrangement.
  4. UMAP Performance:
    • Demonstrates strong performance, generating compact and well-separated clusters.
    • Simultaneously preserves local and global relationships.
    • High computational efficiency, particularly suitable for exploratory data analysis.

Clustering Performance Comparison

  • K-means on Original Data: Poor clustering effectiveness with blurred boundaries.
  • PAM after t-SNE Dimensionality Reduction: Produces more distinct and meaningful clusters.
  • Key Finding: Dimensionality reduction preprocessing significantly improves clustering performance.

Key Experimental Findings

  1. Confirmation of Nonlinear Structure: Differences in clustering patterns between linear PCA and nonlinear KPCA confirm the existence of nonlinear structures in the dataset.
  2. Necessity of Dimensionality Reduction: Direct clustering in high-dimensional space performs poorly; clustering after dimensionality reduction shows significant improvement.
  3. Algorithm Applicability: UMAP and t-SNE are most effective in revealing meaningful structures in NIR spectra.
  4. Importance of Preprocessing: Appropriate spectral preprocessing significantly impacts subsequent analysis results.

Main Research Directions

  1. Applications of NIR Spectroscopy in Pharmaceuticals:
    • Early detection of novel psychoactive substances.
    • Recent advances in biomedical and pharmaceutical applications.
  2. Food and Agricultural Applications:
    • Food quality control and component analysis.
    • Soil component research and ecosystem health monitoring.
  3. Machine Learning Applications in Spectral Analysis:
    • Supervised learning methods for predictive modeling.
    • Unsupervised learning techniques for pattern discovery and clustering.
  • Continuity: Builds upon the authors' previous chemometric analysis work.
  • Extension: Extends from traditional chemometric methods to modern machine learning techniques.
  • Systematicity: First systematic comparison of multiple dimensionality reduction techniques in NIR spectral analysis.

Technical Advantages

Compared to existing work, this paper provides more comprehensive comparison of dimensionality reduction techniques, particularly systematic evaluation in the pharmaceutical NIR spectral analysis field.

Conclusions and Discussion

Main Conclusions

  1. Method Effectiveness: The evaluated dimensionality reduction techniques prove effective in simplifying high-dimensional spectral data and revealing underlying structures.
  2. Linear vs. Nonlinear: Linear methods such as PCA provide rapid and interpretable variance summaries but are limited in capturing nonlinear relationships.
  3. Optimal Methods: Nonlinear methods such as t-SNE and UMAP more effectively discover meaningful clusters and local patterns in spectra.
  4. Application Value: The combination of NIR spectroscopy with modern machine learning techniques can enhance data exploration and interpretation in pharmaceutical research.

Limitations

  1. Dataset Scale: Uses only NIR spectral data of acetaminophen; generalizability requires further verification.
  2. Parameter Sensitivity: Certain methods (e.g., t-SNE) are sensitive to parameter settings, requiring careful tuning.
  3. Lack of Quantitative Analysis: Primarily focuses on qualitative visualization effects, lacking quantitative performance metrics.
  4. Computational Complexity: Does not provide detailed analysis of computational costs for different methods.

Future Directions

  1. Extended Applications: Apply methods to NIR spectral analysis of other pharmaceuticals.
  2. Algorithm Optimization: Develop specialized dimensionality reduction algorithms tailored to NIR spectral characteristics.
  3. Real-time Applications: Explore practical applications in online quality control and process monitoring.
  4. Multimodal Fusion: Combine other analytical techniques to improve analytical accuracy.

In-depth Evaluation

Strengths

  1. Research Systematicity: First systematic comparison of multiple dimensionality reduction techniques in NIR spectral analysis, filling a research gap.
  2. Method Diversity: Encompasses complete spectrum from classical linear methods to modern nonlinear techniques.
  3. Practical Application Value: Direct application value in pharmaceutical quality control.
  4. Visualization Effectiveness: Provides clear visualization results facilitating understanding of different methods' characteristics.
  5. Technical Verification: Validates the existence of nonlinear structures through comparative experiments.

Weaknesses

  1. Theoretical Depth: Lacks deep theoretical analysis of why certain methods perform better on NIR spectral data.
  2. Quantitative Assessment: Primarily relies on visual assessment, lacking objective quantitative metrics.
  3. Data Limitations: Uses only single-drug data; generalizability requires further verification.
  4. Parameter Tuning: Insufficient detail on selection and tuning of critical parameters.
  5. Computational Efficiency: No comparison of computation time and resource consumption across different methods.

Impact

  1. Academic Contribution: Introduces systematic research of modern machine learning methods to NIR spectral analysis field.
  2. Practical Value: Provides new technical options for pharmaceutical industry quality control.
  3. Method Promotion: Facilitates promotion of dimensionality reduction techniques in spectral analysis applications.
  4. Interdisciplinary Fusion: Promotes cross-disciplinary integration of optics, chemistry, and machine learning.

Applicable Scenarios

  1. Pharmaceutical Quality Control: Drug component analysis and quality detection.
  2. Food Safety Detection: Food component and quality analysis.
  3. Chemical Process Monitoring: Real-time process control and product quality monitoring.
  4. Materials Science Research: Rapid analysis of material composition and properties.

References

The paper cites 20 important references covering classical and cutting-edge work in NIR spectroscopy technology, machine learning methods, and related application fields, providing solid theoretical foundation for the research.


Overall Assessment: This is an interdisciplinary research work with practical application value that systematically evaluates the application effectiveness of multiple dimensionality reduction techniques in NIR spectral analysis. While there is room for improvement in theoretical depth and quantitative analysis, its systematic comparative research and clear visualization results provide valuable reference for researchers and practitioners in related fields. This work contributes to advancing the integration of NIR spectroscopy technology with modern machine learning methods, with good application prospects in pharmaceutical and other application domains.