2025-11-18T10:52:13.210456

A mathematical theory for understanding when abstract representations emerge in neural networks

Wang, Johnston, Fusi

Recent experiments reveal that task-relevant variables are often encoded in approximately orthogonal subspaces of the neural activity space. These disentangled low-dimensional representations are observed in multiple brain areas and across different species, and are typically the result of a process of abstraction that supports simple forms of out-of-distribution generalization. The mechanisms by which such geometries emerge remain poorly understood, and the mechanisms that have been investigated are typically unsupervised (e.g., based on variational auto-encoders). Here, we show mathematically that abstract representations of latent variables are guaranteed to appear in the last hidden layer of feedforward nonlinear networks when they are trained on tasks that depend directly on these latent variables. These abstract representations reflect the structure of the desired outputs or the semantics of the input stimuli. To investigate the neural representations that emerge in these networks, we develop an analytical framework that maps the optimization over the network weights into a mean-field problem over the distribution of neural preactivations. Applying this framework to a finite-width ReLU network, we find that its hidden layer exhibits an abstract representation at all global minima of the task objective. We further extend these analyses to two broad families of activation functions and deep feedforward architectures, demonstrating that abstract representations naturally arise in all these scenarios. Together, these results provide an explanation for the widely observed abstract representations in both the brain and artificial neural networks, as well as a mathematically tractable toolkit for understanding the emergence of different kinds of representations in task-optimized, feature-learning network models.

academic

A mathematical theory for understanding when abstract representations emerge in neural networks

基本信息

论文ID: 2510.09816
标题: A mathematical theory for understanding when abstract representations emerge in neural networks
作者: Bin Wang, W. Jeffrey Johnston, Stefano Fusi
机构: Center for Theoretical Neuroscience, Columbia University
分类: q-bio.NC math.OC physics.bio-ph physics.data-an stat.ML
发表时间: October 14, 2025 (预印本)
论文链接: https://arxiv.org/abs/2510.09816

摘要

本文研究神经网络中抽象表示(abstract representations)出现的数学机制。实验发现，任务相关变量通常在神经活动空间的近似正交子空间中编码，形成解耦的低维表示。这种几何结构支持简单的分布外泛化，但其涌现机制尚不清楚。作者数学证明，当前馈非线性网络在依赖潜在变量的任务上训练时，抽象表示必然出现在最后隐藏层。为此，作者开发了一个分析框架，将网络权重优化映射为神经预激活分布上的平均场问题。

研究背景与动机

核心问题

抽象表示的普遍性：神经科学实验表明，多个脑区和物种的神经活动都展现出抽象表示，其中任务相关变量在近似正交的子空间中编码
机制理解缺失：尽管这种几何结构广泛存在，但其涌现的网络机制仍不清楚
现有方法局限：已研究的机制多为无监督方法(如变分自编码器)，但由于可识别性问题，纯无监督学习解耦表示困难重重

研究重要性

理论意义：为广泛观察到的抽象表示现象提供数学解释
实用价值：理解表示学习机制有助于设计更好的神经网络架构
跨学科影响：连接了神经科学和机器学习中的表示学习理论

核心贡献

理论保证：首次数学证明在多任务监督学习设置下，前馈非线性网络必然产生抽象表示
分析框架：开发了将网络权重优化映射为神经预激活分布平均场问题的通用分析工具
激活函数鲁棒性：证明抽象表示的出现对激活函数选择具有鲁棒性
架构扩展：将分析扩展到深度网络和循环网络
神经科学洞察：为生物神经网络中观察到的抽象表示提供计算解释

方法详解

任务定义

考虑训练数据集 $D = \{(x^i, y^i)\}_{i=1}^P$ ，其中：

输入 $x^i \in \mathbb{R}^{d_X}$ 基本无结构
输出 $y^i \in \{±1\}^{d_Y}$ 包含 $d_Y$ 个二元标签，反映潜在变量结构
所有数据形成 $2^{d_Y}$ 个不同类别，每类包含 $n$ 个样本
总样本数 $P = n \cdot 2^{d_Y}$

网络架构

研究最简单的两层网络： $f_{W_1,W_2,b}(x) = W_2\phi(W_1x + b)$

其中：

$W_1 \in \mathbb{R}^{M \times d_X}$ ：第一层权重矩阵
$W_2 \in \mathbb{R}^{d_Y \times M}$ ：第二层权重矩阵
$b \in \mathbb{R}^M$ ：偏置参数
$\phi$ ：逐元素非线性激活函数
$M$ ：隐藏层宽度

损失函数

使用带L2正则化的均方误差： $E(W_1,W_2,b) = \|Y - W_2\phi(WX)\|_F^2 + \lambda_1\|W\|_F^2 + \lambda_2\|W_2\|_F^2$

抽象表示度量

使用**平行性得分(Parallelism Score, PS)**量化表示的抽象程度：

类原型表示： $r^{(y)} = \frac{1}{n}\sum_{i:y^i=y} r^i$
表示变化方向： $\Delta r^{(k;\alpha)} = r^{(y_k=+1,y_{\setminus k}=\alpha)} - r^{(y_k=-1,y_{\setminus k}=\alpha)}$
平行性得分： $PS = \frac{1}{d_Y}\sum_{k=1}^{d_Y} PS_k$

其中 $PS_k$ 测量第 $k$ 个潜在标签编码方向的一致性。PS = 1 对应完全抽象表示。

分析框架核心

平均场变换

关键创新是将原始优化问题： $\min_{W_1,W_2,b} E(W_1,W_2,b)$

转换为神经预激活分布上的优化： $\min_{\rho_M} \mathcal{E}[\rho_M]$

其中 $\rho_M = \sum_{k=1}^M \delta_{h_k}$ 是预激活模式的经验测度。

有效能量函数

有效系统的能量函数为： $\mathcal{E}[\rho_M] = \lambda_1\int h^T K_X^\dagger h d\rho_M(h) + \text{tr}\left(\frac{\lambda_2}{\lambda_2 + \int\phi(h)\phi(h)^T d\rho_M(h)} K_Y\right)$