Language is, as commonly theorized, largely arbitrary. Yet, systematic relationships between phonetics and semantics have been observed in many specific cases. To what degree could those systematic relationships manifest themselves in large scale, quantitative investigations--both in previously identified and unidentified phenomena? This work undertakes a distributional approach to quantifying phonosemantic iconicity at scale across 6 diverse languages (English, Spanish, Hindi, Finnish, Turkish, and Tamil). In each language, we analyze the alignment of morphemes' phonetic and semantic similarity spaces with a suite of statistical measures, and discover an array of interpretable phonosemantic alignments not previously identified in the literature, along with crosslinguistic patterns. We also analyze 5 previously hypothesized phonosemantic alignments, finding support for some such alignments and mixed results for others.
- Paper ID: 2510.14040
- Title: Quantifying Phonosemantic Iconicity Distributionally in 6 Languages
- Authors: George Flint (UC Berkeley), Kaustubh Kislay (UW Madison)
- Classification: cs.CL (Computational Linguistics)
- Code: https://github.com/roccoflint/quantifying-iconicity
Language is typically theorized as predominantly arbitrary, yet systematic relationships between phonology and semantics have been observed in numerous specific cases. This study employs a distributional approach to quantify phonosemantic iconicity at scale across six typologically diverse languages (English, Spanish, Hindi, Finnish, Turkish, and Tamil). The research analyzes the alignment of phonological and semantic similarity spaces for morphemes in each language, identifying a series of interpretable phonosemantic alignments previously unrecognized in the literature, as well as cross-linguistic patterns. Additionally, five previously hypothesized phonosemantic alignments are analyzed, with supporting evidence found for some and mixed results for others.
The central question addressed by this study is: To what extent can systematic relationships between phonology and semantics be demonstrated in large-scale quantitative investigations, including both recognized and unrecognized phenomena?
- Theoretical Significance: Challenges the traditional view of linguistic arbitrariness and explores the universality of phonosemantic iconicity
- Cross-linguistic Perspective: Validates cross-linguistic patterns of phonosemantic relationships through six typologically diverse languages
- Methodological Contribution: Provides a distributional method for large-scale quantification of phonosemantic iconicity
- Scale Limitations: Previous research has primarily focused on specific phenomena or small-scale vocabularies
- Insufficient Language Coverage: Lacks systematic cross-linguistic comparisons
- Methodological Uniformity: Lacks comprehensive statistical analytical methods
- Proposes a distributional method for large-scale quantification of phonosemantic iconicity, combining multiple statistical measures
- Identifies interpretable phonosemantic alignments previously unrecognized in the literature through canonical correlation analysis
- Validates five previously hypothesized phonosemantic alignments, providing cross-linguistic evidence
- Constructs morphological segmentation datasets for six languages using GPT-4 few-shot learning
- Provides cross-linguistic pattern analysis of phonosemantic iconicity
Input: High-frequency vocabulary for each language (top 5,000 words)
Output: Quantification of alignment between phonological and semantic similarity spaces
Constraints: Requires morphological segmentation to avoid transitivity confounds
- Vocabulary Selection: Uses the Wordfreq module to obtain the top 5,000 high-frequency words for each language
- Morphological Segmentation:
- Lemmatization using Stanza
- Morphological segmentation via 10-shot GPT-4 prompt learning
- Structured output API employed to enhance instruction-following capability
- Validation by native speakers with error rates controlled at 0-4.67%
- Embedding Acquisition:
- Semantic Embeddings: FastText subword embeddings for morphemes
- Phonological Embeddings: Mean pooling of PanPhon feature vectors
- Representational Similarity Analysis (RSA)
- Computes Spearman correlation coefficients between phonological and semantic similarity matrices
- Detects global monotonic alignment
- Mutual Information (MI) Test
- Discretizes similarities into 20 equal-width bins
- Measures non-linear statistical dependencies
- k-Nearest Neighbor Overlap (kNN overlap)
- Calculates the overlap proportion of 10 nearest neighbors for each morpheme in phonological and semantic spaces
- Evaluates local neighborhood alignment
- Canonical Correlation Analysis (CCA)
- Extracts the first five canonical variable pairs
- Identifies dimensions of maximal phonosemantic alignment
Targeting five hypothesized phonosemantic scales:
- Magnitude-Sonority
- Angularity-Obstruency (i.e., the Kiki-Bouba effect)
- Fluidity-Continuity
- Brightness-Vowel Frontness
- Agility-Phonological Lightness
- LLM-Assisted Morphological Segmentation: First application of GPT-4 for large-scale multilingual morphological segmentation
- Multi-dimensional Statistical Analysis: Combines linear and non-linear methods for comprehensive assessment of phonosemantic alignment
- Canonical Variable Interpretation Framework: Provides an interpretable analytical approach to phonosemantic alignment
- Cross-linguistic Comparative Design: Encompasses six typologically diverse languages across three language families
- Language Selection: English, Spanish, Hindi, Finnish, Turkish, Tamil
- Data Scale: 1,217-2,153 morphemes per language
- Data Source: Eight text domains from the Wordfreq module (Wikipedia, subtitles, news, etc.)
- Global Analysis: Spearman correlation coefficient, mutual information values, kNN overlap proportions
- Subspace Analysis: Rank correlations of projected coordinates
- Significance Testing: 1,000 permutation tests with p-value threshold of 0.05
- Phonological Features: 21-dimensional PanPhon feature vectors
- Semantic Features: 300-dimensional FastText dense embeddings
- Statistical Testing: 500-point null distributions with repeated runs for stability verification
| Language | Morphemes | RSA(ρ) | MI(bits) | kNN Overlap | CCA CV1(ρ) |
|---|
| English | 2,153 | -0.027 | 0.001 | 0.020* | 0.376* |
| Spanish | 1,929 | 0.021 | 0.001 | 0.032* | 0.598* |
| Hindi | 1,714 | -0.038 | 0.004 | 0.025* | 0.554* |
| Finnish | 1,719 | 0.123 | 0.015 | 0.034* | 0.519* |
| Turkish | 1,626 | 0.132 | 0.015 | 0.034* | 0.538* |
| Tamil | 1,217 | 0.034 | 0.007 | 0.039* | 0.538* |
Key Findings:
- RSA and MI values are non-significant across all languages, indicating absence of global isomorphism
- kNN overlap is significant across all languages (p<0.001), indicating local neighborhood alignment
- First canonical variable correlations exceed 0.5 for all languages except English
| Language | Magnitude-Sonority | Angularity-Obstruency | Fluidity-Continuity | Brightness-Vowel Frontness | Agility-Phonological Lightness |
|---|
| English | 0.050* | 0.009 | 0.021* | -0.012 | 0.017 |
| Spanish | -0.075* | 0.111* | -0.088* | -0.025* | 0.074* |
| Hindi | 0.061* | 0.008 | 0.000 | 0.028* | 0.024* |
| Finnish | 0.018 | 0.136* | 0.105* | 0.101* | -0.001 |
| Turkish | 0.021* | 0.011 | -0.085* | 0.002 | -0.039* |
| Tamil | 0.001 | 0.113* | -0.036* | -0.006 | -0.032* |
- CV1: Tension/Directional Attachment ↔ Tension (ρ=0.376)
- CV2: Scalarity ↔ Concentration (ρ=0.318)
- CV3: Informality ↔ Articulatory Ease (ρ=0.315)
- CV4: Documentation ↔ Contractility (ρ=0.176)
- Informality-Articulatory Ease scale identified in both English and Finnish
- Hindi reveals Stillness-Resonance scale, associating sacred sounds like "ॐ" (om) with resonant phonological features
The study validates the necessity of morphological segmentation, avoiding transitivity confounds at the lexical level.
- Psycholinguistic Research: Kiki-Bouba effect, magnitude-sonority correspondence
- Computational Linguistics: Blasi et al.'s large-scale phonosemantic association research
- Sound Symbolism: Bolinger's analysis of English phonosemantic networks
- Scale Advantage: First large-scale distributional analysis across six languages
- Methodological Innovation: Combines multiple statistical methods with LLM-assisted segmentation
- Novel Findings: Identifies phonosemantic alignments not previously reported in the literature
- Phonosemantic iconicity operates primarily through specific dimensions and local neighborhoods, rather than global monotonic properties
- Supports the theoretical coexistence of linguistic arbitrariness and phonosemantic iconicity
- Angularity-Obstruency scale receives strong cross-linguistic support, validating the Kiki-Bouba effect
- Identifies multiple novel interpretable phonosemantic alignments
- Sample Scale: Limited morpheme set size due to LLM segmentation costs
- Language Coverage: Only six languages covered; cross-linguistic patterns require further validation
- Tool Dependency: Quality of linguistic tools for low-resource languages may impact results
- Reproducibility: LLM methods make complete reproduction challenging
- Extended Language Coverage: Analyze more languages to clarify cross-linguistic variation patterns
- Multimodal Iconicity: Investigate graphical-semantic iconicity in Chinese characters and sign language iconicity
- Additional Subspace Analyses: Evaluate more manually-defined phonosemantic alignments
- Methodological Innovation: First systematic application of distributional methods to quantify phonosemantic iconicity
- Cross-linguistic Perspective: Typologically diverse design spanning three language families
- Statistical Rigor: Employs multiple complementary statistical methods, enhancing result credibility
- Interpretability: Canonical variable analysis provides intuitive explanations of phonosemantic alignments
- Empirical Findings: Both validates known phenomena and discovers novel phonosemantic alignments
- Theoretical Depth: Lacks in-depth exploration of cognitive mechanisms underlying phonosemantic iconicity
- Methodological Limitations: Morphological segmentation depends on LLM, potentially introducing systematic bias
- Result Interpretation: Semantic pole interpretations of some canonical variables are somewhat subjective
- Statistical Power: Some analyses show small effect sizes with limited practical significance
- Academic Contribution: Provides new computational methodology for sound symbolism research
- Practical Value: Applicable to language acquisition, brand naming, and other practical scenarios
- Reproducibility: Complete code and data provided, facilitating subsequent research
- Linguistic Research: Cross-linguistic comparative studies of sound symbolism
- Psycholinguistics: Investigation of relationships between phonological perception and semantic processing
- Applied Linguistics: Language teaching, brand naming, poetic analysis, etc.
- Blasi, D. E., et al. (2016). Sound–meaning association biases evidenced across thousands of languages. PNAS.
- Ćwiek, A., et al. (2021). The bouba/kiki effect is robust across cultures and writing systems. Phil. Trans. R. Soc. B.
- Bolinger, D. L. (1950). Rime, assonance, and morpheme analysis. WORD.
- Vainio, L. (2021). Magnitude sound symbolism influences vowel production. Journal of Memory and Language.
This paper makes important methodological contributions and empirical findings to phonosemantic iconicity research. While there remains room for improvement in theoretical depth and methodological refinement, its cross-linguistic perspective and computational innovations establish a solid foundation for future developments in this field.