2025-11-22T17:25:16.377518

Generalized Top-k Mallows Model for Ranked Choices

Haddadan, Ahmadian
The classic Mallows model is a foundational tool for modeling user preferences. However, it has limitations in capturing real-world scenarios, where users often focus only on a limited set of preferred items and are indifferent to the rest. To address this, extensions such as the top-k Mallows model have been proposed, aligning better with practical applications. In this paper, we address several challenges related to the generalized top-k Mallows model, with a focus on analyzing buyer choices. Our key contributions are: (1) a novel sampling scheme tailored to generalized top-k Mallows models, (2) an efficient algorithm for computing choice probabilities under this model, and (3) an active learning algorithm for estimating the model parameters from observed choice data. These contributions provide new tools for analysis and prediction in critical decision-making scenarios. We present a rigorous mathematical analysis for the performance of our algorithms. Furthermore, through extensive experiments on synthetic data and real-world data, we demonstrate the scalability and accuracy of our proposed methods, and we compare the predictive power of Mallows model for top-k lists compared to the simpler Multinomial Logit model.
academic

Generalized Top-k Mallows Model for Ranked Choices

Basic Information

  • Paper ID: 2510.22040
  • Title: Generalized Top-k Mallows Model for Ranked Choices
  • Authors: Shahrzad Haddadan (Rutgers Business School), Sara Ahmadian (Google Research)
  • Classification: cs.LG, cs.DS, stat.ML
  • Conference: NeurIPS 2025 (39th Conference on Neural Information Processing Systems)
  • Paper Link: https://arxiv.org/abs/2510.22040

Abstract

The classical Mallows model is a foundational tool for modeling user preferences, but has limitations in capturing real-world scenarios where users typically focus on a limited set of preferred items and are indifferent toward the rest. This paper addresses several challenges in the generalized top-k Mallows model, focusing on buyer choice analysis. Core contributions include: (1) a novel sampling scheme tailored for the generalized top-k Mallows model; (2) an efficient algorithm for computing choice probabilities under this model; (3) an active learning algorithm for estimating model parameters from observed choice data. These contributions provide new tools for analysis and prediction in critical decision-making scenarios, with scalability and accuracy validated through rigorous mathematical analysis and extensive experiments on both synthetic and real data.

Research Background and Motivation

Problem Definition

The core problem addressed is: How to effectively model user preferences and predict choice behavior in realistic scenarios where users provide only partial ranking information (top-k lists) rather than complete rankings.

Problem Significance

  1. Broad practical applications: Recommendation systems, advertising platforms, search engines, news aggregators, and other domains where users typically express preferences for only a limited number of items
  2. Business decision value: Accurate prediction of customer choice behavior is crucial for revenue management, product assortment optimization, and other strategic decisions
  3. Theoretical significance: Extending classical probabilistic ranking models to more realistic partial ranking scenarios

Limitations of Existing Methods

  1. Classical Mallows model: Limited to full permutations, cannot handle partial rankings
  2. Multinomial Logit (MNL) model: While simple, has limited expressiveness and satisfies the Independence of Irrelevant Alternatives (IIA) assumption, restricting modeling of complex choice behavior
  3. Existing top-k extensions: The top-k Mallows model proposed by Chierichetti et al. (2018) lacks efficient sampling algorithms and choice probability computation methods
  4. Parameter learning difficulty: Learning model parameters from choice data (rather than complete rankings) lacks theoretical guarantees

Research Motivation

The authors argue that extending the Mallows model to top-k lists can more realistically reflect user preference structures, but requires solving three key algorithmic problems: efficient sampling, choice probability computation, and parameter learning.

Core Contributions

The main contributions of this paper include:

  1. PRIM Sampling Algorithm: Proposes Profile-based Repeated Insertion Method (PRIM) for efficient sampling from the TopKGMM distribution, reducing time complexity from O(k²4^k + k²log n) to O(k2^k + k²log n)
  2. DYPCHIP Choice Probability Algorithm: Designs Dynamic Programming for Choice Probabilities (DYPCHIP) algorithm to efficiently compute choice probabilities under the top-k Mallows model
  3. BUCCHOI Active Learning Algorithm: Develops Build Center from Choices (BUCCHOI) active learning algorithm that infers the distribution center and center size k by presenting product assortments of specified sizes and observing choices, with sample complexity Θ(r²log n)
  4. Model Generalization: Extends Chierichetti et al.'s top-k Mallows model to a generalized version (TopKGMM) that associates weights with each product
  5. Empirical Validation: Demonstrates on the Sushi preference dataset that the top-k Mallows model achieves significantly higher predictive accuracy than the MNL model

Method Details

Task Definition

Input:

  • Product set N = n ∪ {∅} (containing n products and a "no purchase" option)
  • Product assortment A ⊆ n

Output:

  • Probability that a user selects product i from assortment A: C_D(i, A)

Model Assumptions:

  • User preferences follow a TopKGMM distribution D
  • User choice function: C_τ(A) = i if and only if i is the highest-ranked element in A∪{∅} relative to ranking τ

Model Architecture

1. TopKGMM Distribution Definition

Given center τ* ∈ T_{k,N} and parameters β ≥ 0, w ∈ R^{k+1}_{≥0}, p > 0, the probability of a top-k list τ is:

P_D[τ] ∝ exp(-β K_{p,w}(τ, τ*))

where the distance metric is defined as:

K_{p,w}(τ, τ*) = w_0·p·Q(τ) + Σ_{i∈[k]} w_i·(I_i(τ) + p·P_i(τ))

Key Components:

  • I_i(τ) (inversion vector): Number of elements with lower priority than i but ranked higher in τ
  • P_i(τ): Number of elements originally ranked lower than i but incomparable in τ
  • Q(τ): Number of incomparable pairs among unranked elements in τ*
  • w_i: Element weights, allowing different importance for different elements

2. Profile Concept

Profile Definition: S ⊆ k represents the intersection of a sample top-k list τ with the center τ*, i.e., τ ∩ τ* = S

Profile Probability (Lemma 3.3):

P_D[S] ∝ exp(-βf(S))·Z(S)

where:

f(S) = w_0·p·Q(S) + Σ_{j∈[k]\S} w_j(I_j(S) + p·P_j(S))

Z(S) = C(n-k, k-ℓ)·(k-ℓ)! · Π_{j=1}^ℓ Σ_{r=0}^{k-j} e^{-βw_{s_j}r}

This decomposition is the core algorithmic innovation, decomposing the complex top-k list distribution into profile selection and conditional ranking stages.

Technical Innovations

1. PRIM Sampling Algorithm (Algorithm 3-5)

Core Idea:

  1. First sample profile S according to P_DS
  2. Then construct τ given S through repeated insertion

Algorithm Flow:

1. Precompute probabilities for all profiles (O(2^γk) time, γ=min{k,n-k})
2. Sample profile S
3. Initialize: uniformly randomly sample k-ℓ elements from [n]\[k]
4. Insert elements s ∈ S in priority order, with probability of inserting s at position j:
   P(insert s at position j) ∝ exp(-βw_s·j)

Innovations:

  • Solves the open problem left by Chierichetti et al. (lack of RIM-like sampling methods)
  • Significantly reduces complexity through profile decomposition
  • Amortized cost per sample after preprocessing is only O(k log n)

2. DYPCHIP Choice Probability Algorithm (Section 4.1)

Core Idea: Use dynamic programming to compute choice probabilities given profile S

Algorithm Structure:

  • 3D DP table π_S(i, j, q):
    • i: i-th element in product assortment A
    • j: current position
    • q: q-th iteration of PRIM algorithm
    • Meaning: Probability that a_i is the highest-ranked element in A∅ and located at position j after q iterations

Recurrence Relations:

When new element not in A:
π_S(i,j,q) = π_S(i,j,q-1)·P(insert>j) + π_S(i,j-1,q-1)·P(insert≤j-1)

When new element in A:
π_S(i,j,q) = π_S(i,j,q-1)·P(insert>j)
π_S(cur,j,q) = P(insert=j)·Σ_{i<cur}Σ_{j'≥j}π_S(i,j',q-1)

Final Choice Probability:

C_D(a,A) = Σ_{S⊆[k]} P_D[S]·(π_S(a) + π̄_S(a))

Complexity: O(2^{min{k,n-k}}·k³·|A|²)

3. BUCCHOI Active Learning Algorithm (Algorithm 2)

Core Idea: Learn the center τ* by actively selecting product assortments and observing choices

Key Sub-algorithm FINDTOP (Algorithm 1):

  • Maintain matrix X_: records the difference in times product i is selected while j is not
  • Decision criterion: Y_ = X_/m; if there exists a such that Y_{aa'} > (1+|A|)/2 for all a'∈A∅{a}, then a is a top element

Theoretical Guarantee (Lemma 5.2):

  • When β ≥ log(3)/w_
  • Using Θ(ζ(r+1)²log n) samples
  • Correctly identifies top elements with probability at least 1-o(n^{-ζ})

BUCCHOI Flow:

  1. Maintain three sets: T (confirmed in τ*), B (confirmed not in τ*), U (unknown)
  2. Repeat: select assortment A of size r, call FINDTOP
  3. If top element found, move to T; otherwise move entire A to B
  4. Finally call SORTCNTR to sort elements in T

Sample Complexity: Θ(r²log n)

Experimental Setup

Datasets

1. Sushi Preference Dataset (Real Data)

  • Source: Kamishima et al. (2005)
  • Scale: 5000 user preferences, each represented as a top-10 list
  • Products: 100 different sushi types
  • Split: 80% training set, 20% test set
  • Purpose: Evaluate model prediction ability and comparison with MNL

2. Synthetic Data

  • Parameter Ranges:
    • n (number of products): 200-20000
    • k (top-k size): 6-16
    • β (decay parameter): 0.01-5
    • p (Kendall's Tau parameter): 0.01-5
    • r (assortment size): adjusted according to k
  • Purpose: Evaluate algorithm accuracy and complexity, verify theoretical results

Evaluation Metrics

1. Prediction Accuracy

  • Test error: Absolute error between empirical choice probability and predicted probability
  • Computation method: Randomly sample assortments on test set, record actual choices, compare with DYPCHIP predictions

2. Learning Accuracy

  • Kendall's Tau distance: Distance between learned center and true center K_p(τ_learned, τ_true)
  • FINDTOP accuracy: Frequency of correctly identifying top elements

3. Time Complexity

  • PRIM: Preprocessing time and amortized per-sample time
  • DYPCHIP: Total time to compute all choice probabilities
  • Learning algorithm: Number of samples needed to achieve specified accuracy

Comparison Methods

  1. Multinomial Logit (MNL): Classical choice model baseline
  2. Chierichetti et al.'s DP Sampling: Previous top-k Mallows sampling method
  3. No Clustering vs Clustering: Single TopKGMM vs mixture TopKGMM (2-5 clusters)

Implementation Details

Sushi Dataset Experiments

  • Parameter tuning: Grid search β∈{0.01,0.025,0.05,...,5}, p∈{0.05,0.1,...,2}
  • Clustering method: k-means based on K_p distance, number of clusters ∈{2,3,5}
  • Silhouette coefficient: Used to evaluate clustering quality
  • Center learning: Use BUCCHOI-II (Algorithm 7) to handle top-10 list samples

Synthetic Data Experiments

  • Repetitions: 10 runs per parameter group
  • Sample range: 50-250 (adjusted according to task)
  • Weight settings: w=2⃗1 or w=32222111, etc.
  • Computing environment: MacBook Pro M1 Max, 32GB RAM

Experimental Results

Main Results

1. TopKGMM vs MNL (Real Data)

Without Clustering (Table 1):

  • Best parameters: β=0.05, p=0.5
  • TopKGMM test error: 0.0817
  • MNL test error: 0.168
  • Relative improvement: 51.4% error reduction

Key Findings:

  • Better performance with smaller β (0.01-0.1), indicating relatively uniform distribution
  • Best performance at p=0.5, balancing different types of inconsistencies
  • TopKGMM significantly outperforms MNL, validating model expressiveness

With Clustering (Table 2, 2 clusters):

  • p=0.1, β=0.1: test error 0.0771
  • p=1.25, β=0.05: test error 0.0788
  • Observation: Slight performance improvement after clustering, but very low silhouette coefficient (<0.012), suggesting data may come from a single distribution

2. DYPCHIP Accuracy (Figure 2a)

Experimental Setup: n=200, k=6, r=4, p=0.5, w=2⃗1

  • Generate samples using PRIM, compute empirical choice frequency
  • Compare with DYPCHIP predictions
  • Repeat on 20 random assortments

Results:

  • Mean absolute error <0.02
  • Small standard deviation, indicating stable predictions
  • Validates correctness of DYPCHIP

3. PRIM Time Complexity (Table 3)

Preprocessing Time (seconds):

k810121416
Time0.0070.0350.191.068.64

Amortized Per-Sample Time (seconds):

k810121416
Time5.67e-59.59e-52.2e-47.4e-43.2e-3

Observations:

  • Preprocessing time grows exponentially with k (consistent with O(k2^k) theory)
  • Amortized time is very small, enabling efficient large-scale sampling
  • Impact of n is minimal (validates O(k log n) complexity)

4. DYPCHIP Time Complexity (Figure 3)

Experimental Parameters: β=0.6, p=0.5, w=2⃗1, r=k-2

  • k=6: approximately 0.05 seconds
  • k=8: approximately 0.5 seconds
  • k=10: approximately 5 seconds
  • k=12: approximately 50 seconds

Observations:

  • Time grows exponentially with k
  • Impact of n is relatively small
  • Computational cost becomes significant for k>12, limiting practical applications

Ablation Studies

1. FINDTOP Sample Complexity (Figure 2b, Figure 4)

Impact of Parameters:

  • Effect of β (n=900, k=10, r=5, p=1):
    • β=0.2: requires 200+ samples to reach 80% accuracy
    • β=0.6: requires 100 samples to reach 95% accuracy
    • β=1.2: requires 50 samples to reach 100% accuracy
    • Conclusion: Larger β leads to more concentrated distribution, easier learning
  • Effect of k:
    • k=12 vs k=14 (other parameters same): larger k requires more samples
    • Consistent with theoretical O(r²log n) complexity
  • Effect of r:
    • r=5 vs r=10: larger r provides more information per observation but also increases noise

2. BUCCHOI Sample Complexity (Figure 2c-d, Figure 5)

Experiment 1 (n=500, k=10, p=0.5, w=2⃗1):

  • 50 samples: Kendall's Tau distance ≈12
  • 100 samples: distance ≈3
  • 200 samples: distance ≈0 (perfect recovery)

Experiment 2 (n=300, k=8, p=2, w=32222111):

  • Non-uniform weights increase learning difficulty
  • Requires more samples to achieve same accuracy
  • Still within Θ(r²log n) range

Effect of β (Table 5, n=1000, k=12):

β0.40.60.81.01.2
50 samples12.758.257.054.350.7
100 samples3.50.350.00.00.0

Observations:

  • β≥1 requires only 100 samples for perfect center recovery
  • β=0.4 still has error even with 100 samples
  • Validates theoretical requirement β≥log(3)/w_

Case Analysis

Sushi Dataset Clustering Analysis

Positive Silhouette Coefficient Clustering:

  • p=0.1, 2 clusters: silhouette coefficient=0.0063
  • p=1.25, 2 clusters: silhouette coefficient=0.0078
  • p=5, 2 clusters: silhouette coefficient=0.0110
  • p=5, 3 clusters: silhouette coefficient=0.0023

Interpretation:

  • Extremely low silhouette coefficients indicate large within-cluster variance
  • Single TopKGMM may be more appropriate
  • Consistent with lower error without clustering

Experimental Findings

  1. Model Expressiveness: TopKGMM significantly outperforms MNL on real data, with error reduction exceeding 50%
  2. Algorithm Efficiency:
    • PRIM is efficient for sampling after preprocessing
    • DYPCHIP is practical for k<12
    • Learning algorithm has logarithmic sample complexity
  3. Parameter Sensitivity:
    • β is a critical parameter affecting distribution concentration and learning difficulty
    • p requires tuning based on data characteristics
    • Non-uniform weights increase learning complexity
  4. Theory Validation: Experimental results are highly consistent with theoretical complexity analysis

1. Choice Modeling

Multinomial Logit (MNL):

  • Proposed by Bradley & Terry (1952)
  • Advantages: simple, interpretable, satisfies IIA
  • Disadvantages: limited expressiveness

Mixed MNL:

  • Extended by McFadden & Train (2000)
  • Learning method: EM algorithm (Dempster et al. 1977)
  • Theoretical guarantees: Chierichetti et al. (2018b), Oh & Shah (2014)

Other Frameworks:

  • Mallows choice models (Désir et al. 2021)
  • Markov chain models (Blanchet et al. 2016)

2. Mallows Model

Classical Mallows Model:

  • Original definition by Mallows (1957)
  • Permutation distribution based on Kendall's Tau distance
  • Closed-form normalization constant

Top-k Extensions:

  • Early work by Fligner & Verducci (1986), Lebanon & Mao (2008)
  • TopKMM proposed by Chierichetti et al. (2018a)
  • Issues: lack of closed-form normalization constant and efficient sampling

Generalized Mallows Model:

  • Weights introduced by Fligner & Verducci (1986)
  • First extension to top-k lists in this paper

3. Mallows Model Applications to Choice

Désir et al.'s Work:

  • Désir et al. (2021) first applied Mallows to choice modeling
  • Demonstrated better prediction than MNL
  • Developed DP algorithm for choice probabilities on full permutations

Subsequent Research:

  • Revenue management and product assortment optimization (Désir et al. 2021, 2023; Feng & Tang 2022)
  • Simplified models (Feng & Tang 2022)

This Paper's Contribution: Extension to top-k lists with complete algorithmic toolkit

4. Parameter Learning

Learning from Complete Rankings:

  • Liu & Moitra (2018), Braverman & Mossel (2008)
  • Awasthi et al. (2014), Seshadri et al. (2020)

Learning from Partial Observations:

  • Pairwise comparisons (Lu & Boutilier 2014; Vitelli et al. 2018)
  • Learning from choices: limited research, lacking theoretical guarantees

This Paper's Contribution:

  • First active learning algorithm for learning top-k Mallows from choice data
  • Provides finite sample complexity guarantee Θ(r²log n)

Conclusions and Discussion

Main Conclusions

  1. Theoretical Contributions:
    • Proposes generalized top-k Mallows model (TopKGMM)
    • Enables efficient algorithm design through profile decomposition
    • Provides rigorous mathematical analysis and complexity guarantees
  2. Algorithmic Contributions:
    • PRIM: O(k2^k + k²log n) sampling algorithm
    • DYPCHIP: O(2^{min{k,n-k}}k³|A|²) choice probability algorithm
    • BUCCHOI: Active learning algorithm with Θ(r²log n) sample complexity
  3. Empirical Contributions:
    • Validates TopKGMM superiority over MNL on real data (51% error reduction)
    • Verifies algorithm accuracy and efficiency
    • Provides parameter tuning guidance

Limitations

  1. Computational Complexity:
    • DYPCHIP's exponential dependence on k limits large-k scenarios (k>15)
    • PRIM preprocessing time grows exponentially with k
    • Practical applications require relatively small k
  2. Theoretical Assumptions:
    • BUCCHOI requires β≥log(3)/w_, limiting low-β scenarios
    • Assumes ∅∉τ*, otherwise only partial center recovery possible
  3. Model Assumptions:
    • Single TopKGMM may be insufficient for capturing multiple user types
    • Parameter learning for mixture models remains open
  4. Experimental Scope:
    • Validation on only one real dataset
    • Requires more domain applications for verification

Future Directions

  1. Learning Mixture TopKGMM:
    • Learn multiple customer types from choice data
    • Similar to mixture MNL learning algorithms
  2. Reducing Computational Complexity:
    • Approximation algorithms to reduce exponential dependence on k
    • Distributed or parallel computing
  3. Model Extensions:
    • Consider contextual information (contextual Mallows)
    • Dynamic preference modeling
  4. Practical Applications:
    • Revenue management and pricing
    • Recommendation system optimization
    • A/B test design

In-Depth Evaluation

Strengths

  1. Theoretical Rigor:
    • All algorithms have rigorous correctness proofs
    • Complete complexity analysis (Theorems 3.1, 4.1, 5.1)
    • Sample complexity has probabilistic guarantees
  2. Methodological Innovation:
    • Profile decomposition is the core innovation, cleverly decomposing complex distribution into tractable parts
    • Solves the open problem left by Chierichetti et al.
    • First implementation of choice probability computation for top-k Mallows
  3. Experimental Sufficiency:
    • Synthetic data broadly covers parameter space
    • Real data validates practical effectiveness
    • Ablation studies analyze parameter impacts
    • Code is publicly available for reproducibility
  4. Practical Value:
    • Significantly outperforms MNL on real data
    • Algorithms are efficient for reasonable parameter ranges
    • Provides complete toolkit (sampling-inference-learning)
  5. Writing Clarity:
    • Clear structure and rigorous logic
    • Accurate mathematical notation definitions
    • Detailed algorithm pseudocode (appendix)

Weaknesses

  1. Computational Scalability:
    • Exponential dependence on k is a fundamental limitation
    • Impractical for k>15 scenarios
    • Lacks discussion of approximation algorithms
  2. Experimental Limitations:
    • Only one real dataset: Sushi dataset may not represent all application scenarios
    • No comparison with other top-k choice models (e.g., top-k MNL variants)
    • Lacks large-scale dataset experiments
  3. Model Assumptions:
    • Assumption ∅∉τ* limits application scope
    • Single distribution assumption may be unrealistic when clustering shows poor results
    • No discussion of robustness to model misspecification
  4. Parameter Selection:
    • β and p selection depends on grid search
    • Lacks automatic parameter selection methods
    • Limited guidance on weight w setting
  5. Theory-Practice Gap:
    • BUCCHOI theoretical requirement β≥log(3)/w_≈1.1, but experiments show β=0.05 also works
    • Theoretical complexity is worst-case, actual performance may be better

Impact

  1. Academic Contribution:
    • Fills gap in top-k Mallows choice modeling
    • Provides foundational algorithms for subsequent research
    • May inspire more Mallows-based choice model research
  2. Practical Value:
    • Recommendation systems: model user preferences for top-k recommendations
    • E-commerce: product assortment optimization
    • Search engines: understand user click behavior
  3. Reproducibility:
  4. Impact Limitations:
    • k limitation may hinder adoption in large-scale applications
    • Requires more real dataset validation for widespread application

Applicable Scenarios

Highly Applicable:

  1. Small to medium k recommendation systems (k≤12):
    • Display top-10 products and predict user selection
    • News recommendation, video recommendation
  2. Product Assortment Optimization:
    • Retailers selecting products to display
    • Menu design
  3. A/B Testing:
    • Compare attractiveness of different product assortments
    • Active learning algorithm reduces test samples

Use with Caution:

  1. Large k scenarios (k>15): High computational cost
  2. Real-time systems: DYPCHIP computation time may not meet latency requirements
  3. Extreme non-uniform weights: Increased learning complexity

Not Applicable:

  1. Complete rankings known: Use classical Mallows model
  2. Completely random user preferences: MNL may be simpler and effective
  3. Interpretability required: Mallows parameters less interpretable than MNL

Key References

  1. Mallows (1957): Original Mallows model
  2. Chierichetti et al. (2018a): Top-k Mallows model foundation
  3. Désir et al. (2021, 2023): Pioneering work on Mallows choice models
  4. Bradley & Terry (1952): MNL model foundation
  5. McFadden & Train (2000): Mixture MNL models
  6. Fligner & Verducci (1986): Generalized Mallows models

Summary

This paper makes important contributions at the intersection of top-k Mallows models and choice modeling. Through clever profile decomposition, the authors design a complete algorithmic toolkit with rigorous theoretical analysis. Experiments validate the method's effectiveness, particularly showing significant advantages over MNL on real data.

Main values include: (1) theoretically solving important open problems; (2) practically providing usable tools; (3) laying foundation for future research.

Main limitation is exponential dependence on k, restricting applications in large-k scenarios. Future research on learning mixture models and developing approximation algorithms will further enhance practical utility.

For applications requiring modeling partial ranking preferences and predicting choice behavior, the TopKGMM framework and algorithms provided by this paper are valuable tools, particularly in scenarios with small k (≤12) and high prediction accuracy requirements.