2025-11-25T01:25:17.472232

The Price-Pareto growth model of networks with community structure

Brzozowski, Gagolewski, Siudem et al.
We introduce a new analytical framework for modelling degree sequences in individual communities of real-world networks, e.g., citations to papers in different fields. Our work is inspired by Price's model and its recent generalisation called 3DSI (three dimensions of scientific impact), which assumes that citations are gained partly accidentally, and to some extent preferentially. Our generalisation is motivated by existing research indicating significant differences between how various scientific disciplines grow, namely, minding different growth ratios, average reference list lengths, and preferential citing tendencies. Extending the 3DSI model to heterogeneous networks with a community structure allows us to devise new analytical formulas for, e.g., citation number inequality and preferentiality measures. We show that the distribution of citations in a community tends to a Pareto type II distribution. We also present analytical formulas for estimating its parameters and Gini's index. The new model is validated on real citation networks.
academic

The Price-Pareto Growth Model of Networks with Community Structure

Basic Information

  • Paper ID: 2510.13392
  • Title: The Price-Pareto growth model of networks with community structure
  • Authors: Łukasz Brzozowski, Marek Gagolewski, Grzegorz Siudem, Barbara Żogała-Siudem
  • Classification: physics.soc-ph cs.SI stat.AP
  • Publication Date: October 15, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.13392

Abstract

This paper proposes a novel analytical framework for modeling degree sequences of individual communities in real-world networks, such as citation patterns across different research fields. The work is inspired by the Price model and its recent generalization, the 3DSI (Three Dimensions of Scientific Impact) model, which assumes that citations are acquired partly randomly and partly through preferential attachment. The research motivation stems from existing evidence showing significant differences across scientific disciplines in growth patterns, including varying growth rates, average reference list lengths, and preferential citation tendencies. The 3DSI model is extended to heterogeneous networks with community structure, enabling the design of new analytical formulas to calculate citation inequality and preferentiality measures. The study demonstrates that citation distributions within communities tend toward Pareto II distributions and provides analytical formulas for estimating their parameters and Gini coefficients.

Research Background and Motivation

Problem Definition

This research addresses the limitation that existing citation network models cannot effectively handle community structure. Traditional network growth models such as the Barabási-Albert model and Price model, while capable of explaining scale-free properties, are based on relative homogeneity assumptions and cannot capture network features with local variability, particularly networks with community structure.

Problem Significance

  1. Disciplinary Differences: Different scientific disciplines exhibit significant variations in network growth patterns, including growth rates, average reference list lengths, and preferential citation tendencies
  2. Ubiquity of Community Structure: Community structure plays important roles in biological, urban, and social networks but is frequently overlooked in modern citation network modeling
  3. Missing Analytical Tools: Lack of analytical tools that simultaneously provide theoretical insights and handle community structure

Limitations of Existing Approaches

  1. Simple Network Models: While BA model, Price model, and 3DSI model have good analytical properties, they do not support community structure
  2. Complex Technical Models: Graph neural networks and graph variational autoencoders can handle communities but lack theoretical insights and require black-box interpretation
  3. Computationally Complex Models: Exponential random graph models are statistically precise but require substantial computation to fit real data

Core Contributions

  1. Proposes Price-Pareto Growth Model: Extends the 3DSI model to heterogeneous networks with community structure, allowing different communities to have different parameters
  2. Theoretical Analysis: Proves that citation distributions within communities converge to Pareto II distributions and derives related analytical formulas
  3. Gini Coefficient Formula: Provides exact analytical formulas for computing Gini coefficients within communities and for the entire network
  4. Parameter Estimation Methods: Develops multiple parameter estimation methods, particularly estimators based on Gini coefficients
  5. Empirical Validation: Validates model effectiveness on CORA and DBLP datasets

Methodology Details

Task Definition

Input: Citation networks with community structure Output: Degree sequence models for each community and their parameters Objective: Accurately model citation distribution characteristics within each community

Model Architecture

Review of Baseline 3DSI Model

Core assumptions of the standard 3DSI model:

  • Each iteration adds a new node with m citations
  • (1-ρ)m citations are randomly allocated (accidental citations)
  • ρm citations are allocated through preferential attachment (preferential citations)

Degree recurrence relation:

d^(t)(ℓ) = d^(t-1)(ℓ) + Acc^(t)(ℓ) + ρm * [d^(t-1)(ℓ) + Acc^(t)(ℓ)] / [(t-1)m + (1-ρ)m]

Community Structure Extension

Key Extensions:

  1. Community Assignment: New nodes are assigned to community i with probability p_i
  2. Parameter Heterogeneity: Each community has its own m_i and ρ_i parameters
  3. Citation Rules:
    • Accidental citations are randomly selected from the entire network
    • Preferential citations are selected only from the same community
    • Self-loops are not allowed

Recurrence Formula:

d_i^(t)(ℓ) = d_i^(t-1)(ℓ) + Acc_i^(t)(ℓ) + ρ_i*m_i * [d_i^(t-1)(ℓ) + Acc_i^(t)(ℓ)] / Σ_{r=1}^{t-1}[d_i^(t-1)(r) + Acc_i^(t-1)(r)]

Accidental Income Calculation

Network growth randomness is modeled through negative binomial distribution:

Acc_i^(t)(ℓ) = ⟨a⟩/(t-1)

where ⟨a⟩ = ⟨m⟩ - ⟨ρm⟩ is the weighted average number of accidental citations.

Closed-Form Solution

Introducing effective parameter ν_i = ρ_im_i/(⟨a⟩ + ρ_im_i), the closed-form solution is:

d_i^(t)(ℓ) = (⟨a⟩/ν_i) * [Γ(ℓ-ν_i)*Γ(t) / (Γ(ℓ)*Γ(t-ν_i)) - 1]

Technical Innovations

  1. Local Time Concept: Introduces local time relative to community size, enabling handling of communities with different growth rates
  2. Mixed Distribution Handling: Models network growth randomness through negative binomial distribution, precisely calculating accidental income
  3. Effective Parameters: Introduces ν_i as an "effective" version of ρ in the standard 3DSI model, simplifying analysis
  4. Asymptotic Analysis: Proves degree distribution convergence to Pareto II distribution, establishing connections between Price model and Pareto distribution

Experimental Setup

Datasets

  1. CORA Dataset:
    • 2,708 nodes, 5,429 edges
    • 7 disciplinary communities
    • Average in-degree/out-degree: 2.005
  2. DBLP v14 Author Network:
    • 481,387 nodes, 58,544,370 edges
    • 8 largest communities
    • Average in-degree/out-degree: 121.616
    • Data preprocessing: Aggregates paper citations to author citations, removes self-citations

Evaluation Metrics

  1. Degree Distribution Fitting: Compares observed values with model predictions through density functions
  2. Parameter Estimation Accuracy: Evaluates accuracy of different estimation methods
  3. Gini Coefficient: Compares theoretically computed and empirically measured Gini coefficients

Parameter Estimation Methods

Gini Coefficient-Based Estimator (primary method):

m̂_i = Ψ_i/(N_i-1)
p̂_i = N_i/N  
ρ̂_i = Σ_i(2G_i + N_i - 2G_i*N_i) / [Ψ_i(G_i + 1 - G_i*N_i)]

Alternative Methods:

  • Estimator based on number of edges within communities
  • Linear system solving based on in-degree equations

Experimental Results

Main Results

  1. CORA Dataset: Model performs well across all 7 communities, with excellent fitting quality particularly in distribution tails
  2. DBLP Dataset: Shows good fitting in most of the 8 communities, though certain communities (e.g., "Control theory") show poorer fitting
  3. Global Network: Standard 3DSI model and the proposed model are nearly identical in global degree sequences, except for tail differences

Parameter Estimation Results

CORA Dataset Parameters:

  • m̂_i range: 1.798-2.338
  • ρ̂_i range: 0.457-0.710
  • Gini coefficient range: 0.674-0.757

DBLP Dataset Parameters:

  • m̂_i range: 35.39-144.31
  • ρ̂_i range: 0.523-0.810
  • Gini coefficient range: 0.726-0.814

Key Findings

  1. Parameter Heterogeneity: Significant variation in ρ̂ values across different disciplines within the same network, confirming that different disciplines have different ratios of accidental to preferential citations
  2. Tail Fitting Advantage: Model shows particularly good fitting quality in distribution tails, important for understanding high-citation paper distributions
  3. Global Consistency: Weighted averages of community models are highly consistent with global 3DSI model

Theoretical Analysis

Asymptotic Properties

As t→∞, degree distribution converges to Pareto II distribution:

f_i(x) = (1/⟨a⟩) * (1 + ν_i*x/⟨a⟩)^{-1-1/ν_i}

Parameters: α = 1/ν_i, λ = ⟨a⟩/ν_i

Gini Coefficient Formula

Within-Community Gini Coefficient:

G_i^(t) = (t-ν_i)/(t-1) * 1/(2-ν_i)

Global Gini Coefficient: Represented through integration of mixed distributions involving complex hypergeometric functions, with practical approximation formulas provided.

Foundational Network Growth Models

  • Price Model: First introduced preferential attachment and "rich-get-richer" phenomenon
  • Barabási-Albert Model: Generalized Price model and proved its mathematical properties
  • Bianconi-Barabási Fitness Model: Introduced concept of node intrinsic "fitness"

Community Structure Models

  • Stochastic Block Model (SBM): Classical generative model with community structure
  • Topic Models: Such as Latent Dirichlet Allocation (LDA), predicting links based on topic similarity
  • Relational Topic Model (RTM): Combines LDA and link prediction

Modern Approaches

  • Graph Neural Networks: Such as graph convolutional networks, but lacking statistical exactness
  • Exponential Random Graph Models: Statistically rigorous framework but computationally complex
  • 3DSI Model: Direct foundation of this work, but does not support community structure

Conclusions and Discussion

Main Conclusions

  1. Successfully extends 3DSI model to networks with community structure while maintaining good analytical properties
  2. Theoretically proves that community degree distributions converge to Pareto II distributions
  3. Provides complete parameter estimation framework and Gini coefficient calculation formulas
  4. Validates model effectiveness on real data

Limitations

  1. Global Degree Sequence: Cannot obtain simple analytical representation of global degree sequences due to complexity of community mixing
  2. Model Assumptions: Assumes accidental citations are uniformly distributed across the network and preferential citations are limited within communities
  3. Parameter Independence: ν_i values are not independent across communities, increasing analytical complexity
  4. Fitting Quality: Certain real network communities cannot be perfectly fitted, reflecting unpredictability of real network behavior

Future Directions

  1. Benchmark Graph Generation: Develop algorithmic frameworks for community detection
  2. Non-Uniform Accidental Edges: Consider non-uniform distribution of accidental edges
  3. Time-Varying Parameters: Study how parameters vary with network scale
  4. Cross-Disciplinary Citations: Model temporal changes in cross-disciplinary citation trends

In-Depth Evaluation

Strengths

  1. Theoretical Rigor: Provides complete mathematical derivations and asymptotic analysis
  2. Strong Practicality: Parameter estimation methods are simple and direct, easy to apply
  3. Innovation: First to handle community structure within preferential attachment framework
  4. Sufficient Validation: Validated on two real datasets of different scales
  5. Complete Analysis: Comprehensive analysis chain from recurrence relations to closed-form solutions to asymptotic properties

Weaknesses

  1. Model Limitations: Allocation rules for accidental and preferential citations are relatively simplified
  2. Community Detection: Depends on pre-given community partitions, does not address community discovery
  3. Dynamics: Does not consider temporal evolution of community structure
  4. Validation Scope: Validated only on citation networks; applicability to other network types remains unknown

Impact

  1. Theoretical Contribution: Establishes new connections between Price model and Pareto distribution
  2. Methodology: Provides new tool for network science to model community structure
  3. Application Value: Has direct application value for scientometrics and network analysis
  4. Reproducibility: Provides clear algorithms and formulas, easy to reproduce

Applicable Scenarios

  1. Scientometrics: Analyzing citation patterns across different disciplines
  2. Social Networks: Modeling growth of social networks with group structure
  3. Benchmark Testing: Providing benchmark networks for community detection algorithms
  4. Policy Analysis: Understanding impacts of disciplinary development and resource allocation

References

Key references include:

  • Price (1965): Networks of scientific papers - Original Price model
  • Siudem et al. (2020): Three dimensions of scientific impact - 3DSI model
  • Albert & Barabási (2002): Statistical mechanics of complex networks - BA model
  • Fortunato (2010): Community detection in graphs - Community detection survey
  • Holland et al. (1983): Stochastic blockmodels - Stochastic block model

This paper makes important contributions at the intersection of network science and scientometrics, providing new theoretical tools for understanding network growth with community structure through rigorous mathematical analysis and empirical validation.