Epidemiologists and social scientists have used the Network Scale-Up Method (NSUM) for over 30 years to estimate the size of hidden subpopulations in social networks. The method operates by querying a subset of network nodes about the number of their neighbors belonging to the hidden subpopulation. Generally, NSUM assumes that social network topology and hidden subpopulation distribution are well-behaved, such that NSUM estimates approximate true values. However, analytical bounds on NSUM estimation errors have not been rigorously proven. This paper provides analytical error bounds for two of the most popular NSUM estimators. The main findings are twofold: first, when an adversary designs the network and places the hidden subpopulation, estimates may deviate from true values by a factor of Ω(√n); second, when the underlying network is randomly generated, using a sample of logarithmic size O(log n) can achieve small constant-factor error bounds with high probability.
The Network Scale-Up Method (NSUM) is an indirect survey technique used to estimate the size of hard-to-reach hidden populations in social networks, such as disease patients, disaster victims, or members of clandestine networks. The core idea of the method is to ask a subset of nodes in the network: "How many neighbors do you know?" and "How many of them belong to the hidden population?"
Practical Application Value: NSUM has broad applications in public health, social sciences, and security, such as estimating the number of AIDS patients, COVID-19 prevalence, etc.
Theoretical Gap: Despite over 30 years of NSUM usage, there is a lack of rigorous theoretical error bound analysis
Method Reliability: Theoretical guarantees are needed to ensure the accuracy and credibility of estimates
First Theoretical Error Bounds for NSUM: Provides rigorous analytical error bounds for two most popular NSUM estimators (MoR and RoS)
Adversarial Lower Bound Proof: Proves that under adversarial scenarios, any NSUM estimator has error at least Ω(√n)
Upper Bound Analysis on Random Networks: Proves that in random networks, using a sample of size O(log n) can achieve small constant-factor error bounds
Analysis for Specific Network Models: Provides improved analytical bounds for Erdős-Rényi and Scale-Free networks
Extensive Experimental Validation: Verifies theoretical analysis through numerical experiments on synthetic and real networks
Given a directed graph G = (V, E) and hidden subpopulation H ⊆ V, collect aggregated relational data (ARD) from sample set S ⊆ V to estimate prevalence ρ(I) = |H|/|V|.
Each sampled node v reports:
In-degree Rv (number of in-neighbors)
Number of in-neighbors Cv belonging to the hidden population
k additional nodes Va, each connected to a different complete subgraph node
A special node s connected to all complete subgraph nodes
By designing two different hidden population configurations I₁ = (G, {s}) and I₂ = (G, Va) that produce identical ARD but vastly different prevalence, the Ω(√n) lower bound is proven.
Key Insight: Proves that random variables Yv = Cv/Rv and Xvj (indicator variables) exhibit negative correlation, which is crucial for applying concentration inequalities.
Negative Correlation Definition: For random variables Z₁, Z₂, ..., Zn, if for any subset B ⊆ {1,2,...,n}:
This paper cites 26 related references, primarily including:
Bernard et al. (1991): Foundational work on NSUM methodology
Killworth et al. (1998): Introduction of MoR and RoS estimators
Chen et al. (2016): Related theoretical work on network scale estimation
Srivastava et al. (2024): Recent advances in NSUM trend estimation
Overall Assessment: This is a pioneering paper in NSUM theoretical analysis that fills a 30-year gap in theoretical analysis of this field, providing important theoretical foundation and guidance for practical applications.