Sparse Calibration for Microdata Synthesis#
Abstract: Survey calibration adjusts sample weights to match known population totals. When synthesizing microdata from multiple sources, we face a different problem: selecting a minimal subset of synthetic records that, when weighted, reproduce thousands of hierarchical targets. We review classical calibration methods (IPF, chi-square, entropy), their theoretical foundations and limitations, discuss sparse optimization approaches from compressed sensing and machine learning, examine PolicyEngine’s Hard Concrete L0 implementation as a case study, and present a cross-category L0 selection approach that achieves exact sparsity while preserving joint distributions.
1. Introduction#
Microdata synthesis requires calibrating synthetic populations to match official statistics at multiple geographic levels. A typical application might require matching:
~50 federal targets (total income, employment, benefits by program)
~2,500 state-level targets (50 states × 50 variables)
~150,000 county-level targets (3,000 counties × 50 variables)
Millions of tract-level targets
This scale fundamentally changes the calibration problem. Classical survey calibration assumes:
A single sample with design weights to preserve
A modest number of non-conflicting targets
Exact target satisfaction as the goal
None of these assumptions hold for synthetic data calibration at scale.
2. Classical Calibration Methods#
2.1 Iterative Proportional Fitting (IPF)#
2.1.1 Historical Development#
IPF, also known as raking or biproportional fitting, has been independently discovered multiple times across different fields:
Kruithof (1937): Developed the “double factor method” for telephone traffic analysis
Deming & Stephan (1940): Introduced IPF for the 1940 U.S. Census to reconcile sample cross-tabulations with known marginal totals
Sinkhorn (1964): Proved convergence for positive matrices in the context of doubly stochastic matrices
Bishop, Fienberg & Holland (1975): Provided comprehensive treatment in Discrete Multivariate Analysis, establishing IPF as the standard for log-linear models
The algorithm iteratively adjusts weights to match marginal totals:
for each margin m:
adjustment = target[m] / current_weighted_sum[m]
weights[in_margin_m] *= adjustment
2.1.2 Convergence Theory#
Key theoretical contributions establish convergence conditions:
Sinkhorn & Knopp (1967): Proved convergence for nonnegative matrices with “total support” (containing at least one positive diagonal)
Csiszar (1975): Established the information-theoretic foundation, showing IPF minimizes Kullback-Leibler divergence. Provided necessary and sufficient conditions for convergence with zero entries
Fienberg (1970): Gave geometric proof using differential geometry, interpreting contingency tables as manifolds
Ruschendorf (1995): Extended convergence theory to continuous bivariate densities
2.1.3 Properties and Limitations#
Properties:
Simple and fast
Converges for consistent targets
No explicit objective function
Stops when targets are hit (or fails)
Limitations:
No tradeoff mechanism: IPF provides no way to balance conflicting targets. With thousands of hierarchical targets, some will inevitably conflict
Categorical variables only: Fundamentally operates on discrete contingency tables
Empty cell problem: Zero cells can cause convergence failure
Fractional weights: Produces non-integer weights requiring “integerization” for agent-based microsimulation
Slow convergence: Linear convergence in worst case (Fienberg 1970)
2.2 Chi-Square Distance Minimization#
Chi-square calibration minimizes deviation from initial weights:
where \(d_i\) are design weights and \(b\) are targets.
Properties:
Preserves design weight structure
Quadratic programming (efficient)
Well-defined objective for tradeoffs
Limitation: Assumes meaningful initial weights. When synthesizing from multiple sources (CPS + ACS + IRS + SSA), there is no single “design weight” to preserve.
2.3 Entropy Balancing#
Entropy calibration minimizes Kullback-Leibler divergence:
Theoretical Foundation:
Ireland & Kullback (1968): Formulated IPF as minimum discrimination information estimation
Hainmueller (2012): Developed entropy balancing for causal inference, achieving exact covariate balance by reweighting
Properties:
Maximum entropy principle
Guarantees non-negative weights (unlike GREG)
Smooth, exponential tilting
Doubly robust when combined with outcome regression (Zhao & Percival 2017)
Limitation: Same as chi-square - requires meaningful initial weights. May fail to converge in sparse scenarios or when positivity is violated.
2.4 Generalized Regression Estimation (GREG)#
The GREG estimator provides a unified framework for calibration:
Deville & Särndal (1992): Showed that raking is a special case of minimizing distance between adjusted and original weights subject to calibration equations
Särndal, Swensson & Wretman (1992): Established the model-assisted paradigm where estimators are asymptotically unbiased regardless of model specification
GREG extends calibration to continuous auxiliary variables but shares limitations regarding initial weights.
3. Sparse Optimization Background#
3.1 The L0 Minimization Problem#
Finding the sparsest solution to an underdetermined system is NP-hard:
where \(\|w\|_0\) counts non-zero entries. Natarajan (1995) proved this computational intractability, motivating convex relaxations.
3.2 Compressed Sensing Foundations#
Two seminal papers established that sparse solutions can be recovered efficiently:
Candes, Romberg & Tao (2006): Proved that sparse signals can be exactly recovered from incomplete measurements via L1 minimization with high probability, provided the measurement matrix satisfies the Restricted Isometry Property (RIP)
Donoho (2006): Demonstrated that signals with sparse representations can be recovered from far fewer samples than Nyquist-Shannon requires
These theoretical guarantees justify using L1 relaxation for sparse calibration:
3.3 LASSO and Extensions#
Tibshirani (1996): Introduced LASSO, which naturally produces exact zeros via L1 penalty
Zou & Hastie (2005): Developed Elastic Net combining L1 and L2 penalties for grouping correlated variables
Candes, Wakin & Boyd (2008): Proposed iteratively reweighted L1 (IRL1) to better approximate L0
3.4 Optimization Algorithms#
Modern algorithms efficiently solve L1-penalized problems:
ADMM (Boyd et al. 2011): Decomposes problems with L1 penalties plus constraints
ISTA/FISTA (Beck & Teboulle 2009): Proximal gradient methods with soft-thresholding
Best Subset Selection (Bertsimas et al. 2016): Mixed-integer optimization can solve exact L0 problems for moderate sizes
4. Statistical Matching and Data Fusion#
4.1 The Identification Problem#
When combining data sources that don’t jointly observe all variables, the full joint distribution cannot be uniquely identified. D’Orazio, Di Zio & Scanu (2006) provide the foundational treatment in Statistical Matching: Theory and Practice.
4.2 Conditional Independence Assumption#
Statistical matching typically assumes:
This simplifies matching but may bias correlations toward zero. Singh et al. (1993) showed auxiliary information can substitute for this assumption.
4.3 Fréchet Bounds#
Without additional assumptions, Fréchet-Hoeffding bounds provide sharp limits on joint distributions given marginals:
4.4 Uncertainty Quantification#
Moriarity & Scheuren (2001, 2003): Established framework for assessing uncertainty in statistical matching
Rässler (2002): Proposed multiple imputation with informative priors to overcome conditional independence
Edwards & Tanton (2016): Addressed uncertainty estimation in spatial microsimulation
5. Small Area Estimation#
5.1 Fay-Herriot Model#
The Fay-Herriot (1979) area-level model is fundamental for borrowing strength across sparse areas:
where the shrinkage factor \(\gamma_i\) depends on relative variance of direct vs. synthetic estimators.
5.2 Hierarchical Bayesian Methods#
Molina, Nandram & Rao (2014): Hierarchical Bayes for complex nonlinear parameters like poverty indices
Provides both point estimates and proper uncertainty quantification, unlike frequentist EBLUP
5.3 Modern Synthetic Population Generation#
Recent large-scale implementations:
Nature Scientific Data (2025): U.S. national dataset generating 120M+ household synthetic population from ACS
Nature Scientific Data (2022): UK SIPHER dataset combining Census with Understanding Society survey
World Bank REaLTabFormer: Transformer architecture for hierarchical synthetic population generation
6. The Multi-Source Synthesis Problem#
When building synthetic populations from multiple data sources, we face fundamentally different constraints:
6.1 No Meaningful Initial Weights#
Consider synthesizing a population using:
Demographics from Census
Income from CPS
Tax variables from IRS SOI
Benefits from SSA administrative data
Housing from ACS
Each source has its own sampling design and weights. There is no single “initial weight” to preserve - we are creating a new population, not reweighting an existing sample.
6.2 Thousands of Potentially Conflicting Targets#
With hierarchical geography (nation → state → county → tract), targets at different levels may conflict:
# Published statistics have measurement error
sum(county_populations) ≠ state_population # off by ~0.1-1%
sum(tract_incomes) ≠ county_income # off by ~1-5%
IPF cannot handle this - it assumes exact target satisfaction. We need a method that finds the best tradeoff across all targets.
6.3 Sparsity as a First-Class Goal#
For computational efficiency, we want the smallest subset of synthetic records that adequately represents the population. This is an L0 (cardinality) constraint:
7. PolicyEngine’s Hard Concrete L0 Approach#
PolicyEngine has implemented a production-grade L0 regularization system across multiple repositories. This section documents their methodology as a case study.
7.1 Hard Concrete Distribution#
Based on Louizos, Welling & Kingma (2017), the Hard Concrete distribution provides a differentiable approximation of discrete L0 regularization.
For each weight, a gate \(z_i \in [0,1]\) is learned via:
Sampling (Training): $\(s = \sigma\left(\frac{\log u - \log(1-u) + \alpha}{\tau}\right)\)\( \)\(\bar{s} = s(\zeta - \gamma) + \gamma\)\( \)\(z = \min(1, \max(0, \bar{s}))\)$
where \(u \sim U(\epsilon, 1-\epsilon)\), \(\alpha\) are learnable logits, \(\tau\) is temperature, and \(\gamma=-0.1, \zeta=1.1\) are stretch parameters encouraging exact zeros.
L0 Penalty: $\(\mathbb{E}[\|z\|_0] = \sum_i \sigma\left(\alpha_i - \tau \log(-\gamma/\zeta)\right)\)$
7.2 Implementation Architecture#
PolicyEngine’s implementation spans several repositories:
L0 Package (l0-python):
distributions.py: HardConcrete distributiongates.py: SampleGate, FeatureGate, HybridGate for selectioncalibration.py: SparseCalibrationWeights combining gates with positive weights
MicroCalibrate (microcalibrate):
Two-phase optimization: dense reweighting then L0 regularization
Adam optimizer with log-space weight parameterization
Relative squared error loss with normalization
PolicyEngine-US-Data (policyengine-us-data):
2,813+ calibration targets from IRS, Census, CBO, healthcare sources
Group-wise loss averaging for balanced contribution across target types
Geographic stratification for congressional district analysis
7.3 Loss Function#
The core loss function uses relative squared error:
where:
\(w = \exp(\log w)\) ensures positivity
\(g \sim \text{HardConcrete}\) are learned gates
The \(+1\) offset prevents division by zero
Normalization makes targets of different magnitudes comparable
7.4 Two-Phase Optimization#
Phase 1 (Dense):
Optimize weights using Adam with learning rate ~0.001
Optional dropout regularization
~2,000 epochs
Phase 2 (Sparse):
Add Hard Concrete gates with L0 penalty
Higher learning rate (~0.2) for gate parameters
~4,000 epochs (doubled due to increased difficulty)
7.5 Design Rationale#
PolicyEngine chose gradient descent over IPF for several reasons:
Automatic tradeoffs: When targets conflict, the loss function finds the best compromise
No tolerance setting: One objective, no per-target tolerances to tune
End-to-end sparsity: L0 gates are jointly optimized with weights
Scalability: Handles 2,800+ targets efficiently with sparse matrices
Key hyperparameters (from production use):
l0_lambda = 2.6e-7: Regularization strengthtemperature = 0.25: Low for hard gate decisionsinit_mean = 0.999: Start with most weights active
8. Gradient Descent Calibration#
8.1 Unified Loss Formulation#
One approach is to use gradient descent with a unified loss function:
where:
\(M\) is the number of targets (potentially thousands)
\(x_{ij}\) is the value of target variable \(j\) for record \(i\)
\(b_j\) is target \(j\)
The normalization by \(b_j + \epsilon\) makes targets of different magnitudes comparable
8.2 Advantages#
Automatic tradeoffs: When targets conflict, gradient descent finds the best compromise
No tolerance setting: One loss function, no per-target tolerances to tune
Differentiable sparsity: Can use L0 relaxations (Hard Concrete distribution) for end-to-end optimization
8.3 Disadvantages#
Slow: Requires 1000+ epochs for convergence
Approximate zeros: Differentiable L0 relaxations produce near-zero, not exact zero weights
Hyperparameter sensitivity: Learning rate, temperature, regularization strength all matter
9. Cross-Category L0 Selection#
We propose an alternative that achieves exact sparsity while preserving calibration accuracy.
9.1 Key Insight: Cross-Category Structure#
For categorical constraints (state, age group, income bracket), records belong to discrete cross-categories:
Record 1: (state=CA, age=25-34, income=50k-75k)
Record 2: (state=CA, age=25-34, income=75k-100k)
Record 3: (state=NY, age=35-44, income=50k-75k)
...
When selecting a subset for sparsity, we must preserve the joint distribution across all categorical dimensions, not just the marginals.
9.2 Algorithm#
def sparse_calibrate(data, targets, target_sparsity):
# Step 1: Identify cross-categories
# Each record belongs to exactly one (state × age × income × ...) cell
cross_cats = group_by_all_categorical_constraints(data)
# Step 2: Select proportionally from each cross-category
k = int(len(data) * (1 - target_sparsity))
selected = []
for cell in cross_cats:
n_keep = max(1, int(len(cell) * k / len(data)))
selected.extend(random_sample(cell, n_keep))
# Step 3: Calibrate selected records via IPF
weights = ipf_calibrate(data[selected], targets)
return weights
9.3 Properties#
Exact sparsity: Achieves exactly the target sparsity level
Preserved joint distribution: Cross-category selection ensures the selected subset has the correct joint distribution for all categorical constraints
Fast: O(n) selection + O(iterations × constraints) IPF
Exact zeros: Non-selected records have exactly zero weight
9.4 Limitations#
Categorical constraints only: The cross-category approach requires discrete categories
Continuous targets are secondary: After selection, continuous targets may have higher error
IPF limitations apply: Still assumes non-conflicting targets within the selected subset
10. Imputation Methods for Synthesis#
Beyond calibration, microdata synthesis often requires imputation for variables not jointly observed.
10.1 Random Forest Imputation#
RF-based methods show strong performance for tabular data:
Handles mixed types, interactions, nonlinearity
~30% improvement over conventional methods (benchmarks)
PolicyEngine uses Quantile Regression Forests to preserve conditional distributions
10.2 Copula-Based Methods#
Copulas model marginal distributions and dependence structure separately:
Gaussian copula (arXiv 2019): Handles continuous, ordinal, categorical jointly
Relaxes normality assumptions for multivariate mixed data
10.3 Hot Deck Imputation#
Despite ML advances, nearest neighbor hot deck remains competitive:
Preserves observed value distributions
No parametric assumptions
StatMatch R package implements k-NN and probabilistic methods
11. When to Use Each Approach#
Scenario |
Recommended Method |
|---|---|
Simple calibration (5-10 margins) |
IPF or Chi-square |
Preserve design weights |
Chi-square or Entropy |
Thousands of hierarchical targets |
Gradient Descent with L0 |
Need exact sparsity + categorical accuracy |
Cross-Category L0 |
Conflicting targets that need tradeoffs |
Gradient Descent |
Fast iteration, simple constraints |
Cross-Category L0 |
Borrowing strength across sparse areas |
Fay-Herriot / Hierarchical Bayes |
Multi-source data fusion |
Statistical matching + calibration |
12. Open Questions#
Hybrid approaches: Can we combine cross-category selection with gradient descent for final calibration?
Continuous target handling: How should we modify cross-category selection to better handle continuous targets (total income, total benefits)?
Hierarchical consistency: When targets are hierarchical (state = sum of counties), how do we propagate consistency constraints?
Uncertainty quantification: How do we quantify uncertainty in calibrated estimates when sparsity discards information? Multiple imputation and Fréchet bounds provide partial answers.
Adaptive sparsity: Can sparsity vary by geography (more records for larger counties)?
Temperature scheduling: What annealing schedule optimizes the tradeoff between exploration and exploitation in Hard Concrete gates?
Best subset feasibility: For what problem sizes can exact L0 optimization (Bertsimas et al. 2016) replace approximate methods?
13. Conclusion#
Large-scale microdata synthesis requires rethinking classical calibration. IPF’s “hit target or fail” approach cannot handle thousands of potentially conflicting targets. Chi-square and entropy methods assume meaningful initial weights that don’t exist for multi-source synthesis.
Gradient descent with unified loss functions provides automatic tradeoffs but is slow and produces approximate sparsity. PolicyEngine’s Hard Concrete L0 implementation demonstrates that production-scale calibration with 2,800+ targets is feasible, though requiring careful hyperparameter tuning and two-phase optimization.
Cross-category L0 selection achieves exact sparsity with categorical accuracy but struggles with continuous targets. The optimal approach likely combines elements of multiple methods: cross-category selection for efficient L0 sparsity, followed by gradient-based refinement for continuous targets and conflicting constraints, with hierarchical Bayesian methods for uncertainty quantification.
References#
Classical Calibration#
Deming, W.E. & Stephan, F.F. (1940). On a least squares adjustment of a sampled frequency table when the expected marginal totals are known. Annals of Mathematical Statistics, 11, 427-444.
Deville, J.C. & Särndal, C.E. (1992). Calibration Estimators in Survey Sampling. Journal of the American Statistical Association, 87(418), 376-382.
Hainmueller, J. (2012). Entropy Balancing for Causal Effects: A Multivariate Reweighting Method to Produce Balanced Samples in Observational Studies. Political Analysis, 20(1), 25-46.
Ireland, C.T. & Kullback, S. (1968). Contingency Tables with Given Marginals. Biometrika, 55(1), 179-188.
Särndal, C.E., Swensson, B. & Wretman, J. (1992). Model Assisted Survey Sampling. Springer.
Convergence Theory#
Sinkhorn, R. (1964). A relationship between arbitrary positive matrices and doubly stochastic matrices. Annals of Mathematical Statistics, 35, 876-879.
Sinkhorn, R. & Knopp, P. (1967). Concerning nonnegative matrices and doubly stochastic matrices. Pacific Journal of Mathematics, 21(2), 343-348.
Csiszar, I. (1975). I-Divergence Geometry of Probability Distributions and Minimization Problems. Annals of Probability, 3(1), 146-158.
Fienberg, S.E. (1970). An Iterative Procedure for Estimation in Contingency Tables. Annals of Mathematical Statistics, 41(3), 907-917.
Bishop, Y.M.M., Fienberg, S.E. & Holland, P.W. (1975). Discrete Multivariate Analysis: Theory and Practice. MIT Press.
Sparse Optimization#
Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society B, 58, 267-288.
Candes, E.J., Romberg, J. & Tao, T. (2006). Robust Uncertainty Principles: Exact Signal Reconstruction from Highly Incomplete Frequency Information. IEEE Transactions on Information Theory, 52(2), 489-509.
Donoho, D.L. (2006). Compressed Sensing. IEEE Transactions on Information Theory, 52(4), 1289-1306.
Candes, E.J., Wakin, M.B. & Boyd, S.P. (2008). Enhancing Sparsity by Reweighted ℓ1 Minimization. Journal of Fourier Analysis and Applications, 14(5), 877-905.
Boyd, S., Parikh, N., Chu, E., Peleato, B. & Eckstein, J. (2011). Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Foundations and Trends in Machine Learning, 3(1), 1-122.
Bertsimas, D., King, A. & Mazumder, R. (2016). Best Subset Selection via a Modern Optimization Lens. Annals of Statistics, 44(2), 813-852.
L0 Regularization#
Louizos, C., Welling, M. & Kingma, D.P. (2017). Learning Sparse Neural Networks through L0 Regularization. arXiv:1712.01312.
Statistical Matching#
D’Orazio, M., Di Zio, M. & Scanu, M. (2006). Statistical Matching: Theory and Practice. Wiley.
Rässler, S. (2002). Statistical Matching: A Frequentist Theory, Practical Applications, and Alternative Bayesian Approaches. Springer.
Moriarity, C. & Scheuren, F. (2001). Statistical Matching: A Paradigm for Assessing the Uncertainty in the Procedure. Journal of Official Statistics, 17, 407-422.
Small Area Estimation#
Fay, R.E. & Herriot, R.A. (1979). Estimates of Income for Small Places: An Application of James-Stein Procedures to Census Data. Journal of the American Statistical Association, 74, 269-277.
Molina, I., Nandram, B. & Rao, J.N.K. (2014). Small area estimation of general parameters with application to poverty indicators: A hierarchical Bayes approach. Annals of Applied Statistics, 8(2), 852-885.
Edwards, K.L. & Tanton, R. (2016). Estimating Uncertainty in Spatial Microsimulation Approaches to Small Area Estimation. Computers, Environment and Urban Systems.
Imputation#
Andridge, R.R. & Little, R.J.A. (2010). A Review of Hot Deck Imputation for Survey Non-response. International Statistical Review, 78(1), 40-64.
Rubin, D.B. (1987). Multiple Imputation for Nonresponse in Surveys. Wiley.