Reweighting Module#
The Reweighter class implements sparse optimization for calibrating microdata to population targets.
Overview#
Reweighting finds optimal weights for synthetic microdata records to match official population statistics (margins/targets) while using the minimal number of records.
Mathematical Formulation#
The reweighting problem is formulated as:
minimize ||w||_p
subject to A @ w = b
w >= 0
where:
w: weight vector (decision variables)A: constraint matrix (indicator matrix for margins)b: target vector (population totals)p: sparsity norm (0, 1, or 2)
Key Features#
Multiple sparsity objectives: L0, L1, L2 norms
Geographic hierarchies: State, county, tract level targeting
Multiple backends: scipy (default), cvxpy (optional)
Efficient sparse solutions: L0 uses iterative reweighted L1 (IRL1)
Installation#
The reweighter requires scipy (included in base dependencies). For additional optimization capabilities, install cvxpy:
pip install microplex[cvxpy]
Basic Usage#
from microplex import Reweighter
import pandas as pd
# Load synthetic microdata
data = pd.DataFrame({
"state": ["CA", "CA", "NY", "NY", "TX", "TX"],
"age_group": ["young", "old", "young", "old", "young", "old"],
"income": [50000, 60000, 55000, 65000, 48000, 58000],
"weight": [1, 1, 1, 1, 1, 1], # Initial uniform weights
})
# Define population targets
targets = {
"state": {"CA": 100, "NY": 50, "TX": 50},
"age_group": {"young": 120, "old": 80},
}
# Fit and transform
reweighter = Reweighter(sparsity="l1")
weighted_data = reweighter.fit_transform(data, targets)
print(weighted_data["weight"])
API Reference#
Reweighter#
class Reweighter:
def __init__(
self,
backend: Literal["scipy", "cvxpy"] = "scipy",
sparsity: Literal["l0", "l1", "l2"] = "l1",
tol: float = 1e-4,
max_iter: int = 1000,
)
Parameters:
backend: Optimization backend (“scipy” or “cvxpy”)sparsity: Sparsity objective (“l0”, “l1”, or “l2”)l0: Minimize number of non-zero weights (sparsest)l1: Minimize sum of weights (sparse)l2: Minimize squared weights (dense)
tol: Convergence tolerancemax_iter: Maximum optimization iterations
Methods#
fit#
def fit(
self,
data: pd.DataFrame,
targets: Dict[str, Dict[str, float]],
weight_col: str = "weight",
) -> "Reweighter"
Fit weights to match population targets.
Parameters:
data: DataFrame with microdata recordstargets: Nested dict{margin_var: {category: count}}weight_col: Name of weight column (optional)
Returns: self
Example:
targets = {
"state": {"CA": 1000, "NY": 500},
"age_group": {"18-64": 1000, "65+": 500},
}
reweighter.fit(data, targets)
transform#
def transform(
self,
data: pd.DataFrame,
weight_col: str = "weight",
drop_zeros: bool = False,
) -> pd.DataFrame
Apply fitted weights to data.
Parameters:
data: DataFrame to reweightweight_col: Weight column namedrop_zeros: If True, remove zero-weight records
Returns: DataFrame with updated weights
fit_transform#
def fit_transform(
self,
data: pd.DataFrame,
targets: Dict[str, Dict[str, float]],
weight_col: str = "weight",
drop_zeros: bool = False,
) -> pd.DataFrame
Fit and transform in one call (convenience method).
get_sparsity_stats#
def get_sparsity_stats() -> Dict[str, Union[int, float]]
Get statistics about fitted weights.
Returns: Dictionary with:
n_records: Total recordsn_nonzero: Records with positive weightsparsity: Fraction of zero weightsmax_weight: Maximum weight valuetotal_weight: Sum of all weights
Sparsity Objectives#
L0 (Minimize Count)#
Objective: min ||w||_0 = min (number of non-zero weights)
Use case: Extreme sparsity - use fewest possible records
Algorithm: Iterative Reweighted L1 (IRL1)
reweighter = Reweighter(sparsity="l0")
Properties:
Produces sparsest solutions
Non-convex (uses approximation)
Good for computational efficiency
L1 (Minimize Sum)#
Objective: min ||w||_1 = min sum(w_i)
Use case: Sparse solutions with computational guarantees
Algorithm: Linear programming (scipy.optimize.linprog)
reweighter = Reweighter(sparsity="l1")
Properties:
Convex optimization (globally optimal)
Naturally sparse solutions
Fast and reliable
L2 (Minimize Squares)#
Objective: min ||w||_2^2 = min sum(w_i^2)
Use case: Smooth weight distributions
Algorithm: Quadratic programming (scipy.optimize.minimize)
reweighter = Reweighter(sparsity="l2")
Properties:
Convex optimization
Dense solutions (most records used)
Penalizes large weights
Advanced Examples#
Geographic Hierarchy#
# State and county level targets
targets = {
"state": {
"CA": 39_500_000,
"NY": 19_500_000,
},
"county": {
"Los Angeles": 10_000_000,
"Orange": 3_100_000,
"San Diego": 3_300_000,
"New York": 1_600_000,
"Kings": 2_600_000,
},
}
reweighter = Reweighter(sparsity="l1")
weighted = reweighter.fit_transform(data, targets)
Multiple Margin Variables#
# Match multiple demographic margins
targets = {
"state": {"CA": 1000, "NY": 500, "TX": 500},
"age_group": {"0-17": 400, "18-64": 1200, "65+": 400},
"sex": {"M": 1000, "F": 1000},
}
reweighter = Reweighter(sparsity="l0")
weighted = reweighter.fit_transform(data, targets)
# Check how many records used
stats = reweighter.get_sparsity_stats()
print(f"Used {stats['n_nonzero']} records")
Comparing Sparsity Methods#
import matplotlib.pyplot as plt
sparsities = []
for method in ["l0", "l1", "l2"]:
rw = Reweighter(sparsity=method)
result = rw.fit_transform(data, targets)
stats = rw.get_sparsity_stats()
sparsities.append({
"method": method.upper(),
"n_nonzero": stats["n_nonzero"],
"max_weight": stats["max_weight"],
})
df = pd.DataFrame(sparsities)
print(df)
Drop Zero-Weight Records#
# Remove records with zero weight to reduce dataset size
weighted = reweighter.fit_transform(data, targets, drop_zeros=True)
print(f"Original records: {len(data)}")
print(f"Retained records: {len(weighted)}")
Integration with Synthesizer#
Combine synthesis and reweighting for end-to-end microdata creation:
from microplex import Synthesizer, Reweighter
# Step 1: Synthesize microdata
synth = Synthesizer(
target_vars=["income"],
condition_vars=["age", "education", "state"],
)
synth.fit(training_data, epochs=100)
synthetic = synth.generate(demographics, n=10000)
# Step 2: Reweight to population targets
targets = {
"state": {"CA": 4000, "NY": 3000, "TX": 3000},
}
reweighter = Reweighter(sparsity="l0")
calibrated = reweighter.fit_transform(synthetic, targets)
# Step 3: Analyze
stats = reweighter.get_sparsity_stats()
print(f"Final dataset: {stats['n_nonzero']} weighted records")
Performance Considerations#
Computational Complexity#
L1/L2: Polynomial time (efficient for large problems)
L0: Iterative approximation (may be slower)
Problem Size#
Small (<10k records, <10 margins): All methods work well
Medium (10k-100k records, 10-50 margins): L1 recommended
Large (>100k records, >50 margins): L1 with sparse backends
Tips for Large Datasets#
Use L1 for speed and reliability
Consider cvxpy backend for complex constraints
Pre-filter data to relevant categories
Use hierarchical reweighting (state → county → tract)
Optimization Backends#
scipy (default)#
Pros:
No additional dependencies
Fast for L1/L2
Stable and well-tested
Cons:
L0 uses approximation
Limited to standard problem forms
cvxpy (optional)#
Pros:
More flexible problem formulations
Multiple solver options (ECOS, SCS, etc.)
Better handling of complex constraints
Cons:
Requires additional installation
Can be slower for simple problems
Installation:
pip install cvxpy
Usage:
reweighter = Reweighter(backend="cvxpy", sparsity="l1")
Error Handling#
Common Errors#
ValueError: Data contains categories not in targets
# Solution: Ensure all data categories have targets
targets = {
"state": {"CA": 100, "NY": 50, "TX": 50, "FL": 25}
}
ValueError: Reweighter not fitted
# Solution: Call fit() before transform()
reweighter.fit(data, targets)
result = reweighter.transform(data)
ValueError: Data length doesn’t match fitted length
# Solution: Use same data for fit and transform
reweighter.fit(data, targets)
result = reweighter.transform(data) # Same data
Validation#
Check that weights match targets:
weighted = reweighter.fit_transform(data, targets)
# Verify state targets
for state, target in targets["state"].items():
actual = weighted[weighted["state"] == state]["weight"].sum()
error = abs(actual - target) / target * 100
print(f"{state}: {actual:.0f} (target: {target}, error: {error:.2f}%)")
References#
Iterative Reweighted L1: Candès et al. (2008) “Enhancing Sparsity by Reweighted ℓ1 Minimization”
Survey Calibration: Deville & Särndal (1992) “Calibration Estimators in Survey Sampling”
Sparse Optimization: Boyd & Vandenberghe (2004) “Convex Optimization”