Hierarchical Microdata Synthesis

Hierarchical Microdata Synthesis#

The Problem#

Real microdata has hierarchical structure:

Household
├── Tax Unit 1
│   ├── Person 1 (Head)
│   ├── Person 2 (Spouse)
│   └── Person 3 (Dependent)
└── Tax Unit 2
    └── Person 4 (Adult child)

Variables exist at each level:

Household: Total income, dwelling type, region
Tax Unit: Tax liability, EITC eligibility, filing status
Person: Age, earnings, education, employment

Relationships matter:

Spouse ages are correlated
Children’s ages depend on parent ages
All members must share the same household income
Tax unit aggregates must sum to household totals

The question: Should microplex flatten this to single records or directly model the hierarchy?

Approach 1: Flattening#

The Idea#

Represent each person as a row with household/tax unit variables copied:

person_id | age | household_income | tax_unit_income | region
----------|-----|------------------|-----------------|--------
1         | 42  | 120000          | 100000          | CA
2         | 40  | 120000          | 100000          | CA
3         | 8   | 120000          | 100000          | CA
4         | 22  | 120000          | 20000           | CA

Implementation#

Current microplex already supports this:

from microplex import Synthesizer

# Flatten CPS to person-level
flattened = []
for household in cps_households:
    for person in household.members:
        flattened.append({
            # Person-level
            "age": person.age,
            "earnings": person.earnings,
            "education": person.education,

            # Copied from household
            "household_income": household.income,
            "household_size": len(household.members),
            "region": household.region,

            # Copied from tax unit
            "tax_unit_income": person.tax_unit.income,
            "filing_status": person.tax_unit.filing_status,
        })

# Synthesize
synth = Synthesizer(
    target_vars=["earnings", "household_income", "tax_unit_income"],
    condition_vars=["age", "education", "region"],
)
synth.fit(pd.DataFrame(flattened))

Pros#

Simple: Works with existing microplex API
No structural constraints: Just learn conditional distributions
Proven: This is how most survey microdata is released (PUMS, CPS)
Fast: Single model, no hierarchical sampling

Cons#

Loses within-household correlations: Can’t ensure spouse ages are realistic
No consistency enforcement: Can generate different household_income for same household
Inefficient: Copies household-level variables N times (storage + memory)
Can’t enforce aggregation: Tax unit incomes might not sum to household income

Approach 2: Hierarchical Synthesis#

2a. Top-Down Sequential Sampling#

Generate in order: household → tax units → persons

# Pseudocode
class HierarchicalSynthesizer:
    def __init__(self):
        # Three separate models
        self.household_model = Synthesizer(
            target_vars=["household_income", "n_tax_units"],
            condition_vars=["region", "year"],
        )
        self.tax_unit_model = Synthesizer(
            target_vars=["tax_unit_income", "filing_status", "n_persons"],
            condition_vars=["household_income", "region"],
        )
        self.person_model = Synthesizer(
            target_vars=["age", "earnings", "education"],
            condition_vars=["tax_unit_income", "filing_status"],
        )

    def generate(self, n_households):
        # 1. Sample households
        households = self.household_model.generate(n_households)

        # 2. For each household, sample tax units
        tax_units = []
        for hh in households:
            n_tu = int(hh["n_tax_units"])
            tu = self.tax_unit_model.generate(
                conditions=pd.DataFrame([hh] * n_tu)
            )
            tax_units.append(tu)

        # 3. For each tax unit, sample persons
        persons = []
        for tu in tax_units:
            n_persons = int(tu["n_persons"])
            p = self.person_model.generate(
                conditions=pd.DataFrame([tu] * n_persons)
            )
            persons.append(p)

        return households, tax_units, persons

Pros:

Natural modeling of causality (household determines tax units, etc.)
Can enforce aggregation constraints at each level
Memory efficient (no duplication)

Cons:

Complex: Three models to train
Error propagation: Mistakes at household level affect all downstream
Training data requirements: Need nested structure in training data

2b. Copula-Based Hierarchical Synthesis#

Use copulas to model cross-level dependencies.

From Copula-Based Transferable Models:

Separate marginal distributions from dependency structure
Model household structure using vine copulas
Can transfer learned dependencies to new populations

# Conceptual (would need implementation)
class CopulaHierarchicalSynthesizer:
    def __init__(self):
        # Learn marginals at each level
        self.household_marginals = fit_marginals(household_vars)
        self.person_marginals = fit_marginals(person_vars)

        # Learn dependency structure
        self.within_household_copula = VineCopula()
        self.cross_level_copula = VineCopula()

    def generate(self, n):
        # Sample from copula, then transform to marginals
        u = self.copula.sample(n)
        x = self.marginals.inverse_cdf(u)
        return x

Pros:

Preserves all correlations (within and across levels)
Mathematically rigorous
Transferable to new populations

Cons:

High-dimensional copulas are complex
Hard to enforce hard constraints (e.g., child age < parent age)
Training requires significant sample size

2c. Graph Neural Networks for Household Structure#

From University of Oxford student project:

Represent households as graphs
Nodes = persons
Edges = relationships (spouse, parent-child)
Use GNN to predict realistic household compositions

# Conceptual
class GNNHouseholdSynthesizer:
    def __init__(self):
        self.person_generator = Synthesizer(target_vars=["age", "earnings"])
        self.structure_predictor = GraphNeuralNetwork()

    def generate(self, n_households):
        # 1. Generate persons independently
        persons = self.person_generator.generate(n_total_persons)

        # 2. Use GNN to cluster into realistic households
        household_graphs = self.structure_predictor.predict(persons)

        return household_graphs

Pros:

Learns complex household structure patterns
Can enforce relationship constraints
State-of-the-art for spatial microsimulation

Cons:

Bleeding edge (not production-ready)
Requires graph-structured training data
Computationally expensive

Approach 3: Hybrid (Recommended for microplex v1)#

The Pragmatic Middle Ground#

For synthesis: Use flattened representation For consistency: Add post-processing to fix violations

class HybridSynthesizer:
    def __init__(self):
        self.base = Synthesizer(
            target_vars=["age", "earnings", "household_income"],
            condition_vars=["education", "region"],
        )

    def generate(self, conditions, household_sizes):
        # 1. Generate persons with flattened approach
        synthetic = self.base.generate(conditions)

        # 2. Post-process to enforce consistency
        synthetic = self._enforce_household_consistency(
            synthetic, household_sizes
        )

        return synthetic

    def _enforce_household_consistency(self, data, household_sizes):
        # Group by household
        households = []
        start = 0
        for size in household_sizes:
            hh = data.iloc[start:start+size].copy()

            # All members get same household income
            hh["household_income"] = hh["household_income"].mean()

            # Ensure spouse ages are reasonable
            if size >= 2:
                ages = hh["age"].values[:2]
                if abs(ages[0] - ages[1]) > 20:
                    # Adjust to be within 20 years
                    hh.loc[1, "age"] = ages[0] + np.random.randint(-10, 10)

            households.append(hh)
            start += size

        return pd.concat(households)

Why This Works#

Leverages existing microplex: No architectural changes needed
Handles 90% of constraints: Post-processing fixes obvious violations
Incremental path: Can add more sophisticated approaches later
Practical: Microsimulation models mostly treat households atomically anyway

Limitations#

Post-processing can’t enforce complex joint distributions
May slightly distort learned correlations
Not guaranteed to match all training data properties

How PolicyEngine Handles This#

PolicyEngine uses a different approach entirely - it doesn’t synthesize microdata at all. Instead:

Source microdata (CPS/ACS) already has hierarchical structure

Entity definitions in PolicyEngine Core:

class Person(Entity):
    key = "person"

class TaxUnit(GroupEntity):
    key = "tax_unit"
    roles = [Head, Spouse, Dependent]

class Household(GroupEntity):
    key = "household"

Projectors handle cross-level calculations:
- entity_to_person_projector: Broadcast household income to all members
- first_person_to_entity_projector: Use head’s age for tax unit age

Reweighting operates at household level:

# Reweight to match state populations
reweight(microdata, targets={"state": {...}})

Key insight: For tax-benefit microsimulation, you rarely need to synthesize new household structures. You just reweight existing ones to match population margins.

Recommendations#

For microplex v1: Hybrid Flattening + Post-Processing#

# Example workflow
from microplex import Synthesizer, enforce_hierarchy

# 1. Train on flattened data
synth = Synthesizer(
    target_vars=["age", "earnings", "household_income"],
    condition_vars=["education", "region"],
)
synth.fit(flattened_cps)

# 2. Generate
synthetic = synth.generate(new_demographics)

# 3. Enforce consistency (new utility function)
synthetic = enforce_hierarchy(
    synthetic,
    household_id="household_id",
    shared_vars=["household_income", "region"],
    relationship_constraints={
        "spouse_age_diff": {"max": 20},
        "child_age": {"max_parent_diff": 18},
    }
)

Implementation effort: Low (1-2 weeks)

Modify Synthesizer.generate() to accept household_id
Add enforce_hierarchy() utility function
Document hierarchical use cases

For microplex v2: Top-Down Hierarchical#

When you need true joint modeling:

from microplex import HierarchicalSynthesizer

synth = HierarchicalSynthesizer(
    levels={
        "household": {
            "target_vars": ["income", "size"],
            "condition_vars": ["region"],
        },
        "person": {
            "target_vars": ["age", "earnings"],
            "condition_vars": ["household_income", "household_size"],
            "parent_level": "household",
        },
    }
)

Implementation effort: Medium (4-6 weeks)

New HierarchicalSynthesizer class
Training pipeline for nested models
Aggregation constraint enforcement

For microplex v3: Copula/GNN Approaches#

If you need:

Perfect correlation preservation
Transferability across populations
State-of-the-art quality

See research papers:

Implementation effort: High (3-6 months)

Practical Considerations#

Storage Format#

For hierarchical data, use denormalized tables:

persons.parquet:
  person_id | household_id | tax_unit_id | age | earnings

households.parquet:
  household_id | income | size | region

tax_units.parquet:
  tax_unit_id | household_id | income | filing_status

Then join when needed:

# For synthesis, flatten
training_data = (
    persons
    .merge(households, on="household_id")
    .merge(tax_units, on="tax_unit_id")
)

# For microsimulation, keep normalized
sim.load_persons(persons)
sim.load_households(households)
sim.calculate("tax_liability")

Reweighting Level#

Always reweight at the household level:

from microplex import Reweighter

# CORRECT: Household-level reweighting
reweighter = Reweighter()
weighted = reweighter.fit_transform(
    households,  # One row per household
    targets={"state": {...}},
)

# WRONG: Person-level reweighting (violates hierarchy)
weighted = reweighter.fit_transform(
    persons,  # Persons in same household could get different weights
    targets={"state": {...}},
)

Memory Efficiency#

For large populations (100M+ persons):

Don’t duplicate household vars in memory:

# Bad: 100M rows × 50 household vars = huge
flat = persons.merge(households)

# Good: Store separately, join on demand
person_features = synth.generate(persons)
full_data = person_features.merge(
    households[["household_id", "income"]],
    on="household_id"
)

Generate in batches:

for batch in batches(demographics, size=1_000_000):
    synthetic_batch = synth.generate(batch)
    synthetic_batch.to_parquet(f"output_{i}.parquet")

Code Sketch: Hybrid Approach#

# microplex/hierarchy.py

import pandas as pd
import numpy as np
from typing import Dict, List, Optional

def enforce_hierarchy(
    data: pd.DataFrame,
    household_id: str,
    shared_vars: List[str],
    relationship_constraints: Optional[Dict] = None,
) -> pd.DataFrame:
    """
    Enforce hierarchical consistency in flattened microdata.

    Ensures:
    - Household-level variables are identical for all members
    - Relationship constraints are satisfied (spouse ages, child ages, etc.)

    Args:
        data: Flattened microdata with person records
        household_id: Column identifying household membership
        shared_vars: Variables that must be identical within household
        relationship_constraints: Rules for relationships (optional)

    Returns:
        Corrected microdata

    Example:
        >>> synthetic = synth.generate(demographics)
        >>> consistent = enforce_hierarchy(
        ...     synthetic,
        ...     household_id="household_id",
        ...     shared_vars=["household_income", "region"],
        ...     relationship_constraints={
        ...         "spouse_age_diff": {"max": 20},
        ...     }
        ... )
    """
    result = []

    for hh_id, group in data.groupby(household_id):
        hh = group.copy()

        # Enforce shared variables (use mean/mode)
        for var in shared_vars:
            if var in hh.columns:
                if hh[var].dtype in [np.float64, np.float32, np.int64, np.int32]:
                    # Numeric: use mean
                    hh[var] = hh[var].mean()
                else:
                    # Categorical: use mode
                    hh[var] = hh[var].mode()[0]

        # Apply relationship constraints
        if relationship_constraints:
            hh = _apply_constraints(hh, relationship_constraints)

        result.append(hh)

    return pd.concat(result, ignore_index=True)


def _apply_constraints(
    household: pd.DataFrame,
    constraints: Dict,
) -> pd.DataFrame:
    """Apply relationship constraints within household."""

    # Spouse age difference
    if "spouse_age_diff" in constraints and len(household) >= 2:
        max_diff = constraints["spouse_age_diff"]["max"]
        ages = household["age"].values[:2]

        if abs(ages[0] - ages[1]) > max_diff:
            # Adjust second person's age
            household.loc[household.index[1], "age"] = (
                ages[0] + np.random.randint(-max_diff//2, max_diff//2)
            )

    # Child age constraints
    if "child_age" in constraints and len(household) > 2:
        max_parent_diff = constraints["child_age"]["max_parent_diff"]
        parent_ages = household["age"].values[:2]
        max_parent_age = max(parent_ages)

        for i in range(2, len(household)):
            child_age = household.iloc[i]["age"]
            if child_age > max_parent_age - max_parent_diff:
                # Adjust child age to be reasonable
                household.loc[household.index[i], "age"] = max(
                    0,
                    max_parent_age - max_parent_diff - np.random.randint(0, 5)
                )

    return household


# Add to Synthesizer class
from microplex.synthesizer import Synthesizer

def generate_with_hierarchy(
    self,
    conditions: pd.DataFrame,
    household_id: str,
    shared_vars: List[str],
    seed: Optional[int] = None,
) -> pd.DataFrame:
    """
    Generate synthetic data with hierarchical consistency enforced.

    Convenience method that combines generate() + enforce_hierarchy().
    """
    synthetic = self.generate(conditions, seed=seed)
    return enforce_hierarchy(synthetic, household_id, shared_vars)

# Monkey-patch for backwards compatibility
Synthesizer.generate_with_hierarchy = generate_with_hierarchy

Hierarchical Microdata Synthesis

Contents

Hierarchical Microdata Synthesis#

The Problem#

Approach 1: Flattening#

The Idea#

Implementation#

Pros#

Cons#

Approach 2: Hierarchical Synthesis#

2a. Top-Down Sequential Sampling#

2b. Copula-Based Hierarchical Synthesis#

2c. Graph Neural Networks for Household Structure#

Approach 3: Hybrid (Recommended for microplex v1)#

The Pragmatic Middle Ground#

Why This Works#

Limitations#

How PolicyEngine Handles This#

Recommendations#

For microplex v1: Hybrid Flattening + Post-Processing#

For microplex v2: Top-Down Hierarchical#

For microplex v3: Copula/GNN Approaches#

Practical Considerations#

Storage Format#

Reweighting Level#

Memory Efficiency#

Code Sketch: Hybrid Approach#

Future Roadmap#

v1.0 (Current): Flattening#

v1.5: Hybrid Post-Processing#

v2.0: Native Hierarchical Synthesis#

v3.0: Advanced Methods#

References#

Hierarchical Synthesis Methods#

Graph Neural Networks#

Census/Government Approaches#

Microsimulation Models#

Review Articles#