Hierarchical Microdata Synthesis#

The Problem#

Real microdata has hierarchical structure:

Household
├── Tax Unit 1
│   ├── Person 1 (Head)
│   ├── Person 2 (Spouse)
│   └── Person 3 (Dependent)
└── Tax Unit 2
    └── Person 4 (Adult child)

Variables exist at each level:

  • Household: Total income, dwelling type, region

  • Tax Unit: Tax liability, EITC eligibility, filing status

  • Person: Age, earnings, education, employment

Relationships matter:

  • Spouse ages are correlated

  • Children’s ages depend on parent ages

  • All members must share the same household income

  • Tax unit aggregates must sum to household totals

The question: Should microplex flatten this to single records or directly model the hierarchy?

Approach 1: Flattening#

The Idea#

Represent each person as a row with household/tax unit variables copied:

person_id | age | household_income | tax_unit_income | region
----------|-----|------------------|-----------------|--------
1         | 42  | 120000          | 100000          | CA
2         | 40  | 120000          | 100000          | CA
3         | 8   | 120000          | 100000          | CA
4         | 22  | 120000          | 20000           | CA

Implementation#

Current microplex already supports this:

from microplex import Synthesizer

# Flatten CPS to person-level
flattened = []
for household in cps_households:
    for person in household.members:
        flattened.append({
            # Person-level
            "age": person.age,
            "earnings": person.earnings,
            "education": person.education,

            # Copied from household
            "household_income": household.income,
            "household_size": len(household.members),
            "region": household.region,

            # Copied from tax unit
            "tax_unit_income": person.tax_unit.income,
            "filing_status": person.tax_unit.filing_status,
        })

# Synthesize
synth = Synthesizer(
    target_vars=["earnings", "household_income", "tax_unit_income"],
    condition_vars=["age", "education", "region"],
)
synth.fit(pd.DataFrame(flattened))

Pros#

  1. Simple: Works with existing microplex API

  2. No structural constraints: Just learn conditional distributions

  3. Proven: This is how most survey microdata is released (PUMS, CPS)

  4. Fast: Single model, no hierarchical sampling

Cons#

  1. Loses within-household correlations: Can’t ensure spouse ages are realistic

  2. No consistency enforcement: Can generate different household_income for same household

  3. Inefficient: Copies household-level variables N times (storage + memory)

  4. Can’t enforce aggregation: Tax unit incomes might not sum to household income

Approach 2: Hierarchical Synthesis#

2a. Top-Down Sequential Sampling#

Generate in order: household → tax units → persons

# Pseudocode
class HierarchicalSynthesizer:
    def __init__(self):
        # Three separate models
        self.household_model = Synthesizer(
            target_vars=["household_income", "n_tax_units"],
            condition_vars=["region", "year"],
        )
        self.tax_unit_model = Synthesizer(
            target_vars=["tax_unit_income", "filing_status", "n_persons"],
            condition_vars=["household_income", "region"],
        )
        self.person_model = Synthesizer(
            target_vars=["age", "earnings", "education"],
            condition_vars=["tax_unit_income", "filing_status"],
        )

    def generate(self, n_households):
        # 1. Sample households
        households = self.household_model.generate(n_households)

        # 2. For each household, sample tax units
        tax_units = []
        for hh in households:
            n_tu = int(hh["n_tax_units"])
            tu = self.tax_unit_model.generate(
                conditions=pd.DataFrame([hh] * n_tu)
            )
            tax_units.append(tu)

        # 3. For each tax unit, sample persons
        persons = []
        for tu in tax_units:
            n_persons = int(tu["n_persons"])
            p = self.person_model.generate(
                conditions=pd.DataFrame([tu] * n_persons)
            )
            persons.append(p)

        return households, tax_units, persons

Pros:

  • Natural modeling of causality (household determines tax units, etc.)

  • Can enforce aggregation constraints at each level

  • Memory efficient (no duplication)

Cons:

  • Complex: Three models to train

  • Error propagation: Mistakes at household level affect all downstream

  • Training data requirements: Need nested structure in training data

2b. Copula-Based Hierarchical Synthesis#

Use copulas to model cross-level dependencies.

From Copula-Based Transferable Models:

  • Separate marginal distributions from dependency structure

  • Model household structure using vine copulas

  • Can transfer learned dependencies to new populations

# Conceptual (would need implementation)
class CopulaHierarchicalSynthesizer:
    def __init__(self):
        # Learn marginals at each level
        self.household_marginals = fit_marginals(household_vars)
        self.person_marginals = fit_marginals(person_vars)

        # Learn dependency structure
        self.within_household_copula = VineCopula()
        self.cross_level_copula = VineCopula()

    def generate(self, n):
        # Sample from copula, then transform to marginals
        u = self.copula.sample(n)
        x = self.marginals.inverse_cdf(u)
        return x

Pros:

  • Preserves all correlations (within and across levels)

  • Mathematically rigorous

  • Transferable to new populations

Cons:

  • High-dimensional copulas are complex

  • Hard to enforce hard constraints (e.g., child age < parent age)

  • Training requires significant sample size

2c. Graph Neural Networks for Household Structure#

From University of Oxford student project:

  • Represent households as graphs

  • Nodes = persons

  • Edges = relationships (spouse, parent-child)

  • Use GNN to predict realistic household compositions

# Conceptual
class GNNHouseholdSynthesizer:
    def __init__(self):
        self.person_generator = Synthesizer(target_vars=["age", "earnings"])
        self.structure_predictor = GraphNeuralNetwork()

    def generate(self, n_households):
        # 1. Generate persons independently
        persons = self.person_generator.generate(n_total_persons)

        # 2. Use GNN to cluster into realistic households
        household_graphs = self.structure_predictor.predict(persons)

        return household_graphs

Pros:

  • Learns complex household structure patterns

  • Can enforce relationship constraints

  • State-of-the-art for spatial microsimulation

Cons:

  • Bleeding edge (not production-ready)

  • Requires graph-structured training data

  • Computationally expensive

How PolicyEngine Handles This#

PolicyEngine uses a different approach entirely - it doesn’t synthesize microdata at all. Instead:

  1. Source microdata (CPS/ACS) already has hierarchical structure

  2. Entity definitions in PolicyEngine Core:

    class Person(Entity):
        key = "person"
    
    class TaxUnit(GroupEntity):
        key = "tax_unit"
        roles = [Head, Spouse, Dependent]
    
    class Household(GroupEntity):
        key = "household"
    
  3. Projectors handle cross-level calculations:

    • entity_to_person_projector: Broadcast household income to all members

    • first_person_to_entity_projector: Use head’s age for tax unit age

  4. Reweighting operates at household level:

    # Reweight to match state populations
    reweight(microdata, targets={"state": {...}})
    

Key insight: For tax-benefit microsimulation, you rarely need to synthesize new household structures. You just reweight existing ones to match population margins.

Recommendations#

For microplex v1: Hybrid Flattening + Post-Processing#

# Example workflow
from microplex import Synthesizer, enforce_hierarchy

# 1. Train on flattened data
synth = Synthesizer(
    target_vars=["age", "earnings", "household_income"],
    condition_vars=["education", "region"],
)
synth.fit(flattened_cps)

# 2. Generate
synthetic = synth.generate(new_demographics)

# 3. Enforce consistency (new utility function)
synthetic = enforce_hierarchy(
    synthetic,
    household_id="household_id",
    shared_vars=["household_income", "region"],
    relationship_constraints={
        "spouse_age_diff": {"max": 20},
        "child_age": {"max_parent_diff": 18},
    }
)

Implementation effort: Low (1-2 weeks)

  • Modify Synthesizer.generate() to accept household_id

  • Add enforce_hierarchy() utility function

  • Document hierarchical use cases

For microplex v2: Top-Down Hierarchical#

When you need true joint modeling:

from microplex import HierarchicalSynthesizer

synth = HierarchicalSynthesizer(
    levels={
        "household": {
            "target_vars": ["income", "size"],
            "condition_vars": ["region"],
        },
        "person": {
            "target_vars": ["age", "earnings"],
            "condition_vars": ["household_income", "household_size"],
            "parent_level": "household",
        },
    }
)

Implementation effort: Medium (4-6 weeks)

  • New HierarchicalSynthesizer class

  • Training pipeline for nested models

  • Aggregation constraint enforcement

For microplex v3: Copula/GNN Approaches#

If you need:

  • Perfect correlation preservation

  • Transferability across populations

  • State-of-the-art quality

See research papers:

Implementation effort: High (3-6 months)

Practical Considerations#

Storage Format#

For hierarchical data, use denormalized tables:

persons.parquet:
  person_id | household_id | tax_unit_id | age | earnings

households.parquet:
  household_id | income | size | region

tax_units.parquet:
  tax_unit_id | household_id | income | filing_status

Then join when needed:

# For synthesis, flatten
training_data = (
    persons
    .merge(households, on="household_id")
    .merge(tax_units, on="tax_unit_id")
)

# For microsimulation, keep normalized
sim.load_persons(persons)
sim.load_households(households)
sim.calculate("tax_liability")

Reweighting Level#

Always reweight at the household level:

from microplex import Reweighter

# CORRECT: Household-level reweighting
reweighter = Reweighter()
weighted = reweighter.fit_transform(
    households,  # One row per household
    targets={"state": {...}},
)

# WRONG: Person-level reweighting (violates hierarchy)
weighted = reweighter.fit_transform(
    persons,  # Persons in same household could get different weights
    targets={"state": {...}},
)

Memory Efficiency#

For large populations (100M+ persons):

  1. Don’t duplicate household vars in memory:

    # Bad: 100M rows × 50 household vars = huge
    flat = persons.merge(households)
    
    # Good: Store separately, join on demand
    person_features = synth.generate(persons)
    full_data = person_features.merge(
        households[["household_id", "income"]],
        on="household_id"
    )
    
  2. Generate in batches:

    for batch in batches(demographics, size=1_000_000):
        synthetic_batch = synth.generate(batch)
        synthetic_batch.to_parquet(f"output_{i}.parquet")
    

Code Sketch: Hybrid Approach#

# microplex/hierarchy.py

import pandas as pd
import numpy as np
from typing import Dict, List, Optional

def enforce_hierarchy(
    data: pd.DataFrame,
    household_id: str,
    shared_vars: List[str],
    relationship_constraints: Optional[Dict] = None,
) -> pd.DataFrame:
    """
    Enforce hierarchical consistency in flattened microdata.

    Ensures:
    - Household-level variables are identical for all members
    - Relationship constraints are satisfied (spouse ages, child ages, etc.)

    Args:
        data: Flattened microdata with person records
        household_id: Column identifying household membership
        shared_vars: Variables that must be identical within household
        relationship_constraints: Rules for relationships (optional)

    Returns:
        Corrected microdata

    Example:
        >>> synthetic = synth.generate(demographics)
        >>> consistent = enforce_hierarchy(
        ...     synthetic,
        ...     household_id="household_id",
        ...     shared_vars=["household_income", "region"],
        ...     relationship_constraints={
        ...         "spouse_age_diff": {"max": 20},
        ...     }
        ... )
    """
    result = []

    for hh_id, group in data.groupby(household_id):
        hh = group.copy()

        # Enforce shared variables (use mean/mode)
        for var in shared_vars:
            if var in hh.columns:
                if hh[var].dtype in [np.float64, np.float32, np.int64, np.int32]:
                    # Numeric: use mean
                    hh[var] = hh[var].mean()
                else:
                    # Categorical: use mode
                    hh[var] = hh[var].mode()[0]

        # Apply relationship constraints
        if relationship_constraints:
            hh = _apply_constraints(hh, relationship_constraints)

        result.append(hh)

    return pd.concat(result, ignore_index=True)


def _apply_constraints(
    household: pd.DataFrame,
    constraints: Dict,
) -> pd.DataFrame:
    """Apply relationship constraints within household."""

    # Spouse age difference
    if "spouse_age_diff" in constraints and len(household) >= 2:
        max_diff = constraints["spouse_age_diff"]["max"]
        ages = household["age"].values[:2]

        if abs(ages[0] - ages[1]) > max_diff:
            # Adjust second person's age
            household.loc[household.index[1], "age"] = (
                ages[0] + np.random.randint(-max_diff//2, max_diff//2)
            )

    # Child age constraints
    if "child_age" in constraints and len(household) > 2:
        max_parent_diff = constraints["child_age"]["max_parent_diff"]
        parent_ages = household["age"].values[:2]
        max_parent_age = max(parent_ages)

        for i in range(2, len(household)):
            child_age = household.iloc[i]["age"]
            if child_age > max_parent_age - max_parent_diff:
                # Adjust child age to be reasonable
                household.loc[household.index[i], "age"] = max(
                    0,
                    max_parent_age - max_parent_diff - np.random.randint(0, 5)
                )

    return household


# Add to Synthesizer class
from microplex.synthesizer import Synthesizer

def generate_with_hierarchy(
    self,
    conditions: pd.DataFrame,
    household_id: str,
    shared_vars: List[str],
    seed: Optional[int] = None,
) -> pd.DataFrame:
    """
    Generate synthetic data with hierarchical consistency enforced.

    Convenience method that combines generate() + enforce_hierarchy().
    """
    synthetic = self.generate(conditions, seed=seed)
    return enforce_hierarchy(synthetic, household_id, shared_vars)

# Monkey-patch for backwards compatibility
Synthesizer.generate_with_hierarchy = generate_with_hierarchy

Future Roadmap#

v1.0 (Current): Flattening#

  • ✅ Works with existing API

  • ✅ Documented hierarchical patterns

  • enforce_hierarchy() utility

v1.5: Hybrid Post-Processing#

  • generate_with_hierarchy() method

  • Automatic constraint detection

  • Validation metrics for hierarchical quality

v2.0: Native Hierarchical Synthesis#

  • HierarchicalSynthesizer class

  • Top-down sampling

  • Aggregation constraint enforcement

v3.0: Advanced Methods#

  • Copula-based synthesis

  • GNN household structure prediction

  • Transferable models across populations

References#

Hierarchical Synthesis Methods#

Graph Neural Networks#

Census/Government Approaches#

Microsimulation Models#

Review Articles#