API reference#

Synthesizer#

class microplex.Synthesizer(target_vars, condition_vars, discrete_vars=None, n_layers=6, hidden_dim=64, zero_inflated=True, log_transform=True, variance_regularization=0.1, sample_clipping=3.0)#

Bases: object

Conditional microdata synthesizer using normalizing flows.

Learns P(target_vars | condition_vars) from training data, then generates synthetic target variables for new observations.

Key features: - Handles zero-inflated variables (common in economic data) - Preserves joint correlations between target variables - Supports sample weights for survey data - Reproducible generation with seed parameter

Example

>>> synth = Synthesizer(
...     target_vars=["income", "expenditure"],
...     condition_vars=["age", "education", "region"],
... )
>>> synth.fit(training_data, weight_col="weight")
>>> synthetic = synth.generate(new_demographics)
fit(data, weight_col='weight', epochs=100, batch_size=256, learning_rate=0.001, verbose=True)#

Fit synthesizer on training data.

Uses a two-stage approach: 1. Binary models predict P(positive | context) for each variable 2. Normalizing flow learns P(value | context) for positive cases

Parameters:
  • data (DataFrame) – DataFrame with target and condition variables

  • weight_col (str | None) – Name of weight column (None if unweighted)

  • epochs (int) – Number of training epochs

  • batch_size (int) – Training batch size

  • learning_rate (float) – Optimizer learning rate

  • verbose (bool) – Whether to print progress

Return type:

Self

Returns:

self

generate(conditions, seed=None)#

Generate synthetic target variables for given conditions.

Two-stage generation: 1. Sample zero/non-zero indicators for each target variable 2. For non-zero cases, sample from flow and inverse transform

Parameters:
  • conditions (DataFrame) – DataFrame with condition variables

  • seed (int | None) – Random seed for reproducibility

Return type:

DataFrame

Returns:

DataFrame with conditions + synthetic target variables

sample(n, seed=None)#

Generate fully synthetic records (both conditions and targets).

For full synthesis mode - samples conditions from training distribution, then generates targets conditioned on those.

Parameters:
  • n (int) – Number of synthetic records to generate

  • seed (int | None) – Random seed for reproducibility

Return type:

DataFrame

Returns:

DataFrame with all variables (conditions + targets)

save(path)#

Save fitted model to disk.

Return type:

None

classmethod load(path)#

Load fitted model from disk.

Return type:

Self

Transforms#

Data transformations for microdata synthesis.

Handles common patterns in survey/administrative data: - Zero-inflated variables (many observations are exactly 0) - Heavy-tailed distributions (log transform) - Standardization (for neural network training)

class microplex.transforms.ZeroInflatedTransform#

Handle zero-inflated variables by splitting into indicator and values.

Common in economic data where many people have $0 for a given category (e.g., capital gains, medical expenditures, business income).

split(x)#

Split data into zero indicator and positive values.

Parameters:

x (ndarray) – Array of values (may contain zeros)

Returns:

Binary array (1 if positive, 0 if zero) positive_values: Array of only the positive values

Return type:

indicator

combine(indicator, positive_values)#

Recombine indicator and positive values.

Parameters:
  • indicator (ndarray) – Binary array (1 if should be positive)

  • positive_values (ndarray) – Array of positive values to fill in

Return type:

ndarray

Returns:

Combined array with zeros and positive values

class microplex.transforms.LogTransform(offset=0.0)#

Log transformation for heavy-tailed distributions.

Most income/expenditure variables are approximately log-normal, so log-transforming before modeling improves results.

forward(x)#

Apply log transform: log(x + offset).

Return type:

ndarray

inverse(y)#

Inverse log transform: exp(y) - offset.

Return type:

ndarray

class microplex.transforms.Standardizer#

Standardize data to zero mean and unit variance.

Supports sample weights for proper handling of survey data.

fit(x, weights=None)#

Compute (weighted) mean and standard deviation.

Parameters:
  • x (ndarray) – Data array

  • weights (ndarray | None) – Sample weights (optional)

Return type:

Self

Returns:

self

transform(x)#

Standardize: (x - mean) / std.

Return type:

ndarray

inverse_transform(y)#

Inverse standardize: y * std + mean.

Return type:

ndarray

class microplex.transforms.VariableTransformer(zero_inflated=True, log_transform=True, standardize=True)#

Complete transformation pipeline for a single variable.

Combines zero-inflation handling, log transform, and standardization. Designed for variables that are: - Often zero (zero_inflated=True) - Heavy-tailed when positive (log_transform=True) - Need standardization for neural network training

fit(x, weights=None)#

Fit the transformer on training data.

Parameters:
  • x (ndarray) – Data array

  • weights (ndarray | None) – Sample weights

Return type:

Self

Returns:

self

transform(x)#

Transform data.

For zero-inflated variables, returns NaN at zero positions (to distinguish from transformed values that equal 0).

Parameters:

x (ndarray | Tensor) – Data array or tensor

Return type:

ndarray | Tensor

Returns:

Transformed data

inverse_transform(y)#

Inverse transform data.

Parameters:

y (ndarray) – Transformed data (NaN indicates original zeros)

Return type:

ndarray

Returns:

Original scale data

class microplex.transforms.MultiVariableTransformer(var_names, zero_inflated=True, log_transform=True)#

Transform multiple variables at once.

Fits separate transformers for each variable, enabling independent handling of different distribution types.

fit(data, weight_col='weight')#

Fit transformers for all variables.

Parameters:
  • data (dict[str, ndarray]) – Dict with variable names as keys, arrays as values

  • weight_col (str) – Name of weight column (optional)

Return type:

Self

Returns:

self

transform(data)#

Transform all variables.

Parameters:

data (dict[str, ndarray]) – Dict with variable arrays

Return type:

dict[str, ndarray]

Returns:

Dict with transformed arrays

inverse_transform(data)#

Inverse transform all variables.

Parameters:

data (dict[str, ndarray]) – Dict with transformed arrays

Return type:

dict[str, ndarray]

Returns:

Dict with original scale arrays

Flows#

Normalizing flow models for conditional generation.

Implements Conditional Masked Autoregressive Flow (MAF) for learning the joint distribution of tax variables conditioned on demographics.

class microplex.flows.MADE(n_features, n_context, hidden_dim, n_hidden=2)#

Masked Autoencoder for Distribution Estimation (MADE).

Implements autoregressive property: output[i] only depends on input[:i]. Used as the conditioner network in MAF.

forward(x, context)#

Forward pass through MADE.

Parameters:
  • x (Tensor) – Input features [batch, n_features]

  • context (Tensor) – Context features [batch, n_context]

Returns:

Mean parameters [batch, n_features] log_scale: Log scale parameters [batch, n_features]

Return type:

mu

class microplex.flows.AffineCouplingLayer(n_features, n_context, hidden_dim)#

Affine coupling layer using MADE as the conditioner.

Transform: z = (x - mu(x, context)) / exp(log_scale(x, context)) This is invertible and the Jacobian is easy to compute.

forward(x, context)#

Forward transformation: x -> z.

Parameters:
  • x (Tensor) – Input [batch, n_features]

  • context (Tensor) – Context [batch, n_context]

Returns:

Transformed output log_det: Log determinant of Jacobian

Return type:

z

inverse(z, context)#

Inverse transformation: z -> x.

Must be done autoregressively since mu, log_scale depend on x.

Parameters:
  • z (Tensor) – Latent space input

  • context (Tensor) – Context features

Returns:

Reconstructed input

Return type:

x

class microplex.flows.ConditionalMAF(n_features, n_context, n_layers=4, hidden_dim=64)#

Conditional Masked Autoregressive Flow.

Stacks multiple affine coupling layers with permutations between them to model complex distributions.

log_prob(x, context, mask=None, dim_weights=None)#

Compute log probability of x given context, with optional masking.

When mask is provided, only computes loss on observed (mask=1) dimensions. This enables training on multi-survey data with missing values.

Parameters:
  • x (Tensor) – Data [batch, n_features]

  • context (Tensor) – Context [batch, n_context]

  • mask (Tensor) – Optional observation mask [batch, n_features], 1=observed, 0=missing

  • dim_weights (Tensor) – Optional per-dimension weights [n_features] for balancing sparse observations (inverse frequency weighting)

Return type:

Tensor

Returns:

Log probability [batch]

sample(context, clip_z=None)#

Sample from the flow given context.

Parameters:
  • context (Tensor) – Context [batch, n_context]

  • clip_z (float) – If provided, clip base samples to [-clip_z, clip_z]

Return type:

Tensor

Returns:

Samples [batch, n_features]

fit(X, context, epochs=100, batch_size=256, lr=0.001, weight_decay=1e-05, verbose=True, verbose_freq=10, clip_grad=5.0, device='cpu')#

Train the flow on data.

Parameters:
  • X (ndarray) – Training data [n_samples, n_features]

  • context (ndarray) – Conditioning data [n_samples, n_context]

  • epochs (int) – Number of training epochs

  • batch_size (int) – Batch size

  • lr (float) – Learning rate

  • weight_decay (float) – L2 regularization

  • verbose (bool) – Print progress

  • verbose_freq (int) – Print every N epochs

  • clip_grad (float) – Gradient clipping norm

  • device (str) – Device to train on

Return type:

ConditionalMAF

Returns:

self for chaining

generate(context, clip_z=3.0, device='cpu')#

Generate samples given context (numpy interface).

Parameters:
  • context (ndarray) – Conditioning data [n_samples, n_context]

  • clip_z (float) – Clip base distribution samples to avoid outliers

  • device (str) – Device to use

Return type:

ndarray

Returns:

Generated samples [n_samples, n_features]

Discrete models#

Models for discrete/categorical variables.

Handles binary variables (yes/no indicators) and categorical variables (multi-class) separately from continuous variables.

class microplex.discrete.BinaryModel(n_context, hidden_dim=32)#

Model for binary variables (0/1).

Examples: has_income, is_employed, owns_home

forward(context)#

Predict probability of 1 given context.

Parameters:

context (Tensor) – Conditioning features [batch, n_context]

Return type:

Tensor

Returns:

Probability of 1 [batch, 1]

class microplex.discrete.CategoricalModel(n_context, n_categories, hidden_dim=32)#

Model for categorical variables (multiple classes).

Example: education_level, region, industry

forward(context)#

Predict category probabilities given context.

Parameters:

context (Tensor) – Conditioning features [batch, n_context]

Return type:

Tensor

Returns:

Category probabilities [batch, n_categories]

class microplex.discrete.DiscreteModelCollection(n_context, binary_vars, categorical_vars, hidden_dim=32)#

Collection of discrete variable models.

Manages multiple binary and categorical models for different variables.

forward(context)#

Predict probabilities for all discrete variables.

Parameters:

context (Tensor) – Conditioning features

Returns:

probabilities}

Return type:

Dict of {var_name

sample(context)#

Sample all discrete variables.

Parameters:

context (Tensor) – Conditioning features

Returns:

samples}

Return type:

Dict of {var_name

log_prob(context, targets)#

Compute log probability of discrete variables.

Parameters:
  • context (Tensor) – Conditioning features

  • targets (dict) – Dict of {var_name: values}

Return type:

Tensor

Returns:

Total log probability [batch]