API reference#
Synthesizer#
- class microplex.Synthesizer(target_vars, condition_vars, discrete_vars=None, n_layers=6, hidden_dim=64, zero_inflated=True, log_transform=True, variance_regularization=0.1, sample_clipping=3.0)#
Bases:
objectConditional microdata synthesizer using normalizing flows.
Learns P(target_vars | condition_vars) from training data, then generates synthetic target variables for new observations.
Key features: - Handles zero-inflated variables (common in economic data) - Preserves joint correlations between target variables - Supports sample weights for survey data - Reproducible generation with seed parameter
Example
>>> synth = Synthesizer( ... target_vars=["income", "expenditure"], ... condition_vars=["age", "education", "region"], ... ) >>> synth.fit(training_data, weight_col="weight") >>> synthetic = synth.generate(new_demographics)
- fit(data, weight_col='weight', epochs=100, batch_size=256, learning_rate=0.001, verbose=True)#
Fit synthesizer on training data.
Uses a two-stage approach: 1. Binary models predict P(positive | context) for each variable 2. Normalizing flow learns P(value | context) for positive cases
- Parameters:
data (
DataFrame) – DataFrame with target and condition variablesweight_col (
str|None) – Name of weight column (None if unweighted)epochs (
int) – Number of training epochsbatch_size (
int) – Training batch sizelearning_rate (
float) – Optimizer learning rateverbose (
bool) – Whether to print progress
- Return type:
Self- Returns:
self
- generate(conditions, seed=None)#
Generate synthetic target variables for given conditions.
Two-stage generation: 1. Sample zero/non-zero indicators for each target variable 2. For non-zero cases, sample from flow and inverse transform
- Parameters:
conditions (
DataFrame) – DataFrame with condition variablesseed (
int|None) – Random seed for reproducibility
- Return type:
DataFrame- Returns:
DataFrame with conditions + synthetic target variables
- sample(n, seed=None)#
Generate fully synthetic records (both conditions and targets).
For full synthesis mode - samples conditions from training distribution, then generates targets conditioned on those.
- Parameters:
n (
int) – Number of synthetic records to generateseed (
int|None) – Random seed for reproducibility
- Return type:
DataFrame- Returns:
DataFrame with all variables (conditions + targets)
- save(path)#
Save fitted model to disk.
- Return type:
None
- classmethod load(path)#
Load fitted model from disk.
- Return type:
Self
Transforms#
Data transformations for microdata synthesis.
Handles common patterns in survey/administrative data: - Zero-inflated variables (many observations are exactly 0) - Heavy-tailed distributions (log transform) - Standardization (for neural network training)
- class microplex.transforms.ZeroInflatedTransform#
Handle zero-inflated variables by splitting into indicator and values.
Common in economic data where many people have $0 for a given category (e.g., capital gains, medical expenditures, business income).
- split(x)#
Split data into zero indicator and positive values.
- Parameters:
x (
ndarray) – Array of values (may contain zeros)- Returns:
Binary array (1 if positive, 0 if zero) positive_values: Array of only the positive values
- Return type:
indicator
- combine(indicator, positive_values)#
Recombine indicator and positive values.
- Parameters:
indicator (
ndarray) – Binary array (1 if should be positive)positive_values (
ndarray) – Array of positive values to fill in
- Return type:
ndarray- Returns:
Combined array with zeros and positive values
- class microplex.transforms.LogTransform(offset=0.0)#
Log transformation for heavy-tailed distributions.
Most income/expenditure variables are approximately log-normal, so log-transforming before modeling improves results.
- forward(x)#
Apply log transform: log(x + offset).
- Return type:
ndarray
- inverse(y)#
Inverse log transform: exp(y) - offset.
- Return type:
ndarray
- class microplex.transforms.Standardizer#
Standardize data to zero mean and unit variance.
Supports sample weights for proper handling of survey data.
- fit(x, weights=None)#
Compute (weighted) mean and standard deviation.
- Parameters:
x (
ndarray) – Data arrayweights (
ndarray|None) – Sample weights (optional)
- Return type:
Self- Returns:
self
- transform(x)#
Standardize: (x - mean) / std.
- Return type:
ndarray
- inverse_transform(y)#
Inverse standardize: y * std + mean.
- Return type:
ndarray
- class microplex.transforms.VariableTransformer(zero_inflated=True, log_transform=True, standardize=True)#
Complete transformation pipeline for a single variable.
Combines zero-inflation handling, log transform, and standardization. Designed for variables that are: - Often zero (zero_inflated=True) - Heavy-tailed when positive (log_transform=True) - Need standardization for neural network training
- fit(x, weights=None)#
Fit the transformer on training data.
- Parameters:
x (
ndarray) – Data arrayweights (
ndarray|None) – Sample weights
- Return type:
Self- Returns:
self
- transform(x)#
Transform data.
For zero-inflated variables, returns NaN at zero positions (to distinguish from transformed values that equal 0).
- Parameters:
x (
ndarray|Tensor) – Data array or tensor- Return type:
ndarray|Tensor- Returns:
Transformed data
- inverse_transform(y)#
Inverse transform data.
- Parameters:
y (
ndarray) – Transformed data (NaN indicates original zeros)- Return type:
ndarray- Returns:
Original scale data
- class microplex.transforms.MultiVariableTransformer(var_names, zero_inflated=True, log_transform=True)#
Transform multiple variables at once.
Fits separate transformers for each variable, enabling independent handling of different distribution types.
- fit(data, weight_col='weight')#
Fit transformers for all variables.
- Parameters:
data (
dict[str,ndarray]) – Dict with variable names as keys, arrays as valuesweight_col (
str) – Name of weight column (optional)
- Return type:
Self- Returns:
self
- transform(data)#
Transform all variables.
- Parameters:
data (
dict[str,ndarray]) – Dict with variable arrays- Return type:
dict[str,ndarray]- Returns:
Dict with transformed arrays
- inverse_transform(data)#
Inverse transform all variables.
- Parameters:
data (
dict[str,ndarray]) – Dict with transformed arrays- Return type:
dict[str,ndarray]- Returns:
Dict with original scale arrays
Flows#
Normalizing flow models for conditional generation.
Implements Conditional Masked Autoregressive Flow (MAF) for learning the joint distribution of tax variables conditioned on demographics.
- class microplex.flows.MADE(n_features, n_context, hidden_dim, n_hidden=2)#
Masked Autoencoder for Distribution Estimation (MADE).
Implements autoregressive property: output[i] only depends on input[:i]. Used as the conditioner network in MAF.
- forward(x, context)#
Forward pass through MADE.
- Parameters:
x (
Tensor) – Input features [batch, n_features]context (
Tensor) – Context features [batch, n_context]
- Returns:
Mean parameters [batch, n_features] log_scale: Log scale parameters [batch, n_features]
- Return type:
mu
- class microplex.flows.AffineCouplingLayer(n_features, n_context, hidden_dim)#
Affine coupling layer using MADE as the conditioner.
Transform: z = (x - mu(x, context)) / exp(log_scale(x, context)) This is invertible and the Jacobian is easy to compute.
- forward(x, context)#
Forward transformation: x -> z.
- Parameters:
x (
Tensor) – Input [batch, n_features]context (
Tensor) – Context [batch, n_context]
- Returns:
Transformed output log_det: Log determinant of Jacobian
- Return type:
z
- inverse(z, context)#
Inverse transformation: z -> x.
Must be done autoregressively since mu, log_scale depend on x.
- Parameters:
z (
Tensor) – Latent space inputcontext (
Tensor) – Context features
- Returns:
Reconstructed input
- Return type:
x
- class microplex.flows.ConditionalMAF(n_features, n_context, n_layers=4, hidden_dim=64)#
Conditional Masked Autoregressive Flow.
Stacks multiple affine coupling layers with permutations between them to model complex distributions.
- log_prob(x, context, mask=None, dim_weights=None)#
Compute log probability of x given context, with optional masking.
When mask is provided, only computes loss on observed (mask=1) dimensions. This enables training on multi-survey data with missing values.
- Parameters:
x (
Tensor) – Data [batch, n_features]context (
Tensor) – Context [batch, n_context]mask (
Tensor) – Optional observation mask [batch, n_features], 1=observed, 0=missingdim_weights (
Tensor) – Optional per-dimension weights [n_features] for balancing sparse observations (inverse frequency weighting)
- Return type:
Tensor- Returns:
Log probability [batch]
- sample(context, clip_z=None)#
Sample from the flow given context.
- Parameters:
context (
Tensor) – Context [batch, n_context]clip_z (
float) – If provided, clip base samples to [-clip_z, clip_z]
- Return type:
Tensor- Returns:
Samples [batch, n_features]
- fit(X, context, epochs=100, batch_size=256, lr=0.001, weight_decay=1e-05, verbose=True, verbose_freq=10, clip_grad=5.0, device='cpu')#
Train the flow on data.
- Parameters:
X (
ndarray) – Training data [n_samples, n_features]context (
ndarray) – Conditioning data [n_samples, n_context]epochs (
int) – Number of training epochsbatch_size (
int) – Batch sizelr (
float) – Learning rateweight_decay (
float) – L2 regularizationverbose (
bool) – Print progressverbose_freq (
int) – Print every N epochsclip_grad (
float) – Gradient clipping normdevice (
str) – Device to train on
- Return type:
- Returns:
self for chaining
- generate(context, clip_z=3.0, device='cpu')#
Generate samples given context (numpy interface).
- Parameters:
context (
ndarray) – Conditioning data [n_samples, n_context]clip_z (
float) – Clip base distribution samples to avoid outliersdevice (
str) – Device to use
- Return type:
ndarray- Returns:
Generated samples [n_samples, n_features]
Discrete models#
Models for discrete/categorical variables.
Handles binary variables (yes/no indicators) and categorical variables (multi-class) separately from continuous variables.
- class microplex.discrete.BinaryModel(n_context, hidden_dim=32)#
Model for binary variables (0/1).
Examples: has_income, is_employed, owns_home
- forward(context)#
Predict probability of 1 given context.
- Parameters:
context (
Tensor) – Conditioning features [batch, n_context]- Return type:
Tensor- Returns:
Probability of 1 [batch, 1]
- class microplex.discrete.CategoricalModel(n_context, n_categories, hidden_dim=32)#
Model for categorical variables (multiple classes).
Example: education_level, region, industry
- forward(context)#
Predict category probabilities given context.
- Parameters:
context (
Tensor) – Conditioning features [batch, n_context]- Return type:
Tensor- Returns:
Category probabilities [batch, n_categories]
- class microplex.discrete.DiscreteModelCollection(n_context, binary_vars, categorical_vars, hidden_dim=32)#
Collection of discrete variable models.
Manages multiple binary and categorical models for different variables.
- forward(context)#
Predict probabilities for all discrete variables.
- Parameters:
context (
Tensor) – Conditioning features- Returns:
probabilities}
- Return type:
Dict of {var_name
- sample(context)#
Sample all discrete variables.
- Parameters:
context (
Tensor) – Conditioning features- Returns:
samples}
- Return type:
Dict of {var_name
- log_prob(context, targets)#
Compute log probability of discrete variables.
- Parameters:
context (
Tensor) – Conditioning featurestargets (
dict) – Dict of {var_name: values}
- Return type:
Tensor- Returns:
Total log probability [batch]