Tumult Synthetics API reference#

Note

PRO The features described in this page are only available on a paid version of the Tumult Platform. If you would like to hear more, please contact us at info@tmlt.io.

Tumult Synthetics is a differentially private synthetic data generator. It relies on Tumult Analytics to specify privacy guarantees, and provides the tmlt.synthetics.generate_synthetic_data() method to perform the synthetic data generation.

Caution

The public API of Tumult Synthetics is still under active development, and will change over upcoming releases.

Below is a minimal example for synthetic data generation. A more detailed introduction to Tumult Synthetics can be found in the tutorial tutorial, and in the optimization guide.

>>> from pyspark.sql import SparkSession, Row
>>> from tmlt.synthetics import (
...     generate_synthetic_data,
...     Count,
...     Sum,
...     FixedMarginals,
...     ClampingBounds,
... )
>>> from tmlt.analytics import ApproxDPBudget, AddRowsWithID, KeySet, Session
>>> import pandas as pd
>>>
>>> spark = SparkSession.builder.getOrCreate()
>>>
>>> # Create toy data
>>> data = [
...     Row(id=1, state='California', city='San Francisco', gender='Male', age=25, occupation='Engineer', salary=100000),
...     Row(id=2, state='California', city='Sunnyvale', gender='Female', age=30, occupation='Teacher', salary=70000),
...     Row(id=3, state='Washington', city='Seattle', gender='Other', age=40, occupation='Doctor', salary=150000),
...     Row(id=4, state='California', city='San Francisco', gender='Male', age=35, occupation='Unemployed', salary=0),
... ]
>>> df = spark.createDataFrame(data)
>>>
>>> # Define keysets and domain
>>> state_city_df = pd.DataFrame(
...     [("California", "San Francisco"), ("California", "Sunnyvale"), ("Washington", "Seattle")],
...     columns=["state", "city"]
... )
>>> keyset = (
...     KeySet.from_dataframe(spark.createDataFrame(state_city_df))
...     * KeySet.from_dict({"gender": ["Male", "Female", "Other"]})
...     * KeySet.from_dict({"age": list(range(18, 101))})
...     * KeySet.from_dict({"occupation": ["Engineer", "Teacher", "Doctor", "Unemployed"]})
... )
>>> clamping_bounds = ClampingBounds({"salary": (0, 200000)})
>>>
>>> # Define measurement strategies
>>> measurement_strategies = [
...     FixedMarginals(marginals=[
...         Count(groupby_columns=['state', 'city']),
...         Count(groupby_columns=['age']),
...         Count(groupby_columns=['gender', 'occupation']),
...         Sum(groupby_columns=['state', 'occupation'], measure_column='salary')
...     ])
... ]
>>>
>>> # Set up the session
>>> session = (
...     Session.Builder()
...     .with_privacy_budget(ApproxDPBudget(epsilon=100, delta=0))
...     .with_private_dataframe("protected_data", df, AddRowsWithID(id_column="id"))
...     .build()
... )
>>>
>>> # Generate synthetic data
>>> synthetic_data = generate_synthetic_data(
...     session=session,
...     source_id="protected_data",
...     keyset=keyset,
...     measurement_strategies=measurement_strategies,
...     clamping_bounds=clamping_bounds,
...     max_rows_per_id=1
... )
>>>
>>> # Access the synthetic data
>>> synthetic_data.show(5)  
+----------+-------------+------+---+----------+------------------+
|     state|         city|gender|age|occupation|            salary|
+----------+-------------+------+---+----------+------------------+
|California|San Francisco|Female| 40|   Teacher|27883.202202662666|
|California|San Francisco|  Male| 30|  Engineer|40116.144665117616|
|California|    Sunnyvale|  Male| 35|Unemployed|20954.157816173778|
|Washington|      Seattle| Other| 25|    Doctor| 226830.4953160459|
+----------+-------------+------+---+----------+------------------+

Specifying metadata#

Before synthetic data generation, users must specify metadata about the input data: possible values for each column using the KeySet or BinningSpec classes, and clamping bounds for numeric columns using the classes below.

ClampingBounds(bounds_per_column[, weight])

A configuration for clamping bounds of each numeric column to generate.

AutomaticBounds([groupby_columns, weight, ...])

Clamping bounds for a magnitude column is automatically set using DP.

PerGroupClampingBounds(groupby_columns, ...)

Clamping bounds for a magnitude column specified separately for each group.

Specifying the strategy#

An important parameter in synthetic data generation is the measurement strategy: which statistics are measured on the sensitive data using differential privacy, and are then used as input to the model. The classes below are used to define this strategy.

MeasurementStrategy()

A strategy for measuring differentially private aggregates.

FixedMarginals(marginals[, weight])

A fixed workload of weighted count and sum marginals.

AdaptiveMarginals(total_by, ...[, weight, ...])

An adaptive workload with multiple cases based on the total noisy count.

WeightedMarginal(groupby_columns[, weight])

Interface for specifying a weighted DP groupby aggregate.

Count(groupby_columns[, weight, count_column])

A marginal count.

Sum(groupby_columns, measure_column[, weight])

A marginal sum.

Generating the data#

Once the metadata and the strategy have been defined, they are passed to the synthetic generate_synthetic_data() method, which then produces the synthetic data.

generate_synthetic_data(session, keyset, ...)

Generates and returns differentially private synthetic data.

Internal classes#

These classes should not be used directly.

KeySetsRepository

A collection of possibly overlapping KeySets.

MarginalAnswer

Output from computing a weighted marginal.