Tumult Synthetics API reference#
Note
The features described in this page are only available on a paid version of the Tumult Platform. If you would like to hear more, please contact us at info@tmlt.io.
Tumult Synthetics is a differentially private synthetic data generator. It
relies on Tumult Analytics to
specify privacy guarantees, and provides the
tmlt.synthetics.generate_synthetic_data()
method to perform the synthetic
data generation.
Caution
The public API of Tumult Synthetics is still under active development, and will change over upcoming releases.
Below is a minimal example for synthetic data generation. A more detailed introduction to Tumult Synthetics can be found in the tutorial tutorial, and in the optimization guide.
>>> from pyspark.sql import SparkSession, Row
>>> from tmlt.synthetics import (
... generate_synthetic_data,
... Count,
... Sum,
... FixedMarginals,
... ClampingBounds,
... )
>>> from tmlt.analytics import ApproxDPBudget, AddRowsWithID, KeySet, Session
>>> import pandas as pd
>>>
>>> spark = SparkSession.builder.getOrCreate()
>>>
>>> # Create toy data
>>> data = [
... Row(id=1, state='California', city='San Francisco', gender='Male', age=25, occupation='Engineer', salary=100000),
... Row(id=2, state='California', city='Sunnyvale', gender='Female', age=30, occupation='Teacher', salary=70000),
... Row(id=3, state='Washington', city='Seattle', gender='Other', age=40, occupation='Doctor', salary=150000),
... Row(id=4, state='California', city='San Francisco', gender='Male', age=35, occupation='Unemployed', salary=0),
... ]
>>> df = spark.createDataFrame(data)
>>>
>>> # Define keysets and domain
>>> state_city_df = pd.DataFrame(
... [("California", "San Francisco"), ("California", "Sunnyvale"), ("Washington", "Seattle")],
... columns=["state", "city"]
... )
>>> keyset = (
... KeySet.from_dataframe(spark.createDataFrame(state_city_df))
... * KeySet.from_dict({"gender": ["Male", "Female", "Other"]})
... * KeySet.from_dict({"age": list(range(18, 101))})
... * KeySet.from_dict({"occupation": ["Engineer", "Teacher", "Doctor", "Unemployed"]})
... )
>>> clamping_bounds = ClampingBounds({"salary": (0, 200000)})
>>>
>>> # Define measurement strategies
>>> measurement_strategies = [
... FixedMarginals(marginals=[
... Count(groupby_columns=['state', 'city']),
... Count(groupby_columns=['age']),
... Count(groupby_columns=['gender', 'occupation']),
... Sum(groupby_columns=['state', 'occupation'], measure_column='salary')
... ])
... ]
>>>
>>> # Set up the session
>>> session = (
... Session.Builder()
... .with_privacy_budget(ApproxDPBudget(epsilon=100, delta=0))
... .with_private_dataframe("protected_data", df, AddRowsWithID(id_column="id"))
... .build()
... )
>>>
>>> # Generate synthetic data
>>> synthetic_data = generate_synthetic_data(
... session=session,
... source_id="protected_data",
... keyset=keyset,
... measurement_strategies=measurement_strategies,
... clamping_bounds=clamping_bounds,
... max_rows_per_id=1
... )
>>>
>>> # Access the synthetic data
>>> synthetic_data.show(5)
+----------+-------------+------+---+----------+------------------+
| state| city|gender|age|occupation| salary|
+----------+-------------+------+---+----------+------------------+
|California|San Francisco|Female| 40| Teacher|27883.202202662666|
|California|San Francisco| Male| 30| Engineer|40116.144665117616|
|California| Sunnyvale| Male| 35|Unemployed|20954.157816173778|
|Washington| Seattle| Other| 25| Doctor| 226830.4953160459|
+----------+-------------+------+---+----------+------------------+
Specifying metadata#
Before synthetic data generation, users must specify metadata about the input
data: possible values for each column using the KeySet
or BinningSpec
classes, and clamping bounds for numeric
columns using the classes below.
|
A configuration for clamping bounds of each numeric column to generate. |
|
Clamping bounds for a magnitude column is automatically set using DP. |
|
Clamping bounds for a magnitude column specified separately for each group. |
Specifying the strategy#
An important parameter in synthetic data generation is the measurement strategy: which statistics are measured on the sensitive data using differential privacy, and are then used as input to the model. The classes below are used to define this strategy.
A strategy for measuring differentially private aggregates. |
|
|
A fixed workload of weighted count and sum marginals. |
|
An adaptive workload with multiple cases based on the total noisy count. |
|
Interface for specifying a weighted DP groupby aggregate. |
|
A marginal count. |
|
A marginal sum. |
Generating the data#
Once the metadata and the strategy have been defined, they are passed to the
synthetic generate_synthetic_data()
method, which then
produces the synthetic data.
|
Generates and returns differentially private synthetic data. |
Internal classes#
These classes should not be used directly.
A collection of possibly overlapping KeySets. |
|
Output from computing a weighted marginal. |