synthetics#
Tumult Synthetics is a differentially private synthetic data generator built on Tumult Analytics.
Caution
The public API of Tumult Synthetics is still under active development, and will change over upcoming releases.
Example
>>> from pyspark.sql import SparkSession, Row
>>> from tmlt.synthetics import (
... generate_synthetic_data,
... Count,
... Sum,
... FixedMarginals,
... ClampingBounds,
... )
>>> from tmlt.analytics import ApproxDPBudget, AddRowsWithID, KeySet, Session
>>> import pandas as pd
>>>
>>> spark = SparkSession.builder.getOrCreate()
>>>
>>> # Create toy data
>>> data = [
... Row(id=1, state='California', city='San Francisco', gender='Male', age=25, occupation='Engineer', salary=100000),
... Row(id=2, state='California', city='Sunnyvale', gender='Female', age=30, occupation='Teacher', salary=70000),
... Row(id=3, state='Washington', city='Seattle', gender='Other', age=40, occupation='Doctor', salary=150000),
... Row(id=4, state='California', city='San Francisco', gender='Male', age=35, occupation='Unemployed', salary=0),
... ]
>>> df = spark.createDataFrame(data)
>>>
>>> # Define keysets and domain
>>> state_city_df = pd.DataFrame(
... [("California", "San Francisco"), ("California", "Sunnyvale"), ("Washington", "Seattle")],
... columns=["state", "city"]
... )
>>> keyset = (
... KeySet.from_dataframe(spark.createDataFrame(state_city_df))
... * KeySet.from_dict({"gender": ["Male", "Female", "Other"]})
... * KeySet.from_dict({"age": list(range(18, 101))})
... * KeySet.from_dict({"occupation": ["Engineer", "Teacher", "Doctor", "Unemployed"]})
... )
>>> clamping_bounds = ClampingBounds({"salary": (0, 200000)})
>>>
>>> # Define measurement strategies
>>> measurement_strategies = [
... FixedMarginals(marginals=[
... Count(groupby_columns=['state', 'city']),
... Count(groupby_columns=['age']),
... Count(groupby_columns=['gender', 'occupation']),
... Sum(groupby_columns=['state', 'occupation'], measure_column='salary')
... ])
... ]
>>>
>>> # Set up the session
>>> session = (
... Session.Builder()
... .with_privacy_budget(ApproxDPBudget(epsilon=100, delta=0))
... .with_private_dataframe("protected_data", df, AddRowsWithID(id_column="id"))
... .build()
... )
>>>
>>> # Generate synthetic data
>>> synthetic_data = generate_synthetic_data(
... session=session,
... source_id="protected_data",
... keyset=keyset,
... measurement_strategies=measurement_strategies,
... clamping_bounds=clamping_bounds,
... max_rows_per_id=1
... )
>>>
>>> # Access the synthetic data
>>> synthetic_data.show(5)
+----------+-------------+------+---+----------+------------------+
| state| city|gender|age|occupation| salary|
+----------+-------------+------+---+----------+------------------+
|California|San Francisco|Female| 40| Teacher|27883.202202662666|
|California|San Francisco| Male| 30| Engineer|40116.144665117616|
|California| Sunnyvale| Male| 35|Unemployed|20954.157816173778|
|Washington| Seattle| Other| 25| Doctor| 226830.4953160459|
+----------+-------------+------+---+----------+------------------+
Functions#
- generate_synthetic_data(session, keyset, measurement_strategies, clamping_bounds=None, binning_specs=None, count_structural_zeroes=None, sum_structural_zeroes=None, source_id=None, privacy_budget=None, split_columns=None, max_rows_per_id=None, model_iterations=3000)#
Generate synthetic data using differential privacy techniques.
- Parameters:
session (tmlt.analytics.session.Session) – An Analytics Session to use for computing the aggregates on the private data.
keyset (tmlt.analytics.keyset.KeySet) – Full domain for the synthetic data, before removing structural zeroes.
measurement_strategies (Sequence[tmlt.analytics.synthetics._strategy.MeasurementStrategy]) – Strategies for measuring data aggregates.
clamping_bounds (Optional[tmlt.analytics.synthetics._clamping_bounds.ClampingBounds]) – Clamping bounds for the numeric columns to generate. An alternative to using binning. Note that these columns should not have binning specs or be included in count marginals, but rather should be included in sum marginals.
binning_specs (Optional[Dict[str, tmlt.analytics.binning_spec.BinningSpec]]) – Specifications for how to bin numeric or timestamp/date columns for measurement. Note that these columns should not have clamping bounds or be included in sum marginals, but rather should be included in count marginals. Null values will always be included in the domain for binned columns.
count_structural_zeroes (Optional[List[pyspark.sql.DataFrame]]) – Structural zeroes for the count measurements.
sum_structural_zeroes (Optional[Dict[str, List[pyspark.sql.DataFrame]]]) – Structural zeroes for the sum measurements.
source_id (Optional[str]) – Source ID of the private table in the session to query.
privacy_budget (Optional[tmlt.analytics.privacy_budget.ApproxDPBudget]) – Privacy budget to use for synthetic data generation.
split_columns (Optional[List[str]]) – Columns to split the data into subsets for model fitting.
max_rows_per_id (Optional[int]) – Maximum number of rows per ID in the protected data.
model_iterations (int) – Number of iterations to use for fitting the model.
- Returns:
A DataFrame containing the generated synthetic data.
- Return type:
Classes#
Clamping bounds for a magnitude column is automatically set using DP. |
|
A configuration for clamping bounds of each numeric column to generate. |
|
Clamping bounds for a magnitude column specified separately for each group. |
|
A collection of possibly overlapping KeySets. |
|
An adaptive workload with multiple cases based on the total noisy count. |
|
A fixed workload of weighted count and sum marginals. |
|
A strategy for measuring differentially private aggregates. |
|
A marginal count. |
|
Output from computing a weighted marginal. |
|
A marginal sum. |
|
Interface for specifying a weighted DP groupby aggregate. |
- class AutomaticBounds(groupby_columns=None, weight=1, low_column_name='low', high_column_name='high')#
Clamping bounds for a magnitude column is automatically set using DP.
- Parameters:
- property groupby_columns: List[str]#
List of columns that determine the groups.
- Return type:
List[str]
- property low_column_name: str#
Name of the column that contains the lower clamping bounds.
- Return type:
- property high_column_name: str#
Name of the column that contains the upper clamping bounds.
- Return type:
- __init__(groupby_columns=None, weight=1, low_column_name='low', high_column_name='high')#
Constructor.
- Parameters:
groupby_columns (
Optional
[List
[str
]]) – Clamping bounds are automatically discovered for each unique combination of values of these columns.weight (
int
) – Relative weight that determines how much privacy budget should be allocated to discover clamping bounds.low_column_name (
str
) – (Only applicable ifgroupby_columns
is specified) Name of the column in the result DataFrame that contains the lower clamping bounds. If None, the default name “low” is used.high_column_name (
str
) – (Only applicable ifgroupby_columns
is specified) Name of the column in the result DataFrame that contains the upper clamping bounds. If None, the default name “high” is used.
- class ClampingBounds(bounds_per_column, weight=0)#
A configuration for clamping bounds of each numeric column to generate.
ClampingBounds can be explictly provided or set to be automatically discovered using DP.
- Parameters:
bounds_per_column (Dict[str, Union[Tuple[float, float], PerGroupClampingBounds, AutomaticBounds]])
weight (int)
- property bounds_per_column: Dict[str, Tuple[float, float] | AutomaticBounds | PerGroupClampingBounds]#
Returns the dictionary mapping column names to clamping bounds.
- Return type:
Dict[str, Union[Tuple[float, float], AutomaticBounds, PerGroupClampingBounds]]
- __init__(bounds_per_column, weight=0)#
Constructor.
- Parameters:
bounds_per_column (
Dict
[str
,Union
[Tuple
[float
,float
],PerGroupClampingBounds
,AutomaticBounds
]]) – A dictionary mapping magnitude column names to clamping bounds specified as a tuple of two floats (low and high), a PerGroupClampingBounds object that specifies clamping bounds for each combination of values for some categorical “grouping” columns, or an AutomaticBounds object that specifies that clamping bounds should be automatically discovered from the private data using DP.weight (
int
) – Relative weight that determines how much privacy budget should be used to discover clamping bounds for columns with AutomaticBounds.
- __getitem__(column)#
Returns clamping bounds for a column.
If the bounds for
column
are specified to be distinct per group, this method returns a PerGroupClampingBounds object. Otherwise, it returns a tuple (low, high).If the bounds for
column
is set to be automatically discovered, this method will raise an error unless the bounds have been discovered by calling thediscover_unknown_bounds
method first.- Parameters:
column (str) – The name of the column for which to get the clamping bounds.
- Return type:
Union[Tuple[float, float], PerGroupClampingBounds]
- discover_unknown_bounds(session, source_id, budget, keysets)#
Discover clamping bounds for columns with AutomaticBounds.
- Parameters:
session (tmlt.analytics.session.Session)
source_id (str)
keysets (tmlt.analytics.synthetics._keysets_repository.KeySetsRepository)
- class PerGroupClampingBounds(groupby_columns, dataframe, low_column_name='low', high_column_name='high')#
Clamping bounds for a magnitude column specified separately for each group.
- Parameters:
groupby_columns (List[str])
dataframe (pyspark.sql.DataFrame)
low_column_name (str)
high_column_name (str)
- property groupby_columns: List[str]#
List of columns that determine the groups.
- Return type:
List[str]
- property dataframe: pyspark.sql.DataFrame#
Returns DataFrame with clamping bounds for each group.
- Return type:
- property low_column_name: str#
Name of the column that contains the lower clamping bounds.
- Return type:
- property high_column_name: str#
Name of the column that contains the upper clamping bounds.
- Return type:
- __init__(groupby_columns, dataframe, low_column_name='low', high_column_name='high')#
Constructor.
- Parameters:
groupby_columns (
List
[str
]) – Clamping bounds are specified separately for each unique combination of values of these columns.dataframe (
DataFrame
) – DataFrame with clamping bounds for each group.low_column_name (
str
) – Name of the column indataframe
that contains the lower clamping bounds.high_column_name (
str
) – Name of the column indataframe
that contains the upper clamping bounds.
- class KeySetsRepository(keysets)#
A collection of possibly overlapping KeySets.
See the module docstring for more information.
- Parameters:
keysets (List[tmlt.analytics.keyset.KeySet])
- property columns: Set[str]#
Set of column names spanned by all KeySets in this repository.
- Return type:
Set[str]
- __init__(keysets)#
Constructor.
- join(dataframe, on)#
Returns a new KeySetsRepository by joining each keyset with dataframe.
- Parameters:
dataframe (pyspark.sql.DataFrame) – DataFrame to join with each keyset.
on (List[str]) – List of columns to join on.
- Return type:
- class AdaptiveMarginals(total_by, total_count_budget_fraction, cases, weight=1, count_column='count')#
Bases:
MeasurementStrategy
An adaptive workload with multiple cases based on the total noisy count.
In particular, this workload specifies different sets of marginals to compute for different subsets of the data based on the size of the subset.
For example, in a dataset of website visits for a month for a set of websites, we might want to compute finer granularity marginals for websites that get a large number of visits while only computing coarser marginals for websites that get relatively fewer visits. We could do this using an
AdaptiveMarginals
as follows:>>> adaptive_counts = AdaptiveMarginals( ... total_by=["website_id"], ... total_count_budget_fraction=0.15, ... cases=[ ... AdaptiveMarginals.Case( ... threshold=1000, ... marginals=[ ... Count(["website_id", "day"]), ... Count(["website_id", "day", "hour"]), ... ], ... ), ... AdaptiveMarginals.Case( ... threshold=500, ... marginals=[ ... Count(["website_id", "day"]), ... ], ... ), ... AdaptiveMarginals.Case( ... default=True, ... marginals=[ ... Count(["website_id", "week"]), ... ], ... ), ... ], ... )
The adaptive strategy defined in the example above does the following:
Computes the total count of visits for each website.
Divides the websites into three cases:
Case 1: Websites with more than 1000 visits.
Computes the count of visits per day and per hour.
Case 2: Websites with between 500 and 1000 visits.
Computes the count of visits per day.
Default case: Websites with fewer than 500 visits.
Computes the count of visits per week.
- Parameters:
total_by (List[str])
total_count_budget_fraction (float)
cases (List[AdaptiveMarginals])
weight (int)
count_column (str)
- class Case(marginals, threshold=None, default=False)#
A case in the adaptive workload.
- Parameters:
marginals (List[tmlt.analytics.synthetics._weighted_marginal.WeightedMarginal])
threshold (Optional[float])
default (bool)
- property weight: int#
Weight for determining privacy budget allocated for this strategy.
- Return type:
- __init__(total_by, total_count_budget_fraction, cases, weight=1, count_column='count')#
Constructor.
- compute(session, source_id, budget, keysets, clamping_bounds)#
Compute marginals adaptively based on the total noisy count.
- Parameters:
session (tmlt.analytics.session.Session)
source_id (str)
keysets (tmlt.analytics.synthetics._keysets_repository.KeySetsRepository)
clamping_bounds (tmlt.analytics.synthetics._clamping_bounds.ClampingBounds)
- Return type:
List[tmlt.analytics.synthetics._weighted_marginal.MarginalAnswer]
- class FixedMarginals(marginals, weight=1)#
Bases:
MeasurementStrategy
A fixed workload of weighted count and sum marginals.
- Parameters:
marginals (List[tmlt.analytics.synthetics._weighted_marginal.WeightedMarginal])
weight (int)
- property weight: int#
Weight for determining privacy budget allocated for this strategy.
- Return type:
- __init__(marginals, weight=1)#
Constructor.
- Parameters:
marginals (
List
[WeightedMarginal
]) – List of weighted marginals to compute. The fraction of the total privacy budget for this strategy to use for a specific marginal is determined by its relative weight.weight (
int
) – Weight for this strategy. When evaluating mulitple measurement strategies, the total privacy budget is divided among the strategies in proportion to their weight.
- compute(session, source_id, budget, keysets, clamping_bounds)#
Returns the results from computing all of the specified marginals.
- Parameters:
session (tmlt.analytics.session.Session) – An Analytics
Session
to use for computing the marginals.source_id (str) – Source ID of the private table in the
session
to query.budget (tmlt.analytics.privacy_budget.ApproxDPBudget) – Total privacy budget to use across all marginals. This budget is distributed among the marginals in proportion to their weights.
keysets (tmlt.analytics.synthetics._keysets_repository.KeySetsRepository) – Repository of KeySets to use for the marginals.
clamping_bounds (tmlt.analytics.synthetics._clamping_bounds.ClampingBounds) – Clamping bounds to use for answering Sum queries.
- Return type:
List[tmlt.analytics.synthetics._weighted_marginal.MarginalAnswer]
- class MeasurementStrategy#
Bases:
abc.ABC
A strategy for measuring differentially private aggregates.
- property weight: int#
- Abstractmethod:
- Return type:
Weight for determining privacy budget allocated for this strategy.
- abstract compute(session, source_id, budget, keysets, clamping_bounds)#
Compute answers using this strategy.
- Parameters:
session (tmlt.analytics.session.Session)
source_id (str)
keysets (tmlt.analytics.synthetics._keysets_repository.KeySetsRepository)
clamping_bounds (tmlt.analytics.synthetics._clamping_bounds.ClampingBounds)
- Return type:
List[tmlt.analytics.synthetics._weighted_marginal.MarginalAnswer]
- class Count(groupby_columns, weight=1, count_column='count')#
Bases:
WeightedMarginal
A marginal count.
- __init__(groupby_columns, weight=1, count_column='count')#
Constructor.
- Parameters:
- answer(source_id, keyset, session, budget, clamping_bounds)#
Returns an answer to the Count marginal.
- Parameters:
source_id (str)
keyset (tmlt.analytics.keyset.KeySet)
session (tmlt.analytics.session.Session)
clamping_bounds (tmlt.analytics.synthetics._clamping_bounds.ClampingBounds)
- Return type:
List[MarginalAnswer]
- class MarginalAnswer#
Output from computing a weighted marginal.
- marginal: WeightedMarginal#
The marginal that was computed.
- budget: tmlt.analytics.privacy_budget.ApproxDPBudget#
Privacy budget used to compute the marginal.
- answer: pyspark.sql.DataFrame#
The computed marginal.
- class Sum(groupby_columns, measure_column, weight=1)#
Bases:
WeightedMarginal
A marginal sum.
- property output_column: str#
The name of the column containing the sum of the measure column.
- Return type:
- __init__(groupby_columns, measure_column, weight=1)#
Constructor.
Note
The lower and upper clamping bounds determine the sensitivity of the sum query, and therefore the scale of the noise that needs to be added when computing this marginal using some privacy budget. Consequently, the choice of these clamping bounds has a significant impact on the accuracy of the sum query.
- Parameters:
- answer(source_id, keyset, session, budget, clamping_bounds)#
Returns a query object corresponding to the sum to compute.
- Parameters:
source_id (str)
keyset (tmlt.analytics.keyset.KeySet)
session (tmlt.analytics.session.Session)
clamping_bounds (tmlt.analytics.synthetics._clamping_bounds.ClampingBounds)
- Return type:
List[MarginalAnswer]
- class WeightedMarginal(groupby_columns, weight=1.0)#
Bases:
abc.ABC
Interface for specifying a weighted DP groupby aggregate.
The weight is used to determine the fraction of the total privacy budget to be used for measuring this marginal. The weight is budget relative to other marginals.
- property output_column: str#
- Abstractmethod:
- Return type:
The name of the output column produced by the marginal.
- __init__(groupby_columns, weight=1.0)#
Constructor.
- Parameters:
weight (
float
) – Relative weight used to determine the fraction of privacy budget to allocate for this marginal when computing multiple marginals with some fixed total budget. For example, when computing two weighted marginals,WM1
with a weight of 1 andWM2
with a weight of 2, using a total privacy budget of epsilon=3,WM1
is computed with an epsilon of 1 andWM2
is computed with an epsilon of 2. Defaults to 1.
- abstract answer(source_id, keyset, session, budget, clamping_bounds)#
Returns the answer to this marginal evaluated using
session
.- Parameters:
source_id (str)
keyset (tmlt.analytics.keyset.KeySet)
session (tmlt.analytics.session.Session)
clamping_bounds (tmlt.analytics.synthetics._clamping_bounds.ClampingBounds)
- Return type:
List[MarginalAnswer]