synthetics#

Tumult Synthetics is a differentially private synthetic data generator built on Tumult Analytics.

Caution

The public API of Tumult Synthetics is still under active development, and will change over upcoming releases.

Example

>>> from pyspark.sql import SparkSession, Row
>>> from tmlt.synthetics import (
...     generate_synthetic_data,
...     Count,
...     Sum,
...     FixedMarginals,
...     ClampingBounds,
... )
>>> from tmlt.analytics import ApproxDPBudget, AddRowsWithID, KeySet, Session
>>> import pandas as pd
>>>
>>> spark = SparkSession.builder.getOrCreate()
>>>
>>> # Create toy data
>>> data = [
...     Row(id=1, state='California', city='San Francisco', gender='Male', age=25, occupation='Engineer', salary=100000),
...     Row(id=2, state='California', city='Sunnyvale', gender='Female', age=30, occupation='Teacher', salary=70000),
...     Row(id=3, state='Washington', city='Seattle', gender='Other', age=40, occupation='Doctor', salary=150000),
...     Row(id=4, state='California', city='San Francisco', gender='Male', age=35, occupation='Unemployed', salary=0),
... ]
>>> df = spark.createDataFrame(data)
>>>
>>> # Define keysets and domain
>>> state_city_df = pd.DataFrame(
...     [("California", "San Francisco"), ("California", "Sunnyvale"), ("Washington", "Seattle")],
...     columns=["state", "city"]
... )
>>> keyset = (
...     KeySet.from_dataframe(spark.createDataFrame(state_city_df))
...     * KeySet.from_dict({"gender": ["Male", "Female", "Other"]})
...     * KeySet.from_dict({"age": list(range(18, 101))})
...     * KeySet.from_dict({"occupation": ["Engineer", "Teacher", "Doctor", "Unemployed"]})
... )
>>> clamping_bounds = ClampingBounds({"salary": (0, 200000)})
>>>
>>> # Define measurement strategies
>>> measurement_strategies = [
...     FixedMarginals(marginals=[
...         Count(groupby_columns=['state', 'city']),
...         Count(groupby_columns=['age']),
...         Count(groupby_columns=['gender', 'occupation']),
...         Sum(groupby_columns=['state', 'occupation'], measure_column='salary')
...     ])
... ]
>>>
>>> # Set up the session
>>> session = (
...     Session.Builder()
...     .with_privacy_budget(ApproxDPBudget(epsilon=100, delta=0))
...     .with_private_dataframe("protected_data", df, AddRowsWithID(id_column="id"))
...     .build()
... )
>>>
>>> # Generate synthetic data
>>> synthetic_data = generate_synthetic_data(
...     session=session,
...     source_id="protected_data",
...     keyset=keyset,
...     measurement_strategies=measurement_strategies,
...     clamping_bounds=clamping_bounds,
...     max_rows_per_id=1
... )
>>>
>>> # Access the synthetic data
>>> synthetic_data.show(5)  
+----------+-------------+------+---+----------+------------------+
|     state|         city|gender|age|occupation|            salary|
+----------+-------------+------+---+----------+------------------+
|California|San Francisco|Female| 40|   Teacher|27883.202202662666|
|California|San Francisco|  Male| 30|  Engineer|40116.144665117616|
|California|    Sunnyvale|  Male| 35|Unemployed|20954.157816173778|
|Washington|      Seattle| Other| 25|    Doctor| 226830.4953160459|
+----------+-------------+------+---+----------+------------------+

Functions#

generate_synthetic_data(session, keyset, measurement_strategies, clamping_bounds=None, binning_specs=None, count_structural_zeroes=None, sum_structural_zeroes=None, source_id=None, privacy_budget=None, split_columns=None, max_rows_per_id=None, model_iterations=3000)#

Generate synthetic data using differential privacy techniques.

Parameters:
  • session (tmlt.analytics.session.Session) – An Analytics Session to use for computing the aggregates on the private data.

  • keyset (tmlt.analytics.keyset.KeySet) – Full domain for the synthetic data, before removing structural zeroes.

  • measurement_strategies (Sequence[tmlt.analytics.synthetics._strategy.MeasurementStrategy]) – Strategies for measuring data aggregates.

  • clamping_bounds (Optional[tmlt.analytics.synthetics._clamping_bounds.ClampingBounds]) – Clamping bounds for the numeric columns to generate. An alternative to using binning. Note that these columns should not have binning specs or be included in count marginals, but rather should be included in sum marginals.

  • binning_specs (Optional[Dict[str, tmlt.analytics.binning_spec.BinningSpec]]) – Specifications for how to bin numeric or timestamp/date columns for measurement. Note that these columns should not have clamping bounds or be included in sum marginals, but rather should be included in count marginals. Null values will always be included in the domain for binned columns.

  • count_structural_zeroes (Optional[List[pyspark.sql.DataFrame]]) – Structural zeroes for the count measurements.

  • sum_structural_zeroes (Optional[Dict[str, List[pyspark.sql.DataFrame]]]) – Structural zeroes for the sum measurements.

  • source_id (Optional[str]) – Source ID of the private table in the session to query.

  • privacy_budget (Optional[tmlt.analytics.privacy_budget.ApproxDPBudget]) – Privacy budget to use for synthetic data generation.

  • split_columns (Optional[List[str]]) – Columns to split the data into subsets for model fitting.

  • max_rows_per_id (Optional[int]) – Maximum number of rows per ID in the protected data.

  • model_iterations (int) – Number of iterations to use for fitting the model.

Returns:

A DataFrame containing the generated synthetic data.

Return type:

pyspark.sql.DataFrame

Classes#

AutomaticBounds

Clamping bounds for a magnitude column is automatically set using DP.

ClampingBounds

A configuration for clamping bounds of each numeric column to generate.

PerGroupClampingBounds

Clamping bounds for a magnitude column specified separately for each group.

KeySetsRepository

A collection of possibly overlapping KeySets.

AdaptiveMarginals

An adaptive workload with multiple cases based on the total noisy count.

FixedMarginals

A fixed workload of weighted count and sum marginals.

MeasurementStrategy

A strategy for measuring differentially private aggregates.

Count

A marginal count.

MarginalAnswer

Output from computing a weighted marginal.

Sum

A marginal sum.

WeightedMarginal

Interface for specifying a weighted DP groupby aggregate.

class AutomaticBounds(groupby_columns=None, weight=1, low_column_name='low', high_column_name='high')#

Clamping bounds for a magnitude column is automatically set using DP.

Parameters:
  • groupby_columns (Optional[List[str]])

  • weight (int)

  • low_column_name (str)

  • high_column_name (str)

property groupby_columns: List[str]#

List of columns that determine the groups.

Return type:

List[str]

property low_column_name: str#

Name of the column that contains the lower clamping bounds.

Return type:

str

property high_column_name: str#

Name of the column that contains the upper clamping bounds.

Return type:

str

property weight: int#

How much privacy budget should be used to discover bounds.

Return type:

int

__init__(groupby_columns=None, weight=1, low_column_name='low', high_column_name='high')#

Constructor.

Parameters:
  • groupby_columns (Optional[List[str]]) – Clamping bounds are automatically discovered for each unique combination of values of these columns.

  • weight (int) – Relative weight that determines how much privacy budget should be allocated to discover clamping bounds.

  • low_column_name (str) – (Only applicable if groupby_columns is specified) Name of the column in the result DataFrame that contains the lower clamping bounds. If None, the default name “low” is used.

  • high_column_name (str) – (Only applicable if groupby_columns is specified) Name of the column in the result DataFrame that contains the upper clamping bounds. If None, the default name “high” is used.

class ClampingBounds(bounds_per_column, weight=0)#

A configuration for clamping bounds of each numeric column to generate.

ClampingBounds can be explictly provided or set to be automatically discovered using DP.

Parameters:
property weight: int#

How much privacy budget should be used to discover bounds.

Return type:

int

property bounds_per_column: Dict[str, Tuple[float, float] | AutomaticBounds | PerGroupClampingBounds]#

Returns the dictionary mapping column names to clamping bounds.

Return type:

Dict[str, Union[Tuple[float, float], AutomaticBounds, PerGroupClampingBounds]]

__init__(bounds_per_column, weight=0)#

Constructor.

Parameters:
  • bounds_per_column (Dict[str, Union[Tuple[float, float], PerGroupClampingBounds, AutomaticBounds]]) – A dictionary mapping magnitude column names to clamping bounds specified as a tuple of two floats (low and high), a PerGroupClampingBounds object that specifies clamping bounds for each combination of values for some categorical “grouping” columns, or an AutomaticBounds object that specifies that clamping bounds should be automatically discovered from the private data using DP.

  • weight (int) – Relative weight that determines how much privacy budget should be used to discover clamping bounds for columns with AutomaticBounds.

__getitem__(column)#

Returns clamping bounds for a column.

If the bounds for column are specified to be distinct per group, this method returns a PerGroupClampingBounds object. Otherwise, it returns a tuple (low, high).

If the bounds for column is set to be automatically discovered, this method will raise an error unless the bounds have been discovered by calling the discover_unknown_bounds method first.

Parameters:

column (str) – The name of the column for which to get the clamping bounds.

Return type:

Union[Tuple[float, float], PerGroupClampingBounds]

discover_unknown_bounds(session, source_id, budget, keysets)#

Discover clamping bounds for columns with AutomaticBounds.

Parameters:
class PerGroupClampingBounds(groupby_columns, dataframe, low_column_name='low', high_column_name='high')#

Clamping bounds for a magnitude column specified separately for each group.

Parameters:
property groupby_columns: List[str]#

List of columns that determine the groups.

Return type:

List[str]

property dataframe: pyspark.sql.DataFrame#

Returns DataFrame with clamping bounds for each group.

Return type:

pyspark.sql.DataFrame

property low_column_name: str#

Name of the column that contains the lower clamping bounds.

Return type:

str

property high_column_name: str#

Name of the column that contains the upper clamping bounds.

Return type:

str

__init__(groupby_columns, dataframe, low_column_name='low', high_column_name='high')#

Constructor.

Parameters:
  • groupby_columns (List[str]) – Clamping bounds are specified separately for each unique combination of values of these columns.

  • dataframe (DataFrame) – DataFrame with clamping bounds for each group.

  • low_column_name (str) – Name of the column in dataframe that contains the lower clamping bounds.

  • high_column_name (str) – Name of the column in dataframe that contains the upper clamping bounds.

class KeySetsRepository(keysets)#

A collection of possibly overlapping KeySets.

See the module docstring for more information.

Parameters:

keysets (List[tmlt.analytics.keyset.KeySet])

property columns: Set[str]#

Set of column names spanned by all KeySets in this repository.

Return type:

Set[str]

__init__(keysets)#

Constructor.

Parameters:

keysets (List[KeySet]) – The KeySets to include in the repository.

join(dataframe, on)#

Returns a new KeySetsRepository by joining each keyset with dataframe.

Parameters:
  • dataframe (pyspark.sql.DataFrame) – DataFrame to join with each keyset.

  • on (List[str]) – List of columns to join on.

Return type:

KeySetsRepository

__getitem__(columns)#

Returns a maximal KeySet with the given list of columns.

Parameters:

columns (List[str]) – List of columns that the KeySet should contain.

Return type:

tmlt.analytics.keyset.KeySet

class AdaptiveMarginals(total_by, total_count_budget_fraction, cases, weight=1, count_column='count')#

Bases: MeasurementStrategy

An adaptive workload with multiple cases based on the total noisy count.

In particular, this workload specifies different sets of marginals to compute for different subsets of the data based on the size of the subset.

For example, in a dataset of website visits for a month for a set of websites, we might want to compute finer granularity marginals for websites that get a large number of visits while only computing coarser marginals for websites that get relatively fewer visits. We could do this using an AdaptiveMarginals as follows:

>>> adaptive_counts = AdaptiveMarginals(
...     total_by=["website_id"],
...     total_count_budget_fraction=0.15,
...     cases=[
...         AdaptiveMarginals.Case(
...             threshold=1000,
...             marginals=[
...                 Count(["website_id", "day"]),
...                 Count(["website_id", "day", "hour"]),
...             ],
...         ),
...         AdaptiveMarginals.Case(
...             threshold=500,
...             marginals=[
...                 Count(["website_id", "day"]),
...             ],
...         ),
...         AdaptiveMarginals.Case(
...             default=True,
...             marginals=[
...                 Count(["website_id", "week"]),
...             ],
...         ),
...     ],
... )

The adaptive strategy defined in the example above does the following:

  • Computes the total count of visits for each website.

  • Divides the websites into three cases:

    • Case 1: Websites with more than 1000 visits.

      • Computes the count of visits per day and per hour.

    • Case 2: Websites with between 500 and 1000 visits.

      • Computes the count of visits per day.

    • Default case: Websites with fewer than 500 visits.

      • Computes the count of visits per week.

Parameters:
class Case(marginals, threshold=None, default=False)#

A case in the adaptive workload.

Parameters:
property weight: int#

Weight for determining privacy budget allocated for this strategy.

Return type:

int

__init__(total_by, total_count_budget_fraction, cases, weight=1, count_column='count')#

Constructor.

Parameters:
compute(session, source_id, budget, keysets, clamping_bounds)#

Compute marginals adaptively based on the total noisy count.

Parameters:
Return type:

List[tmlt.analytics.synthetics._weighted_marginal.MarginalAnswer]

class FixedMarginals(marginals, weight=1)#

Bases: MeasurementStrategy

A fixed workload of weighted count and sum marginals.

Parameters:
property weight: int#

Weight for determining privacy budget allocated for this strategy.

Return type:

int

__init__(marginals, weight=1)#

Constructor.

Parameters:
  • marginals (List[WeightedMarginal]) – List of weighted marginals to compute. The fraction of the total privacy budget for this strategy to use for a specific marginal is determined by its relative weight.

  • weight (int) – Weight for this strategy. When evaluating mulitple measurement strategies, the total privacy budget is divided among the strategies in proportion to their weight.

compute(session, source_id, budget, keysets, clamping_bounds)#

Returns the results from computing all of the specified marginals.

Parameters:
Return type:

List[tmlt.analytics.synthetics._weighted_marginal.MarginalAnswer]

class MeasurementStrategy#

Bases: abc.ABC

A strategy for measuring differentially private aggregates.

property weight: int#
Abstractmethod:

Return type:

int

Weight for determining privacy budget allocated for this strategy.

abstract compute(session, source_id, budget, keysets, clamping_bounds)#

Compute answers using this strategy.

Parameters:
Return type:

List[tmlt.analytics.synthetics._weighted_marginal.MarginalAnswer]

class Count(groupby_columns, weight=1, count_column='count')#

Bases: WeightedMarginal

A marginal count.

Parameters:
  • groupby_columns (List[str])

  • weight (int)

  • count_column (str)

property output_column: str#

The name of the count column.

Return type:

str

__init__(groupby_columns, weight=1, count_column='count')#

Constructor.

Parameters:
  • groupby_columns (List[str]) – List of columns to group by.

  • weight (int) – Relative weight used to determine the fraction of privacy budget to allocate for this marginal when computing multiple marginals with some fixed total budget. Defaults to 1.

  • count_column (str) – What the output column containing counts should be named. Defaults to “count”.

answer(source_id, keyset, session, budget, clamping_bounds)#

Returns an answer to the Count marginal.

Parameters:
Return type:

List[MarginalAnswer]

class MarginalAnswer#

Output from computing a weighted marginal.

marginal: WeightedMarginal#

The marginal that was computed.

budget: tmlt.analytics.privacy_budget.ApproxDPBudget#

Privacy budget used to compute the marginal.

noise_scale: float#

Scale of the noise added to the marginal.

answer: pyspark.sql.DataFrame#

The computed marginal.

class Sum(groupby_columns, measure_column, weight=1)#

Bases: WeightedMarginal

A marginal sum.

Parameters:
  • groupby_columns (List[str])

  • measure_column (str)

  • weight (int)

property output_column: str#

The name of the column containing the sum of the measure column.

Return type:

str

__init__(groupby_columns, measure_column, weight=1)#

Constructor.

Note

The lower and upper clamping bounds determine the sensitivity of the sum query, and therefore the scale of the noise that needs to be added when computing this marginal using some privacy budget. Consequently, the choice of these clamping bounds has a significant impact on the accuracy of the sum query.

Parameters:
  • groupby_columns (List[str]) – List of columns to group by.

  • measure_column (str) – The column to sum.

  • weight (int) – Relative weight used to determine the fraction of privacy budget to allocate for this marginal when computing multiple marginals with some fixed total budget. Defaults to 1.

answer(source_id, keyset, session, budget, clamping_bounds)#

Returns a query object corresponding to the sum to compute.

Parameters:
Return type:

List[MarginalAnswer]

class WeightedMarginal(groupby_columns, weight=1.0)#

Bases: abc.ABC

Interface for specifying a weighted DP groupby aggregate.

The weight is used to determine the fraction of the total privacy budget to be used for measuring this marginal. The weight is budget relative to other marginals.

Parameters:
  • groupby_columns (List[str])

  • weight (float)

property output_column: str#
Abstractmethod:

Return type:

str

The name of the output column produced by the marginal.

__init__(groupby_columns, weight=1.0)#

Constructor.

Parameters:
  • groupby_columns (List[str]) – List of columns to group by.

  • weight (float) – Relative weight used to determine the fraction of privacy budget to allocate for this marginal when computing multiple marginals with some fixed total budget. For example, when computing two weighted marginals, WM1 with a weight of 1 and WM2 with a weight of 2, using a total privacy budget of epsilon=3, WM1 is computed with an epsilon of 1 and WM2 is computed with an epsilon of 2. Defaults to 1.

abstract answer(source_id, keyset, session, budget, clamping_bounds)#

Returns the answer to this marginal evaluated using session.

Parameters:
Return type:

List[MarginalAnswer]