generate_synthetic_data#

from tmlt.synthetics import generate_synthetic_data
tmlt.synthetics.generate_synthetic_data(session, keyset, measurement_strategies, clamping_bounds=None, binning_specs=None, sum_structural_zeroes=None, source_id=None, privacy_budget=None, split_columns=None, max_rows_per_id=None, model_iterations=3000, max_rows_per_batch=None)#

Generates and returns differentially private synthetic data.

Parameters:
  • session (Session) – The Session used to define the privacy guarantees applied to the sensitive data, and compute the DP aggregates.

  • keyset (KeySet) – Full domain for the synthetic data, before removing structural zeroes.

  • measurement_strategies (Sequence[MeasurementStrategy]) – Strategies for measuring data aggregates.

  • clamping_bounds (Optional[ClampingBounds]) – Clamping bounds for the numeric columns to generate. An alternative to using binning. Note that these columns should not have binning specs or be included in count marginals, but rather should be included in sum marginals.

  • binning_specs (Optional[Dict[str, BinningSpec]]) – Specifications for how to bin numeric or timestamp/date columns for measurement. Note that these columns should not have clamping bounds or be included in sum marginals, but rather should be included in count marginals. Null values will always be included in the domain for binned columns.

  • sum_structural_zeroes (Optional[Dict[str, List[DataFrame]]]) – Structural zeroes for the sum measurements.

  • source_id (Optional[str]) – Source ID of the private table in the session to query.

  • privacy_budget (Optional[ApproxDPBudget]) – Privacy budget to use for synthetic data generation.

  • split_columns (Optional[List[str]]) – Columns to split the data into subsets for model fitting.

  • max_rows_per_id (Optional[int]) – Maximum number of rows per ID in the protected data.

  • model_iterations (int) – Number of iterations to use for fitting the model.

  • max_rows_per_batch (Optional[int]) – Maximum number of rows per batch when generating synthetic data. Lower values will use less memory, but may be slower and/or degrade utility. Defaults to 1,000,000.

Return type:

DataFrame

Returns:

A DataFrame containing the generated synthetic data.