Optimizing synthetic data generation#
In this topic guide, we will outline a few common strategies to optimize the privacy-utility trade-off when using Tumult Synthetics to generate differentially private synthetic data.
Choosing the measurement strategy#
The measurement strategy passed in the measurement_strategy
argument of
generate_synthetic_data()
is one of the key
choices that impacts the utility of the generated synthetic data. Put simply,
the measurement strategy determines what distributional properties of the
original dataset will be best captured.
Single-column vs. multi-columns marginals#
The simplest option, useful when building a first prototype, is to measure
the column-wise distribution of all categorical columns. This is done using
FixedMarginals
with one
Count
marginal for each column. This will
preserve the distribution of values within individual columns, but not between
groups of columns.
Depending on the use case for synthetic data, it is often desirable to preserve
correlations between pairs of columns. Using a
Count
marginal initialized with categorical
columns col_a
and col_b
will preserve the correlations between these
columns. Similarly, computing the marginal distribution across more columns will
attempt to preserve the correlations between all these columns at once.
It can thus be tempting to measure all the columns at once, or all pairs of columns in the input dataset. However, there are two fundamental trade-offs that mean that such naive strategies are rarely optimal.
When measuring a marginal distribution across a large number of columns, more (if not almost all) combinations of column values will be associated with very few distinct data points. When adding differentially private noise to this count, the noise will tend to overwhelm the signal, leading to poor utility. We discuss this more in detail in the next section.
The privacy budget is split across all the marginals that are being measured. Therefore, if there are a very large number of marginals, the privacy budget associated with each will be very small, leading to very noisy results: the distributions and correlations will be captured inaccurately. We discuss this in more detail in a follow-up section.
Generally, it is a good practice to start with a simple strategy, and iteratively add more complex marginals as needed.
The impact of sparsity#
When computing a Count
marginal, for a fixed
privacy budget, the noise added to the count of each combination of column
values will have a fixed variance. Therefore, it will have a small impact if the
real counts are large, but can lead to significant inaccuracy if the real counts
are very small. When deciding whether or not a marginal is worth including, it
is worth considering its sparsity: if most combinations of column values appear
only a few times in the dataset, the additional marginal will likely not bring a
significant gain in utility.
A common way to deal with sparsity issues is to use binning: if a categorical
column is a date, a timestamp, or a numeric value, the binning_specs
argument from generate_synthetic_data()
can be
used to specify how the data is binned. Increasing the binning size can be
helpful to group more values together, reducing sparsity at the cost of more
granular measurements. Trying out different binning strategies can be a good way
of iteratively improving the utility of the synthetic data.
Splitting the privacy budget#
By default, the privacy budget will be split evenly across all marginals. This
is a good first choice when prototyping, but rarely an optimal strategy. Instead
of using this default splitting strategy, users may specify the weight
argument in each marginal: then, the budget is split proportionally to this
parameter, so marginals with larger weights will be measured more accurately.
This can be useful to preserve certain particularly important correlations with
a higher accuracy, at the cost of lower accuracy of less-important marginals.
This can also be used to counterbalance the effects of sparsity, by using more
budget on the marginals where each combination of column values is associated
with only a few records.
Choosing clamping and truncation bounds#
To generate synthetic data with numeric columns, it is necessary to specify
clamping bounds using the ClampingBounds
class. All input records whose value for the numeric column is outside of these
bounds will be modified to fit within these bounds, as explained in the
tutorial about clamping bounds. And the trade-off
described in this tutorial also applies for synthetic data generation: larger
clamping bounds will lead to more noise added to the corresponding
Sum
marginal, but tighter bounds can also
bring bias if too many values are clamped.
As a rule of thumb, clamping bounds should generally cover about 95% of the
input values, or more if there are no large outliers. When it is impossible to
determine clamping bounds a priori, the
get_bounds()
aggregation can be
used on the Session
aggregation before calling
generate_synthetic_data()
to automatically
determine reasonable values of clamping bounds, at a small cost in privacy
budget. Afterwards, it can be worthwhile to try different values of clamping
bounds to see which one provides the best trade-off between accuracy and bias.
When generating synthetic data on a table initialized with
privacy IDs, using the
AddRowsWithID
protected change, the
max_rows_per_id
argument must be used when calling
generate_synthetic_data()
. This argument
determines how many rows per privacy ID are kept in the original data prior to
measurement, and involves a similar trade-off as clamping bounds: larger values
will lead to more noise but keep more data, while smaller values will reduce the
noise but might introduce bias when dropping records. The same 95% rule of thumb
applies, and it is also valuable to experiment with different values to
determine an optimal choice.
Using public data#
Sensitive datasets frequently contain columns with data that is not sensitive in nature, and can be derived from other columns. For example, a dataset could have a column containing ZIP codes and another containing U.S. states: the latter can be determined from the former using a public (non-sensitive) lookup table.
In such situations, it is often simpler to use only the most granular column (or the foreign key to the public table) as a categorical column during synthetic data generation, and re-add all other attributes using the public lookup table afterwards.
Resolving inconsistencies#
Synthetic data generated with a simple strategy might have inconsistencies: values or combination of values that cannot exist in the original data. These inconsistencies can be of various kinds, and there are multiple ways to resolve them.
Removing null values#
When using binning on a column, values that are not covered by the specified
bins will be converted to null
values during the measurement stage. Due to
differentially private noise, this can lead to null
values appearing in the
output, even if there are none in the input data. To remove these values, the
simplest option is to modify the synthetic data to convert nulls into random
values instead, or remove the corresponding records entirely.
Post-processing numeric columns#
Clamping bounds specified in the clamping_bounds
argument of
generate_synthetic_data()
are used to clamp the
values of the input data before performing differentially private measurements,
but they are not used during the generation stage. Therefore, it is possible
that values outside the clamping bounds appear in the output data, and these
values might be nonsensical (e.g. age
values above 150). Just like null
values, such inconsistencies can be removed after synthetic data generation, by
clamping these values to fixed bounds, or removing corresponding records
entirely. Note that this may add bias to the statistical properties of the
synthetic data, so it is important to evalute the impact of this step by
measuring utility afterwards.
Preventing impossible combination of values#
Depending on how the metadata was specified, the synthetic data might contain combinations of categorical column values that cannot appear in the original data. There are two main ways to prevent this:
The
KeySet
used to specify possible column values can be initialized with only some combinations of values. Only these combinations will then be present in the output. More information can be found in the tutorial about group-by queries with multiple columns.The
count_structural_zeroes
argument ofgenerate_synthetic_data()
can be used to specify combination of categorical columns that cannot appear in the data. Using this argument has the same effect as removing the corresponding combinations from the input KeySet, but it can be simpler and more efficient when there are a small number of impossible combinations.
Similarly, generate_synthetic_data()
also
accepts a sum_structural_zeroes
argument, which ensures that certain
combinations of categorical values are always associated to a 0 value in the
specified numeric column.