SuppressionRate#

from tmlt.tune import SuppressionRate
class tmlt.tune.SuppressionRate(join_columns, *, name=None, description=None, baseline=None, output=None, grouping_columns=None)#

Bases: JoinedOutputMetric

Computes the fraction of values in the baseline output but not in the DP output.

This metric counts how many values of join_columns appear in the baseline output but not in the DP output (such values are called suppressed), and return the ratio between this number and the total number of values of join_columns in the baseline output.

More formally, let \(s\) be the number of combinations of values of join_columns appearing in the baseline output but not in the DP output, and b be the number of combinations of values of join_columns appearing in the baseline output. The metric returns \(s/b\); if \(b=0\), \(s\) must also be 0, and the metric returns 0.

If grouping_columns is defined, then the DP output and the baseline output are both grouped by these columns, the suppression rate is calculated separately for each group, and the metric returns a DataFrame. Otherwise, the metric returns a single number.

In each group (or globally if grouping_column is None), each combination of values of join_columns must appear in at most one row of the DP output and the baseline output. Otherwise, the metric returns an error.

Example

>>> dp_df = spark.createDataFrame(
...     pd.DataFrame(
...         {
...             "A": ["a1", "a2", "a3", "c"],
...             "X": [50, 110, 100, 50]
...         }
...     )
... )
>>> dp_outputs = {"O": dp_df}
>>> baseline_df = spark.createDataFrame(
...     pd.DataFrame(
...         {
...             "A": ["a1", "a2", "a3", "b"],
...             "X": [100, 100, 100, 50]
...         }
...     )
... )
>>> baseline_outputs = {"default": {"O": baseline_df}}
>>> metric = SuppressionRate(
...     join_columns=["A"]
... )
>>> metric.join_columns
['A']
>>> metric(dp_outputs, baseline_outputs).value
0.25
compute_suppression_rate(joined_output, result_column_name)#

Computes suppression rate given DP and baseline outputs.