SpuriousRate#

from tmlt.tune import SpuriousRate
class tmlt.tune.SpuriousRate(join_columns, *, name=None, description=None, baseline=None, output=None, grouping_columns=None)#

Bases: JoinedOutputMetric

Computes the fraction of values in the DP output but not in the baseline output.

This metric counts how many values of join_columns appear in the DP output but not in the baseline output (such values are called spurious), and return the ratio between this number and the total number of values of join_columns in the DP output.

More formally, let \(s\) be the number of combinations of values of join_columns appearing in the DP output but not the baseline output, and d be the number of combinations of values of join_columns appearing in the DP output. The metric returns \(s/d\); if \(d=0\), \(s\) must also be 0, and the metric returns 0.

If grouping_columns is defined, then the DP output and the baseline output are both grouped by these columns, the spurious rate is calculated separately for each group, and the metric returns a DataFrame. Otherwise, the metric returns a single number.

In each group (or globally if grouping_column is None), each combination of values of join_columns must appear in at most one row of the DP output and the baseline output. Otherwise, the metric returns an error.

Example

>>> dp_df = spark.createDataFrame(
...     pd.DataFrame(
...         {
...             "A": ["a1", "a2", "a3", "c"],
...             "X": [50, 110, 100, 50]
...         }
...     )
... )
>>> dp_outputs = {"O": dp_df}
>>> baseline_df = spark.createDataFrame(
...     pd.DataFrame(
...         {
...             "A": ["a1", "a2", "a3", "b"],
...             "X": [100, 100, 100, 50]
...         }
...     )
... )
>>> baseline_outputs = {"default": {"O": baseline_df}}
>>> metric = SpuriousRate(
...     join_columns=["A"]
... )
>>> metric.join_columns
['A']
>>> metric(dp_outputs, baseline_outputs).value
0.25
compute_spurious_rate(joined_output, result_column_name)#

Computes spurious rate given DP and baseline outputs.