SpuriousRate#
from tmlt.tune import SpuriousRate
- class tmlt.tune.SpuriousRate(join_columns, *, name=None, description=None, baseline=None, output=None, grouping_columns=None)#
Bases:
JoinedOutputMetric
Computes the fraction of values in the DP output but not in the baseline output.
This metric counts how many values of
join_columns
appear in the DP output but not in the baseline output (such values are called spurious), and return the ratio between this number and the total number of values ofjoin_columns
in the DP output.More formally, let \(s\) be the number of combinations of values of
join_columns
appearing in the DP output but not the baseline output, andd
be the number of combinations of values ofjoin_columns
appearing in the DP output. The metric returns \(s/d\); if \(d=0\), \(s\) must also be 0, and the metric returns 0.If
grouping_columns
is defined, then the DP output and the baseline output are both grouped by these columns, the spurious rate is calculated separately for each group, and the metric returns a DataFrame. Otherwise, the metric returns a single number.In each group (or globally if
grouping_column
isNone
), each combination of values ofjoin_columns
must appear in at most one row of the DP output and the baseline output. Otherwise, the metric returns an error.Example
>>> dp_df = spark.createDataFrame( ... pd.DataFrame( ... { ... "A": ["a1", "a2", "a3", "c"], ... "X": [50, 110, 100, 50] ... } ... ) ... ) >>> dp_outputs = {"O": dp_df} >>> baseline_df = spark.createDataFrame( ... pd.DataFrame( ... { ... "A": ["a1", "a2", "a3", "b"], ... "X": [100, 100, 100, 50] ... } ... ) ... ) >>> baseline_outputs = {"default": {"O": baseline_df}}
>>> metric = SpuriousRate( ... join_columns=["A"] ... ) >>> metric.join_columns ['A'] >>> metric(dp_outputs, baseline_outputs).value 0.25
- compute_spurious_rate(joined_output, result_column_name)#
Computes spurious rate given DP and baseline outputs.