HighRelativeErrorRate#

from tmlt.tune import HighRelativeErrorRate
class tmlt.tune.HighRelativeErrorRate(relative_error_threshold, measure_column, join_columns, grouping_columns=None, *, name=None, description=None, baseline=None, output=None)#

Bases: JoinedOutputMetric

Computes the fraction of values whose relative error is above a fixed threshold.

This metric matches values in measure_column between the DP output and the baseline output using an inner, 1-to-1 join on join_columns, then computes the relative error of the DP values, and returns the fraction of values that are above a specified threshold.

More formally, let \(J\) be all combinations of values of join_columns appearing in both the DP output or the baseline output. For all \(j \in J\), let \({DP}_j\) be the corresponding value of measure_column in the DP output, and \(B_j\) the corresponding value of measure_column in the baseline output. Let \(I\) be the set of indices \(i \in J\) such that \({DP}_i\) and \(B_i\) are valid numeric values (not NaN nor nulls).

The high relative error rate is defined as:

\[\frac{ \text{card}\left(\left\{ i \in I \text{ such that } \left|\frac{{DP}_i-B_i}{B_i}\right|\ge\texttt{relative_error_threshold} \right\}\right) }{\text{card}(I)}\]

where \(\text{card}\) denotes the cardinality of a set. Whenever \(B_i=0\), the relative error \(\left|\frac{{DP}_i-B_i}{B_i}\right|\) is evaluated to \(0\) if \({DP}_i=0\), and \(\infty\) otherwise. If \(I\) is empty, the metric returns NaN.

If grouping_columns is defined, then the DP output and the baseline output are both grouped by these columns, the high relative error rate is calculated separately for each group, and the metric returns a DataFrame. Otherwise, the metric returns a single number.

In each group (or globally if grouping_column is None), each combination of values of join_columns must appear in at most one row of the DP output and the baseline output. Otherwise, the metric returns an error.

Note

This metric only measures error for rows that can be mapped 1-to-1 between the DP output and the baseline output (according to the values in join_columns). This ignores the error from rows that appear in only one of the two tables; to capture this kind of error, use SuppressionRate and/or SpuriousRate.

Example

>>> dp_df = spark.createDataFrame(
...     pd.DataFrame(
...         {
...             "A": ["a1", "a2", "a3"],
...             "X": [50, 110, 100]
...         }
...     )
... )
>>> dp_outputs = {"O": dp_df}
>>> baseline_df = spark.createDataFrame(
...     pd.DataFrame(
...         {
...             "A": ["a1", "a2", "a3"],
...             "X": [100, 100, 100]
...         }
...     )
... )
>>> baseline_outputs = {"default": {"O": baseline_df}}
>>> metric = HighRelativeErrorRate(
...     measure_column="X",
...     relative_error_threshold=0.25,
...     join_columns=["A"]
... )
>>> metric.relative_error_threshold
0.25
>>> metric.join_columns
['A']
>>> result = metric(dp_outputs, baseline_outputs).value
>>> result
0.333
property relative_error_threshold: float#

Returns the relative error threshold.

compute_high_re(joined_output, result_column_name)#

Computes the high relative error rate from a joined dataframe.