_spurious#
Metric functions relating to spurious rate.
Classes#
Computes the fraction of groups in the DP output but not in the baseline output. |
- class SpuriousRate(output, join_columns, *, name=None, description=None, baselines=None)#
Bases:
tmlt.analytics.metrics._base.SingleBaselineMetric
Computes the fraction of groups in the DP output but not in the baseline output.
Note
This is only available on a paid version of Tumult Analytics. If you would like to hear more, please contact us at info@tmlt.io.
Note
Below, released means that the group is in the DP output, and spurious means that the group is not in the baseline output.
How it works:
The algorithm takes two dictionaries as input:
dp_outputs
: A dictionary containing the differentially private (DP) outputs, where keys represent output identifiers and values represent corresponding DP output. The DP output data is generated by a differentially private mechanism.baseline_outputs
: A dictionary containing the baseline outputs, where keys represent output identifiers and values represent corresponding baseline table (baseline).
Before performing computations, the algorithm checks if the released count of DP output (released count) is zero. If so, it returns NaN, indicating that no computation can be performed due to the absence of released data. If not, the algorithm performs a left anti-join between the DP and baseline tables based on
join_columns
. This returns all rows from the DP output (left dataframe) where there is no match in the baseline output (right dataframe). The count of these rows is the spurious released count.After performing the join, the algorithm computes the spurious rate by dividing the spurious released count by the total count of released data points (released_count), using the formula \(\text{spurious released count} / \text{released count}\). The result represents the proportion of released data points in the DP output that have no corresponding data points in the baseline output.
Example
>>> dp_df = spark.createDataFrame( ... pd.DataFrame( ... { ... "A": ["a1", "a1", "a2", "c"], ... "X": [50, 110, 100, 50] ... } ... ) ... ) >>> dp_outputs = {"O": dp_df} >>> baseline_df = spark.createDataFrame( ... pd.DataFrame( ... { ... "A": ["a1", "a1", "a2", "b"], ... "X": [100, 100, 100, 50] ... } ... ) ... ) >>> baseline_outputs = {"default": {"O": baseline_df}}
>>> metric = SpuriousRate( ... output="O", ... join_columns=["A"] ... ) >>> metric.join_columns ['A'] >>> metric(dp_outputs, baseline_outputs)[0].value 0.25
Methods# Returns the name of the run output or view name.
Returns the name of the join columns.
Returns a string representation of this object.
Return a table row summarizing the metric result.
Checks if the metric is compatible with the program.
Check that a particular set of outputs is compatible with the metric.
Computes spurious rate given DP and baseline outputs.
Check that the outputs have all the structure the metric expects.
Returns the name of the metric.
Returns the description of the metric.
Returns the baselines used for the metric.
Returns the results of this metric formatted as a dataframe.
Computes the given metric on the given DP and baseline outputs.
- Parameters
- __init__(output, join_columns, *, name=None, description=None, baselines=None)#
Constructor.
- Parameters
output (
str
str
) – The output to compute the spurious rate for.name (
str
|None
Optional
[str
] (default:None
)) – A name for the metric.description (
str
|None
Optional
[str
] (default:None
)) – A description of the metric.baselines (
str
|List
[str
] |None
Union
[str
,List
[str
],None
] (default:None
)) – The name of the baseline program(s) used for the error report. If None, use all baselines specified as custom baseline and baseline options on tuner class. If no baselines are specified on tuner class, use default baseline. If a string, use only that baseline. If a list, use only those baselines.
- format(value)#
Returns a string representation of this object.
- format_as_table_row(result)#
Return a table row summarizing the metric result.
- Parameters
result (tmlt.analytics.metrics._base.MetricResult) –
- Return type
- check_compatibility_with_program(program, output_views)#
Checks if the metric is compatible with the program.
This is a dynamic check and is performed by verifying whether the
output
attribute of the metric object is present in the annotations of theOutputs
attribute of the program. If theoutput
attribute is not found in the annotations, aValueError
is raised.- Parameters
program (Type[tmlt.analytics.program.SessionProgram]) –
output_views (List[str]) –
- check_compatibility_with_outputs(outputs, output_name)#
Check that a particular set of outputs is compatible with the metric.
Should throw a ValueError if the metric is not compatible.
- Parameters
outputs (Dict[str, pyspark.sql.DataFrame]) –
output_name (str) –
- compute_for_baseline(baseline_name, dp_outputs, baseline_outputs, unprotected_inputs=None, program_parameters=None)#
Computes spurious rate given DP and baseline outputs.
- Parameters
baseline_name (str) –
dp_outputs (Dict[str, pyspark.sql.DataFrame]) –
baseline_outputs (Dict[str, pyspark.sql.DataFrame]) –
unprotected_inputs (Optional[Dict[str, pyspark.sql.DataFrame]]) –
program_parameters (Optional[Dict[str, Any]]) –
- check_compatibility_with_data(dp_outputs, baseline_outputs)#
Check that the outputs have all the structure the metric expects.
Should throw a ValueError if the metric is not compatible.
- Parameters
dp_outputs (Dict[str, pyspark.sql.DataFrame]) –
baseline_outputs (Dict[str, Dict[str, pyspark.sql.DataFrame]]) –
- property baselines#
Returns the baselines used for the metric.
- format_as_dataframe(result)#
Returns the results of this metric formatted as a dataframe.
- Parameters
result (tmlt.analytics.metrics.MetricResult) –
- Return type
- __call__(dp_outputs, baseline_outputs, unprotected_inputs=None, program_parameters=None)#
Computes the given metric on the given DP and baseline outputs.
- Parameters
dp_outputs (Dict[str, pyspark.sql.DataFrame]) – The differentially private outputs of the program.
baseline_outputs (Dict[str, Dict[str, pyspark.sql.DataFrame]]) – The outputs of the baseline programs.
unprotected_inputs (Optional[Dict[str, pyspark.sql.DataFrame]]) – Optional public dataframes used in error computation.
program_parameters (Optional[Dict[str, Any]]) – Optional program specific parameters used in error computation.
- Return type