_suppression#

Metric functions relating to suppression rate.

Classes#

SuppressionRate

Computes the fraction of groups in the baseline output but not in the DP output.

class SuppressionRate(output, join_columns, *, name=None, description=None, baselines=None)#

Bases: tmlt.analytics.metrics._base.SingleBaselineMetric

Computes the fraction of groups in the baseline output but not in the DP output.

Note

This is only available on a paid version of Tumult Analytics. If you would like to hear more, please contact us at info@tmlt.io.

Note

Below, released means that the group is in the DP output, and spurious means that the group is not in the output of the baseline.

How it works:

The algorithm takes two dictionaries as input:

dp_outputs: A dictionary containing the differentially private (DP) outputs, where keys represent output identifiers and values represent corresponding DP output. The DP output data is generated by a differentially private mechanism.

baseline_outputs: A dictionary containing the baseline outputs, where keys represent output identifiers and values represent corresponding baseline table (baseline).

Before performing computations, the algorithm checks if the count of baseline output (non-spurious count) is zero. If so, it returns NaN, indicating that no computation can be performed due to the absence of non-spurious data in the baseline outputs. If not, the algorithm performs a left anti-join between the baseline and DP tables based on join_columns. This returns all rows from the baseline (left dataframe) where there is no match in the DP output (right dataframe). The count of these rows is the non-spurious non-released count.

After performing the join, the algorithm computes the suppression rate by dividing the non-spurious non-released count by the total count of non-spurious data points (non-spurious count), using the formula \(\text{non-spurious non-released count} / \text{non-spurious count}\). The result represents the proportion of non-spurious data points in the baseline outputs that are not released in the DP outputs.

Example

>>> dp_df = spark.createDataFrame(
...     pd.DataFrame(
...         {
...             "A": ["a1", "a1", "a2", "c"],
...             "X": [50, 110, 100, 50]
...         }
...     )
... )
>>> dp_outputs = {"O": dp_df}
>>> baseline_df = spark.createDataFrame(
...     pd.DataFrame(
...         {
...             "A": ["a1", "a1", "a2", "b"],
...             "X": [100, 100, 100, 50]
...         }
...     )
... )
>>> baseline_outputs = {"default": {"O": baseline_df}}

>>> metric = SuppressionRate(
...     output="O",
...     join_columns=["A"]
... )
>>> metric.join_columns
['A']
>>> metric(dp_outputs, baseline_outputs)[0].value
0.25

Methods#
`output()`	Returns the name of the run output or view name.
`join_columns()`	Returns the name of the join columns.
`format()`	Returns a string representation of this object.
`format_as_table_row()`	Return a table row summarizing the metric result.
`check_compatibility_with_program()`	Checks if the metric is compatible with the program.
`check_compatibility_with_outputs()`	Check that a particular set of outputs is compatible with the metric.
`compute_for_baseline()`	Computes suppression rate given DP and baseline outputs.
`check_compatibility_with_data()`	Check that the outputs have all the structure the metric expects.
`name()`	Returns the name of the metric.
`description()`	Returns the description of the metric.
`baselines()`	Returns the baselines used for the metric.
`format_as_dataframe()`	Returns the results of this metric formatted as a dataframe.
`__call__()`	Computes the given metric on the given DP and baseline outputs.

Parameters

output (str) –
join_columns (List[str]) –
name (Optional[str]) –
description (Optional[str]) –
baselines (Optional[Union[str, List[str]]]) –

__init__(output, join_columns, *, name=None, description=None, baselines=None)#

Constructor.

Parameters

output (strstr) – Which output to compute the suppression rate for.
join_columns (List[str]List[str]) – The columns to join on.
name (str | NoneOptional[str] (default: None)) – A name for the metric.
description (str | NoneOptional[str] (default: None)) – A description of the metric.
baselines (str | List[str] | NoneUnion[str, List[str], None] (default: None)) – The name of the baseline program(s) used for the error report. If None, use all baselines specified as custom baseline and baseline options on tuner class. If no baselines are specified on tuner class, use default baseline. If a string, use only that baseline. If a list, use only those baselines.

property output#

Returns the name of the run output or view name.

Return type: str

property join_columns#

Returns the name of the join columns.

Return type: List[str]

format(value)#: Returns a string representation of this object.

format_as_table_row(result)#

Return a table row summarizing the metric result.

Parameters: result (tmlt.analytics.metrics._base.MetricResult) –
Return type: pandas.DataFrame

check_compatibility_with_program(program, output_views)#

Checks if the metric is compatible with the program.

This is a dynamic check and is performed by verifying whether the output attribute of the metric object is present in the annotations of the Outputs attribute of the program. If the output attribute is not found in the annotations, a ValueError is raised.

Parameters

program (Type[tmlt.analytics.program.SessionProgram]) –
output_views (List[str]) –

check_compatibility_with_outputs(outputs, output_name)#

Check that a particular set of outputs is compatible with the metric.

Should throw a ValueError if the metric is not compatible.

Parameters

outputs (Dict[str, pyspark.sql.DataFrame]) –
output_name (str) –

compute_for_baseline(baseline_name, dp_outputs, baseline_outputs, unprotected_inputs=None, program_parameters=None)#

Computes suppression rate given DP and baseline outputs.

Parameters

baseline_name (str) –
dp_outputs (Dict[str, pyspark.sql.DataFrame]) –
baseline_outputs (Dict[str, pyspark.sql.DataFrame]) –
unprotected_inputs (Optional[Dict[str, pyspark.sql.DataFrame]]) –
program_parameters (Optional[Dict[str, Any]]) –

check_compatibility_with_data(dp_outputs, baseline_outputs)#

Check that the outputs have all the structure the metric expects.

Should throw a ValueError if the metric is not compatible.

Parameters

dp_outputs (Dict[str, pyspark.sql.DataFrame]) –
baseline_outputs (Dict[str, Dict[str, pyspark.sql.DataFrame]]) –

property name#

Returns the name of the metric.

Return type: str

property description#

Returns the description of the metric.

Return type: str

property baselines#

Returns the baselines used for the metric.

Return type: Optional[Union[str, List[str]]]

format_as_dataframe(result)#

Returns the results of this metric formatted as a dataframe.

Parameters: result (tmlt.analytics.metrics.MetricResult) –
Return type: MetricResultDataframe

__call__(dp_outputs, baseline_outputs, unprotected_inputs=None, program_parameters=None)#

Computes the given metric on the given DP and baseline outputs.

Parameters

dp_outputs (Dict[str, pyspark.sql.DataFrame]) – The differentially private outputs of the program.
baseline_outputs (Dict[str, Dict[str, pyspark.sql.DataFrame]]) – The outputs of the baseline programs.
unprotected_inputs (Optional[Dict[str, pyspark.sql.DataFrame]]) – Optional public dataframes used in error computation.
program_parameters (Optional[Dict[str, Any]]) – Optional program specific parameters used in error computation.

Return type

List[tmlt.analytics.metrics.MetricResult]

Tumult Analytics Pro

_suppression#

Classes#