_user_defined#

Metric functions for wrapping user-defined functions.

Classes#

CustomGroupedMetric

Wrapper to define a metric that operates on a grouped, joined output table.

CustomSingleOutputMetric

Wrapper to allow users to define a metric that operates on a single output table.

CustomMultiBaselineMetric

Wrapper to turn a function into a metric using DP and single baseline’s output.

class CustomGroupedMetric(func, output, join_columns, *, name, description=None, baselines=None, grouping_columns=None, join_how='inner', dropna_columns=None, indicator_column_name=None)#

Bases: tmlt.analytics.metrics._base.GroupedMetric

Wrapper to define a metric that operates on a grouped, joined output table.

Turns a function that calculates error on a GroupedData (produced by joining DP and baseline tables and grouping by a user-specified set of columns) into a Metric.

The GroupedData available to the user-defined function will have all the join columns, and versions of all other columns with _dp and _baseline suffixes, indicating which version of the data they came from.

The user-defined function should return either a single value (if there are no grouping columns), or (if grouped) a DataFrame that has a column named with the output_column_name variable that contains the metric’s result values.

Note

This is only available on a paid version of Tumult Analytics. If you would like to hear more, please contact us at info@tmlt.io.

Example

>>> dp_df = spark.createDataFrame(pd.DataFrame([{"A": 1, "B": "a"}]))
>>> dp_outputs = {"O": dp_df}
>>> baseline_df = spark.createDataFrame(pd.DataFrame([{"A": 5}]))
>>> baseline_outputs = {"default": {"O": baseline_df}}
>>> def size_difference(grouped_output: GroupedData,
...                     grouping_columns: List[str],
...                     output_column_name: str):
...     in_dp = (col("indicator") == "both") | (col("indicator") == "dp")
...     in_baseline = ((col("indicator") == "both") |
...          (col("indicator") == "baseline"))
...     dp_count = sf.sum(sf.when(in_dp, sf.lit(1)).otherwise(0))
...     baseline_count = sf.sum(sf.when(in_baseline, sf.lit(1)).otherwise(0))
...     grouped_size_difference = grouped_output.agg(
...         sf.abs(dp_count - baseline_count).alias(output_column_name)
...     )
...     if not grouping_columns:
...         size_difference = grouped_size_difference.head(1)[0]
...         if (
...             len(size_difference) == 0 or
...             size_difference[output_column_name] is None
...         ):  # this means the column is empty
...             return 0
...         return size_difference[output_column_name]
...     else:
...         return grouped_size_difference
>>> metric = CustomGroupedMetric(
...     func=size_difference,
...     name="Output size difference",
...     description="Difference in number of rows.",
...     output="O",
...     join_columns=["A"],
...     indicator_column_name="indicator",
... )
>>> result = metric(dp_outputs, baseline_outputs)[0].value
>>> result
0
>>> metric.format(result)
'0'
Methods#

func()

Returns function to be applied.

format()

Converts value to human-readable format.

check_compatibility_with_program()

Checks if the metric is compatible with the program.

compute_on_grouped_output()

Computes metric value from the joined, grouped DP and baseline output.

grouping_columns()

Returns the names of the grouping columns.

check_compatibility_with_outputs()

Check that a particular set of outputs is compatible with the metric.

format_as_table_row()

Return a table row summarizing the metric result.

format_as_dataframe()

Returns the results of this metric formatted as a dataframe.

output()

Returns the name of the run output or view name.

join_columns()

Returns the name of the join columns.

indicator_column_name()

Returns the name of the indicator column.

check_join_key_uniqueness()

Check if the join keys uniquely identify rows in the joined DataFrame.

compute_for_baseline()

Computes metric value.

check_compatibility_with_data()

Check that the outputs have all the structure the metric expects.

name()

Returns the name of the metric.

description()

Returns the description of the metric.

baselines()

Returns the baselines used for the metric.

__call__()

Computes the given metric on the given DP and baseline outputs.

Parameters
  • func (Callable) –

  • output (str) –

  • join_columns (List[str]) –

  • name (str) –

  • description (Optional[str]) –

  • baselines (Optional[Union[str, List[str]]]) –

  • grouping_columns (Optional[List[str]]) –

  • join_how (str) –

  • dropna_columns (Optional[List[str]]) –

  • indicator_column_name (Optional[str]) –

__init__(func, output, join_columns, *, name, description=None, baselines=None, grouping_columns=None, join_how='inner', dropna_columns=None, indicator_column_name=None)#

Constructor.

Parameters
  • func (CallableCallable) – Function for computing a metric value from a grouped, joined DP and baseline output. Should return either a single value (if there are no grouping columns), or (if grouped) a DataFrame that has a column named with the output_column_name variable that contains the metric’s result values.

  • output (strstr) – The output to compute the metric for.

  • join_columns (List[str]List[str]) – The columns to join on.

  • name (strstr) – A name for the metric.

  • description (str | NoneOptional[str] (default: None)) – A description of the metric.

  • baselines (str | List[str] | NoneUnion[str, List[str], None] (default: None)) – The name of the baseline program(s) used for the error report. If None, use all baselines specified as custom baseline and baseline options on tuner class. If no baselines are specified on tuner class, use default baseline. If a string, use only that baseline. If a list, use only those baselines.

  • grouping_columns (List[str] | NoneOptional[List[str]] (default: None)) – A set of columns that will be used to group the DP and baseline outputs. The error metric should be calculated for each group, and returned in a table. If grouping columns are None, the metric will be calculated over the whole output, and returned as a single number.

  • join_how (strstr (default: 'inner')) – The type of join to perform (e.g. “inner.”)

  • dropna_columns (List[str] | NoneOptional[List[str]] (default: None)) – If specified, rows with nulls in these columns will be dropped.

  • indicator_column_name (str | NoneOptional[str] (default: None)) – If specified, we will add a column with the specified name to the joined data that contains either “dp”, “baseline”, or “both” to indicate where the values in the row came from.

property func#

Returns function to be applied.

Return type

Callable

format(value)#

Converts value to human-readable format.

Parameters

value (Any) –

check_compatibility_with_program(program, output_views)#

Checks if the metric is compatible with the program.

Parameters
compute_on_grouped_output(grouped_output, baseline_name, unprotected_inputs=None, program_parameters=None)#

Computes metric value from the joined, grouped DP and baseline output.

If grouping columns are empty, the grouped output will have one group that is the entire dataset.

Parameters
property grouping_columns#

Returns the names of the grouping columns.

Return type

List[str]

check_compatibility_with_outputs(outputs, output_name)#

Check that a particular set of outputs is compatible with the metric.

Should throw a ValueError if the metric is not compatible.

Parameters
format_as_table_row(result)#

Return a table row summarizing the metric result.

Parameters

result (tmlt.analytics.metrics.MetricResult) –

Return type

pandas.DataFrame

format_as_dataframe(result)#

Returns the results of this metric formatted as a dataframe.

Parameters

result (tmlt.analytics.metrics.MetricResult) –

Return type

pandas.DataFrame

property output#

Returns the name of the run output or view name.

Return type

str

property join_columns#

Returns the name of the join columns.

Return type

List[str]

property indicator_column_name#

Returns the name of the indicator column.

Return type

Optional[str]

check_join_key_uniqueness(joined_output)#

Check if the join keys uniquely identify rows in the joined DataFrame.

Parameters

joined_output (pyspark.sql.DataFrame) –

compute_for_baseline(baseline_name, dp_outputs, baseline_outputs, unprotected_inputs=None, program_parameters=None)#

Computes metric value.

Parameters
check_compatibility_with_data(dp_outputs, baseline_outputs)#

Check that the outputs have all the structure the metric expects.

Should throw a ValueError if the metric is not compatible.

Parameters
property name#

Returns the name of the metric.

Return type

str

property description#

Returns the description of the metric.

Return type

str

property baselines#

Returns the baselines used for the metric.

Return type

Optional[Union[str, List[str]]]

__call__(dp_outputs, baseline_outputs, unprotected_inputs=None, program_parameters=None)#

Computes the given metric on the given DP and baseline outputs.

Parameters
  • dp_outputs (Dict[str, pyspark.sql.DataFrame]) – The differentially private outputs of the program.

  • baseline_outputs (Dict[str, Dict[str, pyspark.sql.DataFrame]]) – The outputs of the baseline programs.

  • unprotected_inputs (Optional[Dict[str, pyspark.sql.DataFrame]]) – Optional public dataframes used in error computation.

  • program_parameters (Optional[Dict[str, Any]]) – Optional program specific parameters used in error computation.

Return type

List[tmlt.analytics.metrics.MetricResult]

class CustomSingleOutputMetric(func, output, *, name, description=None, baselines=None)#

Bases: tmlt.analytics.metrics._base.SingleBaselineMetric

Wrapper to allow users to define a metric that operates on a single output table.

Turns a function that calculates error on two dataframes (one DP, one baseline) into a Metric.

Note

This is only available on a paid version of Tumult Analytics. If you would like to hear more, please contact us at info@tmlt.io.

Example

>>> dp_df = spark.createDataFrame(pd.DataFrame({"A": [5]}))
>>> dp_outputs = {"O": dp_df}
>>> baseline_df = spark.createDataFrame(pd.DataFrame({"A": [5]}))
>>> baseline_outputs = {"default": {"O": baseline_df}}
>>> def size_difference(dp_outputs: DataFrame, baseline_outputs: DataFrame):
...     return baseline_outputs.count() - dp_outputs.count()
>>> metric = CustomSingleOutputMetric(
...     func=size_difference,
...     name="Output size difference",
...     description="Difference in number of rows.",
...     output="O",
... )
>>> result = metric(dp_outputs, baseline_outputs)[0].value
>>> result
0
>>> metric.format(result)
'0'
Methods#

output()

Returns the name of the run output or view name.

func()

Returns function to be applied.

format()

Converts value to human-readable format.

format_as_table_row()

Return a table row summarizing the metric result.

check_compatibility_with_program()

Checks if the metric is compatible with the program.

check_compatibility_with_outputs()

Check that a particular set of outputs is compatible with the metric.

compute_for_baseline()

Returns the metric value given the DP outputs and the baseline outputs.

check_compatibility_with_data()

Check that the outputs have all the structure the metric expects.

name()

Returns the name of the metric.

description()

Returns the description of the metric.

baselines()

Returns the baselines used for the metric.

format_as_dataframe()

Returns the results of this metric formatted as a dataframe.

__call__()

Computes the given metric on the given DP and baseline outputs.

Parameters
  • func (Callable) –

  • output (str) –

  • name (str) –

  • description (Optional[str]) –

  • baselines (Optional[Union[str, List[str]]]) –

__init__(func, output, *, name, description=None, baselines=None)#

Constructor.

Parameters
  • func (CallableCallable) – Function for computing a metric value from DP outputs and a single baseline’s outputs.

  • output (strstr) – The output to calculate the metric over. This is required, even if the program produces a single output.

  • name (strstr) – A name for the metric.

  • description (str | NoneOptional[str] (default: None)) – A description of the metric.

  • baselines (str | List[str] | NoneUnion[str, List[str], None] (default: None)) – The name of the baseline program(s) used for the error report. If None, use all baselines specified as custom baseline and baseline options on tuner class. If no baselines are specified on tuner class, use default baseline. If a string, use only that baseline. If a list, use only those baselines.

property output#

Returns the name of the run output or view name.

Return type

str

property func#

Returns function to be applied.

Return type

Callable

format(value)#

Converts value to human-readable format.

Parameters

value (Any) –

format_as_table_row(result)#

Return a table row summarizing the metric result.

Parameters

result (tmlt.analytics.metrics._base.MetricResult) –

Return type

pandas.DataFrame

check_compatibility_with_program(program, output_views)#

Checks if the metric is compatible with the program.

Parameters
check_compatibility_with_outputs(outputs, output_name)#

Check that a particular set of outputs is compatible with the metric.

Should throw a ValueError if the metric is not compatible.

Parameters
compute_for_baseline(baseline_name, dp_outputs, baseline_outputs, unprotected_inputs=None, program_parameters=None)#

Returns the metric value given the DP outputs and the baseline outputs.

Parameters
check_compatibility_with_data(dp_outputs, baseline_outputs)#

Check that the outputs have all the structure the metric expects.

Should throw a ValueError if the metric is not compatible.

Parameters
property name#

Returns the name of the metric.

Return type

str

property description#

Returns the description of the metric.

Return type

str

property baselines#

Returns the baselines used for the metric.

Return type

Optional[Union[str, List[str]]]

format_as_dataframe(result)#

Returns the results of this metric formatted as a dataframe.

Parameters

result (tmlt.analytics.metrics.MetricResult) –

Return type

MetricResultDataframe

__call__(dp_outputs, baseline_outputs, unprotected_inputs=None, program_parameters=None)#

Computes the given metric on the given DP and baseline outputs.

Parameters
  • dp_outputs (Dict[str, pyspark.sql.DataFrame]) – The differentially private outputs of the program.

  • baseline_outputs (Dict[str, Dict[str, pyspark.sql.DataFrame]]) – The outputs of the baseline programs.

  • unprotected_inputs (Optional[Dict[str, pyspark.sql.DataFrame]]) – Optional public dataframes used in error computation.

  • program_parameters (Optional[Dict[str, Any]]) – Optional program specific parameters used in error computation.

Return type

List[tmlt.analytics.metrics.MetricResult]

class CustomMultiBaselineMetric(output, func, *, name, description=None, baselines=None)#

Bases: tmlt.analytics.metrics._base.MultiBaselineMetric

Wrapper to turn a function into a metric using DP and single baseline’s output.

Note

This is only available on a paid version of Tumult Analytics. If you would like to hear more, please contact us at info@tmlt.io.

Example

>>> dp_df = spark.createDataFrame(pd.DataFrame({"A": [5]}))
>>> dp_outputs = {"O": dp_df}
>>> baseline_df1 = spark.createDataFrame(pd.DataFrame({"A": [5]}))
>>> baseline_df2 = spark.createDataFrame(pd.DataFrame({"A": [6]}))
>>> baseline_outputs = {
...    "O": {"baseline1": baseline_df1, "baseline2": baseline_df2}
... }
>>> _func = lambda dp_outputs, baseline_outputs: {
...    output_key: {
...         baseline_key: AbsoluteError(output_key).compute_on_scalar(
...                 dp_output.first().A, baseline_output.first().A
...         )
...         for baseline_key, baseline_output
...         in baseline_outputs[output_key].items()
...     }
...     for output_key, dp_output in dp_outputs.items()
...  }
>>> metric = CustomMultiBaselineMetric(
...     output="O",
...     func=_func,
...     name="Custom Metric",
...     description="Custom Description",
... )
>>> result = metric.compute_for_multiple_baselines(dp_outputs, baseline_outputs)
>>> result
{'O': {'baseline1': 0, 'baseline2': 1}}
Methods#

output()

Returns the name of the run output or view name.

func()

Returns function to be applied.

format()

Converts value to human-readable format.

format_as_table_row()

Return a table row summarizing the metric result.

check_compatibility_with_program()

Checks if the metric is compatible with the program.

check_compatibility_with_data()

Check that the outputs have all the structure the metric expects.

compute_for_multiple_baselines()

Returns the metric value given the DP and multiple baseline outputs.

compute()

Computes the given metric on the given DP and baseline outputs.

name()

Returns the name of the metric.

description()

Returns the description of the metric.

baselines()

Returns the baselines used for the metric.

format_as_dataframe()

Returns the results of this metric formatted as a dataframe.

__call__()

Computes the given metric on the given DP and baseline outputs.

Parameters
  • output (str) –

  • func (Callable) –

  • name (str) –

  • description (Optional[str]) –

  • baselines (Optional[Union[str, List[str]]]) –

__init__(output, func, *, name, description=None, baselines=None)#

Constructor.

Parameters
  • output (strstr) – The output to compute the metric for.

  • func (CallableCallable) – Function for computing a metric value from DP outputs and multiple baseline outputs.

  • name (strstr) – A name for the metric.

  • description (str | NoneOptional[str] (default: None)) – A description of the metric.

  • baselines (str | List[str] | NoneUnion[str, List[str], None] (default: None)) – The name of the baseline program(s) used for the error report. If None, use all baselines specified as custom baseline and baseline options on tuner class. If no baselines are specified on tuner class, use default baseline. If a string, use only that baseline. If a list, use only those baselines.

property output#

Returns the name of the run output or view name.

Return type

str

property func#

Returns function to be applied.

Return type

Callable

format(value)#

Converts value to human-readable format.

Parameters

value (Any) –

format_as_table_row(result)#

Return a table row summarizing the metric result.

Parameters

result (tmlt.analytics.metrics._base.MetricResult) –

Return type

pandas.DataFrame

check_compatibility_with_program(program, output_views)#

Checks if the metric is compatible with the program.

Parameters
check_compatibility_with_data(dp_outputs, baseline_outputs)#

Check that the outputs have all the structure the metric expects.

Should throw a ValueError if the metric is not compatible.

Parameters
compute_for_multiple_baselines(dp_outputs, baseline_outputs, unprotected_inputs=None, program_parameters=None)#

Returns the metric value given the DP and multiple baseline outputs.

Parameters
compute(dp_outputs, baseline_outputs, unprotected_inputs=None, program_parameters=None)#

Computes the given metric on the given DP and baseline outputs.

The baseline_outputs will already be filtered to only include the baselines that the metric is supposed to use.

Parameters
  • dp_outputs (Dict[str, pyspark.sql.DataFrame]) – The differentially private outputs of the program.

  • baseline_outputs (Dict[str, Dict[str, pyspark.sql.DataFrame]]) – The outputs of the baseline programs, after filtering to only include the baselines that the metric is supposed to use.

  • unprotected_inputs (Optional[Dict[str, pyspark.sql.DataFrame]]) – Optional public dataframes used in error computation.

  • program_parameters (Optional[Dict[str, Any]]) – Optional program specific parameters used in error computation.

Return type

List[tmlt.analytics.metrics.MetricResult]

property name#

Returns the name of the metric.

Return type

str

property description#

Returns the description of the metric.

Return type

str

property baselines#

Returns the baselines used for the metric.

Return type

Optional[Union[str, List[str]]]

format_as_dataframe(result)#

Returns the results of this metric formatted as a dataframe.

Parameters

result (tmlt.analytics.metrics.MetricResult) –

Return type

MetricResultDataframe

__call__(dp_outputs, baseline_outputs, unprotected_inputs=None, program_parameters=None)#

Computes the given metric on the given DP and baseline outputs.

Parameters
  • dp_outputs (Dict[str, pyspark.sql.DataFrame]) – The differentially private outputs of the program.

  • baseline_outputs (Dict[str, Dict[str, pyspark.sql.DataFrame]]) – The outputs of the baseline programs.

  • unprotected_inputs (Optional[Dict[str, pyspark.sql.DataFrame]]) – Optional public dataframes used in error computation.

  • program_parameters (Optional[Dict[str, Any]]) – Optional program specific parameters used in error computation.

Return type

List[tmlt.analytics.metrics.MetricResult]