aggregations#

Derived measurements for computing noisy aggregates on spark DataFrames.

Functions#

create_count_measurement()

Returns a noisy count measurement.

create_count_distinct_measurement()

Returns a noisy count_distinct measurement.

create_sum_measurement()

Returns a noisy sum measurement.

create_average_measurement()

Returns a noisy average measurement.

create_variance_measurement()

Returns a noisy variance measurement.

create_standard_deviation_measurement()

Returns a noisy standard deviation measurement.

create_quantile_measurement()

Returns a noisy quantile measurement.

get_midpoint()

Returns the midpoint of lower and upper.

create_partition_selection_measurement()

Returns a partition selection measurement.

create_bound_selection_measurement()

Returns a bound selection measurement.

create_count_measurement(input_domain, input_metric, output_measure, d_out, noise_mechanism, d_in=1, groupby_transformation=None, count_column=None)#

Returns a noisy count measurement.

This function constructs a measurement M with the following privacy contract - for any two inputs x, x’ that are d_in-close under the input_metric, M(x) and M(x’) are sampled from distributions that are d_out apart under the output_measure. Noise scale is computed appropriately for the specified noise_mechanism such that the stated privacy property is guaranteed.

Note

d_out is interpreted as the “epsilon” parameter if output_measure is PureDP, the “rho” parameter if output_measure is RhoZCDP, and (“epsilon”, “delta”) if output_measure is ApproxDP.

Note

ApproxDP budgets with delta>0 are not yet supported.

Parameters
Return type

tmlt.core.measurements.base.Measurement

create_count_distinct_measurement(input_domain, input_metric, output_measure, d_out, noise_mechanism, d_in=1, groupby_transformation=None, count_column=None)#

Returns a noisy count_distinct measurement.

This function constructs a measurement M with the following privacy contract - for any two inputs x, x’ that are d_in-close under the input_metric, M(x) and M(x’) are sampled from distributions that are d_out apart under the output_measure. Noise scale is computed appropriately for the specified noise_mechanism such that the stated privacy property is guaranteed.

Note

d_out is interpreted as the “epsilon” parameter if output_measure is PureDP, the “rho” parameter if output_measure is RhoZCDP, and (“epsilon”, “delta”) if output_measure is ApproxDP.

Note

ApproxDP budgets with delta>0 are not yet supported.

Parameters
Return type

tmlt.core.measurements.base.Measurement

create_sum_measurement(input_domain, input_metric, output_measure, d_out, noise_mechanism, measure_column, lower, upper, d_in=1, groupby_transformation=None, sum_column=None)#

Returns a noisy sum measurement.

This function constructs a measurement M with the following privacy contract - for any two inputs x, x’ that are d_in-close under the input_metric, M(x) and M(x’) are sampled from distributions that are d_out apart under the output_measure. Noise scale is computed appropriately for the specified noise_mechanism such that the stated privacy property is guaranteed.

Note

d_out is interpreted as the “epsilon” parameter if output_measure is PureDP, the “rho” parameter if output_measure is RhoZCDP, and (“epsilon”, “delta”) if output_measure is ApproxDP.

Note

ApproxDP budgets with delta>0 are not yet supported.

Parameters
  • input_domain (tmlt.core.domains.spark_domains.SparkDataFrameDomain) – Domain of input spark DataFrames.

  • input_metric (Union[tmlt.core.metrics.SymmetricDifference, tmlt.core.metrics.HammingDistance, tmlt.core.metrics.IfGroupedBy]) – Distance metric on input DataFrames.

  • output_measure (Union[tmlt.core.measures.PureDP, tmlt.core.measures.ApproxDP, tmlt.core.measures.RhoZCDP]) – Desired privacy guarantee (one of PureDP, RhoZCDP, or ApproxDP).

  • d_out (tmlt.core.measures.PrivacyBudgetInput) – Desired distance between output distributions w.r.t. d_in. This is interpreted as “epsilon” if output_measure is PureDP, “rho” if it is RhoZCDP, and (“epsilon”, “delta”) if it is ApproxDP.

  • noise_mechanism (NoiseMechanism) – Noise mechanism to be applied to the sum(s).

  • measure_column (str) – Column to be summed.

  • lower (tmlt.core.utils.exact_number.ExactNumberInput) – Lower clipping bound on measure_column.

  • upper (tmlt.core.utils.exact_number.ExactNumberInput) – Upper clipping bound on measure_column.

  • d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under the input_metric. The returned measurement is guaranteed to have output distributions that are d_out apart for inputs that are d_in apart. Defaults to 1.

  • groupby_transformation (Optional[tmlt.core.transformations.spark_transformations.groupby.GroupBy]) – If provided, this measurement returns a DataFrame with noisy sums for each group obtained by applying the groupby transformation. If None, this measurement outputs a single number - the noisy sum.

  • sum_column (Optional[str]) – If a groupby_transformation is supplied, this is the column name to be used for sums in the DataFrame output by the measurement. If None, this column will be named “sum(<measure_column>)”.

Return type

tmlt.core.measurements.base.Measurement

create_average_measurement(input_domain, input_metric, output_measure, d_out, noise_mechanism, measure_column, lower, upper, d_in=1, groupby_transformation=None, average_column=None, keep_intermediates=False, sum_column=None, count_column=None)#

Returns a noisy average measurement.

This function constructs a measurement M with the following privacy contract - for any two inputs x, x’ that are d_in-close under the input_metric, M(x) and M(x’) are sampled from distributions that are d_out apart under the output_measure. Noise scale is computed appropriately for the specified noise_mechanism such that the stated privacy property is guaranteed.

Note

d_out is interpreted as the “epsilon” parameter if output_measure is PureDP, the “rho” parameter if output_measure is RhoZCDP, and (“epsilon”, “delta”) if output_measure is ApproxDP.

Note

ApproxDP budgets with delta>0 are not yet supported.

Parameters
  • input_domain (tmlt.core.domains.spark_domains.SparkDataFrameDomain) – Domain of input DataFrames.

  • input_metric (Union[tmlt.core.metrics.SymmetricDifference, tmlt.core.metrics.HammingDistance, tmlt.core.metrics.IfGroupedBy]) – Distance metric on input DataFrames.

  • output_measure (Union[tmlt.core.measures.PureDP, tmlt.core.measures.ApproxDP, tmlt.core.measures.RhoZCDP]) – Desired privacy guarantee (one of PureDP, RhoZCDP, or ApproxDP).

  • d_out (tmlt.core.measures.PrivacyBudgetInput) – Desired distance between output distributions w.r.t. d_in. This is interpreted as “epsilon” if output_measure is PureDP, “rho” if it is RhoZCDP, and (“epsilon”, “delta”) if it is ApproxDP.

  • noise_mechanism (NoiseMechanism) – Noise mechanism to apply.

  • measure_column (str) – Name to column to compute average of.

  • lower (tmlt.core.utils.exact_number.ExactNumberInput) – Lower clipping bound for measure_column.

  • upper (tmlt.core.utils.exact_number.ExactNumberInput) – Upper clipping bound for measure_column.

  • d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under the input_metric. The returned measurement is guaranteed to have output distributions that are d_out apart for inputs that are d_in apart. Defaults to 1.

  • groupby_transformation (Optional[tmlt.core.transformations.spark_transformations.groupby.GroupBy]) – If provided, this measurement returns a DataFrame with noisy averages for each group obtained from the groupby transformation. If None, this measurement outputs a single number - the noisy average.

  • average_column (Optional[str]) – If a groupby_transformation is supplied, this is the column name to be used for noisy average in the DataFrame output by the measurement. If None, this column will be named “avg(<measure_column>)”.

  • keep_intermediates (bool) – If True, intermediates (noisy sum of deviations and noisy count) will also be output in addition to the noisy average.

  • sum_column (Optional[str]) – If a groupby_transformation is supplied and keep_intermediates is True, this is the column name to be used for intermediate sums in the DataFrame output by the measurement. If None, this column will be named “sum(<measure_column>)”.

  • count_column (Optional[str]) – If a groupby_transformation is supplied and keep_intermediates is True, this is the column name to be used for intermediate counts in the DataFrame output by the measurement. If None, this column will be named “count”.

Return type

Union[tmlt.core.measurements.postprocess.PostProcess, tmlt.core.measurements.converters.PureDPToApproxDP]

create_variance_measurement(input_domain, input_metric, output_measure, d_out, noise_mechanism, measure_column, lower, upper, d_in=1, groupby_transformation=None, variance_column=None, keep_intermediates=False, sum_of_deviations_column=None, sum_of_squared_deviations_column=None, count_column=None)#

Returns a noisy variance measurement.

This function constructs a measurement M with the following privacy contract - for any two inputs x, x’ that are d_in-close under the input_metric, M(x) and M(x’) are sampled from distributions that are d_out apart under the output_measure. Noise scale is computed appropriately for the specified noise_mechanism such that the stated privacy property is guaranteed.

Note

d_out is interpreted as the “epsilon” parameter if output_measure is PureDP, the “rho” parameter if output_measure is RhoZCDP, and (“epsilon”, “delta”) if output_measure is ApproxDP.

Note

ApproxDP budgets with delta>0 are not yet supported.

Parameters
  • input_domain (tmlt.core.domains.spark_domains.SparkDataFrameDomain) – Domain of input DataFrames.

  • input_metric (Union[tmlt.core.metrics.SymmetricDifference, tmlt.core.metrics.HammingDistance, tmlt.core.metrics.IfGroupedBy]) – Distance metric on input DataFrames.

  • output_measure (Union[tmlt.core.measures.PureDP, tmlt.core.measures.ApproxDP, tmlt.core.measures.RhoZCDP]) – Desired privacy guarantee (one of PureDP, RhoZCDP, or ApproxDP).

  • d_out (tmlt.core.measures.PrivacyBudgetInput) – Desired distance between output distributions w.r.t. d_in. This is interpreted as “epsilon” if output_measure is PureDP, “rho” if it is RhoZCDP, and (“epsilon”, “delta”) if it is ApproxDP.

  • noise_mechanism (NoiseMechanism) – Noise mechanism to apply.

  • measure_column (str) – Name to column to compute variance of.

  • lower (tmlt.core.utils.exact_number.ExactNumberInput) – Lower clipping bound for measure_column.

  • upper (tmlt.core.utils.exact_number.ExactNumberInput) – Upper clipping bound for measure_column.

  • d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under the input_metric. The returned measurement is guaranteed to have output distributions that are d_out apart for inputs that are d_in apart. Defaults to 1.

  • groupby_transformation (Optional[tmlt.core.transformations.spark_transformations.groupby.GroupBy]) – If provided, this measurement returns a DataFrame with a noisy variance for each group obtained from the groupby transformation. If None, this measurement outputs a single number - the noisy variance.

  • variance_column (Optional[str]) – If a groupby_transformation is supplied, this is the column name to be used for noisy variance in the DataFrame output by the measurement. If None, this column will be named “var(<measure_column>)”.

  • keep_intermediates (bool) – If True, intermediates (noisy sum of deviations, noisy sum of squared deviations and noisy count) will also be output in addition to the noisy variance.

  • sum_of_deviations_column (Optional[str]) – If a groupby_transformation is supplied and keep_intermediates is True, this is the column name to be used for intermediate sums of deviations in the DataFrame output by the measurement. If None, this column will be named “sod(<measure_column>)”.

  • sum_of_squared_deviations_column (Optional[str]) – If a groupby_transformation is supplied and keep_intermediates is True, this is the column name to be used for intermediate sums of squared deviations in the DataFrame output by the measurement. If None, this column will be named “sos(<measure_column>)”.

  • count_column (Optional[str]) – If a groupby_transformation is supplied and keep_intermediates is True, this is the column name to be used for intermediate counts in the DataFrame output by the measurement. If None, this column will be named “count”.

Return type

Union[tmlt.core.measurements.postprocess.PostProcess, tmlt.core.measurements.converters.PureDPToApproxDP]

create_standard_deviation_measurement(input_domain, input_metric, output_measure, d_out, noise_mechanism, measure_column, lower, upper, d_in=1, groupby_transformation=None, standard_deviation_column=None, keep_intermediates=False, sum_of_deviations_column=None, sum_of_squared_deviations_column=None, count_column=None)#

Returns a noisy standard deviation measurement.

This function constructs a measurement M with the following privacy contract - for any two inputs x, x’ that are d_in-close under the input_metric, M(x) and M(x’) are sampled from distributions that are d_out apart under the output_measure. Noise scale is computed appropriately for the specified noise_mechanism such that the stated privacy property is guaranteed.

Note

d_out is interpreted as the “epsilon” parameter if output_measure is PureDP, the “rho” parameter if output_measure is RhoZCDP, and (“epsilon”, “delta”) if output_measure is ApproxDP.

Note

ApproxDP budgets with delta>0 are not yet supported.

Parameters
  • input_domain (tmlt.core.domains.spark_domains.SparkDataFrameDomain) – Domain of input DataFrames.

  • input_metric (Union[tmlt.core.metrics.SymmetricDifference, tmlt.core.metrics.HammingDistance, tmlt.core.metrics.IfGroupedBy]) – Distance metric on input DataFrames.

  • output_measure (Union[tmlt.core.measures.PureDP, tmlt.core.measures.ApproxDP, tmlt.core.measures.RhoZCDP]) – Desired privacy guarantee (one of PureDP, RhoZCDP, or ApproxDP).

  • d_out (tmlt.core.measures.PrivacyBudgetInput) – Desired distance between output distributions w.r.t. d_in. This is interpreted as “epsilon” if output_measure is PureDP, “rho” if it is RhoZCDP, and (“epsilon”, “delta”) if it is ApproxDP.

  • noise_mechanism (NoiseMechanism) – Noise mechanism to apply.

  • measure_column (str) – Name to column to compute standard deviation of.

  • lower (tmlt.core.utils.exact_number.ExactNumberInput) – Lower clipping bound for measure_column.

  • upper (tmlt.core.utils.exact_number.ExactNumberInput) – Upper clipping bound for measure_column.

  • d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under the input_metric. The returned measurement is guaranteed to have output distributions that are d_out apart for inputs that are d_in apart. Defaults to 1.

  • groupby_transformation (Optional[tmlt.core.transformations.spark_transformations.groupby.GroupBy]) – If provided, this measurement returns a DataFrame with noisy standard deviations for each group obtained by applying the groupby transformation. If None, this measurement outputs a single number - the noisy standard deviation of measure_column.

  • standard_deviation_column (Optional[str]) – If a groupby_transformation is supplied, this is the column name to be used for noisy standard deviation in the DataFrame output by the measurement. If None, this column will be named “stddev(<measure_column>)”.

  • keep_intermediates (bool) – If True, intermediates (noisy sum of deviations, noisy sum of squared deviations noisy count) will also be output in addition to the noisy standard deviation.

  • sum_of_deviations_column (Optional[str]) – If a groupby_transformation is supplied and keep_intermediates is True, this is the column name to be used for intermediate sums of deviations in the DataFrame output by the measurement. If None, this column will be named “sod(<measure_column>)”.

  • sum_of_squared_deviations_column (Optional[str]) – If a groupby_transformation is supplied and keep_intermediates is True, this is the column name to be used for intermediate sums of squared_deviations in the DataFrame output by the measurement. If None, this column will be named “sos(<measure_column>)”.

  • count_column (Optional[str]) – If a groupby_transformation is supplied and keep_intermediates is True, this is the column name to be used for intermediate counts in the DataFrame output by the measurement. If None, this column will be named “count”.

Return type

Union[tmlt.core.measurements.postprocess.PostProcess, tmlt.core.measurements.converters.PureDPToApproxDP]

create_quantile_measurement(input_domain, input_metric, output_measure, d_out, measure_column, quantile, lower, upper, d_in=1, groupby_transformation=None, quantile_column=None)#

Returns a noisy quantile measurement.

This function constructs a measurement M with the following privacy contract - for any two inputs x, x’ that are d_in-close under the input_metric, M(x) and M(x’) are sampled from distributions that are d_out apart under the output_measure.

Note

d_out is interpreted as the “epsilon” parameter if output_measure is PureDP, the “rho” parameter if output_measure is RhoZCDP, and (“epsilon”, “delta”) if output_measure is ApproxDP.

Note

ApproxDP budgets with delta>0 are not yet supported.

Parameters
Return type

Union[tmlt.core.measurements.postprocess.PostProcess, tmlt.core.measurements.converters.PureDPToApproxDP]

get_midpoint(lower, upper, integer_midpoint=False)#

Returns the midpoint of lower and upper.

If integer_midpoint is True, the midpoint is rounded to the nearest integer using round().

Examples

>>> get_midpoint(1, 2)
(1.5, 3/2)
>>> get_midpoint(1, 5)
(3.0, 3)
>>> get_midpoint("0.2", "0.3")
(0.25, 1/4)
>>> get_midpoint(1, 9, integer_midpoint=True)
(5, 5)
Parameters
  • lower (tmlt.core.utils.exact_number.ExactNumberInput) –

  • upper (tmlt.core.utils.exact_number.ExactNumberInput) –

  • integer_midpoint (bool) –

Return type

Tuple[Union[float, int], tmlt.core.utils.exact_number.ExactNumber]

create_partition_selection_measurement(input_domain, epsilon, delta, d_in=1, count_column=None)#

Returns a partition selection measurement.

A partition selection measurement created by this function will have a privacy guarantee such that measurement.privacy_function(d_in) = (epsilon, delta).

Parameters
  • input_domain (tmlt.core.domains.spark_domains.SparkDataFrameDomain) – Domain of the input Spark DataFrames. Input cannot contain floating point columns.

  • epsilon (tmlt.core.utils.exact_number.ExactNumberInput) – The epsilon portion of the (epsilon, delta) privacy budget that you want this measurement to satisfy.

  • delta (tmlt.core.utils.exact_number.ExactNumberInput) – The delta portion of the (epsilon, delta) privacy budget that you want this measurement to satisfy.

  • d_in (tmlt.core.utils.exact_number.ExactNumberInput) – The given d_in such that measurement.privacy_function(d_in) = (epsilon, delta).

  • count_column (Optional[str]) – Column name for output group counts. If None, output column will be named “count”.

Return type

tmlt.core.measurements.spark_measurements.GeometricPartitionSelection

create_bound_selection_measurement(input_domain, output_measure, d_out, bound_column, threshold, d_in=1)#

Returns a bound selection measurement.

A bound selection measurement created by this function will have a privacy guarantee such that measurement.privacy_function(d_in) = epsilon.

Parameters
  • input_domain (tmlt.core.domains.spark_domains.SparkDataFrameDomain) – Domain of the input Spark DataFrames.

  • output_measure (Union[tmlt.core.measures.PureDP, tmlt.core.measures.ApproxDP, tmlt.core.measures.RhoZCDP]) – Desired privacy guarantee.

  • d_out (tmlt.core.measures.PrivacyBudgetInput) – Desired distance between output distributions w.r.t. d_in. This is interpreted as “epsilon” if output_measure is PureDP, “rho” if it is RhoZCDP, and (“epsilon”, “delta”) if it is ApproxDP.

  • bound_column (str) – Column name to calculate the bounds for. The column must be an integer or floating point column.

  • threshold (float) – The threshold for the bound selection measurement.

  • d_in (tmlt.core.utils.exact_number.ExactNumberInput) – The given d_in such that measurement.privacy_function(d_in) = epsilon.

Return type

tmlt.core.measurements.base.Measurement

Classes#

NoiseMechanism

Enumerating noise mechanisms.

class NoiseMechanism#

Bases: enum.Enum

Enumerating noise mechanisms.

check_output_measure(output_measure)#

Checks if the specified output measure is supported.

Parameters

output_measure (Union[tmlt.core.measures.PureDP, tmlt.core.measures.RhoZCDP]) –

Return type

None

supported_output_measure()#

Returns a list of output measures supported by this noise mechanism.

Return type

List[Union[tmlt.core.measures.PureDP, tmlt.core.measures.RhoZCDP]]

name()#

The name of the Enum member.

value()#

The value of the Enum member.