aggregations#

Derived measurements for computing noisy aggregates on spark DataFrames.

Functions#

`create_count_measurement()`	Returns a noisy count measurement.
`create_count_distinct_measurement()`	Returns a noisy count_distinct measurement.
`create_sum_measurement()`	Returns a noisy sum measurement.
`create_average_measurement()`	Returns a noisy average measurement.
`create_variance_measurement()`	Returns a noisy variance measurement.
`create_standard_deviation_measurement()`	Returns a noisy standard deviation measurement.
`create_quantile_measurement()`	Returns a noisy quantile measurement.
`get_midpoint()`	Returns the midpoint of lower and upper.

create_count_measurement(input_domain, input_metric, output_measure, d_out, noise_mechanism, d_in=1, groupby_transformation=None, count_column=None)#

Returns a noisy count measurement.

This function constructs a measurement M with the following privacy contract - for any two inputs x, x’ that are d_in-close under the input_metric, M(x) and M(x’) are sampled from distributions that are d_out apart under the output_measure. Noise scale is computed appropriately for the specified noise_mechanism such that the stated privacy property is guaranteed.

Note

d_out is interpreted as the “epsilon” parameter if output_measure is PureDP, otherwise it is interpreted as the “rho” parameter (if output_measure is RhoZCDP).

Parameters

input_domain (tmlt.core.domains.spark_domains.SparkDataFrameDomain) – Domain of input spark DataFrames.
input_metric (Union[tmlt.core.metrics.SymmetricDifference, tmlt.core.metrics.HammingDistance, tmlt.core.metrics.IfGroupedBy]) – Distance metric on input DataFrames.
output_measure (Union[tmlt.core.measures.PureDP, tmlt.core.measures.RhoZCDP]) – Desired privacy guarantee (one of PureDP or RhoZCDP).
d_out (tmlt.core.utils.exact_number.ExactNumberInput) – Desired distance between output distributions w.r.t. d_in. This is interpreted as “epsilon” if output_measure is PureDP and as “rho” if it is RhoZCDP.
noise_mechanism (NoiseMechanism) – Noise mechanism to apply to count(s).
d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under the input_metric. The returned measurement is guaranteed to have output distributions that are d_out apart for inputs that are d_in apart. Defaults to 1.
groupby_transformation (Optional[tmlt.core.transformations.spark_transformations.groupby.GroupBy]) – If provided, this measurement returns a DataFrame with noisy counts for each group obtained by applying the groupby transformation . Otherwise, this measurement outputs a single number - the noisy count.
count_column (Optional[str]) – If a groupby_transformation is provided, this is the column name to be used for counts in the dataframe output by the measurement. If None, this column will be named “count”.

Return type

tmlt.core.measurements.base.Measurement

create_count_distinct_measurement(input_domain, input_metric, output_measure, d_out, noise_mechanism, d_in=1, groupby_transformation=None, count_column=None)#

Returns a noisy count_distinct measurement.

Note

d_out is interpreted as the “epsilon” parameter if output_measure is PureDP, otherwise it is interpreted as the “rho” parameter (if output_measure is RhoZCDP).

Parameters

input_domain (tmlt.core.domains.spark_domains.SparkDataFrameDomain) – Domain of input spark DataFrames.
input_metric (Union[tmlt.core.metrics.SymmetricDifference, tmlt.core.metrics.HammingDistance, tmlt.core.metrics.IfGroupedBy]) – Distance metric on input DataFrames.
output_measure (Union[tmlt.core.measures.PureDP, tmlt.core.measures.RhoZCDP]) – Desired privacy guarantee (one of PureDP or RhoZCDP).
d_out (tmlt.core.utils.exact_number.ExactNumberInput) – Desired distance between output distributions with respect to d_in. This is interpreted as “epsilon” if output_measure is PureDP and as “rho” if output_measure is RhoZCDP.
noise_mechanism (NoiseMechanism) – Noise mechanism to apply to count(s).
d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under the input_metric. The returned measurement is guaranteed to have output distributions that are d_out apart for inputs that are d_in apart. Defaults to 1.
groupby_transformation (Optional[tmlt.core.transformations.spark_transformations.groupby.GroupBy]) – If provided, this measurement returns a DataFrame with noisy counts for each group obtained by applying the groupby transformation. Otherwise, this measurement outputs a single number - the noisy count of distinct items.
count_column (Optional[str]) – If a groupby_transformation is provided, this is the column name to be used for counts in the dataframe output by the measurement. If None, this column will be named “count”.

Return type

tmlt.core.measurements.base.Measurement

create_sum_measurement(input_domain, input_metric, output_measure, d_out, noise_mechanism, measure_column, lower, upper, d_in=1, groupby_transformation=None, sum_column=None)#

Returns a noisy sum measurement.

Note

d_out is interpreted as the “epsilon” parameter if output_measure is PureDP, otherwise it is interpreted as the “rho” parameter (if output_measure is RhoZCDP).

Parameters

input_domain (tmlt.core.domains.spark_domains.SparkDataFrameDomain) – Domain of input spark DataFrames.
input_metric (Union[tmlt.core.metrics.SymmetricDifference, tmlt.core.metrics.HammingDistance, tmlt.core.metrics.IfGroupedBy]) – Distance metric on input DataFrames.
output_measure (Union[tmlt.core.measures.PureDP, tmlt.core.measures.RhoZCDP]) – Desired privacy guarantee (one of PureDP or RhoZCDP).
d_out (tmlt.core.utils.exact_number.ExactNumberInput) – Desired distance between output distributions w.r.t. d_in. This is interpreted as “epsilon” if output_measure is PureDP and as “rho” if it is RhoZCDP.
noise_mechanism (NoiseMechanism) – Noise mechanism to be applied to the sum(s).
measure_column (str) – Column to be summed.
lower (tmlt.core.utils.exact_number.ExactNumberInput) – Lower clipping bound on measure_column.
upper (tmlt.core.utils.exact_number.ExactNumberInput) – Upper clipping bound on measure_column.
d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under the input_metric. The returned measurement is guaranteed to have output distributions that are d_out apart for inputs that are d_in apart. Defaults to 1.
groupby_transformation (Optional[tmlt.core.transformations.spark_transformations.groupby.GroupBy]) – If provided, this measurement returns a DataFrame with noisy sums for each group obtained by applying the groupby transformation. If None, this measurement outputs a single number - the noisy sum.
sum_column (Optional[str]) – If a groupby_transformation is supplied, this is the column name to be used for sums in the DataFrame output by the measurement. If None, this column will be named “sum(<measure_column>)”.

Return type

tmlt.core.measurements.base.Measurement

create_average_measurement(input_domain, input_metric, output_measure, d_out, noise_mechanism, measure_column, lower, upper, d_in=1, groupby_transformation=None, average_column=None, keep_intermediates=False, sum_column=None, count_column=None)#

Returns a noisy average measurement.

Note

d_out is interpreted as the “epsilon” parameter if output_measure is PureDP, otherwise it is interpreted as the “rho” parameter (if output_measure is RhoZCDP).

Parameters

input_domain (tmlt.core.domains.spark_domains.SparkDataFrameDomain) – Domain of input DataFrames.
input_metric (Union[tmlt.core.metrics.SymmetricDifference, tmlt.core.metrics.HammingDistance, tmlt.core.metrics.IfGroupedBy]) – Distance metric on input DataFrames.
output_measure (Union[tmlt.core.measures.PureDP, tmlt.core.measures.RhoZCDP]) – Desired privacy guarantee (one of PureDP or RhoZCDP).
d_out (tmlt.core.utils.exact_number.ExactNumberInput) – Desired distance between output distributions w.r.t. d_in. This is interpreted as “epsilon” if output_measure is PureDP and as “rho” if it is RhoZCDP.
noise_mechanism (NoiseMechanism) – Noise mechanism to apply.
measure_column (str) – Name to column to compute average of.
lower (tmlt.core.utils.exact_number.ExactNumberInput) – Lower clipping bound for measure_column.
upper (tmlt.core.utils.exact_number.ExactNumberInput) – Upper clipping bound for measure_column.
d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under the input_metric. The returned measurement is guaranteed to have output distributions that are d_out apart for inputs that are d_in apart. Defaults to 1.
groupby_transformation (Optional[tmlt.core.transformations.spark_transformations.groupby.GroupBy]) – If provided, this measurement returns a DataFrame with noisy averages for each group obtained from the groupby transformation. If None, this measurement outputs a single number - the noisy average.
average_column (Optional[str]) – If a groupby_transformation is supplied, this is the column name to be used for noisy average in the DataFrame output by the measurement. If None, this column will be named “avg(<measure_column>)”.
keep_intermediates (bool) – If True, intermediates (noisy sum of deviations and noisy count) will also be output in addition to the noisy average.
sum_column (Optional[str]) – If a groupby_transformation is supplied and keep_intermediates is True, this is the column name to be used for intermediate sums in the DataFrame output by the measurement. If None, this column will be named “sum(<measure_column>)”.
count_column (Optional[str]) – If a groupby_transformation is supplied and keep_intermediates is True, this is the column name to be used for intermediate counts in the DataFrame output by the measurement. If None, this column will be named “count”.

Return type

tmlt.core.measurements.postprocess.PostProcess

create_variance_measurement(input_domain, input_metric, output_measure, d_out, noise_mechanism, measure_column, lower, upper, d_in=1, groupby_transformation=None, variance_column=None, keep_intermediates=False, sum_of_deviations_column=None, sum_of_squared_deviations_column=None, count_column=None)#

Returns a noisy variance measurement.

Note

d_out is interpreted as the “epsilon” parameter if output_measure is PureDP, otherwise it is interpreted as the “rho” parameter (if output_measure is RhoZCDP).

Parameters

input_domain (tmlt.core.domains.spark_domains.SparkDataFrameDomain) – Domain of input DataFrames.
input_metric (Union[tmlt.core.metrics.SymmetricDifference, tmlt.core.metrics.HammingDistance, tmlt.core.metrics.IfGroupedBy]) – Distance metric on input DataFrames.
output_measure (Union[tmlt.core.measures.PureDP, tmlt.core.measures.RhoZCDP]) – Desired privacy guarantee (one of PureDP or RhoZCDP).
d_out (tmlt.core.utils.exact_number.ExactNumberInput) – Desired distance between output distributions w.r.t. d_in. This is interpreted as “epsilon” if output_measure is PureDP and as “rho” if it is RhoZCDP.
noise_mechanism (NoiseMechanism) – Noise mechanism to apply.
measure_column (str) – Name to column to compute variance of.
lower (tmlt.core.utils.exact_number.ExactNumberInput) – Lower clipping bound for measure_column.
upper (tmlt.core.utils.exact_number.ExactNumberInput) – Upper clipping bound for measure_column.
d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under the input_metric. The returned measurement is guaranteed to have output distributions that are d_out apart for inputs that are d_in apart. Defaults to 1.
groupby_transformation (Optional[tmlt.core.transformations.spark_transformations.groupby.GroupBy]) – If provided, this measurement returns a DataFrame with a noisy variance for each group obtained from the groupby transformation. If None, this measurement outputs a single number - the noisy variance.
variance_column (Optional[str]) – If a groupby_transformation is supplied, this is the column name to be used for noisy variance in the DataFrame output by the measurement. If None, this column will be named “var(<measure_column>)”.
keep_intermediates (bool) – If True, intermediates (noisy sum of deviations, noisy sum of squared deviations and noisy count) will also be output in addition to the noisy variance.
sum_of_deviations_column (Optional[str]) – If a groupby_transformation is supplied and keep_intermediates is True, this is the column name to be used for intermediate sums of deviations in the DataFrame output by the measurement. If None, this column will be named “sod(<measure_column>)”.
sum_of_squared_deviations_column (Optional[str]) – If a groupby_transformation is supplied and keep_intermediates is True, this is the column name to be used for intermediate sums of squared deviations in the DataFrame output by the measurement. If None, this column will be named “sos(<measure_column>)”.
count_column (Optional[str]) – If a groupby_transformation is supplied and keep_intermediates is True, this is the column name to be used for intermediate counts in the DataFrame output by the measurement. If None, this column will be named “count”.

Return type

tmlt.core.measurements.postprocess.PostProcess

create_standard_deviation_measurement(input_domain, input_metric, output_measure, d_out, noise_mechanism, measure_column, lower, upper, d_in=1, groupby_transformation=None, standard_deviation_column=None, keep_intermediates=False, sum_of_deviations_column=None, sum_of_squared_deviations_column=None, count_column=None)#

Returns a noisy standard deviation measurement.

Note

d_out is interpreted as the “epsilon” parameter if output_measure is PureDP, otherwise it is interpreted as the “rho” parameter (if output_measure is RhoZCDP).

Parameters

input_domain (tmlt.core.domains.spark_domains.SparkDataFrameDomain) – Domain of input DataFrames.
input_metric (Union[tmlt.core.metrics.SymmetricDifference, tmlt.core.metrics.HammingDistance, tmlt.core.metrics.IfGroupedBy]) – Distance metric on input DataFrames.
output_measure (Union[tmlt.core.measures.PureDP, tmlt.core.measures.RhoZCDP]) – Desired privacy guarantee (one of PureDP or RhoZCDP).
d_out (tmlt.core.utils.exact_number.ExactNumberInput) – Desired distance between output distributions w.r.t. d_in. This is interpreted as “epsilon” if output_measure is PureDP and as “rho” if it is RhoZCDP.
noise_mechanism (NoiseMechanism) – Noise mechanism to apply.
measure_column (str) – Name to column to compute standard deviation of.
lower (tmlt.core.utils.exact_number.ExactNumberInput) – Lower clipping bound for measure_column.
upper (tmlt.core.utils.exact_number.ExactNumberInput) – Upper clipping bound for measure_column.
d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under the input_metric. The returned measurement is guaranteed to have output distributions that are d_out apart for inputs that are d_in apart. Defaults to 1.
groupby_transformation (Optional[tmlt.core.transformations.spark_transformations.groupby.GroupBy]) – If provided, this measurement returns a DataFrame with noisy standard deviations for each group obtained by applying the groupby transformation. If None, this measurement outputs a single number - the noisy standard deviation of measure_column.
standard_deviation_column (Optional[str]) – If a groupby_transformation is supplied, this is the column name to be used for noisy standard deviation in the DataFrame output by the measurement. If None, this column will be named “stddev(<measure_column>)”.
keep_intermediates (bool) – If True, intermediates (noisy sum of deviations, noisy sum of squared deviations noisy count) will also be output in addition to the noisy standard deviation.
sum_of_deviations_column (Optional[str]) – If a groupby_transformation is supplied and keep_intermediates is True, this is the column name to be used for intermediate sums of deviations in the DataFrame output by the measurement. If None, this column will be named “sod(<measure_column>)”.
sum_of_squared_deviations_column (Optional[str]) – If a groupby_transformation is supplied and keep_intermediates is True, this is the column name to be used for intermediate sums of squared_deviations in the DataFrame output by the measurement. If None, this column will be named “sos(<measure_column>)”.
count_column (Optional[str]) – If a groupby_transformation is supplied and keep_intermediates is True, this is the column name to be used for intermediate counts in the DataFrame output by the measurement. If None, this column will be named “count”.

Return type

tmlt.core.measurements.postprocess.PostProcess

create_quantile_measurement(input_domain, input_metric, output_measure, d_out, measure_column, quantile, lower, upper, d_in=1, groupby_transformation=None, quantile_column=None)#

Returns a noisy quantile measurement.

Note

d_out is interpreted as the “epsilon” parameter if output_measure is PureDP, otherwise it is interpreted as the “rho” parameter (if output_measure is RhoZCDP).

Parameters

input_domain (tmlt.core.domains.spark_domains.SparkDataFrameDomain) – Domain of input DataFrames.
input_metric (Union[tmlt.core.metrics.SymmetricDifference, tmlt.core.metrics.HammingDistance, tmlt.core.metrics.IfGroupedBy]) – Distance metric on input DataFrames.
output_measure (Union[tmlt.core.measures.PureDP, tmlt.core.measures.RhoZCDP]) – Desired privacy guarantee (PureDP or RhoZCDP).
d_out (tmlt.core.utils.exact_number.ExactNumberInput) – Desired distance between output distributions w.r.t. d_in. This is interpreted as “epsilon” if output_measure is PureDP and as “rho” if it is RhoZCDP.
measure_column (str) – Name to column to compute quantile of.
quantile (float) – The quantile to produce.
lower (Union[int, float]) – Lower clipping bound for measure_column.
upper (Union[int, float]) – Upper clipping bound for measure_column.
d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under the input_metric. The returned measurement is guaranteed to have output distributions that are d_out apart for inputs that are d_in apart. Defaults to 1.
groupby_transformation (Optional[tmlt.core.transformations.spark_transformations.groupby.GroupBy]) – If provided, this measurement returns a DataFrame with noisy quantiles for each group obtained by applying groupby. If None, this measurement outputs a single number - the noisy quantile.
quantile_column (Optional[str]) – If a groupby_transformation is supplied, this is the column name to be used for noisy quantile in the DataFrame output by the measurement. If None, this column will be named “q_(<quantile>)_(<measure_column>)”.

Return type

tmlt.core.measurements.postprocess.PostProcess

get_midpoint(lower, upper, integer_midpoint=False)#

Returns the midpoint of lower and upper.

If integer_midpoint is True, the midpoint is rounded to the nearest integer using round().

Examples

>>> get_midpoint(1, 2)
(1.5, 3/2)
>>> get_midpoint(1, 5)
(3.0, 3)
>>> get_midpoint("0.2", "0.3")
(0.25, 1/4)
>>> get_midpoint(1, 9, integer_midpoint=True)
(5, 5)

Parameters

lower (tmlt.core.utils.exact_number.ExactNumberInput) –
upper (tmlt.core.utils.exact_number.ExactNumberInput) –
integer_midpoint (bool) –

Return type

Tuple[Union[float, int], tmlt.core.utils.exact_number.ExactNumber]

Classes#

NoiseMechanism

Enumerating noise mechanisms.

class NoiseMechanism#

Bases: enum.Enum

Enumerating noise mechanisms.

name(self)#: The name of the Enum member.

value(self)#: The value of the Enum member.

Tumult Core

aggregations#

Functions#

Classes#