aggregations#
Derived measurements for computing noisy aggregates on spark DataFrames.
Functions#
Returns a noisy count measurement. |
|
Returns a noisy count_distinct measurement. |
|
Returns a noisy sum measurement. |
|
Returns a noisy average measurement. |
|
Returns a noisy variance measurement. |
|
Returns a noisy standard deviation measurement. |
|
Returns a noisy quantile measurement. |
|
Returns the midpoint of lower and upper. |
- create_count_measurement(input_domain, input_metric, output_measure, d_out, noise_mechanism, d_in=1, groupby_transformation=None, count_column=None)#
Returns a noisy count measurement.
This function constructs a measurement M with the following privacy contract - for any two inputs x, x’ that are d_in-close under the input_metric, M(x) and M(x’) are sampled from distributions that are d_out apart under the output_measure. Noise scale is computed appropriately for the specified noise_mechanism such that the stated privacy property is guaranteed.
Note
d_out is interpreted as the “epsilon” parameter if output_measure is
PureDP
, otherwise it is interpreted as the “rho” parameter (if output_measure isRhoZCDP
).- Parameters
input_domain (tmlt.core.domains.spark_domains.SparkDataFrameDomain) – Domain of input spark DataFrames.
input_metric (Union[tmlt.core.metrics.SymmetricDifference, tmlt.core.metrics.HammingDistance, tmlt.core.metrics.IfGroupedBy]) – Distance metric on input DataFrames.
output_measure (Union[tmlt.core.measures.PureDP, tmlt.core.measures.RhoZCDP]) – Desired privacy guarantee (one of
PureDP
orRhoZCDP
).d_out (tmlt.core.utils.exact_number.ExactNumberInput) – Desired distance between output distributions w.r.t. d_in. This is interpreted as “epsilon” if output_measure is
PureDP
and as “rho” if it isRhoZCDP
.noise_mechanism (NoiseMechanism) – Noise mechanism to apply to count(s).
d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under the input_metric. The returned measurement is guaranteed to have output distributions that are d_out apart for inputs that are d_in apart. Defaults to 1.
groupby_transformation (Optional[tmlt.core.transformations.spark_transformations.groupby.GroupBy]) – If provided, this measurement returns a DataFrame with noisy counts for each group obtained by applying the groupby transformation . Otherwise, this measurement outputs a single number - the noisy count.
count_column (Optional[str]) – If a groupby_transformation is provided, this is the column name to be used for counts in the dataframe output by the measurement. If None, this column will be named “count”.
- Return type
- create_count_distinct_measurement(input_domain, input_metric, output_measure, d_out, noise_mechanism, d_in=1, groupby_transformation=None, count_column=None)#
Returns a noisy count_distinct measurement.
This function constructs a measurement M with the following privacy contract - for any two inputs x, x’ that are d_in-close under the input_metric, M(x) and M(x’) are sampled from distributions that are d_out apart under the output_measure. Noise scale is computed appropriately for the specified noise_mechanism such that the stated privacy property is guaranteed.
Note
d_out is interpreted as the “epsilon” parameter if output_measure is
PureDP
, otherwise it is interpreted as the “rho” parameter (if output_measure isRhoZCDP
).- Parameters
input_domain (tmlt.core.domains.spark_domains.SparkDataFrameDomain) – Domain of input spark DataFrames.
input_metric (Union[tmlt.core.metrics.SymmetricDifference, tmlt.core.metrics.HammingDistance, tmlt.core.metrics.IfGroupedBy]) – Distance metric on input DataFrames.
output_measure (Union[tmlt.core.measures.PureDP, tmlt.core.measures.RhoZCDP]) – Desired privacy guarantee (one of
PureDP
orRhoZCDP
).d_out (tmlt.core.utils.exact_number.ExactNumberInput) – Desired distance between output distributions with respect to d_in. This is interpreted as “epsilon” if output_measure is
PureDP
and as “rho” if output_measure isRhoZCDP
.noise_mechanism (NoiseMechanism) – Noise mechanism to apply to count(s).
d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under the input_metric. The returned measurement is guaranteed to have output distributions that are d_out apart for inputs that are d_in apart. Defaults to 1.
groupby_transformation (Optional[tmlt.core.transformations.spark_transformations.groupby.GroupBy]) – If provided, this measurement returns a DataFrame with noisy counts for each group obtained by applying the groupby transformation. Otherwise, this measurement outputs a single number - the noisy count of distinct items.
count_column (Optional[str]) – If a groupby_transformation is provided, this is the column name to be used for counts in the dataframe output by the measurement. If None, this column will be named “count”.
- Return type
- create_sum_measurement(input_domain, input_metric, output_measure, d_out, noise_mechanism, measure_column, lower, upper, d_in=1, groupby_transformation=None, sum_column=None)#
Returns a noisy sum measurement.
This function constructs a measurement M with the following privacy contract - for any two inputs x, x’ that are d_in-close under the input_metric, M(x) and M(x’) are sampled from distributions that are d_out apart under the output_measure. Noise scale is computed appropriately for the specified noise_mechanism such that the stated privacy property is guaranteed.
Note
d_out is interpreted as the “epsilon” parameter if output_measure is
PureDP
, otherwise it is interpreted as the “rho” parameter (if output_measure isRhoZCDP
).- Parameters
input_domain (tmlt.core.domains.spark_domains.SparkDataFrameDomain) – Domain of input spark DataFrames.
input_metric (Union[tmlt.core.metrics.SymmetricDifference, tmlt.core.metrics.HammingDistance, tmlt.core.metrics.IfGroupedBy]) – Distance metric on input DataFrames.
output_measure (Union[tmlt.core.measures.PureDP, tmlt.core.measures.RhoZCDP]) – Desired privacy guarantee (one of
PureDP
orRhoZCDP
).d_out (tmlt.core.utils.exact_number.ExactNumberInput) – Desired distance between output distributions w.r.t. d_in. This is interpreted as “epsilon” if output_measure is
PureDP
and as “rho” if it isRhoZCDP
.noise_mechanism (NoiseMechanism) – Noise mechanism to be applied to the sum(s).
measure_column (str) – Column to be summed.
lower (tmlt.core.utils.exact_number.ExactNumberInput) – Lower clipping bound on measure_column.
upper (tmlt.core.utils.exact_number.ExactNumberInput) – Upper clipping bound on measure_column.
d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under the input_metric. The returned measurement is guaranteed to have output distributions that are d_out apart for inputs that are d_in apart. Defaults to 1.
groupby_transformation (Optional[tmlt.core.transformations.spark_transformations.groupby.GroupBy]) – If provided, this measurement returns a DataFrame with noisy sums for each group obtained by applying the groupby transformation. If None, this measurement outputs a single number - the noisy sum.
sum_column (Optional[str]) – If a groupby_transformation is supplied, this is the column name to be used for sums in the DataFrame output by the measurement. If None, this column will be named “sum(<measure_column>)”.
- Return type
- create_average_measurement(input_domain, input_metric, output_measure, d_out, noise_mechanism, measure_column, lower, upper, d_in=1, groupby_transformation=None, average_column=None, keep_intermediates=False, sum_column=None, count_column=None)#
Returns a noisy average measurement.
This function constructs a measurement M with the following privacy contract - for any two inputs x, x’ that are d_in-close under the input_metric, M(x) and M(x’) are sampled from distributions that are d_out apart under the output_measure. Noise scale is computed appropriately for the specified noise_mechanism such that the stated privacy property is guaranteed.
Note
d_out is interpreted as the “epsilon” parameter if output_measure is
PureDP
, otherwise it is interpreted as the “rho” parameter (if output_measure isRhoZCDP
).- Parameters
input_domain (tmlt.core.domains.spark_domains.SparkDataFrameDomain) – Domain of input DataFrames.
input_metric (Union[tmlt.core.metrics.SymmetricDifference, tmlt.core.metrics.HammingDistance, tmlt.core.metrics.IfGroupedBy]) – Distance metric on input DataFrames.
output_measure (Union[tmlt.core.measures.PureDP, tmlt.core.measures.RhoZCDP]) – Desired privacy guarantee (one of
PureDP
orRhoZCDP
).d_out (tmlt.core.utils.exact_number.ExactNumberInput) – Desired distance between output distributions w.r.t. d_in. This is interpreted as “epsilon” if output_measure is
PureDP
and as “rho” if it isRhoZCDP
.noise_mechanism (NoiseMechanism) – Noise mechanism to apply.
measure_column (str) – Name to column to compute average of.
lower (tmlt.core.utils.exact_number.ExactNumberInput) – Lower clipping bound for measure_column.
upper (tmlt.core.utils.exact_number.ExactNumberInput) – Upper clipping bound for measure_column.
d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under the input_metric. The returned measurement is guaranteed to have output distributions that are d_out apart for inputs that are d_in apart. Defaults to 1.
groupby_transformation (Optional[tmlt.core.transformations.spark_transformations.groupby.GroupBy]) – If provided, this measurement returns a DataFrame with noisy averages for each group obtained from the groupby transformation. If None, this measurement outputs a single number - the noisy average.
average_column (Optional[str]) – If a groupby_transformation is supplied, this is the column name to be used for noisy average in the DataFrame output by the measurement. If None, this column will be named “avg(<measure_column>)”.
keep_intermediates (bool) – If True, intermediates (noisy sum of deviations and noisy count) will also be output in addition to the noisy average.
sum_column (Optional[str]) – If a groupby_transformation is supplied and keep_intermediates is True, this is the column name to be used for intermediate sums in the DataFrame output by the measurement. If None, this column will be named “sum(<measure_column>)”.
count_column (Optional[str]) – If a groupby_transformation is supplied and keep_intermediates is True, this is the column name to be used for intermediate counts in the DataFrame output by the measurement. If None, this column will be named “count”.
- Return type
- create_variance_measurement(input_domain, input_metric, output_measure, d_out, noise_mechanism, measure_column, lower, upper, d_in=1, groupby_transformation=None, variance_column=None, keep_intermediates=False, sum_of_deviations_column=None, sum_of_squared_deviations_column=None, count_column=None)#
Returns a noisy variance measurement.
This function constructs a measurement M with the following privacy contract - for any two inputs x, x’ that are d_in-close under the input_metric, M(x) and M(x’) are sampled from distributions that are d_out apart under the output_measure. Noise scale is computed appropriately for the specified noise_mechanism such that the stated privacy property is guaranteed.
Note
d_out is interpreted as the “epsilon” parameter if output_measure is
PureDP
, otherwise it is interpreted as the “rho” parameter (if output_measure isRhoZCDP
).- Parameters
input_domain (tmlt.core.domains.spark_domains.SparkDataFrameDomain) – Domain of input DataFrames.
input_metric (Union[tmlt.core.metrics.SymmetricDifference, tmlt.core.metrics.HammingDistance, tmlt.core.metrics.IfGroupedBy]) – Distance metric on input DataFrames.
output_measure (Union[tmlt.core.measures.PureDP, tmlt.core.measures.RhoZCDP]) – Desired privacy guarantee (one of
PureDP
orRhoZCDP
).d_out (tmlt.core.utils.exact_number.ExactNumberInput) – Desired distance between output distributions w.r.t. d_in. This is interpreted as “epsilon” if output_measure is
PureDP
and as “rho” if it isRhoZCDP
.noise_mechanism (NoiseMechanism) – Noise mechanism to apply.
measure_column (str) – Name to column to compute variance of.
lower (tmlt.core.utils.exact_number.ExactNumberInput) – Lower clipping bound for measure_column.
upper (tmlt.core.utils.exact_number.ExactNumberInput) – Upper clipping bound for measure_column.
d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under the input_metric. The returned measurement is guaranteed to have output distributions that are d_out apart for inputs that are d_in apart. Defaults to 1.
groupby_transformation (Optional[tmlt.core.transformations.spark_transformations.groupby.GroupBy]) – If provided, this measurement returns a DataFrame with a noisy variance for each group obtained from the groupby transformation. If None, this measurement outputs a single number - the noisy variance.
variance_column (Optional[str]) – If a groupby_transformation is supplied, this is the column name to be used for noisy variance in the DataFrame output by the measurement. If None, this column will be named “var(<measure_column>)”.
keep_intermediates (bool) – If True, intermediates (noisy sum of deviations, noisy sum of squared deviations and noisy count) will also be output in addition to the noisy variance.
sum_of_deviations_column (Optional[str]) – If a groupby_transformation is supplied and keep_intermediates is True, this is the column name to be used for intermediate sums of deviations in the DataFrame output by the measurement. If None, this column will be named “sod(<measure_column>)”.
sum_of_squared_deviations_column (Optional[str]) – If a groupby_transformation is supplied and keep_intermediates is True, this is the column name to be used for intermediate sums of squared deviations in the DataFrame output by the measurement. If None, this column will be named “sos(<measure_column>)”.
count_column (Optional[str]) – If a groupby_transformation is supplied and keep_intermediates is True, this is the column name to be used for intermediate counts in the DataFrame output by the measurement. If None, this column will be named “count”.
- Return type
- create_standard_deviation_measurement(input_domain, input_metric, output_measure, d_out, noise_mechanism, measure_column, lower, upper, d_in=1, groupby_transformation=None, standard_deviation_column=None, keep_intermediates=False, sum_of_deviations_column=None, sum_of_squared_deviations_column=None, count_column=None)#
Returns a noisy standard deviation measurement.
This function constructs a measurement M with the following privacy contract - for any two inputs x, x’ that are d_in-close under the input_metric, M(x) and M(x’) are sampled from distributions that are d_out apart under the output_measure. Noise scale is computed appropriately for the specified noise_mechanism such that the stated privacy property is guaranteed.
Note
d_out is interpreted as the “epsilon” parameter if output_measure is
PureDP
, otherwise it is interpreted as the “rho” parameter (if output_measure isRhoZCDP
).- Parameters
input_domain (tmlt.core.domains.spark_domains.SparkDataFrameDomain) – Domain of input DataFrames.
input_metric (Union[tmlt.core.metrics.SymmetricDifference, tmlt.core.metrics.HammingDistance, tmlt.core.metrics.IfGroupedBy]) – Distance metric on input DataFrames.
output_measure (Union[tmlt.core.measures.PureDP, tmlt.core.measures.RhoZCDP]) – Desired privacy guarantee (one of
PureDP
orRhoZCDP
).d_out (tmlt.core.utils.exact_number.ExactNumberInput) – Desired distance between output distributions w.r.t. d_in. This is interpreted as “epsilon” if output_measure is
PureDP
and as “rho” if it isRhoZCDP
.noise_mechanism (NoiseMechanism) – Noise mechanism to apply.
measure_column (str) – Name to column to compute standard deviation of.
lower (tmlt.core.utils.exact_number.ExactNumberInput) – Lower clipping bound for measure_column.
upper (tmlt.core.utils.exact_number.ExactNumberInput) – Upper clipping bound for measure_column.
d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under the input_metric. The returned measurement is guaranteed to have output distributions that are d_out apart for inputs that are d_in apart. Defaults to 1.
groupby_transformation (Optional[tmlt.core.transformations.spark_transformations.groupby.GroupBy]) – If provided, this measurement returns a DataFrame with noisy standard deviations for each group obtained by applying the groupby transformation. If None, this measurement outputs a single number - the noisy standard deviation of measure_column.
standard_deviation_column (Optional[str]) – If a groupby_transformation is supplied, this is the column name to be used for noisy standard deviation in the DataFrame output by the measurement. If None, this column will be named “stddev(<measure_column>)”.
keep_intermediates (bool) – If True, intermediates (noisy sum of deviations, noisy sum of squared deviations noisy count) will also be output in addition to the noisy standard deviation.
sum_of_deviations_column (Optional[str]) – If a groupby_transformation is supplied and keep_intermediates is True, this is the column name to be used for intermediate sums of deviations in the DataFrame output by the measurement. If None, this column will be named “sod(<measure_column>)”.
sum_of_squared_deviations_column (Optional[str]) – If a groupby_transformation is supplied and keep_intermediates is True, this is the column name to be used for intermediate sums of squared_deviations in the DataFrame output by the measurement. If None, this column will be named “sos(<measure_column>)”.
count_column (Optional[str]) – If a groupby_transformation is supplied and keep_intermediates is True, this is the column name to be used for intermediate counts in the DataFrame output by the measurement. If None, this column will be named “count”.
- Return type
- create_quantile_measurement(input_domain, input_metric, output_measure, d_out, measure_column, quantile, lower, upper, d_in=1, groupby_transformation=None, quantile_column=None)#
Returns a noisy quantile measurement.
This function constructs a measurement M with the following privacy contract - for any two inputs x, x’ that are d_in-close under the input_metric, M(x) and M(x’) are sampled from distributions that are d_out apart under the output_measure.
Note
d_out is interpreted as the “epsilon” parameter if output_measure is
PureDP
, otherwise it is interpreted as the “rho” parameter (if output_measure isRhoZCDP
).- Parameters
input_domain (tmlt.core.domains.spark_domains.SparkDataFrameDomain) – Domain of input DataFrames.
input_metric (Union[tmlt.core.metrics.SymmetricDifference, tmlt.core.metrics.HammingDistance, tmlt.core.metrics.IfGroupedBy]) – Distance metric on input DataFrames.
output_measure (Union[tmlt.core.measures.PureDP, tmlt.core.measures.RhoZCDP]) – Desired privacy guarantee (
PureDP
orRhoZCDP
).d_out (tmlt.core.utils.exact_number.ExactNumberInput) – Desired distance between output distributions w.r.t. d_in. This is interpreted as “epsilon” if output_measure is
PureDP
and as “rho” if it isRhoZCDP
.measure_column (str) – Name to column to compute quantile of.
quantile (float) – The quantile to produce.
lower (Union[int, float]) – Lower clipping bound for measure_column.
upper (Union[int, float]) – Upper clipping bound for measure_column.
d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under the input_metric. The returned measurement is guaranteed to have output distributions that are d_out apart for inputs that are d_in apart. Defaults to 1.
groupby_transformation (Optional[tmlt.core.transformations.spark_transformations.groupby.GroupBy]) – If provided, this measurement returns a DataFrame with noisy quantiles for each group obtained by applying groupby. If None, this measurement outputs a single number - the noisy quantile.
quantile_column (Optional[str]) – If a groupby_transformation is supplied, this is the column name to be used for noisy quantile in the DataFrame output by the measurement. If None, this column will be named “q_(<quantile>)_(<measure_column>)”.
- Return type
- get_midpoint(lower, upper, integer_midpoint=False)#
Returns the midpoint of lower and upper.
If integer_midpoint is True, the midpoint is rounded to the nearest integer using
round()
.Examples
>>> get_midpoint(1, 2) (1.5, 3/2) >>> get_midpoint(1, 5) (3.0, 3) >>> get_midpoint("0.2", "0.3") (0.25, 1/4) >>> get_midpoint(1, 9, integer_midpoint=True) (5, 5)
- Parameters
lower (tmlt.core.utils.exact_number.ExactNumberInput) –
upper (tmlt.core.utils.exact_number.ExactNumberInput) –
integer_midpoint (bool) –
- Return type
Tuple[Union[float, int], tmlt.core.utils.exact_number.ExactNumber]
Classes#
Enumerating noise mechanisms. |