dataframe#

Measurements on Pandas DataFrames.

Classes#

Aggregate

Aggregate a Pandas DataFrame.

AggregateByColumn

Apply Aggregate measurements to columns of a Pandas DataFrame.

class Aggregate(input_domain, input_metric, output_measure, output_schema)#

Bases: tmlt.core.measurements.base.Measurement

Aggregate a Pandas DataFrame.

This measurement requires the output schema be specified as a pyspark.sql.types.StructType so that it can be used as a udf in Spark.

Parameters
__init__(input_domain, input_metric, output_measure, output_schema)#

Constructor.

Parameters
property input_domain#

Return input domain for the measurement.

Return type

tmlt.core.domains.pandas_domains.PandasDataFrameDomain

property output_schema#

Return the output schema.

Return type

pyspark.sql.types.StructType

abstract __call__(df)#

Perform measurement.

Parameters

df (pandas.DataFrame) –

Return type

pandas.DataFrame

property input_metric#

Distance metric on input domain.

Return type

tmlt.core.metrics.Metric

property output_measure#

Distance measure on output.

Return type

tmlt.core.measures.Measure

property is_interactive#

Returns true iff the measurement is interactive.

Return type

bool

privacy_function(d_in)#

Returns the smallest d_out satisfied by the measurement.

See the privacy and stability tutorial (add link?) for more information.

Parameters

d_in (Any) – Distance between inputs under input_metric.

Raises

NotImplementedError – If not overridden.

Return type

Any

privacy_relation(d_in, d_out)#

Return True if close inputs produce close outputs.

See the privacy and stability tutorial (add link?) for more information.

Parameters
  • d_in (Any) – Distance between inputs under input_metric.

  • d_out (Any) – Distance between outputs under output_measure.

Return type

bool

class AggregateByColumn(input_domain, column_to_aggregation, hint=None)#

Bases: Aggregate

Apply Aggregate measurements to columns of a Pandas DataFrame.

Parameters
__init__(input_domain, column_to_aggregation, hint=None)#

Constructor.

Parameters
property column_to_aggregation#

Returns dictionary from column names to aggregation measurements.

Return type

Dict[str, tmlt.core.measurements.pandas_measurements.series.Aggregate]

privacy_function(d_in)#

Returns the smallest d_out satisfied by the measurement.

Returns the sum of the privacy_function()’s on d_in for all composed measurements.

Parameters

d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under input_metric.

Raises

NotImplementedError – If any of the measurements raise NotImplementedError.

Return type

tmlt.core.utils.exact_number.ExactNumber

privacy_relation(d_in, d_out)#

Returns True only if outputs are close under close inputs.

Let d_outs be the d_out from the privacy_function()’s of all composed measurements or the d_outs from the hint if one of them raises NotImplementedError.

And total_d_out to be the sum of d_outs.

This returns True if total_d_out <= d_out (the input argument) and each composed measurement satisfies its privacy_relation() from d_in to its d_out from d_outs.

Parameters
  • d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under input_metric.

  • d_out (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between outputs under output_measure.

Return type

bool

__call__(df)#

Perform the aggregation.

Parameters

df (pandas.DataFrame) – The DataFrame to aggregate.

Return type

pandas.DataFrame

property input_domain#

Return input domain for the measurement.

Return type

tmlt.core.domains.pandas_domains.PandasDataFrameDomain

property output_schema#

Return the output schema.

Return type

pyspark.sql.types.StructType

property input_metric#

Distance metric on input domain.

Return type

tmlt.core.metrics.Metric

property output_measure#

Distance measure on output.

Return type

tmlt.core.measures.Measure

property is_interactive#

Returns true iff the measurement is interactive.

Return type

bool