dataframe#
Measurements on Pandas DataFrames.
Classes#
Aggregate a Pandas DataFrame. |
|
Apply Aggregate measurements to columns of a Pandas DataFrame. |
- class Aggregate(input_domain, input_metric, output_measure, output_schema)#
Bases:
tmlt.core.measurements.base.Measurement
Aggregate a Pandas DataFrame.
This measurement requires the output schema be specified as a
pyspark.sql.types.StructType
so that it can be used as a udf in Spark.- Parameters:
input_domain (tmlt.core.domains.pandas_domains.PandasDataFrameDomain)
input_metric (Union[tmlt.core.metrics.HammingDistance, tmlt.core.metrics.SymmetricDifference])
output_measure (tmlt.core.measures.Measure)
output_schema (pyspark.sql.types.StructType)
- property input_domain: tmlt.core.domains.pandas_domains.PandasDataFrameDomain#
Return input domain for the measurement.
- property output_schema: pyspark.sql.types.StructType#
Return the output schema.
- Return type:
- property input_metric: tmlt.core.metrics.Metric#
Distance metric on input domain.
- Return type:
- property output_measure: tmlt.core.measures.Measure#
Distance measure on output.
- Return type:
- __init__(input_domain, input_metric, output_measure, output_schema)#
Constructor.
- Parameters:
input_domain (
PandasDataFrameDomain
) – Input domain.input_metric (
Union
[HammingDistance
,SymmetricDifference
]) – Input metric.output_measure (
Measure
) – Output measure.output_schema (
StructType
) – Spark StructType compatible with the output.
- abstract __call__(df)#
Perform measurement.
- Parameters:
df (pandas.DataFrame)
- Return type:
- privacy_function(d_in)#
Returns the smallest d_out satisfied by the measurement.
See the privacy and stability tutorial (add link?) for more information.
- Parameters:
d_in (Any) – Distance between inputs under input_metric.
- Raises:
NotImplementedError – If not overridden.
- Return type:
Any
- privacy_relation(d_in, d_out)#
Return True if close inputs produce close outputs.
See the privacy and stability tutorial (add link?) for more information.
- Parameters:
d_in (Any) – Distance between inputs under
input_metric
.d_out (Any) – Distance between outputs under
output_measure
.
- Return type:
- class AggregateByColumn(input_domain, column_to_aggregation, hint=None)#
Bases:
Aggregate
Apply Aggregate measurements to columns of a Pandas DataFrame.
- Parameters:
input_domain (tmlt.core.domains.pandas_domains.PandasDataFrameDomain)
column_to_aggregation (Mapping[str, tmlt.core.measurements.pandas_measurements.series.Aggregate])
hint (Optional[Callable[[tmlt.core.utils.exact_number.ExactNumberInput, tmlt.core.utils.exact_number.ExactNumberInput], Dict[str, tmlt.core.utils.exact_number.ExactNumberInput]]])
- property column_to_aggregation: Dict[str, tmlt.core.measurements.pandas_measurements.series.Aggregate]#
Returns dictionary from column names to aggregation measurements.
- Return type:
Dict[str, tmlt.core.measurements.pandas_measurements.series.Aggregate]
- property input_domain: tmlt.core.domains.pandas_domains.PandasDataFrameDomain#
Return input domain for the measurement.
- property output_schema: pyspark.sql.types.StructType#
Return the output schema.
- Return type:
- property input_metric: tmlt.core.metrics.Metric#
Distance metric on input domain.
- Return type:
- property output_measure: tmlt.core.measures.Measure#
Distance measure on output.
- Return type:
- __init__(input_domain, column_to_aggregation, hint=None)#
Constructor.
- Parameters:
input_domain (
PandasDataFrameDomain
) – Input domain.column_to_aggregation (
Mapping
[str
,Aggregate
]) – A dictionary mapping column names to aggregation measurements. The provided measurements must all havePureDP
or all haveRhoZCDP
as theiroutput_measure
.hint (
Optional
[Callable
[[Union
[ExactNumber
,float
,int
,str
,Fraction
,Expr
],Union
[ExactNumber
,float
,int
,str
,Fraction
,Expr
]],Dict
[str
,Union
[ExactNumber
,float
,int
,str
,Fraction
,Expr
]]]]) – An optional hint. A hint is only required if one or more of the measurement’sprivacy_function()
raise NotImplementedError. The hint takes in the same arguments asprivacy_relation()
., and should return a d_out for each aggregation to be composed, where all of the d_outs sum to less than the d_out passed into the hint.
- privacy_function(d_in)#
Returns the smallest d_out satisfied by the measurement.
Returns the sum of the
privacy_function()
’s ond_in
for all composed measurements.- Parameters:
d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under input_metric.
- Raises:
NotImplementedError – If any of the measurements raise
NotImplementedError
.- Return type:
- privacy_relation(d_in, d_out)#
Returns True only if outputs are close under close inputs.
Let
d_outs
be the d_out from theprivacy_function()
’s of all composed measurements or the d_outs from the hint if one of them raisesNotImplementedError
.And
total_d_out
to be the sum ofd_outs
.This returns True if
total_d_out
<=d_out
(the input argument) and each composed measurement satisfies itsprivacy_relation()
fromd_in
to its d_out fromd_outs
.- Parameters:
d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under input_metric.
d_out (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between outputs under output_measure.
- Return type:
- __call__(df)#
Perform the aggregation.
- Parameters:
df (pandas.DataFrame) – The DataFrame to aggregate.
- Return type: