dataframe#
Measurements on Pandas DataFrames.
Classes#
Aggregate a Pandas DataFrame. |
|
Apply Aggregate measurements to columns of a Pandas DataFrame. |
- class Aggregate(input_domain, input_metric, output_measure, output_schema)#
Bases:
tmlt.core.measurements.base.Measurement
Aggregate a Pandas DataFrame.
This measurement requires the output schema be specified as a
pyspark.sql.types.StructType
so that it can be used as a udf in Spark.- Parameters
input_domain (tmlt.core.domains.pandas_domains.PandasDataFrameDomain) –
input_metric (Union[tmlt.core.metrics.HammingDistance, tmlt.core.metrics.SymmetricDifference]) –
output_measure (tmlt.core.measures.Measure) –
output_schema (pyspark.sql.types.StructType) –
- __init__(input_domain, input_metric, output_measure, output_schema)#
Constructor.
- Parameters
input_domain (
PandasDataFrameDomain
PandasDataFrameDomain
) – Input domain.input_metric (
HammingDistance
|SymmetricDifference
Union
[HammingDistance
,SymmetricDifference
]) – Input metric.output_schema (
StructType
StructType
) – Spark StructType compatible with the output.
- property input_domain#
Return input domain for the measurement.
- property output_schema#
Return the output schema.
- Return type
- abstract __call__(df)#
Perform measurement.
- Parameters
df (pandas.DataFrame) –
- Return type
- property input_metric#
Distance metric on input domain.
- Return type
- property output_measure#
Distance measure on output.
- Return type
- privacy_function(d_in)#
Returns the smallest d_out satisfied by the measurement.
See the privacy and stability tutorial (add link?) for more information.
- Parameters
d_in (Any) – Distance between inputs under input_metric.
- Raises
NotImplementedError – If not overridden.
- Return type
Any
- privacy_relation(d_in, d_out)#
Return True if close inputs produce close outputs.
See the privacy and stability tutorial (add link?) for more information.
- Parameters
d_in (Any) – Distance between inputs under input_metric.
d_out (Any) – Distance between outputs under output_measure.
- Return type
- class AggregateByColumn(input_domain, column_to_aggregation, hint=None)#
Bases:
Aggregate
Apply Aggregate measurements to columns of a Pandas DataFrame.
- Parameters
input_domain (tmlt.core.domains.pandas_domains.PandasDataFrameDomain) –
column_to_aggregation (Mapping[str, tmlt.core.measurements.pandas_measurements.series.Aggregate]) –
hint (Optional[Callable[[tmlt.core.utils.exact_number.ExactNumberInput, tmlt.core.utils.exact_number.ExactNumberInput], Dict[str, tmlt.core.utils.exact_number.ExactNumberInput]]]) –
- __init__(input_domain, column_to_aggregation, hint=None)#
Constructor.
- Parameters
input_domain (
PandasDataFrameDomain
PandasDataFrameDomain
) – Input domain.column_to_aggregation (
Mapping
Mapping
[str
,Aggregate
]) – A dictionary mapping column names to aggregation measurements. The provided measurements must all havePureDP
or all haveRhoZCDP
as theiroutput_measure
.hint ((
ExactNumber
|float
|int
|str
|Fraction
|Expr
,ExactNumber
|float
|int
|str
|Fraction
|Expr
) → {str
:ExactNumber
|float
|int
|str
|Fraction
|Expr
} |None
Optional
[Callable
[[Union
[ExactNumber
,float
,int
,str
,Fraction
,Expr
],Union
[ExactNumber
,float
,int
,str
,Fraction
,Expr
]],Dict
[str
,Union
[ExactNumber
,float
,int
,str
,Fraction
,Expr
]]]] (default:None
)) – An optional hint. A hint is only required if one or more of the measurement’sprivacy_function()
raise NotImplementedError. The hint takes in the same arguments asprivacy_relation()
., and should return a d_out for each aggregation to be composed, where all of the d_outs sum to less than the d_out passed into the hint.
- property column_to_aggregation#
Returns dictionary from column names to aggregation measurements.
- Return type
Dict[str, tmlt.core.measurements.pandas_measurements.series.Aggregate]
- privacy_function(d_in)#
Returns the smallest d_out satisfied by the measurement.
Returns the sum of the
privacy_function()
’s on d_in for all composed measurements.- Parameters
d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under input_metric.
- Raises
NotImplementedError – If any of the measurements raise
NotImplementedError
.- Return type
- privacy_relation(d_in, d_out)#
Returns True only if outputs are close under close inputs.
Let d_outs be the d_out from the
privacy_function()
’s of all composed measurements or the d_outs from the hint if one of them raisesNotImplementedError
.And total_d_out to be the sum of d_outs.
This returns True if total_d_out <= d_out (the input argument) and each composed measurement satisfies its
privacy_relation()
from d_in to its d_out from d_outs.- Parameters
d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under input_metric.
d_out (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between outputs under output_measure.
- Return type
- __call__(df)#
Perform the aggregation.
- Parameters
df (pandas.DataFrame) – The DataFrame to aggregate.
- Return type
- property input_domain#
Return input domain for the measurement.
- property output_schema#
Return the output schema.
- Return type
- property input_metric#
Distance metric on input domain.
- Return type
- property output_measure#
Distance measure on output.
- Return type