dataframe#
Measurements on Pandas DataFrames.
Classes#
Aggregate a Pandas DataFrame. |
|
Apply Aggregate measurements to columns of a Pandas DataFrame. |
- class Aggregate(input_domain, input_metric, output_measure, output_schema)#
Bases:
tmlt.core.measurements.base.MeasurementAggregate a Pandas DataFrame.
This measurement requires the output schema be specified as a
pyspark.sql.types.StructTypeso that it can be used as a udf in Spark.- Parameters
input_domain (tmlt.core.domains.pandas_domains.PandasDataFrameDomain) –
input_metric (Union[tmlt.core.metrics.HammingDistance, tmlt.core.metrics.SymmetricDifference]) –
output_measure (tmlt.core.measures.Measure) –
output_schema (pyspark.sql.types.StructType) –
- __init__(input_domain, input_metric, output_measure, output_schema)#
Constructor.
- Parameters
input_domain (
PandasDataFrameDomainPandasDataFrameDomain) – Input domain.input_metric (
HammingDistance|SymmetricDifferenceUnion[HammingDistance,SymmetricDifference]) – Input metric.output_schema (
StructTypeStructType) – Spark StructType compatible with the output.
- property input_domain#
Return input domain for the measurement.
- property output_schema#
Return the output schema.
- Return type
- abstract __call__(df)#
Perform measurement.
- Parameters
df (pandas.DataFrame) –
- Return type
- property input_metric#
Distance metric on input domain.
- Return type
- property output_measure#
Distance measure on output.
- Return type
- privacy_function(d_in)#
Returns the smallest d_out satisfied by the measurement.
See the privacy and stability tutorial (add link?) for more information.
- Parameters
d_in (Any) – Distance between inputs under input_metric.
- Raises
NotImplementedError – If not overridden.
- Return type
Any
- privacy_relation(d_in, d_out)#
Return True if close inputs produce close outputs.
See the privacy and stability tutorial (add link?) for more information.
- Parameters
d_in (Any) – Distance between inputs under input_metric.
d_out (Any) – Distance between outputs under output_measure.
- Return type
- class AggregateByColumn(input_domain, column_to_aggregation, hint=None)#
Bases:
AggregateApply Aggregate measurements to columns of a Pandas DataFrame.
- Parameters
input_domain (tmlt.core.domains.pandas_domains.PandasDataFrameDomain) –
column_to_aggregation (Mapping[str, tmlt.core.measurements.pandas_measurements.series.Aggregate]) –
hint (Optional[Callable[[tmlt.core.utils.exact_number.ExactNumberInput, tmlt.core.utils.exact_number.ExactNumberInput], Dict[str, tmlt.core.utils.exact_number.ExactNumberInput]]]) –
- __init__(input_domain, column_to_aggregation, hint=None)#
Constructor.
- Parameters
input_domain (
PandasDataFrameDomainPandasDataFrameDomain) – Input domain.column_to_aggregation (
MappingMapping[str,Aggregate]) – A dictionary mapping column names to aggregation measurements. The provided measurements must all havePureDPor all haveRhoZCDPas theiroutput_measure.hint ((
ExactNumber|float|int|str|Fraction|Expr,ExactNumber|float|int|str|Fraction|Expr) → {str:ExactNumber|float|int|str|Fraction|Expr} |NoneOptional[Callable[[Union[ExactNumber,float,int,str,Fraction,Expr],Union[ExactNumber,float,int,str,Fraction,Expr]],Dict[str,Union[ExactNumber,float,int,str,Fraction,Expr]]]] (default:None)) – An optional hint. A hint is only required if one or more of the measurement’sprivacy_function()raise NotImplementedError. The hint takes in the same arguments asprivacy_relation()., and should return a d_out for each aggregation to be composed, where all of the d_outs sum to less than the d_out passed into the hint.
- property column_to_aggregation#
Returns dictionary from column names to aggregation measurements.
- Return type
Dict[str, tmlt.core.measurements.pandas_measurements.series.Aggregate]
- privacy_function(d_in)#
Returns the smallest d_out satisfied by the measurement.
Returns the sum of the
privacy_function()’s on d_in for all composed measurements.- Parameters
d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under input_metric.
- Raises
NotImplementedError – If any of the measurements raise
NotImplementedError.- Return type
- privacy_relation(d_in, d_out)#
Returns True only if outputs are close under close inputs.
Let d_outs be the d_out from the
privacy_function()’s of all composed measurements or the d_outs from the hint if one of them raisesNotImplementedError.And total_d_out to be the sum of d_outs.
This returns True if total_d_out <= d_out (the input argument) and each composed measurement satisfies its
privacy_relation()from d_in to its d_out from d_outs.- Parameters
d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under input_metric.
d_out (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between outputs under output_measure.
- Return type
- __call__(df)#
Perform the aggregation.
- Parameters
df (pandas.DataFrame) – The DataFrame to aggregate.
- Return type
- property input_domain#
Return input domain for the measurement.
- property output_schema#
Return the output schema.
- Return type
- property input_metric#
Distance metric on input domain.
- Return type
- property output_measure#
Distance measure on output.
- Return type