filter#

Transformations for filtering Spark DataFrames.

See the architecture overview for more information on transformations.

Classes#

Filter

Keeps only selected rows in a Spark DataFrame using an expression.

class Filter(domain, metric, filter_expr)#

Bases: tmlt.core.transformations.base.Transformation

Keeps only selected rows in a Spark DataFrame using an expression.

Example

>>> # Example input
>>> print_sdf(spark_dataframe)
    A   B
0  a1  b1
1  a2  b1
2  a3  b2
3  a3  b2
>>> # Create the transformation
>>> filter_transformation = Filter(
...     domain=SparkDataFrameDomain(
...         {
...             "A": SparkStringColumnDescriptor(),
...             "B": SparkStringColumnDescriptor(),
...         }
...     ),
...     metric=SymmetricDifference(),
...     filter_expr="A = 'a1' or B = 'b2'",
... )
>>> # Apply transformation to data
>>> filtered_spark_dataframe = filter_transformation(spark_dataframe)
>>> print_sdf(filtered_spark_dataframe)
    A   B
0  a1  b1
1  a3  b2
2  a3  b2
Transformation Contract:
>>> filter_transformation.input_domain
SparkDataFrameDomain(schema={'A': SparkStringColumnDescriptor(allow_null=False), 'B': SparkStringColumnDescriptor(allow_null=False)})
>>> filter_transformation.output_domain
SparkDataFrameDomain(schema={'A': SparkStringColumnDescriptor(allow_null=False), 'B': SparkStringColumnDescriptor(allow_null=False)})
>>> filter_transformation.input_metric
SymmetricDifference()
>>> filter_transformation.output_metric
SymmetricDifference()
Stability Guarantee:

Filter’s stability_function() is the identity function.

>>> filter_transformation.stability_function(1)
1
>>> filter_transformation.stability_function(123)
123
Parameters
__init__(domain, metric, filter_expr)#

Constructor.

Parameters
property filter_expr#

Returns the filter expression.

Return type

str

stability_function(d_in)#

Returns the smallest d_out satisfied by the transformation.

See the architecture overview for more information.

Parameters

d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under input_metric.

Return type

tmlt.core.utils.exact_number.ExactNumber

__call__(sdf)#

Returns the filtered DataFrame.

Parameters

sdf (pyspark.sql.DataFrame) –

Return type

pyspark.sql.DataFrame

property input_domain#

Return input domain for the measurement.

Return type

tmlt.core.domains.base.Domain

property input_metric#

Distance metric on input domain.

Return type

tmlt.core.metrics.Metric

property output_domain#

Return input domain for the measurement.

Return type

tmlt.core.domains.base.Domain

property output_metric#

Distance metric on input domain.

Return type

tmlt.core.metrics.Metric

stability_relation(d_in, d_out)#

Returns True only if close inputs produce close outputs.

See the privacy and stability tutorial (add link?) for more information.

Parameters
  • d_in (Any) – Distance between inputs under input_metric.

  • d_out (Any) – Distance between outputs under output_metric.

Return type

bool

__or__(other: Transformation) Transformation#
__or__(other: tmlt.core.measurements.base.Measurement) tmlt.core.measurements.base.Measurement

Return this transformation chained with another component.