filter#
Transformations for filtering Spark DataFrames.
See the architecture overview for more information on transformations.
Classes#
Keeps only selected rows in a Spark DataFrame using an expression. |
- class Filter(domain, metric, filter_expr)#
Bases:
tmlt.core.transformations.base.Transformation
Keeps only selected rows in a Spark DataFrame using an expression.
Example
>>> # Example input >>> print_sdf(spark_dataframe) A B 0 a1 b1 1 a2 b1 2 a3 b2 3 a3 b2 >>> # Create the transformation >>> filter_transformation = Filter( ... domain=SparkDataFrameDomain( ... { ... "A": SparkStringColumnDescriptor(), ... "B": SparkStringColumnDescriptor(), ... } ... ), ... metric=SymmetricDifference(), ... filter_expr="A = 'a1' or B = 'b2'", ... ) >>> # Apply transformation to data >>> filtered_spark_dataframe = filter_transformation(spark_dataframe) >>> print_sdf(filtered_spark_dataframe) A B 0 a1 b1 1 a3 b2 2 a3 b2
- Transformation Contract:
Input domain -
SparkDataFrameDomain
Output domain -
SparkDataFrameDomain
(matches input domain)Input metric -
SymmetricDifference
orIfGroupedBy
Output metric -
SymmetricDifference
orIfGroupedBy
(matches input metric)
>>> filter_transformation.input_domain SparkDataFrameDomain(schema={'A': SparkStringColumnDescriptor(allow_null=False), 'B': SparkStringColumnDescriptor(allow_null=False)}) >>> filter_transformation.output_domain SparkDataFrameDomain(schema={'A': SparkStringColumnDescriptor(allow_null=False), 'B': SparkStringColumnDescriptor(allow_null=False)}) >>> filter_transformation.input_metric SymmetricDifference() >>> filter_transformation.output_metric SymmetricDifference()
- Stability Guarantee:
Filter
’sstability_function()
is the identity function.>>> filter_transformation.stability_function(1) 1 >>> filter_transformation.stability_function(123) 123
- Parameters
domain (tmlt.core.domains.spark_domains.SparkDataFrameDomain) –
metric (Union[tmlt.core.metrics.SymmetricDifference, tmlt.core.metrics.IfGroupedBy]) –
filter_expr (str) –
- __init__(domain, metric, filter_expr)#
Constructor.
- Parameters
filter_expr (
str
str
) – A string of SQL expression specifying the filter to apply to the data. The language is the same as the one used bypyspark.sql.DataFrame.filter()
.domain (
SparkDataFrameDomain
SparkDataFrameDomain
) – Domain of the input/output Spark DataFrames.metric (
SymmetricDifference
|IfGroupedBy
Union
[SymmetricDifference
,IfGroupedBy
]) – Distance metric for the input and output Spark DataFrames. If the metric isIfGroupedBy
, the innermost metric must beSymmetricDifference
.
- stability_function(d_in)#
Returns the smallest d_out satisfied by the transformation.
See the architecture overview for more information.
- Parameters
d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under input_metric.
- Return type
- __call__(sdf)#
Returns the filtered DataFrame.
- Parameters
sdf (pyspark.sql.DataFrame) –
- Return type
- property input_domain#
Return input domain for the measurement.
- Return type
- property input_metric#
Distance metric on input domain.
- Return type
- property output_domain#
Return input domain for the measurement.
- Return type
- property output_metric#
Distance metric on input domain.
- Return type
- stability_relation(d_in, d_out)#
Returns True only if close inputs produce close outputs.
See the privacy and stability tutorial (add link?) for more information.
- Parameters
d_in (Any) – Distance between inputs under input_metric.
d_out (Any) – Distance between outputs under output_metric.
- Return type
- __or__(other: Transformation) Transformation #
- __or__(other: tmlt.core.measurements.base.Measurement) tmlt.core.measurements.base.Measurement
Return this transformation chained with another component.