select#

Transformations for selecting columns from Spark DataFrames.

See the architecture overview for more information.

Classes#

Select

Keep a subset of columns from a Spark DataFrame.

class Select(input_domain, metric, columns)#

Bases: tmlt.core.transformations.base.Transformation

Keep a subset of columns from a Spark DataFrame.

Example

>>> # Example input
>>> print_sdf(spark_dataframe)
    A   B
0  a1  b1
1  a2  b1
2  a3  b2
3  a3  b2
>>> drop_b = Select(
...     input_domain=SparkDataFrameDomain(
...         {
...             "A": SparkStringColumnDescriptor(),
...             "B": SparkStringColumnDescriptor(),
...         }
...     ),
...     columns=["A"],
...     metric=SymmetricDifference(),
... )
>>> # Apply transformation to data
>>> spark_dataframe_without_b = drop_b(spark_dataframe)
>>> print_sdf(spark_dataframe_without_b)
    A
0  a1
1  a2
2  a3
3  a3
Transformation Contract:
>>> drop_b.input_domain
SparkDataFrameDomain(schema={'A': SparkStringColumnDescriptor(allow_null=False), 'B': SparkStringColumnDescriptor(allow_null=False)})
>>> drop_b.output_domain
SparkDataFrameDomain(schema={'A': SparkStringColumnDescriptor(allow_null=False)})
>>> drop_b.input_metric
SymmetricDifference()
>>> drop_b.output_metric
SymmetricDifference()
Stability Guarantee:

Select’s stability_function() returns d_in.

>>> drop_b.stability_function(1)
1
>>> drop_b.stability_function(2)
2
Parameters
__init__(input_domain, metric, columns)#

Constructor.

Parameters
property columns#

Returns columns being selected.

Return type

List[str]

stability_function(d_in)#

Returns the smallest d_out satisfied by the transformation.

See the architecture overview for more information.

Parameters

d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under input_metric.

Return type

tmlt.core.utils.exact_number.ExactNumber

__call__(sdf)#

Selects columns.

Parameters

sdf (pyspark.sql.DataFrame) –

Return type

pyspark.sql.DataFrame

property input_domain#

Return input domain for the measurement.

Return type

tmlt.core.domains.base.Domain

property input_metric#

Distance metric on input domain.

Return type

tmlt.core.metrics.Metric

property output_domain#

Return input domain for the measurement.

Return type

tmlt.core.domains.base.Domain

property output_metric#

Distance metric on input domain.

Return type

tmlt.core.metrics.Metric

stability_relation(d_in, d_out)#

Returns True only if close inputs produce close outputs.

See the privacy and stability tutorial (add link?) for more information.

Parameters
  • d_in (Any) – Distance between inputs under input_metric.

  • d_out (Any) – Distance between outputs under output_metric.

Return type

bool

__or__(other: Transformation) Transformation#
__or__(other: tmlt.core.measurements.base.Measurement) tmlt.core.measurements.base.Measurement

Return this transformation chained with another component.