select#
Transformations for selecting columns from Spark DataFrames.
See the architecture overview for more information.
Classes#
Keep a subset of columns from a Spark DataFrame. |
- class Select(input_domain, metric, columns)#
Bases:
tmlt.core.transformations.base.Transformation
Keep a subset of columns from a Spark DataFrame.
Example
>>> # Example input >>> print_sdf(spark_dataframe) A B 0 a1 b1 1 a2 b1 2 a3 b2 3 a3 b2 >>> drop_b = Select( ... input_domain=SparkDataFrameDomain( ... { ... "A": SparkStringColumnDescriptor(), ... "B": SparkStringColumnDescriptor(), ... } ... ), ... columns=["A"], ... metric=SymmetricDifference(), ... ) >>> # Apply transformation to data >>> spark_dataframe_without_b = drop_b(spark_dataframe) >>> print_sdf(spark_dataframe_without_b) A 0 a1 1 a2 2 a3 3 a3
- Transformation Contract:
Input domain -
SparkDataFrameDomain
Output domain -
SparkDataFrameDomain
Input metric -
SymmetricDifference
,HammingDistance
, orIfGroupedBy
Output metric -
SymmetricDifference
,HammingDistance
, orIfGroupedBy
(matches input metric)
>>> drop_b.input_domain SparkDataFrameDomain(schema={'A': SparkStringColumnDescriptor(allow_null=False), 'B': SparkStringColumnDescriptor(allow_null=False)}) >>> drop_b.output_domain SparkDataFrameDomain(schema={'A': SparkStringColumnDescriptor(allow_null=False)}) >>> drop_b.input_metric SymmetricDifference() >>> drop_b.output_metric SymmetricDifference()
- Stability Guarantee:
Select
’sstability_function()
returns d_in.>>> drop_b.stability_function(1) 1 >>> drop_b.stability_function(2) 2
- Parameters
input_domain (tmlt.core.domains.spark_domains.SparkDataFrameDomain) –
metric (Union[tmlt.core.metrics.SymmetricDifference, tmlt.core.metrics.HammingDistance, tmlt.core.metrics.IfGroupedBy]) –
columns (List[str]) –
- __init__(input_domain, metric, columns)#
Constructor.
- Parameters
input_domain (
SparkDataFrameDomain
SparkDataFrameDomain
) – Domain of input DataFrame.metric (
SymmetricDifference
|HammingDistance
|IfGroupedBy
Union
[SymmetricDifference
,HammingDistance
,IfGroupedBy
]) – Distance metric for input and output DataFrames.columns (
List
[str
]List
[str
]) – A list of existing column names to keep.
- stability_function(d_in)#
Returns the smallest d_out satisfied by the transformation.
See the architecture overview for more information.
- Parameters
d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under input_metric.
- Return type
- __call__(sdf)#
Selects columns.
- Parameters
sdf (pyspark.sql.DataFrame) –
- Return type
- property input_domain#
Return input domain for the measurement.
- Return type
- property input_metric#
Distance metric on input domain.
- Return type
- property output_domain#
Return input domain for the measurement.
- Return type
- property output_metric#
Distance metric on input domain.
- Return type
- stability_relation(d_in, d_out)#
Returns True only if close inputs produce close outputs.
See the privacy and stability tutorial (add link?) for more information.
- Parameters
d_in (Any) – Distance between inputs under input_metric.
d_out (Any) – Distance between outputs under output_metric.
- Return type
- __or__(other: Transformation) Transformation #
- __or__(other: tmlt.core.measurements.base.Measurement) tmlt.core.measurements.base.Measurement
Return this transformation chained with another component.