id#
Add a column containing a unique id for each row in a Spark DataFrame.
See the architecture overview for more information on transformations.
Classes#
Adds a column containing a unique ID for each row. |
- class AddUniqueColumn(input_domain, column)#
Bases:
tmlt.core.transformations.base.Transformation
Adds a column containing a unique ID for each row.
Examples
>>> # Example input >>> spark_dataframe.sort("A").show() +----+----+ | A| B| +----+----+ |null| NaN| | a1| 0.1| | a2|null| +----+----+ >>> add_unique_column = AddUniqueColumn( ... input_domain=SparkDataFrameDomain( ... { ... "A": SparkStringColumnDescriptor(), ... "B": SparkFloatColumnDescriptor(allow_nan=True, allow_inf=True), ... } ... ), ... column="ID", ... ) >>> # Apply transformation to data >>> output_dataframe = add_unique_column(spark_dataframe) >>> output_dataframe.sort("A").show(truncate=False) +----+----+--------------------------------+ |A |B |ID | +----+----+--------------------------------+ |null|NaN |5B6E756C6C2C224E614E222C2231225D| |a1 |0.1 |5B226131222C22302E31222C2231225D| |a2 |null|5B226132222C6E756C6C2C2231225D | +----+----+--------------------------------+
- Transformation Contract:
Input domain -
SparkDataFrameDomain
Output domain -
SparkDataFrameDomain
Input metric -
SymmetricDifference
Output metric -
IfGroupedBy
overSymmetricDifference
>>> add_unique_column.input_domain SparkDataFrameDomain(schema={'A': SparkStringColumnDescriptor(allow_null=False), 'B': SparkFloatColumnDescriptor(allow_nan=True, allow_inf=True, allow_null=False, size=64)}) >>> add_unique_column.output_domain SparkDataFrameDomain(schema={'A': SparkStringColumnDescriptor(allow_null=False), 'B': SparkFloatColumnDescriptor(allow_nan=True, allow_inf=True, allow_null=False, size=64), 'ID': SparkStringColumnDescriptor(allow_null=False)}) >>> add_unique_column.input_metric SymmetricDifference() >>> add_unique_column.output_metric IfGroupedBy(column='ID', inner_metric=SymmetricDifference())
- Stability Guarantee:
AddUniqueColumn
’sstability_function()
returns d_in.>>> add_unique_column.stability_function(1) 1 >>> add_unique_column.stability_function(2) 2
- Parameters
input_domain (tmlt.core.domains.spark_domains.SparkDataFrameDomain) –
column (str) –
- __init__(input_domain, column)#
Constructor.
- Parameters
input_domain (
SparkDataFrameDomain
SparkDataFrameDomain
) – Domain of input DataFrames.
- stability_function(d_in)#
Returns the smallest d_out satisfied by the transformation.
See the architecture overview for more information.
- Parameters
d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under input_metric.
- Return type
- __call__(sdf)#
Returns DataFrame with ID column added.
- Parameters
sdf (pyspark.sql.DataFrame) –
- Return type
- property input_domain#
Return input domain for the measurement.
- Return type
- property input_metric#
Distance metric on input domain.
- Return type
- property output_domain#
Return input domain for the measurement.
- Return type
- property output_metric#
Distance metric on input domain.
- Return type
- stability_relation(d_in, d_out)#
Returns True only if close inputs produce close outputs.
See the privacy and stability tutorial (add link?) for more information.
- Parameters
d_in (Any) – Distance between inputs under input_metric.
d_out (Any) – Distance between outputs under output_metric.
- Return type
- __or__(other: Transformation) Transformation #
- __or__(other: tmlt.core.measurements.base.Measurement) tmlt.core.measurements.base.Measurement
Return this transformation chained with another component.