id#

Add a column containing a unique id for each row in a Spark DataFrame.

See the architecture overview for more information on transformations.

Classes#

AddUniqueColumn

Adds a column containing a unique ID for each row.

class AddUniqueColumn(input_domain, column)#

Bases: tmlt.core.transformations.base.Transformation

Adds a column containing a unique ID for each row.

Examples

>>> # Example input
>>> spark_dataframe.sort("A").show()
+----+----+
|   A|   B|
+----+----+
|null| NaN|
|  a1| 0.1|
|  a2|null|
+----+----+

>>> add_unique_column = AddUniqueColumn(
...     input_domain=SparkDataFrameDomain(
...         {
...             "A": SparkStringColumnDescriptor(),
...             "B": SparkFloatColumnDescriptor(allow_nan=True, allow_inf=True),
...         }
...     ),
...     column="ID",
... )
>>> # Apply transformation to data
>>> output_dataframe = add_unique_column(spark_dataframe)
>>> output_dataframe.sort("A").show(truncate=False)
+----+----+--------------------------------+
|A   |B   |ID                              |
+----+----+--------------------------------+
|null|NaN |5B6E756C6C2C224E614E222C2231225D|
|a1  |0.1 |5B226131222C22302E31222C2231225D|
|a2  |null|5B226132222C6E756C6C2C2231225D  |
+----+----+--------------------------------+
Transformation Contract:
>>> add_unique_column.input_domain
SparkDataFrameDomain(schema={'A': SparkStringColumnDescriptor(allow_null=False), 'B': SparkFloatColumnDescriptor(allow_nan=True, allow_inf=True, allow_null=False, size=64)})
>>> add_unique_column.output_domain
SparkDataFrameDomain(schema={'A': SparkStringColumnDescriptor(allow_null=False), 'B': SparkFloatColumnDescriptor(allow_nan=True, allow_inf=True, allow_null=False, size=64), 'ID': SparkStringColumnDescriptor(allow_null=False)})
>>> add_unique_column.input_metric
SymmetricDifference()
>>> add_unique_column.output_metric
IfGroupedBy(column='ID', inner_metric=SymmetricDifference())
Stability Guarantee:

AddUniqueColumn’s stability_function() returns d_in.

>>> add_unique_column.stability_function(1)
1
>>> add_unique_column.stability_function(2)
2
Parameters
__init__(input_domain, column)#

Constructor.

Parameters
property column#

Returns name of ID column to add.

Return type

str

stability_function(d_in)#

Returns the smallest d_out satisfied by the transformation.

See the architecture overview for more information.

Parameters

d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under input_metric.

Return type

tmlt.core.utils.exact_number.ExactNumber

__call__(sdf)#

Returns DataFrame with ID column added.

Parameters

sdf (pyspark.sql.DataFrame) –

Return type

pyspark.sql.DataFrame

property input_domain#

Return input domain for the measurement.

Return type

tmlt.core.domains.base.Domain

property input_metric#

Distance metric on input domain.

Return type

tmlt.core.metrics.Metric

property output_domain#

Return input domain for the measurement.

Return type

tmlt.core.domains.base.Domain

property output_metric#

Distance metric on input domain.

Return type

tmlt.core.metrics.Metric

stability_relation(d_in, d_out)#

Returns True only if close inputs produce close outputs.

See the privacy and stability tutorial (add link?) for more information.

Parameters
  • d_in (Any) – Distance between inputs under input_metric.

  • d_out (Any) – Distance between outputs under output_metric.

Return type

bool

__or__(other: Transformation) Transformation#
__or__(other: tmlt.core.measurements.base.Measurement) tmlt.core.measurements.base.Measurement

Return this transformation chained with another component.