nan#

Transformations to drop or replace NaNs, nulls, and infs in Spark DataFrames.

See the architecture overview for more information on transformations.

Classes#

DropInfs

Drops rows containing +inf or -inf in one or more specified columns.

DropNaNs

Drops rows containing NaNs in one or more specified columns.

DropNulls

Drops rows containing nulls in one or more specified columns.

ReplaceInfs

Replaces +inf and -inf in one or more specified columns.

ReplaceNaNs

Replaces NaNs in one or more specified columns.

ReplaceNulls

Replaces nulls in one or more specified columns.

class DropInfs(input_domain, metric, columns)#

Bases: tmlt.core.transformations.base.Transformation

Drops rows containing +inf or -inf in one or more specified columns.

Examples

>>> # Example input
>>> print_sdf(spark_dataframe)
    A    B
0  a1  0.1
1  a2 -inf
2  a3  NaN
3  a4  inf
>>> drop_b_infs = DropInfs(
...     input_domain=SparkDataFrameDomain(
...         {
...             "A": SparkStringColumnDescriptor(),
...             "B": SparkFloatColumnDescriptor(allow_nan=True, allow_inf=True),
...         }
...     ),
...     metric=SymmetricDifference(),
...     columns=["B"],
... )
>>> # Apply transformation to data
>>> output_dataframe = drop_b_infs(spark_dataframe)
>>> print_sdf(output_dataframe)
    A    B
0  a1  0.1
1  a3  NaN
Transformation Contract:
>>> drop_b_infs.input_domain
SparkDataFrameDomain(schema={'A': SparkStringColumnDescriptor(allow_null=False), 'B': SparkFloatColumnDescriptor(allow_nan=True, allow_inf=True, allow_null=False, size=64)})
>>> drop_b_infs.output_domain
SparkDataFrameDomain(schema={'A': SparkStringColumnDescriptor(allow_null=False), 'B': SparkFloatColumnDescriptor(allow_nan=True, allow_inf=False, allow_null=False, size=64)})
>>> drop_b_infs.input_metric
SymmetricDifference()
>>> drop_b_infs.output_metric
SymmetricDifference()
Stability Guarantee:

DropInfs’s stability_function() returns d_in.

>>> drop_b_infs.stability_function(1)
1
>>> drop_b_infs.stability_function(2)
2
Parameters
__init__(input_domain, metric, columns)#

Constructor.

Parameters
property columns#

Returns the columns to check for +inf and -inf.

Return type

List[str]

stability_function(d_in)#

Returns the smallest d_out satisfied by the transformation.

See the architecture overview for more information.

Parameters

d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under input_metric.

Return type

tmlt.core.utils.exact_number.ExactNumber

__call__(sdf)#

Drops rows containing +inf or -inf in self.columns.

Parameters

sdf (pyspark.sql.DataFrame) –

Return type

pyspark.sql.DataFrame

property input_domain#

Return input domain for the measurement.

Return type

tmlt.core.domains.base.Domain

property input_metric#

Distance metric on input domain.

Return type

tmlt.core.metrics.Metric

property output_domain#

Return input domain for the measurement.

Return type

tmlt.core.domains.base.Domain

property output_metric#

Distance metric on input domain.

Return type

tmlt.core.metrics.Metric

stability_relation(d_in, d_out)#

Returns True only if close inputs produce close outputs.

See the privacy and stability tutorial (add link?) for more information.

Parameters
  • d_in (Any) – Distance between inputs under input_metric.

  • d_out (Any) – Distance between outputs under output_metric.

Return type

bool

__or__(other: Transformation) Transformation#
__or__(other: tmlt.core.measurements.base.Measurement) tmlt.core.measurements.base.Measurement

Return this transformation chained with another component.

class DropNaNs(input_domain, metric, columns)#

Bases: tmlt.core.transformations.base.Transformation

Drops rows containing NaNs in one or more specified columns.

Examples

>>> # Example input
>>> print_sdf(spark_dataframe)
    A    B
0  a1  0.1
1  a2  1.1
2  a3  NaN
3  a4  inf
>>> drop_b_nans = DropNaNs(
...     input_domain=SparkDataFrameDomain(
...         {
...             "A": SparkStringColumnDescriptor(),
...             "B": SparkFloatColumnDescriptor(allow_nan=True, allow_inf=True),
...         }
...     ),
...     metric=SymmetricDifference(),
...     columns=["B"],
... )
>>> # Apply transformation to data
>>> output_dataframe = drop_b_nans(spark_dataframe)
>>> print_sdf(output_dataframe)
    A    B
0  a1  0.1
1  a2  1.1
2  a4  inf
Transformation Contract:
>>> drop_b_nans.input_domain
SparkDataFrameDomain(schema={'A': SparkStringColumnDescriptor(allow_null=False), 'B': SparkFloatColumnDescriptor(allow_nan=True, allow_inf=True, allow_null=False, size=64)})
>>> drop_b_nans.output_domain
SparkDataFrameDomain(schema={'A': SparkStringColumnDescriptor(allow_null=False), 'B': SparkFloatColumnDescriptor(allow_nan=False, allow_inf=True, allow_null=False, size=64)})
>>> drop_b_nans.input_metric
SymmetricDifference()
>>> drop_b_nans.output_metric
SymmetricDifference()
Stability Guarantee:

DropNaNs’s stability_function() returns d_in.

>>> drop_b_nans.stability_function(1)
1
>>> drop_b_nans.stability_function(2)
2
Parameters
__init__(input_domain, metric, columns)#

Constructor.

Parameters
property columns#

Returns the columns to check for NaNs.

Return type

List[str]

stability_function(d_in)#

Returns the smallest d_out satisfied by the transformation.

See the architecture overview for more information.

Parameters

d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under input_metric.

Return type

tmlt.core.utils.exact_number.ExactNumber

__call__(sdf)#

Drops rows containing NaNs in self.columns.

Parameters

sdf (pyspark.sql.DataFrame) –

Return type

pyspark.sql.DataFrame

property input_domain#

Return input domain for the measurement.

Return type

tmlt.core.domains.base.Domain

property input_metric#

Distance metric on input domain.

Return type

tmlt.core.metrics.Metric

property output_domain#

Return input domain for the measurement.

Return type

tmlt.core.domains.base.Domain

property output_metric#

Distance metric on input domain.

Return type

tmlt.core.metrics.Metric

stability_relation(d_in, d_out)#

Returns True only if close inputs produce close outputs.

See the privacy and stability tutorial (add link?) for more information.

Parameters
  • d_in (Any) – Distance between inputs under input_metric.

  • d_out (Any) – Distance between outputs under output_metric.

Return type

bool

__or__(other: Transformation) Transformation#
__or__(other: tmlt.core.measurements.base.Measurement) tmlt.core.measurements.base.Measurement

Return this transformation chained with another component.

class DropNulls(input_domain, metric, columns)#

Bases: tmlt.core.transformations.base.Transformation

Drops rows containing nulls in one or more specified columns.

Examples

>>> # Example input
>>> spark_dataframe.sort("A").show()
+----+----+
|   A|   B|
+----+----+
|null| NaN|
|  a1| 0.1|
|  a2|null|
+----+----+

>>> drop_b_nulls = DropNulls(
...     input_domain=SparkDataFrameDomain(
...         {
...             "A": SparkStringColumnDescriptor(allow_null=True),
...             "B": SparkFloatColumnDescriptor(allow_nan=True, allow_null=True),
...         }
...     ),
...     metric=SymmetricDifference(),
...     columns=["B"],
... )
>>> # Apply transformation to data
>>> output_dataframe = drop_b_nulls(spark_dataframe)
>>> output_dataframe.sort("A").show()
+----+---+
|   A|  B|
+----+---+
|null|NaN|
|  a1|0.1|
+----+---+
Transformation Contract:
>>> drop_b_nulls.input_domain
SparkDataFrameDomain(schema={'A': SparkStringColumnDescriptor(allow_null=True), 'B': SparkFloatColumnDescriptor(allow_nan=True, allow_inf=False, allow_null=True, size=64)})
>>> drop_b_nulls.output_domain
SparkDataFrameDomain(schema={'A': SparkStringColumnDescriptor(allow_null=True), 'B': SparkFloatColumnDescriptor(allow_nan=True, allow_inf=False, allow_null=False, size=64)})
>>> drop_b_nulls.input_metric
SymmetricDifference()
>>> drop_b_nulls.output_metric
SymmetricDifference()
Stability Guarantee:

DropNulls’s stability_function() returns d_in.

>>> drop_b_nulls.stability_function(1)
1
>>> drop_b_nulls.stability_function(2)
2
Parameters
__init__(input_domain, metric, columns)#

Constructor.

Parameters
property columns#

Returns the columns to check for nulls.

Return type

List[str]

stability_function(d_in)#

Returns the smallest d_out satisfied by the transformation.

See the architecture overview for more information.

Parameters

d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under input_metric.

Return type

tmlt.core.utils.exact_number.ExactNumber

__call__(sdf)#

Drops rows containing nulls in self.columns.

Parameters

sdf (pyspark.sql.DataFrame) –

Return type

pyspark.sql.DataFrame

property input_domain#

Return input domain for the measurement.

Return type

tmlt.core.domains.base.Domain

property input_metric#

Distance metric on input domain.

Return type

tmlt.core.metrics.Metric

property output_domain#

Return input domain for the measurement.

Return type

tmlt.core.domains.base.Domain

property output_metric#

Distance metric on input domain.

Return type

tmlt.core.metrics.Metric

stability_relation(d_in, d_out)#

Returns True only if close inputs produce close outputs.

See the privacy and stability tutorial (add link?) for more information.

Parameters
  • d_in (Any) – Distance between inputs under input_metric.

  • d_out (Any) – Distance between outputs under output_metric.

Return type

bool

__or__(other: Transformation) Transformation#
__or__(other: tmlt.core.measurements.base.Measurement) tmlt.core.measurements.base.Measurement

Return this transformation chained with another component.

class ReplaceInfs(input_domain, metric, replace_map)#

Bases: tmlt.core.transformations.base.Transformation

Replaces +inf and -inf in one or more specified columns.

Examples

>>> # Example input
>>> spark_dataframe.sort("A").show()
+----+---------+
|   A|        B|
+----+---------+
|null|      NaN|
|  a1| Infinity|
|  a2|     null|
|  a3|-Infinity|
+----+---------+

>>> replace_infs = ReplaceInfs(
...     input_domain=SparkDataFrameDomain(
...         {
...             "A": SparkStringColumnDescriptor(allow_null=True),
...             "B": SparkFloatColumnDescriptor(allow_nan=True, allow_null=True, allow_inf=True),
...         }
...     ),
...     metric=SymmetricDifference(),
...     replace_map={"B": (-100.0, 100.0)},
... )
>>> # Apply transformation to data
>>> output_dataframe = replace_infs(spark_dataframe)
>>> output_dataframe.sort("A").show()
+----+------+
|   A|     B|
+----+------+
|null|   NaN|
|  a1| 100.0|
|  a2|  null|
|  a3|-100.0|
+----+------+
Transformation Contract:
>>> replace_infs.input_domain
SparkDataFrameDomain(schema={'A': SparkStringColumnDescriptor(allow_null=True), 'B': SparkFloatColumnDescriptor(allow_nan=True, allow_inf=True, allow_null=True, size=64)})
>>> replace_infs.output_domain
SparkDataFrameDomain(schema={'A': SparkStringColumnDescriptor(allow_null=True), 'B': SparkFloatColumnDescriptor(allow_nan=True, allow_inf=False, allow_null=True, size=64)})
>>> replace_infs.input_metric
SymmetricDifference()
>>> replace_infs.output_metric
SymmetricDifference()
Stability Guarantee:

DropNulls’s stability_function() returns d_in.

>>> replace_infs.stability_function(1)
1
>>> replace_infs.stability_function(2)
2
Parameters
__init__(input_domain, metric, replace_map)#

Constructor.

Parameters
property replace_map#

Returns mapping used to replace infinite values.

Return type

Dict[str, Tuple[float, float]]

stability_function(d_in)#

Returns the smallest d_out satisfied by the transformation.

See the architecture overview for more information.

Parameters

d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under input_metric.

Return type

tmlt.core.utils.exact_number.ExactNumber

__call__(sdf)#

Returns DataFrame with +inf and -inf replaced in specified columns.

Parameters

sdf (pyspark.sql.DataFrame) –

Return type

pyspark.sql.DataFrame

property input_domain#

Return input domain for the measurement.

Return type

tmlt.core.domains.base.Domain

property input_metric#

Distance metric on input domain.

Return type

tmlt.core.metrics.Metric

property output_domain#

Return input domain for the measurement.

Return type

tmlt.core.domains.base.Domain

property output_metric#

Distance metric on input domain.

Return type

tmlt.core.metrics.Metric

stability_relation(d_in, d_out)#

Returns True only if close inputs produce close outputs.

See the privacy and stability tutorial (add link?) for more information.

Parameters
  • d_in (Any) – Distance between inputs under input_metric.

  • d_out (Any) – Distance between outputs under output_metric.

Return type

bool

__or__(other: Transformation) Transformation#
__or__(other: tmlt.core.measurements.base.Measurement) tmlt.core.measurements.base.Measurement

Return this transformation chained with another component.

class ReplaceNaNs(input_domain, metric, replace_map)#

Bases: tmlt.core.transformations.base.Transformation

Replaces NaNs in one or more specified columns.

Examples

>>> # Example input
>>> spark_dataframe.sort("A").show()
+----+----+
|   A|   B|
+----+----+
|null| NaN|
|  a1| 0.1|
|  a2|null|
+----+----+

>>> replace_nans = ReplaceNaNs(
...     input_domain=SparkDataFrameDomain(
...         {
...             "A": SparkStringColumnDescriptor(allow_null=True),
...             "B": SparkFloatColumnDescriptor(allow_nan=True, allow_null=True),
...         }
...     ),
...     metric=SymmetricDifference(),
...     replace_map={"B": 0.0},
... )
>>> # Apply transformation to data
>>> output_dataframe = replace_nans(spark_dataframe)
>>> output_dataframe.sort("A").show()
+----+----+
|   A|   B|
+----+----+
|null| 0.0|
|  a1| 0.1|
|  a2|null|
+----+----+
Transformation Contract:
>>> replace_nans.input_domain
SparkDataFrameDomain(schema={'A': SparkStringColumnDescriptor(allow_null=True), 'B': SparkFloatColumnDescriptor(allow_nan=True, allow_inf=False, allow_null=True, size=64)})
>>> replace_nans.output_domain
SparkDataFrameDomain(schema={'A': SparkStringColumnDescriptor(allow_null=True), 'B': SparkFloatColumnDescriptor(allow_nan=False, allow_inf=False, allow_null=True, size=64)})
>>> replace_nans.input_metric
SymmetricDifference()
>>> replace_nans.output_metric
SymmetricDifference()
Stability Guarantee:

DropNulls’s stability_function() returns d_in.

>>> replace_nans.stability_function(1)
1
>>> replace_nans.stability_function(2)
2
Parameters
__init__(input_domain, metric, replace_map)#

Constructor.

Parameters
property replace_map#

Returns mapping used to replace NaNs and nulls.

Return type

Dict[str, Any]

stability_function(d_in)#

Returns the smallest d_out satisfied by the transformation.

See the architecture overview for more information.

Parameters

d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under input_metric.

Return type

tmlt.core.utils.exact_number.ExactNumber

__call__(sdf)#

Returns DataFrame with NaNs replaced in specified columns.

Parameters

sdf (pyspark.sql.DataFrame) –

Return type

pyspark.sql.DataFrame

property input_domain#

Return input domain for the measurement.

Return type

tmlt.core.domains.base.Domain

property input_metric#

Distance metric on input domain.

Return type

tmlt.core.metrics.Metric

property output_domain#

Return input domain for the measurement.

Return type

tmlt.core.domains.base.Domain

property output_metric#

Distance metric on input domain.

Return type

tmlt.core.metrics.Metric

stability_relation(d_in, d_out)#

Returns True only if close inputs produce close outputs.

See the privacy and stability tutorial (add link?) for more information.

Parameters
  • d_in (Any) – Distance between inputs under input_metric.

  • d_out (Any) – Distance between outputs under output_metric.

Return type

bool

__or__(other: Transformation) Transformation#
__or__(other: tmlt.core.measurements.base.Measurement) tmlt.core.measurements.base.Measurement

Return this transformation chained with another component.

class ReplaceNulls(input_domain, metric, replace_map)#

Bases: tmlt.core.transformations.base.Transformation

Replaces nulls in one or more specified columns.

Examples

>>> # Example input
>>> spark_dataframe.sort("A").show()
+----+----+
|   A|   B|
+----+----+
|null| NaN|
|  a1| 0.1|
|  a2|null|
+----+----+

>>> replace_nulls = ReplaceNulls(
...     input_domain=SparkDataFrameDomain(
...         {
...             "A": SparkStringColumnDescriptor(allow_null=True),
...             "B": SparkFloatColumnDescriptor(allow_nan=True, allow_null=True),
...         }
...     ),
...     metric=HammingDistance(),
...     replace_map={"A": "a0", "B": 0.0},
... )
>>> # Apply transformation to data
>>> output_dataframe = replace_nulls(spark_dataframe)
>>> output_dataframe.sort("A").show()
+---+---+
|  A|  B|
+---+---+
| a0|NaN|
| a1|0.1|
| a2|0.0|
+---+---+
Transformation Contract:
>>> replace_nulls.input_domain
SparkDataFrameDomain(schema={'A': SparkStringColumnDescriptor(allow_null=True), 'B': SparkFloatColumnDescriptor(allow_nan=True, allow_inf=False, allow_null=True, size=64)})
>>> replace_nulls.output_domain
SparkDataFrameDomain(schema={'A': SparkStringColumnDescriptor(allow_null=False), 'B': SparkFloatColumnDescriptor(allow_nan=True, allow_inf=False, allow_null=False, size=64)})
>>> replace_nulls.input_metric
HammingDistance()
>>> replace_nulls.output_metric
HammingDistance()
Stability Guarantee:

DropNulls’s stability_function() returns d_in.

>>> replace_nulls.stability_function(1)
1
>>> replace_nulls.stability_function(2)
2
Parameters
__init__(input_domain, metric, replace_map)#

Constructor.

Parameters
property replace_map#

Returns mapping used to replace nulls.

Return type

Dict[str, Any]

stability_function(d_in)#

Returns the smallest d_out satisfied by the transformation.

See the architecture overview for more information.

Parameters

d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under input_metric.

Return type

tmlt.core.utils.exact_number.ExactNumber

__call__(sdf)#

Returns DataFrame with nulls replaced in specified columns.

Parameters

sdf (pyspark.sql.DataFrame) –

Return type

pyspark.sql.DataFrame

property input_domain#

Return input domain for the measurement.

Return type

tmlt.core.domains.base.Domain

property input_metric#

Distance metric on input domain.

Return type

tmlt.core.metrics.Metric

property output_domain#

Return input domain for the measurement.

Return type

tmlt.core.domains.base.Domain

property output_metric#

Distance metric on input domain.

Return type

tmlt.core.metrics.Metric

stability_relation(d_in, d_out)#

Returns True only if close inputs produce close outputs.

See the privacy and stability tutorial (add link?) for more information.

Parameters
  • d_in (Any) – Distance between inputs under input_metric.

  • d_out (Any) – Distance between outputs under output_metric.

Return type

bool

__or__(other: Transformation) Transformation#
__or__(other: tmlt.core.measurements.base.Measurement) tmlt.core.measurements.base.Measurement

Return this transformation chained with another component.