nan#
Transformations to drop or replace NaNs, nulls, and infs in Spark DataFrames.
See the architecture overview for more information on transformations.
Classes#
Drops rows containing +inf or -inf in one or more specified columns. |
|
Drops rows containing NaNs in one or more specified columns. |
|
Drops rows containing nulls in one or more specified columns. |
|
Replaces +inf and -inf in one or more specified columns. |
|
Replaces NaNs in one or more specified columns. |
|
Replaces nulls in one or more specified columns. |
- class DropInfs(input_domain, metric, columns)#
Bases:
tmlt.core.transformations.base.Transformation
Drops rows containing +inf or -inf in one or more specified columns.
Examples
>>> # Example input >>> print_sdf(spark_dataframe) A B 0 a1 0.1 1 a2 -inf 2 a3 NaN 3 a4 inf >>> drop_b_infs = DropInfs( ... input_domain=SparkDataFrameDomain( ... { ... "A": SparkStringColumnDescriptor(), ... "B": SparkFloatColumnDescriptor(allow_nan=True, allow_inf=True), ... } ... ), ... metric=SymmetricDifference(), ... columns=["B"], ... ) >>> # Apply transformation to data >>> output_dataframe = drop_b_infs(spark_dataframe) >>> print_sdf(output_dataframe) A B 0 a1 0.1 1 a3 NaN
- Transformation Contract:
Input domain -
SparkDataFrameDomain
Output domain -
SparkDataFrameDomain
Input metric -
SymmetricDifference
orIfGroupedBy
Output metric -
SymmetricDifference
orIfGroupedBy
>>> drop_b_infs.input_domain SparkDataFrameDomain(schema={'A': SparkStringColumnDescriptor(allow_null=False), 'B': SparkFloatColumnDescriptor(allow_nan=True, allow_inf=True, allow_null=False, size=64)}) >>> drop_b_infs.output_domain SparkDataFrameDomain(schema={'A': SparkStringColumnDescriptor(allow_null=False), 'B': SparkFloatColumnDescriptor(allow_nan=True, allow_inf=False, allow_null=False, size=64)}) >>> drop_b_infs.input_metric SymmetricDifference() >>> drop_b_infs.output_metric SymmetricDifference()
- Stability Guarantee:
DropInfs
’sstability_function()
returns d_in.>>> drop_b_infs.stability_function(1) 1 >>> drop_b_infs.stability_function(2) 2
- Parameters
input_domain (tmlt.core.domains.spark_domains.SparkDataFrameDomain) –
metric (Union[tmlt.core.metrics.SymmetricDifference, tmlt.core.metrics.IfGroupedBy]) –
columns (List[str]) –
- __init__(input_domain, metric, columns)#
Constructor.
- Parameters
input_domain (
SparkDataFrameDomain
SparkDataFrameDomain
) – Domain of the input Spark DataFrames.metric (
SymmetricDifference
|IfGroupedBy
Union
[SymmetricDifference
,IfGroupedBy
]) – Distance metric for the input and output Spark DataFrames. If the metric isIfGroupedBy
, its inner metric must beSymmetricDifference
.columns (
List
[str
]List
[str
]) – Columns to drop +inf and -inf from.
- stability_function(d_in)#
Returns the smallest d_out satisfied by the transformation.
See the architecture overview for more information.
- Parameters
d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under input_metric.
- Return type
- __call__(sdf)#
Drops rows containing +inf or -inf in self.columns.
- Parameters
sdf (pyspark.sql.DataFrame) –
- Return type
- property input_domain#
Return input domain for the measurement.
- Return type
- property input_metric#
Distance metric on input domain.
- Return type
- property output_domain#
Return input domain for the measurement.
- Return type
- property output_metric#
Distance metric on input domain.
- Return type
- stability_relation(d_in, d_out)#
Returns True only if close inputs produce close outputs.
See the privacy and stability tutorial (add link?) for more information.
- Parameters
d_in (Any) – Distance between inputs under input_metric.
d_out (Any) – Distance between outputs under output_metric.
- Return type
- __or__(other: Transformation) Transformation #
- __or__(other: tmlt.core.measurements.base.Measurement) tmlt.core.measurements.base.Measurement
Return this transformation chained with another component.
- class DropNaNs(input_domain, metric, columns)#
Bases:
tmlt.core.transformations.base.Transformation
Drops rows containing NaNs in one or more specified columns.
Examples
>>> # Example input >>> print_sdf(spark_dataframe) A B 0 a1 0.1 1 a2 1.1 2 a3 NaN 3 a4 inf >>> drop_b_nans = DropNaNs( ... input_domain=SparkDataFrameDomain( ... { ... "A": SparkStringColumnDescriptor(), ... "B": SparkFloatColumnDescriptor(allow_nan=True, allow_inf=True), ... } ... ), ... metric=SymmetricDifference(), ... columns=["B"], ... ) >>> # Apply transformation to data >>> output_dataframe = drop_b_nans(spark_dataframe) >>> print_sdf(output_dataframe) A B 0 a1 0.1 1 a2 1.1 2 a4 inf
- Transformation Contract:
Input domain -
SparkDataFrameDomain
Output domain -
SparkDataFrameDomain
Input metric -
SymmetricDifference
orIfGroupedBy
Output metric -
SymmetricDifference
orIfGroupedBy
>>> drop_b_nans.input_domain SparkDataFrameDomain(schema={'A': SparkStringColumnDescriptor(allow_null=False), 'B': SparkFloatColumnDescriptor(allow_nan=True, allow_inf=True, allow_null=False, size=64)}) >>> drop_b_nans.output_domain SparkDataFrameDomain(schema={'A': SparkStringColumnDescriptor(allow_null=False), 'B': SparkFloatColumnDescriptor(allow_nan=False, allow_inf=True, allow_null=False, size=64)}) >>> drop_b_nans.input_metric SymmetricDifference() >>> drop_b_nans.output_metric SymmetricDifference()
- Stability Guarantee:
DropNaNs
’sstability_function()
returns d_in.>>> drop_b_nans.stability_function(1) 1 >>> drop_b_nans.stability_function(2) 2
- Parameters
input_domain (tmlt.core.domains.spark_domains.SparkDataFrameDomain) –
metric (Union[tmlt.core.metrics.SymmetricDifference, tmlt.core.metrics.IfGroupedBy]) –
columns (List[str]) –
- __init__(input_domain, metric, columns)#
Constructor.
- Parameters
input_domain (
SparkDataFrameDomain
SparkDataFrameDomain
) – Domain of the input Spark DataFrames.metric (
SymmetricDifference
|IfGroupedBy
Union
[SymmetricDifference
,IfGroupedBy
]) – Distance metric for the input and output Spark DataFrames. If the metric isIfGroupedBy
, its inner metric must beSumOf
orRootSumOfSquared
overSymmetricDifference
.
- stability_function(d_in)#
Returns the smallest d_out satisfied by the transformation.
See the architecture overview for more information.
- Parameters
d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under input_metric.
- Return type
- __call__(sdf)#
Drops rows containing NaNs in self.columns.
- Parameters
sdf (pyspark.sql.DataFrame) –
- Return type
- property input_domain#
Return input domain for the measurement.
- Return type
- property input_metric#
Distance metric on input domain.
- Return type
- property output_domain#
Return input domain for the measurement.
- Return type
- property output_metric#
Distance metric on input domain.
- Return type
- stability_relation(d_in, d_out)#
Returns True only if close inputs produce close outputs.
See the privacy and stability tutorial (add link?) for more information.
- Parameters
d_in (Any) – Distance between inputs under input_metric.
d_out (Any) – Distance between outputs under output_metric.
- Return type
- __or__(other: Transformation) Transformation #
- __or__(other: tmlt.core.measurements.base.Measurement) tmlt.core.measurements.base.Measurement
Return this transformation chained with another component.
- class DropNulls(input_domain, metric, columns)#
Bases:
tmlt.core.transformations.base.Transformation
Drops rows containing nulls in one or more specified columns.
Examples
>>> # Example input >>> spark_dataframe.sort("A").show() +----+----+ | A| B| +----+----+ |null| NaN| | a1| 0.1| | a2|null| +----+----+ >>> drop_b_nulls = DropNulls( ... input_domain=SparkDataFrameDomain( ... { ... "A": SparkStringColumnDescriptor(allow_null=True), ... "B": SparkFloatColumnDescriptor(allow_nan=True, allow_null=True), ... } ... ), ... metric=SymmetricDifference(), ... columns=["B"], ... ) >>> # Apply transformation to data >>> output_dataframe = drop_b_nulls(spark_dataframe) >>> output_dataframe.sort("A").show() +----+---+ | A| B| +----+---+ |null|NaN| | a1|0.1| +----+---+
- Transformation Contract:
Input domain -
SparkDataFrameDomain
Output domain -
SparkDataFrameDomain
Input metric -
SymmetricDifference
orIfGroupedBy
Output metric -
SymmetricDifference
orIfGroupedBy
>>> drop_b_nulls.input_domain SparkDataFrameDomain(schema={'A': SparkStringColumnDescriptor(allow_null=True), 'B': SparkFloatColumnDescriptor(allow_nan=True, allow_inf=False, allow_null=True, size=64)}) >>> drop_b_nulls.output_domain SparkDataFrameDomain(schema={'A': SparkStringColumnDescriptor(allow_null=True), 'B': SparkFloatColumnDescriptor(allow_nan=True, allow_inf=False, allow_null=False, size=64)}) >>> drop_b_nulls.input_metric SymmetricDifference() >>> drop_b_nulls.output_metric SymmetricDifference()
- Stability Guarantee:
DropNulls
’sstability_function()
returns d_in.>>> drop_b_nulls.stability_function(1) 1 >>> drop_b_nulls.stability_function(2) 2
- Parameters
input_domain (tmlt.core.domains.spark_domains.SparkDataFrameDomain) –
metric (Union[tmlt.core.metrics.SymmetricDifference, tmlt.core.metrics.IfGroupedBy]) –
columns (List[str]) –
- __init__(input_domain, metric, columns)#
Constructor.
- Parameters
input_domain (
SparkDataFrameDomain
SparkDataFrameDomain
) – Domain of the input Spark DataFrames.metric (
SymmetricDifference
|IfGroupedBy
Union
[SymmetricDifference
,IfGroupedBy
]) – Distance metric for the input and output Spark DataFrames. If the metric isIfGroupedBy
, its inner metric must beSumOf
orRootSumOfSquared
overSymmetricDifference
.
- stability_function(d_in)#
Returns the smallest d_out satisfied by the transformation.
See the architecture overview for more information.
- Parameters
d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under input_metric.
- Return type
- __call__(sdf)#
Drops rows containing nulls in self.columns.
- Parameters
sdf (pyspark.sql.DataFrame) –
- Return type
- property input_domain#
Return input domain for the measurement.
- Return type
- property input_metric#
Distance metric on input domain.
- Return type
- property output_domain#
Return input domain for the measurement.
- Return type
- property output_metric#
Distance metric on input domain.
- Return type
- stability_relation(d_in, d_out)#
Returns True only if close inputs produce close outputs.
See the privacy and stability tutorial (add link?) for more information.
- Parameters
d_in (Any) – Distance between inputs under input_metric.
d_out (Any) – Distance between outputs under output_metric.
- Return type
- __or__(other: Transformation) Transformation #
- __or__(other: tmlt.core.measurements.base.Measurement) tmlt.core.measurements.base.Measurement
Return this transformation chained with another component.
- class ReplaceInfs(input_domain, metric, replace_map)#
Bases:
tmlt.core.transformations.base.Transformation
Replaces +inf and -inf in one or more specified columns.
Examples
>>> # Example input >>> spark_dataframe.sort("A").show() +----+---------+ | A| B| +----+---------+ |null| NaN| | a1| Infinity| | a2| null| | a3|-Infinity| +----+---------+ >>> replace_infs = ReplaceInfs( ... input_domain=SparkDataFrameDomain( ... { ... "A": SparkStringColumnDescriptor(allow_null=True), ... "B": SparkFloatColumnDescriptor(allow_nan=True, allow_null=True, allow_inf=True), ... } ... ), ... metric=SymmetricDifference(), ... replace_map={"B": (-100.0, 100.0)}, ... ) >>> # Apply transformation to data >>> output_dataframe = replace_infs(spark_dataframe) >>> output_dataframe.sort("A").show() +----+------+ | A| B| +----+------+ |null| NaN| | a1| 100.0| | a2| null| | a3|-100.0| +----+------+
- Transformation Contract:
Input domain -
SparkDataFrameDomain
Output domain -
SparkDataFrameDomain
Input metric -
SymmetricDifference
,HammingDistance
, orIfGroupedBy
Output metric -
SymmetricDifference
,HammingDistance
, orIfGroupedBy
>>> replace_infs.input_domain SparkDataFrameDomain(schema={'A': SparkStringColumnDescriptor(allow_null=True), 'B': SparkFloatColumnDescriptor(allow_nan=True, allow_inf=True, allow_null=True, size=64)}) >>> replace_infs.output_domain SparkDataFrameDomain(schema={'A': SparkStringColumnDescriptor(allow_null=True), 'B': SparkFloatColumnDescriptor(allow_nan=True, allow_inf=False, allow_null=True, size=64)}) >>> replace_infs.input_metric SymmetricDifference() >>> replace_infs.output_metric SymmetricDifference()
- Stability Guarantee:
DropNulls
’sstability_function()
returns d_in.>>> replace_infs.stability_function(1) 1 >>> replace_infs.stability_function(2) 2
- Parameters
input_domain (tmlt.core.domains.spark_domains.SparkDataFrameDomain) –
metric (Union[tmlt.core.metrics.SymmetricDifference, tmlt.core.metrics.HammingDistance, tmlt.core.metrics.IfGroupedBy]) –
- __init__(input_domain, metric, replace_map)#
Constructor.
- Parameters
input_domain (
SparkDataFrameDomain
SparkDataFrameDomain
) – Domain of the input Spark DataFrames.metric (
SymmetricDifference
|HammingDistance
|IfGroupedBy
Union
[SymmetricDifference
,HammingDistance
,IfGroupedBy
]) – Distance metric for the input and output Spark DataFrames.replace_map ({
str
:Tuple
[float
,float
]}Dict
[str
,Tuple
[float
,float
]]) – Dictionary mapping column names to a tuple. The first value in the tuple will be used to replace -inf in that column, and the second value in the tuple will be used to replace +inf in that column.
- property replace_map#
Returns mapping used to replace infinite values.
- stability_function(d_in)#
Returns the smallest d_out satisfied by the transformation.
See the architecture overview for more information.
- Parameters
d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under input_metric.
- Return type
- __call__(sdf)#
Returns DataFrame with +inf and -inf replaced in specified columns.
- Parameters
sdf (pyspark.sql.DataFrame) –
- Return type
- property input_domain#
Return input domain for the measurement.
- Return type
- property input_metric#
Distance metric on input domain.
- Return type
- property output_domain#
Return input domain for the measurement.
- Return type
- property output_metric#
Distance metric on input domain.
- Return type
- stability_relation(d_in, d_out)#
Returns True only if close inputs produce close outputs.
See the privacy and stability tutorial (add link?) for more information.
- Parameters
d_in (Any) – Distance between inputs under input_metric.
d_out (Any) – Distance between outputs under output_metric.
- Return type
- __or__(other: Transformation) Transformation #
- __or__(other: tmlt.core.measurements.base.Measurement) tmlt.core.measurements.base.Measurement
Return this transformation chained with another component.
- class ReplaceNaNs(input_domain, metric, replace_map)#
Bases:
tmlt.core.transformations.base.Transformation
Replaces NaNs in one or more specified columns.
Examples
>>> # Example input >>> spark_dataframe.sort("A").show() +----+----+ | A| B| +----+----+ |null| NaN| | a1| 0.1| | a2|null| +----+----+ >>> replace_nans = ReplaceNaNs( ... input_domain=SparkDataFrameDomain( ... { ... "A": SparkStringColumnDescriptor(allow_null=True), ... "B": SparkFloatColumnDescriptor(allow_nan=True, allow_null=True), ... } ... ), ... metric=SymmetricDifference(), ... replace_map={"B": 0.0}, ... ) >>> # Apply transformation to data >>> output_dataframe = replace_nans(spark_dataframe) >>> output_dataframe.sort("A").show() +----+----+ | A| B| +----+----+ |null| 0.0| | a1| 0.1| | a2|null| +----+----+
- Transformation Contract:
Input domain -
SparkDataFrameDomain
Output domain -
SparkDataFrameDomain
Input metric -
SymmetricDifference
,HammingDistance
, orIfGroupedBy
Output metric -
SymmetricDifference
,HammingDistance
, orIfGroupedBy
>>> replace_nans.input_domain SparkDataFrameDomain(schema={'A': SparkStringColumnDescriptor(allow_null=True), 'B': SparkFloatColumnDescriptor(allow_nan=True, allow_inf=False, allow_null=True, size=64)}) >>> replace_nans.output_domain SparkDataFrameDomain(schema={'A': SparkStringColumnDescriptor(allow_null=True), 'B': SparkFloatColumnDescriptor(allow_nan=False, allow_inf=False, allow_null=True, size=64)}) >>> replace_nans.input_metric SymmetricDifference() >>> replace_nans.output_metric SymmetricDifference()
- Stability Guarantee:
DropNulls
’sstability_function()
returns d_in.>>> replace_nans.stability_function(1) 1 >>> replace_nans.stability_function(2) 2
- Parameters
input_domain (tmlt.core.domains.spark_domains.SparkDataFrameDomain) –
metric (Union[tmlt.core.metrics.SymmetricDifference, tmlt.core.metrics.HammingDistance, tmlt.core.metrics.IfGroupedBy]) –
replace_map (Dict[str, Any]) –
- __init__(input_domain, metric, replace_map)#
Constructor.
- Parameters
input_domain (
SparkDataFrameDomain
SparkDataFrameDomain
) – Domain of the input Spark DataFrames.metric (
SymmetricDifference
|HammingDistance
|IfGroupedBy
Union
[SymmetricDifference
,HammingDistance
,IfGroupedBy
]) – Distance metric for the input and output Spark DataFrames.replace_map ({
str
:Any
}Dict
[str
,Any
]) – Dictionary mapping column names to value to be used for replacing NaNs in that column.
- stability_function(d_in)#
Returns the smallest d_out satisfied by the transformation.
See the architecture overview for more information.
- Parameters
d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under input_metric.
- Return type
- __call__(sdf)#
Returns DataFrame with NaNs replaced in specified columns.
- Parameters
sdf (pyspark.sql.DataFrame) –
- Return type
- property input_domain#
Return input domain for the measurement.
- Return type
- property input_metric#
Distance metric on input domain.
- Return type
- property output_domain#
Return input domain for the measurement.
- Return type
- property output_metric#
Distance metric on input domain.
- Return type
- stability_relation(d_in, d_out)#
Returns True only if close inputs produce close outputs.
See the privacy and stability tutorial (add link?) for more information.
- Parameters
d_in (Any) – Distance between inputs under input_metric.
d_out (Any) – Distance between outputs under output_metric.
- Return type
- __or__(other: Transformation) Transformation #
- __or__(other: tmlt.core.measurements.base.Measurement) tmlt.core.measurements.base.Measurement
Return this transformation chained with another component.
- class ReplaceNulls(input_domain, metric, replace_map)#
Bases:
tmlt.core.transformations.base.Transformation
Replaces nulls in one or more specified columns.
Examples
>>> # Example input >>> spark_dataframe.sort("A").show() +----+----+ | A| B| +----+----+ |null| NaN| | a1| 0.1| | a2|null| +----+----+ >>> replace_nulls = ReplaceNulls( ... input_domain=SparkDataFrameDomain( ... { ... "A": SparkStringColumnDescriptor(allow_null=True), ... "B": SparkFloatColumnDescriptor(allow_nan=True, allow_null=True), ... } ... ), ... metric=HammingDistance(), ... replace_map={"A": "a0", "B": 0.0}, ... ) >>> # Apply transformation to data >>> output_dataframe = replace_nulls(spark_dataframe) >>> output_dataframe.sort("A").show() +---+---+ | A| B| +---+---+ | a0|NaN| | a1|0.1| | a2|0.0| +---+---+
- Transformation Contract:
Input domain -
SparkDataFrameDomain
Output domain -
SparkDataFrameDomain
Input metric -
SymmetricDifference
,HammingDistance
, orIfGroupedBy
Output metric -
SymmetricDifference
,HammingDistance
, orIfGroupedBy
>>> replace_nulls.input_domain SparkDataFrameDomain(schema={'A': SparkStringColumnDescriptor(allow_null=True), 'B': SparkFloatColumnDescriptor(allow_nan=True, allow_inf=False, allow_null=True, size=64)}) >>> replace_nulls.output_domain SparkDataFrameDomain(schema={'A': SparkStringColumnDescriptor(allow_null=False), 'B': SparkFloatColumnDescriptor(allow_nan=True, allow_inf=False, allow_null=False, size=64)}) >>> replace_nulls.input_metric HammingDistance() >>> replace_nulls.output_metric HammingDistance()
- Stability Guarantee:
DropNulls
’sstability_function()
returns d_in.>>> replace_nulls.stability_function(1) 1 >>> replace_nulls.stability_function(2) 2
- Parameters
input_domain (tmlt.core.domains.spark_domains.SparkDataFrameDomain) –
metric (Union[tmlt.core.metrics.SymmetricDifference, tmlt.core.metrics.HammingDistance, tmlt.core.metrics.IfGroupedBy]) –
replace_map (Dict[str, Any]) –
- __init__(input_domain, metric, replace_map)#
Constructor.
- Parameters
input_domain (
SparkDataFrameDomain
SparkDataFrameDomain
) – Domain of the input Spark DataFrames.metric (
SymmetricDifference
|HammingDistance
|IfGroupedBy
Union
[SymmetricDifference
,HammingDistance
,IfGroupedBy
]) – Distance metric for the input and output Spark DataFrames.replace_map ({
str
:Any
}Dict
[str
,Any
]) – Dictionary mapping column names to value to be used for replacing nulls in that column.
- stability_function(d_in)#
Returns the smallest d_out satisfied by the transformation.
See the architecture overview for more information.
- Parameters
d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under input_metric.
- Return type
- __call__(sdf)#
Returns DataFrame with nulls replaced in specified columns.
- Parameters
sdf (pyspark.sql.DataFrame) –
- Return type
- property input_domain#
Return input domain for the measurement.
- Return type
- property input_metric#
Distance metric on input domain.
- Return type
- property output_domain#
Return input domain for the measurement.
- Return type
- property output_metric#
Distance metric on input domain.
- Return type
- stability_relation(d_in, d_out)#
Returns True only if close inputs produce close outputs.
See the privacy and stability tutorial (add link?) for more information.
- Parameters
d_in (Any) – Distance between inputs under input_metric.
d_out (Any) – Distance between outputs under output_metric.
- Return type
- __or__(other: Transformation) Transformation #
- __or__(other: tmlt.core.measurements.base.Measurement) tmlt.core.measurements.base.Measurement
Return this transformation chained with another component.