# NaNs, nulls, and infs#

This page describes how Tumult Core handles NaNs, nulls, and infs.

## Preliminaries#

`SparkDataFrameDomain`

s are constructed by specifying column constraints using `SparkColumnDescriptor`

s that describe the data type as well as some metadata about what special values are
permitted on each column. In particular, all `SparkColumnDescriptor`

s allow
specifying a flag `allow_null`

to indicate if null (`None`

) values are permitted
in a column; additionally, `SparkFloatColumnDescriptor`

allows specifying
`allow_inf`

and `allow_nan`

to indicate if a column with floating
point values can contain (+/-)`inf`

or `NaN`

respectively.

Tumult Core also supports transformations and measurements on Pandas DataFrames and Series using `pandas_domains`

and `numpy_domains`

. `PandasDataFrameDomain`

s are constructed by specifying a `PandasSeriesDomain`

for each column, which in turn is specified by a `NumpyDomain`

. Unlike `SparkIntegerColumnDescriptor`

and `SparkFloatColumnDescriptor`

, `NumpyIntegerDomain`

and `NumpyFloatDomain`

do not permit `null`

values. Pandas domains are only used for Quantiles (discussed below) currently.

## Comparison Operators#

This section summarizes the behavior of comparison operators in Spark when one or both of the operands are special (null, NaN or inf). We will reference these operators to explain how our components handle these values.

### Nulls#

Comparisons (using `<, >, =, <=, >=`

) between a null value and any other value evaluates to null.
Spark’s null-safe equality operator `<=>`

allows safely comparing potentially null values such that
`X <=> Y`

evaluates to True if `X`

and `Y`

are both non-null values and `X = Y`

, or `X`

and `Y`

are both nulls.

Note

Python’s `==`

operator is equivalent to `=`

in Spark. For example, `df1`

and `df2`

are equivalent below:

```
df1 = dataframe.filter("D = E")
df2 = dataframe.filter(col(D) == col(E))
```

The null-safe equality operator `<=>`

corresponds to `eqNullSafe`

method . Concretely, `df3`

and `df4`

are equivalent below:

```
df3 = dataframe.filter("D <=> E")
df4 = dataframe.filter(col(D).eqNullSafe(col(E)))
```

### NaNs and infs#

`inf = inf`

evaluates to True. Consequently,`inf <= inf`

and`inf >= inf`

also evaluate to True.`NaN = NaN`

evaluates to True (unlike standard floating point implementations including python’s). For any non-null numeric value (including`inf`

)`X`

(incl.`inf`

),`NaN > X`

evaluates to True and`X > NaN`

evaluates to False.`NaN = X`

evaluates to False for all non-null values except`NaN`

.`inf > X`

evaluates to True for all non-null numeric values (except`inf`

and`nan`

).`-inf < X`

evaluates to True for all non-null numeric values (except`-inf`

).

## Filter#

A `Filter`

transformation can be constructed with a SQL filter expression that may refer to one or more columns in the input domain and contain comparison operators (`<, <=, >, >=, =, <=>`

) and logical operators (`AND, OR or NOT`

).

The following table describes how the logical operators behave when one or both values are `null`

(note that `AND`

and `OR`

are commutative):

X |
Y |
X AND Y |
X OR Y |
NOT X |
---|---|---|---|---|

NULL |
True |
NULL |
True |
NULL |

NULL |
False |
False |
NULL |
NULL |

NULL |
NULL |
NULL |
NULL |
NULL |

Comparison between two columns work according to the comparison semantics described above. The following expressions demonstrate how a column can be compared against a special literal value:

`"X = 'INF'"`

evaluates to True only if X is`inf`

`"X = '-INF'"`

evaluates to True only if X is`-inf`

`"X = 'NaN'"`

evaluates to True only if X is`NaN`

`"X <=> NULL"`

evaluates to True only if X is`null`

Note

Since `X = NULL`

evaluates to `NULL`

for any value of `X`

, using the filter expression `"NOT X = NULL"`

results in an empty DataFrame. In order to filter out rows where `X`

is null, filtering with `"NOT X <=> NULL"`

would work; however, `DropNulls`

is better suited for this since it also modifies the domain to indicate that nulls are absent from column `X`

in the output.

## PartitionByKeys#

For a `PartitionByKeys`

transformation, the `list_values`

corresponding to partition keys can contain `Inf`

, `NaN`

or `null`

. A partition corresponding to a particular key is obtained by comparing row values in key columns with the key values.

## Quantile#

A `NoisyQuantile`

measurement requires a `PandasSeriesDomain`

(over `NumpyIntegerDomain`

or `NumpyFloatDomain`

) as its input domain. Additionally, if the input domain is a `PandasSeriesDomain`

over `NumpyFloatDomain`

, it should also disallow `NaN`

s.

When constructing quantile measurements that work on `SparkDataFrameDomain`

s (with `create_quantile_measurement()`

for example), the input domain must disallow `nulls`

and `NaNs`

on the measure column. More generally, `ApplyInPandas`

does not support `aggregation_function`

s that operate on numeric nullable columns.

## GroupBy#

For a `GroupBy`

transformation, a group key can contain a null
only if the input domain permits nulls in the corresponding `SparkColumnDescriptor`

.
A group key containing a null (or one that is a null – when grouping by a single column) is treated
like any other value - i.e. all rows with this key are grouped together.
Since `GroupBy`

does not permit grouping on `SparkFloatColumnDescriptor`

columns, group keys cannot contain NaNs or infs.

## Joins#

Both `PrivateJoin`

and `PublicJoin`

use the `=`

semantics described above by default.
Consequently, all null values on the join columns are dropped. In order to join on nulls, construct the transformation with `join_on_nulls=True`

to use the `<=>`

semantics.

## Removing NaNs, nulls, and infs#

Tumult Core provides transformations to drop or replace NaNs, nulls, and infs. In particular,
`ReplaceNulls`

, `ReplaceNaNs`

, and `ReplaceInfs`

allow replacing these values on one or more columns; `DropNulls`

, `DropNaNs`

, and `DropInfs`

allow dropping rows containing these values in one or more columns.

## Sum and SumGrouped#

`Sum`

and `SumGrouped`

aggregations require NaNs and nulls to be disallowed
from the measure column. Consequently, derived measurements (requiring sums) like `create_average_measurement()`

, `create_standard_deviation_measurement()`

and `create_variance_measurement()`

also require that the measure column disallow NaNs and nulls.

+/- `inf`

values are correctly clipped to the upper and lower clipping bounds specified
on the aggregations.

## CountDistinct#

`CountDistinct`

uses the `<=>`

semantics described above . For example, the following rows are considered identical by this transformation:

`(NULL, NaN, Inf)`

`(NULL, NaN, Inf)`