NaNs, nulls, and infs#

This page describes how Tumult Core handles NaNs, nulls, and infs.

Preliminaries#

SparkDataFrameDomains are constructed by specifying column constraints using SparkColumnDescriptors that describe the data type as well as some metadata about what special values are permitted on each column. In particular, all SparkColumnDescriptors allow specifying a flag allow_null to indicate if null (None) values are permitted in a column; additionally, SparkFloatColumnDescriptor allows specifying allow_inf and allow_nan to indicate if a column with floating point values can contain (+/-)inf or NaN respectively.

Tumult Core also supports transformations and measurements on Pandas DataFrames and Series using pandas_domains and numpy_domains. PandasDataFrameDomains are constructed by specifying a PandasSeriesDomain for each column, which in turn is specified by a NumpyDomain. Unlike SparkIntegerColumnDescriptor and SparkFloatColumnDescriptor, NumpyIntegerDomain and NumpyFloatDomain do not permit null values. Pandas domains are only used for Quantiles (discussed below) currently.

Comparison Operators#

This section summarizes the behavior of comparison operators in Spark when one or both of the operands are special (null, NaN or inf). We will reference these operators to explain how our components handle these values.

Nulls#

Comparisons (using <, >, =, <=, >=) between a null value and any other value evaluates to null. Spark’s null-safe equality operator <=> allows safely comparing potentially null values such that X <=> Y evaluates to True if X and Y are both non-null values and X = Y, or X and Y are both nulls.

Note

Python’s == operator is equivalent to = in Spark. For example, df1 and df2 are equivalent below:

df1 = dataframe.filter("D = E")
df2 = dataframe.filter(col(D) == col(E))

The null-safe equality operator <=> corresponds to eqNullSafe method . Concretely, df3 and df4 are equivalent below:

df3 = dataframe.filter("D <=> E")
df4 = dataframe.filter(col(D).eqNullSafe(col(E)))

NaNs and infs#

  1. inf = inf evaluates to True. Consequently, inf <= inf and inf >= inf also evaluate to True.

  2. NaN = NaN evaluates to True (unlike standard floating point implementations including python’s). For any non-null numeric value (including inf) X (incl. inf), NaN > X evaluates to True and X > NaN evaluates to False. NaN = X evaluates to False for all non-null values except NaN.

  3. inf > X evaluates to True for all non-null numeric values (except inf and nan). -inf < X evaluates to True for all non-null numeric values (except -inf).

Filter#

A Filter transformation can be constructed with a SQL filter expression that may refer to one or more columns in the input domain and contain comparison operators (<, <=, >, >=, =, <=>) and logical operators (AND, OR or NOT).

The following table describes how the logical operators behave when one or both values are null (note that AND and OR are commutative):

Logical Operators and NULLs#

X

Y

X AND Y

X OR Y

NOT X

NULL

True

NULL

True

NULL

NULL

False

False

NULL

NULL

NULL

NULL

NULL

NULL

NULL

Comparison between two columns work according to the comparison semantics described above. The following expressions demonstrate how a column can be compared against a special literal value:

  • "X = 'INF'" evaluates to True only if X is inf

  • "X = '-INF'" evaluates to True only if X is -inf

  • "X = 'NaN'" evaluates to True only if X is NaN

  • "X <=> NULL" evaluates to True only if X is null

Note

Since X = NULL evaluates to NULL for any value of X, using the filter expression "NOT X = NULL" results in an empty DataFrame. In order to filter out rows where X is null, filtering with "NOT X <=> NULL" would work; however, DropNulls is better suited for this since it also modifies the domain to indicate that nulls are absent from column X in the output.

PartitionByKeys#

For a PartitionByKeys transformation, the list_values corresponding to partition keys can contain Inf, NaN or null. A partition corresponding to a particular key is obtained by comparing row values in key columns with the key values.

Quantile#

A NoisyQuantile measurement requires a PandasSeriesDomain (over NumpyIntegerDomain or NumpyFloatDomain) as its input domain. Additionally, if the input domain is a PandasSeriesDomain over NumpyFloatDomain, it should also disallow NaNs.

When constructing quantile measurements that work on SparkDataFrameDomains (with create_quantile_measurement() for example), the input domain must disallow nulls and NaNs on the measure column. More generally, ApplyInPandas does not support aggregation_functions that operate on numeric nullable columns.

GroupBy#

For a GroupBy transformation, a group key can contain a null only if the input domain permits nulls in the corresponding SparkColumnDescriptor. A group key containing a null (or one that is a null – when grouping by a single column) is treated like any other value - i.e. all rows with this key are grouped together. Since GroupBy does not permit grouping on SparkFloatColumnDescriptor columns, group keys cannot contain NaNs or infs.

Joins#

Both PrivateJoin and PublicJoin use the = semantics described above by default. Consequently, all null values on the join columns are dropped. In order to join on nulls, construct the transformation with join_on_nulls=True to use the <=> semantics.

Removing NaNs, nulls, and infs#

Tumult Core provides transformations to drop or replace NaNs, nulls, and infs. In particular, ReplaceNulls, ReplaceNaNs, and ReplaceInfs allow replacing these values on one or more columns; DropNulls, DropNaNs, and DropInfs allow dropping rows containing these values in one or more columns.

Sum and SumGrouped#

Sum and SumGrouped aggregations require NaNs and nulls to be disallowed from the measure column. Consequently, derived measurements (requiring sums) like create_average_measurement(), create_standard_deviation_measurement() and create_variance_measurement() also require that the measure column disallow NaNs and nulls.

+/- inf values are correctly clipped to the upper and lower clipping bounds specified on the aggregations.

CountDistinct#

CountDistinct uses the <=> semantics described above . For example, the following rows are considered identical by this transformation:

  • (NULL, NaN, Inf)

  • (NULL, NaN, Inf)