NaNs, nulls, and infs#
This page describes how Tumult Core handles NaNs, nulls, and infs.
Preliminaries#
SparkDataFrameDomains are constructed by specifying column constraints using SparkColumnDescriptors that describe the data type as well as some metadata about what special values are
permitted on each column. In particular, all SparkColumnDescriptors allow
specifying a flag allow_null to indicate if null (None) values are permitted
in a column; additionally, SparkFloatColumnDescriptor allows specifying
allow_inf and allow_nan to indicate if a column with floating
point values can contain (+/-)inf or NaN respectively.
Tumult Core also supports transformations and measurements on Pandas DataFrames and Series using pandas_domains and numpy_domains. PandasDataFrameDomains are constructed by specifying a PandasSeriesDomain for each column, which in turn is specified by a NumpyDomain. Unlike SparkIntegerColumnDescriptor and SparkFloatColumnDescriptor, NumpyIntegerDomain and NumpyFloatDomain do not permit null values. Pandas domains are only used for Quantiles (discussed below) currently.
Comparison Operators#
This section summarizes the behavior of comparison operators in Spark when one or both of the operands are special (null, NaN or inf). We will reference these operators to explain how our components handle these values.
Nulls#
Comparisons (using <, >, =, <=, >=) between a null value and any other value evaluates to null.
Spark’s null-safe equality operator <=> allows safely comparing potentially null values such that
X <=> Y evaluates to True if X and Y are both non-null values and X = Y, or X and Y are both nulls.
Note
Python’s == operator is equivalent to = in Spark. For example, df1 and df2 are equivalent below:
df1 = dataframe.filter("D = E")
df2 = dataframe.filter(col(D) == col(E))
The null-safe equality operator <=> corresponds to eqNullSafe method . Concretely, df3 and df4 are equivalent below:
df3 = dataframe.filter("D <=> E")
df4 = dataframe.filter(col(D).eqNullSafe(col(E)))
NaNs and infs#
inf = infevaluates to True. Consequently,inf <= infandinf >= infalso evaluate to True.NaN = NaNevaluates to True (unlike standard floating point implementations including python’s). For any non-null numeric value (includinginf)X(incl.inf),NaN > Xevaluates to True andX > NaNevaluates to False.NaN = Xevaluates to False for all non-null values exceptNaN.inf > Xevaluates to True for all non-null numeric values (exceptinfandnan).-inf < Xevaluates to True for all non-null numeric values (except-inf).
Filter#
A Filter transformation can be constructed with a SQL filter expression that may refer to one or more columns in the input domain and contain comparison operators (<, <=, >, >=, =, <=>) and logical operators (AND, OR or NOT).
The following table describes how the logical operators behave when one or both values are null (note that AND and OR are commutative):
X |
Y |
X AND Y |
X OR Y |
NOT X |
|---|---|---|---|---|
NULL |
True |
NULL |
True |
NULL |
NULL |
False |
False |
NULL |
NULL |
NULL |
NULL |
NULL |
NULL |
NULL |
Comparison between two columns work according to the comparison semantics described above. The following expressions demonstrate how a column can be compared against a special literal value:
"X = 'INF'"evaluates to True only if X isinf"X = '-INF'"evaluates to True only if X is-inf"X = 'NaN'"evaluates to True only if X isNaN"X <=> NULL"evaluates to True only if X isnull
Note
Since X = NULL evaluates to NULL for any value of X, using the filter expression "NOT X = NULL" results in an empty DataFrame. In order to filter out rows where X is null, filtering with "NOT X <=> NULL" would work; however, DropNulls is better suited for this since it also modifies the domain to indicate that nulls are absent from column X in the output.
PartitionByKeys#
For a PartitionByKeys transformation, the list_values corresponding to partition keys can contain Inf, NaN or null. A partition corresponding to a particular key is obtained by comparing row values in key columns with the key values.
Quantile#
A NoisyQuantile measurement requires a PandasSeriesDomain (over NumpyIntegerDomain or NumpyFloatDomain) as its input domain. Additionally, if the input domain is a PandasSeriesDomain over NumpyFloatDomain, it should also disallow NaNs.
When constructing quantile measurements that work on SparkDataFrameDomains (with create_quantile_measurement() for example), the input domain must disallow nulls and NaNs on the measure column. More generally, ApplyInPandas does not support aggregation_functions that operate on numeric nullable columns.
GroupBy#
For a GroupBy transformation, a group key can contain a null
only if the input domain permits nulls in the corresponding SparkColumnDescriptor.
A group key containing a null (or one that is a null – when grouping by a single column) is treated
like any other value - i.e. all rows with this key are grouped together.
Since GroupBy does not permit grouping on SparkFloatColumnDescriptor
columns, group keys cannot contain NaNs or infs.
Joins#
Both PrivateJoin and PublicJoin use the = semantics described above by default.
Consequently, all null values on the join columns are dropped. In order to join on nulls, construct the transformation with join_on_nulls=True to use the <=> semantics.
Removing NaNs, nulls, and infs#
Tumult Core provides transformations to drop or replace NaNs, nulls, and infs. In particular,
ReplaceNulls, ReplaceNaNs, and ReplaceInfs allow replacing these values on one or more columns; DropNulls, DropNaNs, and DropInfs allow dropping rows containing these values in one or more columns.
Sum and SumGrouped#
Sum and SumGrouped aggregations require NaNs and nulls to be disallowed
from the measure column. Consequently, derived measurements (requiring sums) like create_average_measurement(), create_standard_deviation_measurement() and create_variance_measurement() also require that the measure column disallow NaNs and nulls.
+/- inf values are correctly clipped to the upper and lower clipping bounds specified
on the aggregations.
CountDistinct#
CountDistinct uses the <=> semantics described above . For example, the following rows are considered identical by this transformation:
(NULL, NaN, Inf)(NULL, NaN, Inf)