NaNs, nulls, and infs#
This page describes how Tumult Core handles NaNs, nulls, and infs.
Preliminaries#
SparkDataFrameDomain
s are constructed by specifying column constraints using SparkColumnDescriptor
s that describe the data type as well as some metadata about what special values are
permitted on each column. In particular, all SparkColumnDescriptor
s allow
specifying a flag allow_null
to indicate if null (None
) values are permitted
in a column; additionally, SparkFloatColumnDescriptor
allows specifying
allow_inf
and allow_nan
to indicate if a column with floating
point values can contain (+/-)inf
or NaN
respectively.
Tumult Core also supports transformations and measurements on Pandas DataFrames and Series using pandas_domains
and numpy_domains
. PandasDataFrameDomain
s are constructed by specifying a PandasSeriesDomain
for each column, which in turn is specified by a NumpyDomain
. Unlike SparkIntegerColumnDescriptor
and SparkFloatColumnDescriptor
, NumpyIntegerDomain
and NumpyFloatDomain
do not permit null
values. Pandas domains are only used for Quantiles (discussed below) currently.
Comparison Operators#
This section summarizes the behavior of comparison operators in Spark when one or both of the operands are special (null, NaN or inf). We will reference these operators to explain how our components handle these values.
Nulls#
Comparisons (using <, >, =, <=, >=
) between a null value and any other value evaluates to null.
Spark’s null-safe equality operator <=>
allows safely comparing potentially null values such that
X <=> Y
evaluates to True if X
and Y
are both non-null values and X = Y
, or X
and Y
are both nulls.
Note
Python’s ==
operator is equivalent to =
in Spark. For example, df1
and df2
are equivalent below:
df1 = dataframe.filter("D = E")
df2 = dataframe.filter(col(D) == col(E))
The null-safe equality operator <=>
corresponds to eqNullSafe
method . Concretely, df3
and df4
are equivalent below:
df3 = dataframe.filter("D <=> E")
df4 = dataframe.filter(col(D).eqNullSafe(col(E)))
NaNs and infs#
inf = inf
evaluates to True. Consequently,inf <= inf
andinf >= inf
also evaluate to True.NaN = NaN
evaluates to True (unlike standard floating point implementations including python’s). For any non-null numeric value (includinginf
)X
(incl.inf
),NaN > X
evaluates to True andX > NaN
evaluates to False.NaN = X
evaluates to False for all non-null values exceptNaN
.inf > X
evaluates to True for all non-null numeric values (exceptinf
andnan
).-inf < X
evaluates to True for all non-null numeric values (except-inf
).
Filter#
A Filter
transformation can be constructed with a SQL filter expression that may refer to one or more columns in the input domain and contain comparison operators (<, <=, >, >=, =, <=>
) and logical operators (AND, OR or NOT
).
The following table describes how the logical operators behave when one or both values are null
(note that AND
and OR
are commutative):
X |
Y |
X AND Y |
X OR Y |
NOT X |
---|---|---|---|---|
NULL |
True |
NULL |
True |
NULL |
NULL |
False |
False |
NULL |
NULL |
NULL |
NULL |
NULL |
NULL |
NULL |
Comparison between two columns work according to the comparison semantics described above. The following expressions demonstrate how a column can be compared against a special literal value:
"X = 'INF'"
evaluates to True only if X isinf
"X = '-INF'"
evaluates to True only if X is-inf
"X = 'NaN'"
evaluates to True only if X isNaN
"X <=> NULL"
evaluates to True only if X isnull
Note
Since X = NULL
evaluates to NULL
for any value of X
, using the filter expression "NOT X = NULL"
results in an empty DataFrame. In order to filter out rows where X
is null, filtering with "NOT X <=> NULL"
would work; however, DropNulls
is better suited for this since it also modifies the domain to indicate that nulls are absent from column X
in the output.
PartitionByKeys#
For a PartitionByKeys
transformation, the list_values
corresponding to partition keys can contain Inf
, NaN
or null
. A partition corresponding to a particular key is obtained by comparing row values in key columns with the key values.
Quantile#
A NoisyQuantile
measurement requires a PandasSeriesDomain
(over NumpyIntegerDomain
or NumpyFloatDomain
) as its input domain. Additionally, if the input domain is a PandasSeriesDomain
over NumpyFloatDomain
, it should also disallow NaN
s.
When constructing quantile measurements that work on SparkDataFrameDomain
s (with create_quantile_measurement()
for example), the input domain must disallow nulls
and NaNs
on the measure column. More generally, ApplyInPandas
does not support aggregation_function
s that operate on numeric nullable columns.
GroupBy#
For a GroupBy
transformation, a group key can contain a null
only if the input domain permits nulls in the corresponding SparkColumnDescriptor
.
A group key containing a null (or one that is a null – when grouping by a single column) is treated
like any other value - i.e. all rows with this key are grouped together.
Since GroupBy
does not permit grouping on SparkFloatColumnDescriptor
columns, group keys cannot contain NaNs or infs.
Joins#
Both PrivateJoin
and PublicJoin
use the =
semantics described above by default.
Consequently, all null values on the join columns are dropped. In order to join on nulls, construct the transformation with join_on_nulls=True
to use the <=>
semantics.
Removing NaNs, nulls, and infs#
Tumult Core provides transformations to drop or replace NaNs, nulls, and infs. In particular,
ReplaceNulls
, ReplaceNaNs
, and ReplaceInfs
allow replacing these values on one or more columns; DropNulls
, DropNaNs
, and DropInfs
allow dropping rows containing these values in one or more columns.
Sum and SumGrouped#
Sum
and SumGrouped
aggregations require NaNs and nulls to be disallowed
from the measure column. Consequently, derived measurements (requiring sums) like create_average_measurement()
, create_standard_deviation_measurement()
and create_variance_measurement()
also require that the measure column disallow NaNs and nulls.
+/- inf
values are correctly clipped to the upper and lower clipping bounds specified
on the aggregations.
CountDistinct#
CountDistinct
uses the <=>
semantics described above . For example, the following rows are considered identical by this transformation:
(NULL, NaN, Inf)
(NULL, NaN, Inf)