Nulls, NaNs, and infinite values#
This topic guide details how null values, NaN values, and infinite values (positive and negative infinity) are handled in Tumult Analytics.
Constructing Sessions with Null, NaN, and infinite values#
By default, Tumult Analytics allows you to use DataFrames that contain null values, NaNs, or infinite values. Just pass in a DataFrame that contains those values:
session_with_nulls = Session.from_dataframe(
privacy_budget=PureDPBudget(2),
source_id="my_private_data",
dataframe=spark_df,
protected_change=AddOneRow(),
)
spark_df.show()
+-----+---+--------+
| name|age| grade|
+-----+---+--------+
| null| 20| 4.0|
| bob| 30| NaN|
|carol| 40|Infinity|
+-----+---+--------+
Null, NaN, and infinity values are also allowed in public DataFrames of a session.
Null, NaN, and infinite values in maps and flat maps#
By default, all columns created by
tmlt.analytics.query_builder.QueryBuilder.map()
or
tmlt.analytics.query_builder.QueryBuilder.flat_map()
are assumed to contain
null values. If those columns are DECIMAL columns, they are also
assumed to potentially contain NaN or infinite values.
query = QueryBuilder("private").map(
f=lambda row: {"new": row["B"]*1.5},
new_column_types={"new": "DECIMAL"},
augment=True,
).query_expr
print(query.schema_new_columns)
Schema({'new': ColumnDescriptor(column_type=ColumnType.DECIMAL,
allow_null=True,
allow_nan=True,
allow_inf=True)})
If you pass in a full tmlt.analytics.query_builder.ColumnDescriptor
, then you can specify whether new
columns can contain null, NaN, or infinite values:
from tmlt.analytics.query_builder import ColumnDescriptor, ColumnType
new_column_types = {'new': ColumnDescriptor(
column_type=ColumnType.DECIMAL,
allow_null=False,
allow_nan=False,
allow_inf=False,
)}
query = QueryBuilder("private").map(
f=lambda row: {"new": row["B"]*1.5},
new_column_types=new_column_types,
augment=True,
).query_expr
print(query.schema_new_columns)
Schema({'new': ColumnDescriptor(column_type=ColumnType.DECIMAL,
allow_null=False,
allow_nan=False,
allow_inf=False)})
If you do this, it is your responsibility to ensure that the mapping function does not create null, NaN, or infinite values. Tumult Analytics’ will raise an error if there are null values in columns marked as allow_null=False, NaN values in columns marked as allow_nan=False, or infinite values in columns marked as allow_inf=False.
Special case: null values in grouping columns#
Tumult Analytics does not allow you to replace null values in
a flat map grouping column
,
because this could violate Tumult Analytics’ stability guarantee.
If your flat map transformation could create null values, you cannot replace
them later.
Null, NaN, and infinite values and aggregations#
Analytics automatically transforms your data when you perform a numerical aggregation - a sum, variance, average, standard deviation, or quantile - on columns that contain null, NaN, or infinite values. This section explains how Analytics handles aggregations when data contains null, NaN, or infinite values.
tmlt.analytics.query_builder.QueryBuilder.count()
and
tmlt.analytics.query_builder.QueryBuilder.count_distinct()
do not have special behavior for rows containing nulls, NaNs, or infinite values.
Rows with those values are counted the same as rows without any of those values.
Null and NaN values in aggregations#
By default, all rows that contain a null or NaN value in the measure_column are dropped immediately before aggregation. The following example uses a Session with an infinite budget to demonstrate this:
session = Session.from_dataframe(
privacy_budget=PureDPBudget(float("inf")),
source_id="checkouts",
dataframe=private_data,
protected_change=AddOneRow(),
)
private_data.show()
+-----+-------+-----------+
| name| genre|checked_out|
+-----+-------+-----------+
|Ambar|Unknown| null|
|Tessa|Unknown| 3|
+-----+-------+-----------+
query = QueryBuilder("checkouts").groupby(
KeySet.from_dict({
"genre": [
"Unknown",
]
})
).average(
column="checked_out",
low=0,
high=30,
)
answer = session.evaluate(query, privacy_budget=PureDPBudget(float("inf")))
answer.show()
+-------+-------------------+
| genre|checked_out_average|
+-------+-------------------+
|Unknown| 3.0|
+-------+-------------------+
The row where the genre “Unknown” had null
books checked out has been dropped,
so the average number of Unknown books checked out is 3 - even though originally
there were two rows with Unknown books checked out (one with 3 books
checked out, and one with a null value).
If we instead replace all null values with 0, we get a different result:
query = QueryBuilder("checkouts").replace_null_and_nan({
"checked_out": 0,
}).groupby(
KeySet.from_dict({
"genre": [
"Unknown",
]
})
).average(
column="checked_out",
low=0,
high=30,
)
answer = session.evaluate(query, privacy_budget=PureDPBudget(float("inf")))
answer.show()
+-------+-------------------+
| genre|checked_out_average|
+-------+-------------------+
|Unknown| 1.5|
+-------+-------------------+
If you want to treat null values as zeroes, you must explicitly replace them before performing your query.
The same principles apply for NaN values.
Infinite values in aggregations#
When you perform a sum, variance, average, standard deviation, or quantile query on data containing infinite values, Analytics clamps those infinite values to the query’s upper and lower bounds. Positive infinity is clamped to the upper bound, and negative infinity is clamped to the lower bound. The following example uses a Session with an infinite budget to demonstrate this:
session = Session.from_dataframe(
privacy_budget=PureDPBudget(float("inf")),
source_id="checkouts",
dataframe=private_data,
protected_change=AddOneRow(),
)
private_data.show()
+-------+---------------+-----------+
| name| genre|checked_out|
+-------+---------------+-----------+
| Ambar|Science fiction| 5.0|
| Tessa|Science fiction| -Infinity|
|Alfredo|Science fiction| Infinity|
+-------+---------------+-----------+
query = QueryBuilder("checkouts").groupby(
KeySet.from_dict({
"genre": [
"Science fiction",
]
})
).sum(
column="checked_out",
low=0,
high=30,
)
answer = session.evaluate(query, privacy_budget=PureDPBudget(float("inf")))
answer.show()
+---------------+---------------+
| genre|checked_out_sum|
+---------------+---------------+
|Science fiction| 35.0|
+---------------+---------------+
Tessa’s -Infinity
books checked out became 0, and Alfredo’s Infinity
books checked out became 30, for a total of 35 (5 + 0 + 30).
The example below uses different query bounds:
query = QueryBuilder("checkouts").groupby(
KeySet.from_dict({
"genre": [
"Science fiction",
]
})
).average(
column="checked_out",
low=-15,
high=10,
)
answer = session.evaluate(query, privacy_budget=PureDPBudget(float("inf")))
answer.show()
+---------------+-------------------+
| genre|checked_out_average|
+---------------+-------------------+
|Science fiction| 0.0|
+---------------+-------------------+
In this example, Tessa’s -Infinity
books checked out becomes -15, and
Alfredo’s Infinity
books checked out becomes 10. The average number
of books checked out is therefore 0 (5 + 10 + -15, divided by 3).
If you want infinite values to be treated differently, then you should
explicitly drop infinite values (with
tmlt.analytics.query_builder.QueryBuilder.drop_infinity()
) or
replace them (with
tmlt.analytics.query_builder.QueryBuilder.replace_infinity()
) before
performing your aggregation.