query_expr#
Building blocks of the Tumult Analytics query language. Not for direct use.
Deprecated since version 0.14: This module will be removed in an upcoming release.
Import mechanism enums from tmlt.analytics.query_builder
instead.
QueryExpr
will be removed from the Tumult Analytics public API.
Defines the QueryExpr
class, which represents expressions in the
Tumult Analytics query language. QueryExpr and its subclasses should not be
directly constructed or deconstructed by most users; interfaces such as
tmlt.analytics.query_builder.QueryBuilder
to create them and
tmlt.analytics.session.Session
to consume them provide more
user-friendly features.
Classes#
Default values for each type of column in Tumult Analytics. |
|
Possible mechanisms for the average() aggregation. |
|
Enumerating the possible mechanisms used for the count_distinct aggregation. |
|
Possible mechanisms for the count() aggregation. |
|
Returns data with rows that contain +inf/-inf dropped. |
|
Returns data with rows that contain null or NaN value dropped. |
|
Enforces a constraint on the data. |
|
Returns the subset of the rows that satisfy the condition. |
|
Applies a flat map function to each row of a relation. |
|
Returns approximate upper and lower bounds of a column. |
|
Returns groups based on the geometric partition selection for these columns. |
|
Returns bounded average of a column for each combination of groupby domains. |
|
Returns bounded stdev of a column for each combination of groupby domains. |
|
Returns the bounded sum of a column for each combination of groupby domains. |
|
Returns bounded variance of a column for each combination of groupby domains. |
|
Returns the count of each combination of the groupby domains. |
|
Returns the count of distinct rows in each groupby domain value. |
|
Returns the quantile of a column for each combination of the groupby domains. |
|
Returns the join of two private tables. |
|
Returns the join of a private and public table. |
|
Applies a map function to each row of a relation. |
|
Loads the private source. |
|
A query expression, base class for relational operators. |
|
A base class for implementing visitors for |
|
Returns the dataframe with columns renamed. |
|
Returns data with +inf and -inf expressions replaced by defaults. |
|
Returns data with null and NaN expressions replaced by a default. |
|
Returns a subset of the columns. |
|
Possible mechanisms for the stdev() aggregation. |
|
Possible mechanisms for the sum() aggregation. |
|
Remove all counts that are less than the threshold. |
|
Possible mechanisms for the variance() aggregation. |
- class AnalyticsDefault#
Default values for each type of column in Tumult Analytics.
- INTEGER = 0#
The default value used for integers (0).
- DECIMAL = 0.0#
The default value used for floats (0).
- VARCHAR = ''#
The default value used for VARCHARs (the empty string).
- DATE#
The default value used for dates (
datetime.date.fromtimestamp(0)
).See
fromtimestamp()
.
- TIMESTAMP#
The default value used for timestamps (
datetime.datetime.fromtimestamp(0)
).See
fromtimestamp()
.
- class AverageMechanism#
Bases:
enum.Enum
Possible mechanisms for the average() aggregation.
Currently, the
average()
aggregation uses an additive noise mechanism to achieve differential privacy.- DEFAULT#
The framework automatically selects an appropriate mechanism. This choice might change over time as additional optimizations are added to the library.
- LAPLACE#
Laplace and/or double-sided geometric noise is used, depending on the column type.
- GAUSSIAN#
Discrete and/or continuous Gaussian noise is used, depending on the column type. Not compatible with pure DP.
- name()#
The name of the Enum member.
- value()#
The value of the Enum member.
- class CountDistinctMechanism#
Bases:
enum.Enum
Enumerating the possible mechanisms used for the count_distinct aggregation.
Currently, the
count_distinct()
aggregation uses an additive noise mechanism to achieve differential privacy.- DEFAULT#
The framework automatically selects an appropriate mechanism. This choice might change over time as additional optimizations are added to the library.
- LAPLACE#
Double-sided geometric noise is used.
- GAUSSIAN#
The discrete Gaussian mechanism is used. Not compatible with pure DP.
- name()#
The name of the Enum member.
- value()#
The value of the Enum member.
- class CountMechanism#
Bases:
enum.Enum
Possible mechanisms for the count() aggregation.
Currently, the
count()
aggregation uses an additive noise mechanism to achieve differential privacy.- DEFAULT#
The framework automatically selects an appropriate mechanism. This choice might change over time as additional optimizations are added to the library.
- LAPLACE#
Double-sided geometric noise is used.
- GAUSSIAN#
The discrete Gaussian mechanism is used. Not compatible with pure DP.
- name()#
The name of the Enum member.
- value()#
The value of the Enum member.
- class DropInfinity#
Bases:
QueryExpr
Returns data with rows that contain +inf/-inf dropped.
- columns: List[str]#
Columns in which to look for and infinite values.
If this list is empty, all columns will be looked at - so if any column contains an infinite value, that row will be dropped.
- accept(visitor)#
Visit this QueryExpr with visitor.
- Parameters:
visitor (QueryExprVisitor) –
- Return type:
Any
- class DropNullAndNan#
Bases:
QueryExpr
Returns data with rows that contain null or NaN value dropped.
Warning
After a
DropNullAndNan
query has been performed for a column, Tumult Analytics will raise an error if you use aKeySet
for that column that contains null values.- columns: List[str]#
Columns in which to look for nulls and NaNs.
If this list is empty, all columns will be looked at - so if any column contains a null or NaN value that row will be dropped.
- accept(visitor)#
Visit this QueryExpr with visitor.
- Parameters:
visitor (QueryExprVisitor) –
- Return type:
Any
- class EnforceConstraint#
Bases:
QueryExpr
Enforces a constraint on the data.
- constraint: tmlt.analytics.constraints.Constraint#
A constraint to be enforced.
- options: Dict[str, Any]#
Options to be used when enforcing the constraint.
Appropriate values here vary depending on the constraint. These options are to support advanced use cases, and generally should not be used.
- accept(visitor)#
Visit this QueryExpr with visitor.
- Parameters:
visitor (QueryExprVisitor) –
- Return type:
Any
- class Filter#
Bases:
QueryExpr
Returns the subset of the rows that satisfy the condition.
- condition: str#
A string of SQL expression specifying the filter to apply to the data.
For example, the string “A > B” matches rows where column A is greater than column B.
- accept(visitor)#
Visit this QueryExpr with visitor.
- Parameters:
visitor (QueryExprVisitor) –
- Return type:
Any
- class FlatMap#
Bases:
QueryExpr
Applies a flat map function to each row of a relation.
- f: Callable[[Row], List[Row]]#
The flat map function.
- schema_new_columns: tmlt.analytics._schema.Schema#
The expected schema for new columns produced by
f
.If the
schema_new_columns
has agrouping_column
, that means this FlatMap produces a column that must be grouped by eventually. It also must be the only column in the schema.
- augment: bool#
Whether to keep the existing columns.
If True, schema = old schema + schema_new_columns, otherwise only keeps the new columns (schema = schema_new_columns).
- accept(visitor)#
Visit this QueryExpr with visitor.
- Parameters:
visitor (QueryExprVisitor) –
- Return type:
Any
- class GetBounds#
Bases:
QueryExpr
Returns approximate upper and lower bounds of a column.
- accept(visitor)#
Visit this QueryExpr with visitor.
- Parameters:
visitor (QueryExprVisitor) –
- Return type:
Any
- class GetGroups#
Bases:
QueryExpr
Returns groups based on the geometric partition selection for these columns.
- columns: List[str] | None = None#
The columns used for geometric partition selection.
If empty or none are provided, will use all of the columns in the table for partition selection.
- accept(visitor)#
Visit this QueryExpr with visitor.
- Parameters:
visitor (QueryExprVisitor) –
- Return type:
Any
- class GroupByBoundedAverage#
Bases:
QueryExpr
Returns bounded average of a column for each combination of groupby domains.
If the column to be measured contains null, NaN, or positive or negative infinity, those values will be dropped (as if dropped explicitly via
DropNullAndNan
andDropInfinity
) before the average is calculated.- groupby_keys: tmlt.analytics.keyset.KeySet | List[str]#
The keys, or columns list to collect keys from, to be grouped on.
- mechanism: AverageMechanism#
Choice of noise mechanism.
By DEFAULT, the framework automatically selects an appropriate mechanism.
- accept(visitor)#
Visit this QueryExpr with visitor.
- Parameters:
visitor (QueryExprVisitor) –
- Return type:
Any
- class GroupByBoundedSTDEV#
Bases:
QueryExpr
Returns bounded stdev of a column for each combination of groupby domains.
If the column to be measured contains null, NaN, or positive or negative infinity, those values will be dropped (as if dropped explicitly via
DropNullAndNan
andDropInfinity
) before the standard deviation is calculated.- groupby_keys: tmlt.analytics.keyset.KeySet | List[str]#
The keys, or columns list to collect keys from, to be grouped on.
- mechanism: StdevMechanism#
Choice of noise mechanism.
By DEFAULT, the framework automatically selects an appropriate mechanism.
- accept(visitor)#
Visit this QueryExpr with visitor.
- Parameters:
visitor (QueryExprVisitor) –
- Return type:
Any
- class GroupByBoundedSum#
Bases:
QueryExpr
Returns the bounded sum of a column for each combination of groupby domains.
If the column to be measured contains null, NaN, or positive or negative infinity, those values will be dropped (as if dropped explicitly via
DropNullAndNan
andDropInfinity
) before the sum is calculated.- groupby_keys: tmlt.analytics.keyset.KeySet | List[str]#
The keys, or columns list to collect keys from, to be grouped on.
- mechanism: SumMechanism#
Choice of noise mechanism.
By DEFAULT, the framework automatically selects an appropriate mechanism.
- accept(visitor)#
Visit this QueryExpr with visitor.
- Parameters:
visitor (QueryExprVisitor) –
- Return type:
Any
- class GroupByBoundedVariance#
Bases:
QueryExpr
Returns bounded variance of a column for each combination of groupby domains.
If the column to be measured contains null, NaN, or positive or negative infinity, those values will be dropped (as if dropped explicitly via
DropNullAndNan
andDropInfinity
) before the variance is calculated.- groupby_keys: tmlt.analytics.keyset.KeySet | List[str]#
The keys, or columns list to collect keys from, to be grouped on.
- mechanism: VarianceMechanism#
Choice of noise mechanism.
By DEFAULT, the framework automatically selects an appropriate mechanism.
- accept(visitor)#
Visit this QueryExpr with visitor.
- Parameters:
visitor (QueryExprVisitor) –
- Return type:
Any
- class GroupByCount#
Bases:
QueryExpr
Returns the count of each combination of the groupby domains.
- groupby_keys: tmlt.analytics.keyset.KeySet | List[str]#
The keys, or columns list to collect keys from, to be grouped on.
- mechanism: CountMechanism#
Choice of noise mechanism.
By DEFAULT, the framework automatically selects an appropriate mechanism.
- accept(visitor)#
Visit this QueryExpr with visitor.
- Parameters:
visitor (QueryExprVisitor) –
- Return type:
Any
- class GroupByCountDistinct#
Bases:
QueryExpr
Returns the count of distinct rows in each groupby domain value.
- groupby_keys: tmlt.analytics.keyset.KeySet | List[str]#
The keys, or columns list to collect keys from, to be grouped on.
- columns_to_count: List[str] | None = None#
The columns that are compared when determining if two rows are distinct.
If empty, will count all distinct rows.
- mechanism: CountDistinctMechanism#
Choice of noise mechanism.
By DEFAULT, the framework automatically selects an appropriate mechanism.
- accept(visitor)#
Visit this QueryExpr with visitor.
- Parameters:
visitor (QueryExprVisitor) –
- Return type:
Any
- class GroupByQuantile#
Bases:
QueryExpr
Returns the quantile of a column for each combination of the groupby domains.
If the column to be measured contains null, NaN, or positive or negative infinity, those values will be dropped (as if dropped explicitly via
DropNullAndNan
andDropInfinity
) before the quantile is calculated.- groupby_keys: tmlt.analytics.keyset.KeySet | List[str]#
The keys, or columns list to collect keys from, to be grouped on.
- accept(visitor)#
Visit this QueryExpr with visitor.
- Parameters:
visitor (QueryExprVisitor) –
- Return type:
Any
- class JoinPrivate#
Bases:
QueryExpr
Returns the join of two private tables.
Before performing the join, each table is truncated based on the corresponding
TruncationStrategy
. For a more detailed overview ofJoinPrivate
’s behavior, seejoin_private()
.- truncation_strategy_left: tmlt.analytics.truncation_strategy.TruncationStrategy.Type | None = None#
Truncation strategy to be used for the left table.
- truncation_strategy_right: tmlt.analytics.truncation_strategy.TruncationStrategy.Type | None = None#
Truncation strategy to be used for the right table.
- join_columns: List[str] | None = None#
The columns used for joining the tables, or None to use all common columns.
- accept(visitor)#
Visit this QueryExpr with visitor.
- Parameters:
visitor (QueryExprVisitor) –
- Return type:
Any
- class JoinPublic#
Bases:
QueryExpr
Returns the join of a private and public table.
- public_table: pyspark.sql.DataFrame | str#
A DataFrame or public source to join with.
- join_columns: List[str] | None = None#
The columns used for joining the tables, or None to use all common columns.
- accept(visitor)#
Visit this QueryExpr with visitor.
- Parameters:
visitor (QueryExprVisitor) –
- Return type:
Any
- __eq__(other)#
Returns true iff self == other.
For the purposes of this equality operation, two dataframes are equal if they contain the same data, in any order.
Calling this on a JoinPublic that includes a very large dataframe could take a long time or consume a lot of resources, and is not recommended.
- class Map#
Bases:
QueryExpr
Applies a map function to each row of a relation.
- f: Callable[[Row], Row]#
The map function.
- schema_new_columns: tmlt.analytics._schema.Schema#
The expected schema for new columns produced by
f
.
- augment: bool#
Whether to keep the existing columns.
If True, schema = old schema + schema_new_columns, otherwise only keeps the new columns (schema = schema_new_columns).
- accept(visitor)#
Visit this QueryExpr with visitor.
- Parameters:
visitor (QueryExprVisitor) –
- Return type:
Any
- class PrivateSource#
Bases:
QueryExpr
Loads the private source.
- accept(visitor)#
Visit this QueryExpr with visitor.
- Parameters:
visitor (QueryExprVisitor) –
- Return type:
Any
- class QueryExpr#
Bases:
abc.ABC
A query expression, base class for relational operators.
In most cases, QueryExpr should not be manipulated directly, but rather created using
tmlt.analytics.query_builder.QueryBuilder
and then consumed bytmlt.analytics.session.Session
. While they can be created and modified directly, this is an advanced usage and is not recommended for typical users.QueryExpr are organized in a tree, where each node is an operator which returns a relation.
- abstract accept(visitor)#
Dispatch methods on a visitor based on the QueryExpr type.
- Parameters:
visitor (QueryExprVisitor) –
- Return type:
Any
- class QueryExprVisitor#
Bases:
abc.ABC
A base class for implementing visitors for
QueryExpr
.Methods# Visit a
PrivateSource
.Visit a
Rename
.Visit a
Filter
.Visit a
Select
.Visit a
Map
.Visit a
FlatMap
.Visit a
JoinPrivate
.Visit a
JoinPublic
.Visit a
ReplaceNullAndNan
.Visit a
ReplaceInfinity
.Visit a
DropNullAndNan
.Visit a
DropInfinity
.Visit a
EnforceConstraint
.Visit a
GetGroups
.Visit a
GetBounds
.Visit a
GroupByCount
.Visit a
GroupByCountDistinct
.Visit a
GroupByQuantile
.Visit a
GroupByBoundedSum
.Visit a
GroupByBoundedAverage
.Visit a
GroupByBoundedVariance
.Visit a
GroupByBoundedSTDEV
.Visit a
SuppressAggregates
.- abstract visit_private_source(expr)#
Visit a
PrivateSource
.- Parameters:
expr (PrivateSource) –
- Return type:
Any
- abstract visit_join_private(expr)#
Visit a
JoinPrivate
.- Parameters:
expr (JoinPrivate) –
- Return type:
Any
- abstract visit_join_public(expr)#
Visit a
JoinPublic
.- Parameters:
expr (JoinPublic) –
- Return type:
Any
- abstract visit_replace_null_and_nan(expr)#
Visit a
ReplaceNullAndNan
.- Parameters:
expr (ReplaceNullAndNan) –
- Return type:
Any
- abstract visit_replace_infinity(expr)#
Visit a
ReplaceInfinity
.- Parameters:
expr (ReplaceInfinity) –
- Return type:
Any
- abstract visit_drop_null_and_nan(expr)#
Visit a
DropNullAndNan
.- Parameters:
expr (DropNullAndNan) –
- Return type:
Any
- abstract visit_drop_infinity(expr)#
Visit a
DropInfinity
.- Parameters:
expr (DropInfinity) –
- Return type:
Any
- abstract visit_enforce_constraint(expr)#
Visit a
EnforceConstraint
.- Parameters:
expr (EnforceConstraint) –
- Return type:
Any
- abstract visit_groupby_count(expr)#
Visit a
GroupByCount
.- Parameters:
expr (GroupByCount) –
- Return type:
Any
- abstract visit_groupby_count_distinct(expr)#
Visit a
GroupByCountDistinct
.- Parameters:
expr (GroupByCountDistinct) –
- Return type:
Any
- abstract visit_groupby_quantile(expr)#
Visit a
GroupByQuantile
.- Parameters:
expr (GroupByQuantile) –
- Return type:
Any
- abstract visit_groupby_bounded_sum(expr)#
Visit a
GroupByBoundedSum
.- Parameters:
expr (GroupByBoundedSum) –
- Return type:
Any
- abstract visit_groupby_bounded_average(expr)#
Visit a
GroupByBoundedAverage
.- Parameters:
expr (GroupByBoundedAverage) –
- Return type:
Any
- abstract visit_groupby_bounded_variance(expr)#
Visit a
GroupByBoundedVariance
.- Parameters:
expr (GroupByBoundedVariance) –
- Return type:
Any
- abstract visit_groupby_bounded_stdev(expr)#
Visit a
GroupByBoundedSTDEV
.- Parameters:
expr (GroupByBoundedSTDEV) –
- Return type:
Any
- abstract visit_suppress_aggregates(expr)#
Visit a
SuppressAggregates
.- Parameters:
expr (SuppressAggregates) –
- Return type:
Any
- class Rename#
Bases:
QueryExpr
Returns the dataframe with columns renamed.
- column_mapper: Dict[str, str]#
The mapping of old column names to new column names.
This mapping can contain all column names or just a subset. If it contains a subset of columns, it will only rename those columns and keep the other column names the same.
- accept(visitor)#
Visit this QueryExpr with visitor.
- Parameters:
visitor (QueryExprVisitor) –
- Return type:
Any
- class ReplaceInfinity#
Bases:
QueryExpr
Returns data with +inf and -inf expressions replaced by defaults.
- replace_with: Dict[str, Tuple[float, float]]#
New values to replace with, by column. The first value for each column will be used to replace -infinity, and the second value will be used to replace +infinity.
If this dictionary is empty, all columns of type DECIMAL will be changed, with infinite values replaced with a default value (see the
AnalyticsDefault
class variables).
- accept(visitor)#
Visit this QueryExpr with visitor.
- Parameters:
visitor (QueryExprVisitor) –
- Return type:
Any
- class ReplaceNullAndNan#
Bases:
QueryExpr
Returns data with null and NaN expressions replaced by a default.
Warning
after a
ReplaceNullAndNan
query has been performed for a column, Tumult Analytics will raise an error if you use aKeySet
for that column that contains null values.- replace_with: Mapping[str, int | float | str | datetime.date | datetime.datetime]#
New values to replace with, by column.
If this dictionary is empty, all columns will be changed, with values replaced by a default value for each column’s type (see the
AnalyticsDefault
class variables).
- accept(visitor)#
Visit this QueryExpr with visitor.
- Parameters:
visitor (QueryExprVisitor) –
- Return type:
Any
- class Select#
Bases:
QueryExpr
Returns a subset of the columns.
- accept(visitor)#
Visit this QueryExpr with visitor.
- Parameters:
visitor (QueryExprVisitor) –
- Return type:
Any
- class StdevMechanism#
Bases:
enum.Enum
Possible mechanisms for the stdev() aggregation.
Currently, the
stdev()
aggregation uses an additive noise mechanism to achieve differential privacy.- DEFAULT#
The framework automatically selects an appropriate mechanism. This choice might change over time as additional optimizations are added to the library.
- LAPLACE#
Laplace and/or double-sided geometric noise is used, depending on the column type.
- GAUSSIAN#
Discrete and/or continuous Gaussian noise is used, depending on the column type. Not compatible with pure DP.
- name()#
The name of the Enum member.
- value()#
The value of the Enum member.
- class SumMechanism#
Bases:
enum.Enum
Possible mechanisms for the sum() aggregation.
Currently, the
sum()
aggregation uses an additive noise mechanism to achieve differential privacy.- DEFAULT#
The framework automatically selects an appropriate mechanism. This choice might change over time as additional optimizations are added to the library.
- LAPLACE#
Laplace and/or double-sided geometric noise is used, depending on the column type.
- GAUSSIAN#
Discrete and/or continuous Gaussian noise is used, depending on the column type. Not compatible with pure DP.
- name()#
The name of the Enum member.
- value()#
The value of the Enum member.
- class SuppressAggregates#
Bases:
QueryExpr
Remove all counts that are less than the threshold.
- child: QueryExpr#
The aggregate on which to suppress small counts.
Currently, only GroupByCount is supported.
- accept(visitor)#
Visit this QueryExpr with visitor.
- Parameters:
visitor (QueryExprVisitor) –
- Return type:
Any
- class VarianceMechanism#
Bases:
enum.Enum
Possible mechanisms for the variance() aggregation.
Currently, the
variance()
aggregation uses an additive noise mechanism to achieve differential privacy.- DEFAULT#
The framework automatically selects an appropriate mechanism. This choice might change over time as additional optimizations are added to the library.
- LAPLACE#
Laplace and/or double-sided geometric noise is used, depending on the column type.
- GAUSSIAN#
Discrete and/or continuous Gaussian noise is used, depending on the column type. Not compatible with pure DP.
- name()#
The name of the Enum member.
- value()#
The value of the Enum member.