query_expr#

Building blocks of the Tumult Analytics query language. Not for direct use.

Defines the QueryExpr class, which represents expressions in the Tumult Analytics query language. QueryExpr and its subclasses should not be directly constructed or deconstructed by most users; interfaces such as tmlt.analytics.query_builder.QueryBuilder to create them and tmlt.analytics.session.Session to consume them provide more user-friendly features.

Data#

Row#

Type alias for dictionary with string keys.

Classes#

CountMechanism

Possible mechanisms for the count() aggregation.

CountDistinctMechanism

Enumerating the possible mechanisms used for the count_distinct aggregation.

SumMechanism

Possible mechanisms for the sum() aggregation.

AverageMechanism

Possible mechanisms for the average() aggregation.

VarianceMechanism

Possible mechanisms for the variance() aggregation.

StdevMechanism

Possible mechanisms for the stdev() aggregation.

QueryExpr

A query expression, base class for relational operators.

PrivateSource

Loads the private source.

GetGroups

Returns groups based on the geometric partition selection for these columns.

Rename

Returns the dataframe with columns renamed.

Filter

Returns the subset of the rows that satisfy the condition.

Select

Returns a subset of the columns.

Map

Applies a map function to each row of a relation.

FlatMap

Applies a flat map function to each row of a relation.

JoinPrivate

Returns the join of two private tables.

JoinPublic

Returns the join of a private and public table.

AnalyticsDefault

Default values for each type of column in Tumult Analytics.

ReplaceNullAndNan

Returns data with null and NaN expressions replaced by a default.

ReplaceInfinity

Returns data with +inf and -inf expressions replaced by defaults.

DropNullAndNan

Returns data with rows that contain null or NaN value dropped.

DropInfinity

Returns data with rows that contain +inf/-inf dropped.

EnforceConstraint

Enforces a constraint on the data.

GroupByCount

Returns the count of each combination of the groupby domains.

GroupByCountDistinct

Returns the count of distinct rows in each groupby domain value.

GroupByQuantile

Returns the quantile of a column for each combination of the groupby domains.

GroupByBoundedSum

Returns the bounded sum of a column for each combination of groupby domains.

GroupByBoundedAverage

Returns bounded average of a column for each combination of groupby domains.

GroupByBoundedVariance

Returns bounded variance of a column for each combination of groupby domains.

GroupByBoundedSTDEV

Returns bounded stdev of a column for each combination of groupby domains.

QueryExprVisitor

A base class for implementing visitors for QueryExpr.

class CountMechanism#

Bases: enum.Enum

Possible mechanisms for the count() aggregation.

Currently, the count() aggregation uses an additive noise mechanism to achieve differential privacy.

DEFAULT#

The framework automatically selects an appropriate mechanism. This choice might change over time as additional optimizations are added to the library.

LAPLACE#

Double-sided geometric noise is used.

GAUSSIAN#

The discrete Gaussian mechanism is used. Not compatible with pure DP.

name()#

The name of the Enum member.

value()#

The value of the Enum member.

class CountDistinctMechanism#

Bases: enum.Enum

Enumerating the possible mechanisms used for the count_distinct aggregation.

Currently, the count_distinct() aggregation uses an additive noise mechanism to achieve differential privacy.

DEFAULT#

The framework automatically selects an appropriate mechanism. This choice might change over time as additional optimizations are added to the library.

LAPLACE#

Double-sided geometric noise is used.

GAUSSIAN#

The discrete Gaussian mechanism is used. Not compatible with pure DP.

name()#

The name of the Enum member.

value()#

The value of the Enum member.

class SumMechanism#

Bases: enum.Enum

Possible mechanisms for the sum() aggregation.

Currently, the sum() aggregation uses an additive noise mechanism to achieve differential privacy.

DEFAULT#

The framework automatically selects an appropriate mechanism. This choice might change over time as additional optimizations are added to the library.

LAPLACE#

Laplace and/or double-sided geometric noise is used, depending on the column type.

GAUSSIAN#

Discrete and/or continuous Gaussian noise is used, depending on the column type. Not compatible with pure DP.

name()#

The name of the Enum member.

value()#

The value of the Enum member.

class AverageMechanism#

Bases: enum.Enum

Possible mechanisms for the average() aggregation.

Currently, the average() aggregation uses an additive noise mechanism to achieve differential privacy.

DEFAULT#

The framework automatically selects an appropriate mechanism. This choice might change over time as additional optimizations are added to the library.

LAPLACE#

Laplace and/or double-sided geometric noise is used, depending on the column type.

GAUSSIAN#

Discrete and/or continuous Gaussian noise is used, depending on the column type. Not compatible with pure DP.

name()#

The name of the Enum member.

value()#

The value of the Enum member.

class VarianceMechanism#

Bases: enum.Enum

Possible mechanisms for the variance() aggregation.

Currently, the variance() aggregation uses an additive noise mechanism to achieve differential privacy.

DEFAULT#

The framework automatically selects an appropriate mechanism. This choice might change over time as additional optimizations are added to the library.

LAPLACE#

Laplace and/or double-sided geometric noise is used, depending on the column type.

GAUSSIAN#

Discrete and/or continuous Gaussian noise is used, depending on the column type. Not compatible with pure DP.

name()#

The name of the Enum member.

value()#

The value of the Enum member.

class StdevMechanism#

Bases: enum.Enum

Possible mechanisms for the stdev() aggregation.

Currently, the stdev() aggregation uses an additive noise mechanism to achieve differential privacy.

DEFAULT#

The framework automatically selects an appropriate mechanism. This choice might change over time as additional optimizations are added to the library.

LAPLACE#

Laplace and/or double-sided geometric noise is used, depending on the column type.

GAUSSIAN#

Discrete and/or continuous Gaussian noise is used, depending on the column type. Not compatible with pure DP.

name()#

The name of the Enum member.

value()#

The value of the Enum member.

class QueryExpr#

Bases: abc.ABC

A query expression, base class for relational operators.

In most cases, QueryExpr should not be manipulated directly, but rather created using tmlt.analytics.query_builder.QueryBuilder and then consumed by tmlt.analytics.session.Session. While they can be created and modified directly, this is an advanced usage and is not recommended for typical users.

QueryExpr are organized in a tree, where each node is an operator which returns a relation.

abstract accept(visitor)#

Dispatch methods on a visitor based on the QueryExpr type.

Parameters

visitor (QueryExprVisitor) –

Return type

Any

class PrivateSource#

Bases: QueryExpr

Loads the private source.

source_id :str#

The ID for the private source to load.

accept(visitor)#

Visit this QueryExpr with visitor.

Parameters

visitor (QueryExprVisitor) –

Return type

Any

class GetGroups#

Bases: QueryExpr

Returns groups based on the geometric partition selection for these columns.

child :QueryExpr#

The QueryExpr to get groups for.

columns :Optional[List[str]]#

The columns used for geometric partition selection.

If empty or none are provided, will use all of the columns in the table for partition selection.

accept(visitor)#

Visit this QueryExpr with visitor.

Parameters

visitor (QueryExprVisitor) –

Return type

Any

class Rename#

Bases: QueryExpr

Returns the dataframe with columns renamed.

child :QueryExpr#

The QueryExpr to apply Rename to.

column_mapper :Dict[str, str]#

The mapping of old column names to new column names.

This mapping can contain all column names or just a subset. If it contains a subset of columns, it will only rename those columns and keep the other column names the same.

accept(visitor)#

Visit this QueryExpr with visitor.

Parameters

visitor (QueryExprVisitor) –

Return type

Any

class Filter#

Bases: QueryExpr

Returns the subset of the rows that satisfy the condition.

child :QueryExpr#

The QueryExpr to filter.

condition :str#

A string of SQL expression specifying the filter to apply to the data.

For example, the string “A > B” matches rows where column A is greater than column B.

accept(visitor)#

Visit this QueryExpr with visitor.

Parameters

visitor (QueryExprVisitor) –

Return type

Any

class Select#

Bases: QueryExpr

Returns a subset of the columns.

child :QueryExpr#

The QueryExpr to apply the select on.

columns :List[str]#

The columns to select.

accept(visitor)#

Visit this QueryExpr with visitor.

Parameters

visitor (QueryExprVisitor) –

Return type

Any

class Map#

Bases: QueryExpr

Applies a map function to each row of a relation.

child :QueryExpr#

The QueryExpr to apply the map on.

f :Callable[[Row], Row]#

The map function.

schema_new_columns :tmlt.analytics._schema.Schema#

The expected schema for new columns produced by f.

augment :bool#

Whether to keep the existing columns.

If True, schema = old schema + schema_new_columns, otherwise only keeps the new columns (schema = schema_new_columns).

accept(visitor)#

Visit this QueryExpr with visitor.

Parameters

visitor (QueryExprVisitor) –

Return type

Any

__eq__(other)#

Returns true iff self == other.

This uses the bytecode of self.f and other.f to determine if the two functions are equal.

Parameters

other (object) –

Return type

bool

class FlatMap#

Bases: QueryExpr

Applies a flat map function to each row of a relation.

child :QueryExpr#

The QueryExpr to apply the flat map on.

f :Callable[[Row], List[Row]]#

The flat map function.

schema_new_columns :tmlt.analytics._schema.Schema#

The expected schema for new columns produced by f.

If the schema_new_columns has a grouping_column, that means this FlatMap produces a column that must be grouped by eventually. It also must be the only column in the schema.

augment :bool#

Whether to keep the existing columns.

If True, schema = old schema + schema_new_columns, otherwise only keeps the new columns (schema = schema_new_columns).

max_rows :Optional[int]#

The enforced limit on number of rows from each f(row).

accept(visitor)#

Visit this QueryExpr with visitor.

Parameters

visitor (QueryExprVisitor) –

Return type

Any

__eq__(other)#

Returns true iff self == other.

This uses the bytecode of self.f and other.f to determine if the two functions are equal.

Parameters

other (object) –

Return type

bool

class JoinPrivate#

Bases: QueryExpr

Returns the join of two private tables.

Before performing the join, each table is truncated based on the corresponding TruncationStrategy. For a more detailed overview of JoinPrivate’s behavior, see join_private().

child :QueryExpr#

The QueryExpr to join with right operand.

right_operand_expr :QueryExpr#

The QueryExpr for private source to join with.

truncation_strategy_left :Optional[tmlt.analytics.truncation_strategy.TruncationStrategy.Type]#

Truncation strategy to be used for the left table.

truncation_strategy_right :Optional[tmlt.analytics.truncation_strategy.TruncationStrategy.Type]#

Truncation strategy to be used for the right table.

join_columns :Optional[List[str]]#

The columns used for joining the tables, or None to use all common columns.

accept(visitor)#

Visit this QueryExpr with visitor.

Parameters

visitor (QueryExprVisitor) –

Return type

Any

class JoinPublic#

Bases: QueryExpr

Returns the join of a private and public table.

child :QueryExpr#

The QueryExpr to join with public_df.

public_table :Union[pyspark.sql.DataFrame, str]#

A DataFrame or public source to join with.

join_columns :Optional[List[str]]#

The columns used for joining the tables, or None to use all common columns.

accept(visitor)#

Visit this QueryExpr with visitor.

Parameters

visitor (QueryExprVisitor) –

Return type

Any

__eq__(other)#

Returns true iff self == other.

For the purposes of this equality operation, two dataframes are equal if they contain the same data, in any order.

Calling this on a JoinPublic that includes a very large dataframe could take a long time or consume a lot of resources, and is not recommended.

Parameters

other (object) –

Return type

bool

class AnalyticsDefault#

Default values for each type of column in Tumult Analytics.

INTEGER = 0#

The default value used for integers (0).

DECIMAL = 0.0#

The default value used for floats (0).

VARCHAR =#

The default value used for VARCHARs (the empty string).

DATE#

The default value used for dates (datetime.date.fromtimestamp(0)).

See fromtimestamp().

TIMESTAMP#

The default value used for timestamps (datetime.datetime.fromtimestamp(0)).

See fromtimestamp().

class ReplaceNullAndNan#

Bases: QueryExpr

Returns data with null and NaN expressions replaced by a default.

Warning

after a ReplaceNullAndNan query has been performed for a column, Tumult Analytics will raise an error if you use a KeySet for that column that contains null values.

child :QueryExpr#

The QueryExpr to replace null/NaN values in.

replace_with :Mapping[str, Union[int, float, str, datetime.date, datetime.datetime]]#

New values to replace with, by column.

If this dictionary is empty, all columns will be changed, with values replaced by a default value for each column’s type (see the AnalyticsDefault class variables).

accept(visitor)#

Visit this QueryExpr with visitor.

Parameters

visitor (QueryExprVisitor) –

Return type

Any

class ReplaceInfinity#

Bases: QueryExpr

Returns data with +inf and -inf expressions replaced by defaults.

child :QueryExpr#

The QueryExpr to replace +inf and -inf values in.

replace_with :Dict[str, Tuple[float, float]]#

New values to replace with, by column. The first value for each column will be used to replace -infinity, and the second value will be used to replace +infinity.

If this dictionary is empty, all columns of type DECIMAL will be changed, with infinite values replaced with a default value (see the AnalyticsDefault class variables).

accept(visitor)#

Visit this QueryExpr with visitor.

Parameters

visitor (QueryExprVisitor) –

Return type

Any

class DropNullAndNan#

Bases: QueryExpr

Returns data with rows that contain null or NaN value dropped.

Warning

After a DropNullAndNan query has been performed for a column, Tumult Analytics will raise an error if you use a KeySet for that column that contains null values.

child :QueryExpr#

The QueryExpr in which to drop nulls/NaNs.

columns :List[str]#

Columns in which to look for nulls and NaNs.

If this list is empty, all columns will be looked at - so if any column contains a null or NaN value that row will be dropped.

accept(visitor)#

Visit this QueryExpr with visitor.

Parameters

visitor (QueryExprVisitor) –

Return type

Any

class DropInfinity#

Bases: QueryExpr

Returns data with rows that contain +inf/-inf dropped.

child :QueryExpr#

The QueryExpr in which to drop +inf/-inf.

columns :List[str]#

Columns in which to look for and infinite values.

If this list is empty, all columns will be looked at - so if any column contains an infinite value, that row will be dropped.

accept(visitor)#

Visit this QueryExpr with visitor.

Parameters

visitor (QueryExprVisitor) –

Return type

Any

class EnforceConstraint#

Bases: QueryExpr

Enforces a constraint on the data.

child :QueryExpr#

The QueryExpr to which the constraint will be applied.

constraint :tmlt.analytics.constraints.Constraint#

A constraint to be enforced.

options :Dict[str, Any]#

Options to be used when enforcing the constraint.

Appropriate values here vary depending on the constraint. These options are to support advanced use cases, and generally should not be used.

accept(visitor)#

Visit this QueryExpr with visitor.

Parameters

visitor (QueryExprVisitor) –

Return type

Any

class GroupByCount#

Bases: QueryExpr

Returns the count of each combination of the groupby domains.

child :QueryExpr#

The QueryExpr to measure.

groupby_keys :tmlt.analytics.keyset.KeySet#

The keys of the resulting aggregated data.

output_column :str = count#

The name of the column to store the counts in.

mechanism :CountMechanism#

Choice of noise mechanism.

By DEFAULT, the framework automatically selects an appropriate mechanism.

accept(visitor)#

Visit this QueryExpr with visitor.

Parameters

visitor (QueryExprVisitor) –

Return type

Any

class GroupByCountDistinct#

Bases: QueryExpr

Returns the count of distinct rows in each groupby domain value.

child :QueryExpr#

The QueryExpr to measure.

groupby_keys :tmlt.analytics.keyset.KeySet#

The keys of the resulting aggregated data.

columns_to_count :Optional[List[str]]#

The columns that are compared when determining if two rows are distinct.

If empty, will count all distinct rows.

output_column :str = count_distinct#

The name of the column to store the distinct counts in.

mechanism :CountDistinctMechanism#

Choice of noise mechanism.

By DEFAULT, the framework automatically selects an appropriate mechanism.

accept(visitor)#

Visit this QueryExpr with visitor.

Parameters

visitor (QueryExprVisitor) –

Return type

Any

class GroupByQuantile#

Bases: QueryExpr

Returns the quantile of a column for each combination of the groupby domains.

If the column to be measured contains null, NaN, or positive or negative infinity, those values will be dropped (as if dropped explicitly via DropNullAndNan and DropInfinity) before the quantile is calculated.

child :QueryExpr#

The QueryExpr to measure.

groupby_keys :tmlt.analytics.keyset.KeySet#

The keys of the resulting aggregated data.

measure_column :str#

The column to compute the quantile over.

quantile :float#

The quantile to compute (between 0 and 1).

low :float#

The lower bound for clamping the measure_column. Should be less than high.

high :float#

The upper bound for clamping the measure_column. Should be greater than low.

output_column :str = quantile#

The name of the column to store the quantiles in.

accept(visitor)#

Visit this QueryExpr with visitor.

Parameters

visitor (QueryExprVisitor) –

Return type

Any

class GroupByBoundedSum#

Bases: QueryExpr

Returns the bounded sum of a column for each combination of groupby domains.

If the column to be measured contains null, NaN, or positive or negative infinity, those values will be dropped (as if dropped explicitly via DropNullAndNan and DropInfinity) before the sum is calculated.

child :QueryExpr#

The QueryExpr to measure.

groupby_keys :tmlt.analytics.keyset.KeySet#

The keys of the resulting aggregated data.

measure_column :str#

The column to compute the sum over.

low :float#

The lower bound for clamping the measure_column. Should be less than high.

high :float#

The upper bound for clamping the measure_column. Should be greater than low.

output_column :str = sum#

The name of the column to store the sums in.

mechanism :SumMechanism#

Choice of noise mechanism.

By DEFAULT, the framework automatically selects an appropriate mechanism.

accept(visitor)#

Visit this QueryExpr with visitor.

Parameters

visitor (QueryExprVisitor) –

Return type

Any

class GroupByBoundedAverage#

Bases: QueryExpr

Returns bounded average of a column for each combination of groupby domains.

If the column to be measured contains null, NaN, or positive or negative infinity, those values will be dropped (as if dropped explicitly via DropNullAndNan and DropInfinity) before the average is calculated.

child :QueryExpr#

The QueryExpr to measure.

groupby_keys :tmlt.analytics.keyset.KeySet#

The keys of the resulting aggregated data.

measure_column :str#

The column to compute the average over.

low :float#

The lower bound for clamping the measure_column. Should be less than high.

high :float#

The upper bound for clamping the measure_column. Should be greater than low.

output_column :str = average#

The name of the column to store the averages in.

mechanism :AverageMechanism#

Choice of noise mechanism.

By DEFAULT, the framework automatically selects an appropriate mechanism.

accept(visitor)#

Visit this QueryExpr with visitor.

Parameters

visitor (QueryExprVisitor) –

Return type

Any

class GroupByBoundedVariance#

Bases: QueryExpr

Returns bounded variance of a column for each combination of groupby domains.

If the column to be measured contains null, NaN, or positive or negative infinity, those values will be dropped (as if dropped explicitly via DropNullAndNan and DropInfinity) before the variance is calculated.

child :QueryExpr#

The QueryExpr to measure.

groupby_keys :tmlt.analytics.keyset.KeySet#

The keys of the resulting aggregated data.

measure_column :str#

The column to compute the variance over.

low :float#

The lower bound for clamping the measure_column. Should be less than high.

high :float#

The upper bound for clamping the measure_column. Should be greater than low.

output_column :str = variance#

The name of the column to store the variances in.

mechanism :VarianceMechanism#

Choice of noise mechanism.

By DEFAULT, the framework automatically selects an appropriate mechanism.

accept(visitor)#

Visit this QueryExpr with visitor.

Parameters

visitor (QueryExprVisitor) –

Return type

Any

class GroupByBoundedSTDEV#

Bases: QueryExpr

Returns bounded stdev of a column for each combination of groupby domains.

If the column to be measured contains null, NaN, or positive or negative infinity, those values will be dropped (as if dropped explicitly via DropNullAndNan and DropInfinity) before the standard deviation is calculated.

child :QueryExpr#

The QueryExpr to measure.

groupby_keys :tmlt.analytics.keyset.KeySet#

The keys of the resulting aggregated data.

measure_column :str#

The column to compute the standard deviation over.

low :float#

The lower bound for clamping the measure_column. Should be less than high.

high :float#

The upper bound for clamping the measure_column. Should be greater than low.

output_column :str = stdev#

The name of the column to store the stdev in.

mechanism :StdevMechanism#

Choice of noise mechanism.

By DEFAULT, the framework automatically selects an appropriate mechanism.

accept(visitor)#

Visit this QueryExpr with visitor.

Parameters

visitor (QueryExprVisitor) –

Return type

Any

class QueryExprVisitor#

Bases: abc.ABC

A base class for implementing visitors for QueryExpr.

Methods#

visit_private_source()

Visit a PrivateSource.

visit_rename()

Visit a Rename.

visit_filter()

Visit a Filter.

visit_select()

Visit a Select.

visit_map()

Visit a Map.

visit_flat_map()

Visit a FlatMap.

visit_join_private()

Visit a JoinPrivate.

visit_join_public()

Visit a JoinPublic.

visit_replace_null_and_nan()

Visit a ReplaceNullAndNan.

visit_replace_infinity()

Visit a ReplaceInfinity.

visit_drop_null_and_nan()

Visit a DropNullAndNan.

visit_drop_infinity()

Visit a DropInfinity.

visit_enforce_constraint()

Visit a EnforceConstraint.

visit_get_groups()

Visit a GetGroups.

visit_groupby_count()

Visit a GroupByCount.

visit_groupby_count_distinct()

Visit a GroupByCountDistinct.

visit_groupby_quantile()

Visit a GroupByQuantile.

visit_groupby_bounded_sum()

Visit a GroupByBoundedSum.

visit_groupby_bounded_average()

Visit a GroupByBoundedAverage.

visit_groupby_bounded_variance()

Visit a GroupByBoundedVariance.

visit_groupby_bounded_stdev()

Visit a GroupByBoundedSTDEV.

abstract visit_private_source(expr)#

Visit a PrivateSource.

Parameters

expr (PrivateSource) –

Return type

Any

abstract visit_rename(expr)#

Visit a Rename.

Parameters

expr (Rename) –

Return type

Any

abstract visit_filter(expr)#

Visit a Filter.

Parameters

expr (Filter) –

Return type

Any

abstract visit_select(expr)#

Visit a Select.

Parameters

expr (Select) –

Return type

Any

abstract visit_map(expr)#

Visit a Map.

Parameters

expr (Map) –

Return type

Any

abstract visit_flat_map(expr)#

Visit a FlatMap.

Parameters

expr (FlatMap) –

Return type

Any

abstract visit_join_private(expr)#

Visit a JoinPrivate.

Parameters

expr (JoinPrivate) –

Return type

Any

abstract visit_join_public(expr)#

Visit a JoinPublic.

Parameters

expr (JoinPublic) –

Return type

Any

abstract visit_replace_null_and_nan(expr)#

Visit a ReplaceNullAndNan.

Parameters

expr (ReplaceNullAndNan) –

Return type

Any

abstract visit_replace_infinity(expr)#

Visit a ReplaceInfinity.

Parameters

expr (ReplaceInfinity) –

Return type

Any

abstract visit_drop_null_and_nan(expr)#

Visit a DropNullAndNan.

Parameters

expr (DropNullAndNan) –

Return type

Any

abstract visit_drop_infinity(expr)#

Visit a DropInfinity.

Parameters

expr (DropInfinity) –

Return type

Any

abstract visit_enforce_constraint(expr)#

Visit a EnforceConstraint.

Parameters

expr (EnforceConstraint) –

Return type

Any

abstract visit_get_groups(expr)#

Visit a GetGroups.

Parameters

expr (GetGroups) –

Return type

Any

abstract visit_groupby_count(expr)#

Visit a GroupByCount.

Parameters

expr (GroupByCount) –

Return type

Any

abstract visit_groupby_count_distinct(expr)#

Visit a GroupByCountDistinct.

Parameters

expr (GroupByCountDistinct) –

Return type

Any

abstract visit_groupby_quantile(expr)#

Visit a GroupByQuantile.

Parameters

expr (GroupByQuantile) –

Return type

Any

abstract visit_groupby_bounded_sum(expr)#

Visit a GroupByBoundedSum.

Parameters

expr (GroupByBoundedSum) –

Return type

Any

abstract visit_groupby_bounded_average(expr)#

Visit a GroupByBoundedAverage.

Parameters

expr (GroupByBoundedAverage) –

Return type

Any

abstract visit_groupby_bounded_variance(expr)#

Visit a GroupByBoundedVariance.

Parameters

expr (GroupByBoundedVariance) –

Return type

Any

abstract visit_groupby_bounded_stdev(expr)#

Visit a GroupByBoundedSTDEV.

Parameters

expr (GroupByBoundedSTDEV) –

Return type

Any