Building queries#

The QueryBuilder class allows users to construct differentially private queries using a PySpark-like syntax.

QueryBuilder implements transformations such as joins, maps, or filters. Using a transformation method returns a new QueryBuilder with that transformation applied. To re-use the transformations in a QueryBuilder as the base for multiple queries, users can create a view using create_view() and write queries on that view.

QueryBuilder instances can also have an aggregation like count() applied to them, potentially after a groupby(), yielding an object that can be passed to evaluate() to obtain differentially private results to the query.

QueryBuilder initialization#

QueryBuilder initialization is very simple: the only argument of the constructor is the table on which to apply the query.

QueryBuilder(source_id)

High-level interface for specifying DP queries.

Transformations#

QueryBuilders implement a variety of transformations, which all yield a new QueryBuilder instance with that transformation applied. At this stage, the query cannot yet be evaluated in a differentially private manner, but users can create views using create_view() on a transformation.

Schema manipulation#

Transformations that manipulate table schemas.

QueryBuilder.select(columns)

Selects the specified columns, dropping the others.

QueryBuilder.rename(column_mapper)

Renames one or more columns in the table.

Special value handling#

Transformations that replace or remove special column values such as null values, NaN values, or infinity values.

QueryBuilder.replace_null_and_nan([replace_with])

Replaces null and NaN values in specified columns.

QueryBuilder.drop_null_and_nan(columns)

Removes rows containing null or NaN values.

QueryBuilder.replace_infinity([replace_with])

Replaces +inf and -inf values in specified columns.

QueryBuilder.drop_infinity(columns)

Remove rows containing infinite values.

Filters and maps#

Transformations that remove or modify rows of private tables, according to user-specified predicates or functions.

QueryBuilder.filter(condition)

Filter rows matching a condition.

QueryBuilder.map(f, new_column_types[, augment])

Applies a mapping function to each row.

QueryBuilder.flat_map(f, new_column_types[, ...])

Applies a mapping function to each row, returning zero or more rows.

QueryBuilder.flat_map_by_id(f, new_column_types)

Applies a transformation to each group of records sharing an ID.

Binning#

A transformation that groups together nearby values in numerical, date, or timestamp columns, according to user-specified bins.

QueryBuilder.bin_column(column, spec[, name])

Creates a new column by assigning the values in a given column to bins.

BinningSpec(bin_edges[, names, right, ...])

A spec object defining an operation where values are assigned to bins.

Constraints#

enforce() truncates the sensitive data to limit the maximum impact of the protected change. More information about it can be found in the Working with privacy IDs tutorial.

QueryBuilder.enforce(constraint)

Enforces a Constraint on the table.

Constraint()

Base class representing a known, enforceable fact about a table.

MaxRowsPerID(max)

A constraint limiting the number of rows associated with each ID in a table.

MaxGroupsPerID(grouping_column, max)

A constraint limiting the number of distinct groups per ID.

MaxRowsPerGroupPerID(grouping_column, max)

A constraint limiting rows per unique (ID, grouping column) pair in a table.

Joins#

Transformations that join the sensitive data with public, non-sensitive data, or with another sensitive data source.

QueryBuilder.join_public(public_table[, ...])

Joins the table with a DataFrame or a public source.

QueryBuilder.join_private(right_operand[, ...])

Join the table with another QueryBuilder.

Group-by#

A transformation that groups the data by the value of one or more columns. The group-by keys can be specified using a KeySet; more information about it can be found in the Group-by queries tutorial. The transformation returns a GroupedQueryBuilder, a object representing a partial query on which only aggregations can be run.

QueryBuilder.groupby(by)

Groups the query by the given set of keys, returning a GroupedQueryBuilder.

KeySet

A class containing a set of values for specific columns.

GroupedQueryBuilder

A QueryBuilder that is grouped by a set of columns and can be aggregated.

Aggregations#

These aggregations return a Query that can be evaluated with differential privacy. They can be used after a groupby() operation on a GroupedQueryBuilder, or on a QueryBuilder directly.

QueryBuilder.count([name, mechanism])

Returns a count query ready to be evaluated.

QueryBuilder.count_distinct([columns, name, ...])

Returns a count_distinct query ready to be evaluated.

QueryBuilder.sum(column, low, high[, name, ...])

Returns a sum query ready to be evaluated.

QueryBuilder.average(column, low, high[, ...])

Returns an average query ready to be evaluated.

QueryBuilder.variance(column, low, high[, ...])

Returns a variance query ready to be evaluated.

QueryBuilder.stdev(column, low, high[, ...])

Returns a standard deviation query ready to be evaluated.

QueryBuilder.quantile(column, quantile, low, ...)

Returns a quantile query ready to be evaluated.

QueryBuilder.median(column, low, high[, name])

Returns a quantile query requesting a median value, ready to be evaluated.

QueryBuilder.min(column, low, high[, name])

Returns a quantile query requesting a minimum value, ready to be evaluated.

QueryBuilder.max(column, low, high[, name])

Returns a quantile query requesting a maximum value, ready to be evaluated.

QueryBuilder.histogram(column, bin_edges[, name])

Returns a count query containing the frequency of values in specified column.

QueryBuilder.get_groups([columns])

Returns a query that gets combinations of values in the listed columns.

QueryBuilder.get_bounds(column[, ...])

Returns a query that gets approximate upper and lower bounds for a column.

Queries and post-processing#

These classes are returned by aggregations, and can be passed to evaluate(). Some of them, notably group-by counts, support additional post-processing operations that can be performed at the same time as query evaluation.

Query

A complete query, ready to be evaluated in a differentially private manner.

GroupbyCountQuery

Stores the plan for a differentially private groupby count calculation.

GroupbyCountQuery.suppress(threshold)

Returns a new query with an added postprocessing thresholding step.

Evaluating queries#

The evaluate() method is the main function used to compute queries with differential privacy. QueryBuilders can also be used to create views using create_view.

The Session also provides methods to add public tables and perform parallel partitioning.

Session.evaluate(query_expr, privacy_budget)

Answers a query within the given privacy budget and returns a Spark dataframe.

Session.create_view(query_expr, source_id, cache)

Creates a new view from a transformation and possibly cache it.

Session.delete_view(source_id)

Deletes a view and decaches it if it was cached.

Session.add_public_dataframe(source_id, ...)

Adds a public data source to the session.

Session.partition_and_create(source_id, ...)

Returns new sessions from a partition mapped to split name/source_id.

Session.stop()

Closes out this Session, allowing other Sessions to become active.

Column types and descriptors#

Objects and classes used to describe the schema of tables in a Session.

ColumnType

The supported SQL92 column types used by Tumult Analytics.

ColumnDescriptor

Information about a column.

AnalyticsDefault

Default values for each type of column in Tumult Analytics.