Building queries#

The QueryBuilder class allows users to construct differentially private queries using a PySpark-like syntax.

QueryBuilder implements transformations such as joins, maps, or filters. Using a transformation method returns a new QueryBuilder with that transformation applied. To re-use the transformations in a QueryBuilder as the base for multiple queries, users can create a view using create_view() and write queries on that view.

QueryBuilder instances can also have an aggregation like count() applied to them, potentially after a groupby(), yielding an object that can be passed to evaluate() to obtain differentially private results to the query.

QueryBuilder initialization#

QueryBuilder initialization is very simple: the only argument of the constructor is the table on which to apply the query.

QueryBuilder(source_id)

High-level interface for specifying DP queries.

Transformations#

QueryBuilders implement a variety of transformations, which all yield a new QueryBuilder instance with that transformation applied. At this stage, the query cannot yet be evaluated in a differentially private manner, but users can create views using create_view() on a transformation.

Schema manipulation#

Transformations that manipulate table schemas.

`QueryBuilder.select`(columns)	Selects the specified columns, dropping the others.
`QueryBuilder.rename`(column_mapper)	Renames one or more columns in the table.

Special value handling#

Transformations that replace or remove special column values such as null values, NaN values, or infinity values.

`QueryBuilder.replace_null_and_nan`([replace_with])	Replaces null and NaN values in specified columns.
`QueryBuilder.drop_null_and_nan`(columns)	Removes rows containing null or NaN values.
`QueryBuilder.replace_infinity`([replace_with])	Replaces +inf and -inf values in specified columns.
`QueryBuilder.drop_infinity`(columns)	Remove rows containing infinite values.

Filters and maps#

Transformations that remove or modify rows of private tables, according to user-specified predicates or functions.

`QueryBuilder.filter`(condition)	Filter rows matching a condition.
`QueryBuilder.map`(f, new_column_types[, augment])	Applies a mapping function to each row.
`QueryBuilder.flat_map`(f, new_column_types[, ...])	Applies a mapping function to each row, returning zero or more rows.
`QueryBuilder.flat_map_by_id`(f, new_column_types)	Applies a transformation to each group of records sharing an ID.

Binning#

A transformation that groups together nearby values in numerical, date, or timestamp columns, according to user-specified bins.

`QueryBuilder.bin_column`(column, spec[, name])	Creates a new column by assigning the values in a given column to bins.
`BinningSpec`(bin_edges[, names, right, ...])	A spec object defining an operation where values are assigned to bins.

Constraints#

enforce() truncates the sensitive data to limit the maximum impact of the protected change. More information about it can be found in the Working with privacy IDs tutorial.

`QueryBuilder.enforce`(constraint)	Enforces a `Constraint` on the table.
`Constraint`()	Base class representing a known, enforceable fact about a table.
`MaxRowsPerID`(max)	A constraint limiting the number of rows associated with each ID in a table.
`MaxGroupsPerID`(grouping_column, max)	A constraint limiting the number of distinct groups per ID.
`MaxRowsPerGroupPerID`(grouping_column, max)	A constraint limiting rows per unique (ID, grouping column) pair in a table.

Joins#

Transformations that join the sensitive data with public, non-sensitive data, or with another sensitive data source.

`QueryBuilder.join_public`(public_table[, ...])	Joins the table with a DataFrame or a public source.
`QueryBuilder.join_private`(right_operand[, ...])	Join the table with another `QueryBuilder`.

Group-by#

A transformation that groups the data by the value of one or more columns. The group-by keys can be specified using a KeySet; more information about it can be found in the Group-by queries tutorial. The transformation returns a GroupedQueryBuilder, a object representing a partial query on which only aggregations can be run.

QueryBuilder.groupby(by)

Groups the query by the given set of keys, returning a GroupedQueryBuilder.

`KeySet`	A class containing a set of values for specific columns.
`GroupedQueryBuilder`	A QueryBuilder that is grouped by a set of columns and can be aggregated.

Aggregations#

These aggregations return a Query that can be evaluated with differential privacy. They can be used after a groupby() operation on a GroupedQueryBuilder, or on a QueryBuilder directly.

`QueryBuilder.count`([name, mechanism])	Returns a count query ready to be evaluated.
`QueryBuilder.count_distinct`([columns, name, ...])	Returns a count_distinct query ready to be evaluated.
`QueryBuilder.sum`(column, low, high[, name, ...])	Returns a sum query ready to be evaluated.
`QueryBuilder.average`(column, low, high[, ...])	Returns an average query ready to be evaluated.
`QueryBuilder.variance`(column, low, high[, ...])	Returns a variance query ready to be evaluated.
`QueryBuilder.stdev`(column, low, high[, ...])	Returns a standard deviation query ready to be evaluated.
`QueryBuilder.quantile`(column, quantile, low, ...)	Returns a quantile query ready to be evaluated.
`QueryBuilder.median`(column, low, high[, name])	Returns a quantile query requesting a median value, ready to be evaluated.
`QueryBuilder.min`(column, low, high[, name])	Returns a quantile query requesting a minimum value, ready to be evaluated.
`QueryBuilder.max`(column, low, high[, name])	Returns a quantile query requesting a maximum value, ready to be evaluated.
`QueryBuilder.histogram`(column, bin_edges[, name])	Returns a count query containing the frequency of values in specified column.
`QueryBuilder.get_groups`([columns])	Returns a query that gets combinations of values in the listed columns.
`QueryBuilder.get_bounds`(column[, ...])	Returns a query that gets approximate upper and lower bounds for a column.

Queries and post-processing#

These classes are returned by aggregations, and can be passed to evaluate(). Some of them, notably group-by counts, support additional post-processing operations that can be performed at the same time as query evaluation.

`Query`	A complete query, ready to be evaluated in a differentially private manner.
`GroupbyCountQuery`	Stores the plan for a differentially private groupby count calculation.

GroupbyCountQuery.suppress(threshold)

Returns a new query with an added postprocessing thresholding step.

Evaluating queries#

The evaluate() method is the main function used to compute queries with differential privacy. QueryBuilders can also be used to create views using create_view.

The Session also provides methods to add public tables and perform parallel partitioning.

`Session.evaluate`(query_expr, privacy_budget)	Answers a query within the given privacy budget and returns a Spark dataframe.
`Session.create_view`(query_expr, source_id, cache)	Creates a new view from a transformation and possibly cache it.
`Session.delete_view`(source_id)	Deletes a view and decaches it if it was cached.
`Session.add_public_dataframe`(source_id, ...)	Adds a public data source to the session.
`Session.partition_and_create`(source_id, ...)	Returns new sessions from a partition mapped to split name/`source_id`.
`Session.stop`()	Closes out this Session, allowing other Sessions to become active.

Column types and descriptors#

Objects and classes used to describe the schema of tables in a Session.

`ColumnType`	The supported SQL92 column types used by Tumult Analytics.
`ColumnDescriptor`	Information about a column.
`AnalyticsDefault`	Default values for each type of column in Tumult Analytics.

Tumult Platform