Building queries#
The QueryBuilder
class allows users to construct
differentially private queries using a PySpark-like syntax.
QueryBuilder implements transformations such as joins, maps, or filters.
Using a transformation method returns a new QueryBuilder with that
transformation applied. To re-use the transformations in a
QueryBuilder
as the base for multiple queries, users
can create a view using create_view()
and write
queries on that view.
QueryBuilder instances can also have an aggregation like
count()
applied to them, potentially after a
groupby()
, yielding an object that can be
passed to evaluate()
to obtain differentially
private results to the query.
QueryBuilder initialization#
QueryBuilder
initialization is very simple: the only
argument of the constructor is the table on which to apply the query.
|
High-level interface for specifying DP queries. |
Transformations#
QueryBuilders implement a variety of transformations, which all yield a new
QueryBuilder instance with that transformation applied. At this stage, the query
cannot yet be evaluated in a differentially private manner, but users can create
views using create_view()
on a transformation.
Schema manipulation#
Transformations that manipulate table schemas.
|
Selects the specified columns, dropping the others. |
|
Renames one or more columns in the table. |
Special value handling#
Transformations that replace or remove special column values such as null values, NaN values, or infinity values.
|
Replaces null and NaN values in specified columns. |
|
Removes rows containing null or NaN values. |
|
Replaces +inf and -inf values in specified columns. |
|
Remove rows containing infinite values. |
Filters and maps#
Transformations that remove or modify rows of private tables, according to user-specified predicates or functions.
|
Filter rows matching a condition. |
|
Applies a mapping function to each row. |
|
Applies a mapping function to each row, returning zero or more rows. |
|
Applies a transformation to each group of records sharing an ID. |
Binning#
A transformation that groups together nearby values in numerical, date, or timestamp columns, according to user-specified bins.
|
Creates a new column by assigning the values in a given column to bins. |
|
A spec object defining an operation where values are assigned to bins. |
Constraints#
enforce()
truncates the sensitive data to limit the maximum impact
of the protected change. More information about it can be found in the
Working with privacy IDs tutorial.
|
Enforces a |
Base class representing a known, enforceable fact about a table. |
|
|
A constraint limiting the number of rows associated with each ID in a table. |
|
A constraint limiting the number of distinct groups per ID. |
|
A constraint limiting rows per unique (ID, grouping column) pair in a table. |
Joins#
Transformations that join the sensitive data with public, non-sensitive data, or with another sensitive data source.
|
Joins the table with a DataFrame or a public source. |
|
Join the table with another |
Group-by#
A transformation that groups the data by the value of one or more columns. The
group-by keys can be specified using a KeySet
; more information about
it can be found in the Group-by queries tutorial. The transformation
returns a GroupedQueryBuilder
, a object representing a partial query on
which only aggregations can be run.
Groups the query by the given set of keys, returning a GroupedQueryBuilder. |
A class containing a set of values for specific columns. |
|
A QueryBuilder that is grouped by a set of columns and can be aggregated. |
Aggregations#
These aggregations return a Query
that can be evaluated with differential
privacy. They can be used after a groupby()
operation on a
GroupedQueryBuilder
, or on a QueryBuilder
directly.
|
Returns a count query ready to be evaluated. |
|
Returns a count_distinct query ready to be evaluated. |
|
Returns a sum query ready to be evaluated. |
|
Returns an average query ready to be evaluated. |
|
Returns a variance query ready to be evaluated. |
|
Returns a standard deviation query ready to be evaluated. |
|
Returns a quantile query ready to be evaluated. |
|
Returns a quantile query requesting a median value, ready to be evaluated. |
|
Returns a quantile query requesting a minimum value, ready to be evaluated. |
|
Returns a quantile query requesting a maximum value, ready to be evaluated. |
|
Returns a count query containing the frequency of values in specified column. |
|
Returns a query that gets combinations of values in the listed columns. |
|
Returns a query that gets approximate upper and lower bounds for a column. |
Queries and post-processing#
These classes are returned by aggregations, and can be passed to
evaluate()
. Some of them, notably group-by counts, support
additional post-processing operations that can be performed at the same time as query
evaluation.
A complete query, ready to be evaluated in a differentially private manner. |
|
Stores the plan for a differentially private groupby count calculation. |
|
Returns a new query with an added postprocessing thresholding step. |
Evaluating queries#
The evaluate()
method is the main function used to
compute queries with differential privacy. QueryBuilders can also be used to create
views using create_view
.
The Session
also provides methods to add public tables
and perform parallel partitioning.
|
Answers a query within the given privacy budget and returns a Spark dataframe. |
|
Creates a new view from a transformation and possibly cache it. |
|
Deletes a view and decaches it if it was cached. |
|
Adds a public data source to the session. |
|
Returns new sessions from a partition mapped to split name/ |
Closes out this Session, allowing other Sessions to become active. |
Column types and descriptors#
Objects and classes used to describe the schema of tables in a Session
.
The supported SQL92 column types used by Tumult Analytics. |
|
Information about a column. |
|
Default values for each type of column in Tumult Analytics. |