query_builder#

An API for building differentially private queries from basic operations.

The QueryBuilder class allows users to construct differentially private queries using SQL-like commands. These queries can then be used with Session to obtain results or construct views on which further queries can be run.

The QueryBuilder class can apply transformations, such as joins or filters, as well as compute aggregations like counts, sums, and standard deviations. See QueryBuilder for a full list of supported transformations and aggregations.

After each transformation, the QueryBuilder is modified and returned with that transformation applied. To re-use the transformations in a QueryBuilder as the base for multiple queries, create a view using create_view() and write queries on that view.

At any point, a QueryBuilder instance can have a group by followed by an aggregation applied to it, yielding a QueryExpr object. This QueryExpr can then be passed to evaluate() to obtain differentially private results to the query.

Data#

Row#

Type alias for a dictionary with string keys.

Classes#

ColumnDescriptor

Information about a column.

ColumnType

The supported SQL92 column types for Analytics data.

QueryBuilder

High-level interface for specifying DP queries.

GroupedQueryBuilder

A QueryBuilder that is grouped by a set of columns and can be aggregated.

class ColumnDescriptor#

Information about a column.

ColumnDescriptors have the following attributes:

column_type#

A ColumnType, specifying what type this column has.

allow_null#

bool. If True, this column allows null values.

allow_nan#

bool. If True, this column allows NaN values.

allow_inf#

bool. If True, this column allows infinite values.

class ColumnType#

Bases: enum.Enum

The supported SQL92 column types for Analytics data.

INTEGER#

Integer column type.

DECIMAL#

Floating-point column type.

VARCHAR#

String column type.

DATE#

Date column type.

TIMESTAMP#

Timestamp column type.

name(self)#

The name of the Enum member.

value(self)#

The value of the Enum member.

class QueryBuilder(source_id)#

High-level interface for specifying DP queries.

Each instance corresponds to applying a transformation. The full graph of QueryBuilder objects can be traversed from root to a node.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Building a query
>>> query = QueryBuilder("my_private_data").count()
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas()
   count
0      3
Methods#

query_expr()

Returns the query_expr being built.

join_public()

Updates the current query to join with a dataframe or public source.

join_private()

Updates the current query to join with another QueryBuilder.

replace_null_and_nan()

Updates the current query to replace null and NaN values in some columns.

replace_infinity()

Updates the current query to replace +inf and -inf values in some columns.

drop_null_and_nan()

Updates the current query to drop rows containing null or NaN values.

drop_infinity()

Updates the current query to drop rows containing infinite values.

rename()

Updates the current query to rename the columns.

filter()

Updates the current query to filter for rows matching a predicate.

select()

Updates the current query to select certain columns.

map()

Updates the current query to apply a mapping function to each row.

flat_map()

Updates the current query to apply a flat map.

bin_column()

Create a new column by assigning the values in a given column to bins.

histogram()

Returns a count query containing the frequency of values in specified column.

groupby()

Groups the query by the given set of keys, returning a GroupedQueryBuilder.

count()

Returns a count query that is ready to be evaluated.

count_distinct()

Returns a count_distinct query that is ready to be evaluated.

quantile()

Returns a quantile query that is ready to be evaluated.

min()

Returns a quantile query requesting a minimum value, ready to be evaluated.

max()

Returns a quantile query requesting a maximum value, ready to be evaluated.

median()

Returns a quantile query requesting a median value, ready to be evaluated.

sum()

Returns a sum query that is ready to be evaluated.

average()

Returns an average query that is ready to be evaluated.

variance()

Returns a variance query that is ready to be evaluated.

stdev()

Returns a standard deviation query that is ready to be evaluated.

Parameters

source_id (str) –

__init__(source_id)#

Constructor.

Parameters

source_id (strstr) – The source id used in the query_expr.

property query_expr(self)#

Returns the query_expr being built.

join_public(self, public_table, join_columns=None)#

Updates the current query to join with a dataframe or public source.

This operation performs a natural join between two tables. This means that the resulting table will contain all columns unique to each input table, along with one copy of each common column. In most cases, the columns have the same names they did in the input tables.

By default, the input tables are joined on all common columns (i.e., columns whose names and data types match). However if join_columns is given, the tables will be joined only on the given columns, and the remaining common columns will be disambiguated in the resulting table by the addition of a _left or _right suffix to their names. If given, join_columns must contain a non-empty subset of the tables’ common columns. For example, two tables with columns A,B,C and A,B,D would by default be joined on columns A and B, resulting in a table with columns A,B,C,D; if join_columns=["B"] were given when performing this join, the resulting table would have columns A_left,A_right,B,C,D. The order of columns in the resulting table is not guaranteed.

Note that columns must share both names and data types for them to be used in joining. If this condition is not met, one of the data sources must be transformed to be eligible for joining (e.g., by using rename() or map()).

Every row within a join group (i.e., every row that shares values in the join columns) from the private table will be joined with every row from that same group in the public table. For example, if a group has \(X\) rows in the private table and \(Y\) rows in the public table, then the output table will contain \(X*Y\) rows for this group.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> public_data.toPandas()
   A  C
0  0  0
1  0  1
2  1  1
3  1  2
>>> # Create a query joining with public_data as a dataframe:
>>> query = (
...     QueryBuilder("my_private_data")
...     .join_public(public_data)
...     .groupby(KeySet.from_dict({"C": [0, 1, 2]}))
...     .count()
... )
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("C").toPandas()
   C  count
0  0      1
1  1      3
2  2      2
>>> # Alternatively, the dataframe can be added to the Session as a public
>>> # source, and its source ID can be used to perform the join:
>>> sess.add_public_dataframe(
...     source_id="my_public_data", dataframe=public_data
... )
>>> query = (
...     QueryBuilder("my_private_data")
...     .join_public("my_public_data")
...     .groupby(KeySet.from_dict({"C": [0, 1, 2]}))
...     .count()
... )
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("C").toPandas()
   C  count
0  0      1
1  1      3
2  2      2
Parameters
  • public_table (Union[pyspark.sql.DataFrame, str]) – A dataframe or source ID for a public source to natural join with private data.

  • join_columns (Optional[Sequence[str]]) – The columns to join on. If join_columns is not specified, the tables will be joined on all common columns.

Return type

QueryBuilder

join_private(self, right_operand, truncation_strategy_left, truncation_strategy_right, join_columns=None)#

Updates the current query to join with another QueryBuilder.

This operation is a natural join, with the same behavior and requirements as join_public(). However, there is a key difference: before the join is performed, each table is truncated based on the corresponding TruncationStrategy.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
... )
>>> from tmlt.analytics.query_builder import TruncationStrategy
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> sess.create_view(
...     QueryBuilder("my_private_data")
...     .select(["A", "X"])
...     .rename({"X": "C"})
...     .query_expr,
...     source_id="my_private_view",
...     cache=False
... )
>>> # A query where only one record with each join key is kept on the left
>>> # table, but two are kept on the right table.
>>> query_drop_excess = (
...     QueryBuilder("my_private_data")
...     .join_private(
...         QueryBuilder("my_private_view"),
...         truncation_strategy_left=TruncationStrategy.DropExcess(1),
...         truncation_strategy_right=TruncationStrategy.DropExcess(2),
...     )
...     .count()
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query_drop_excess,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas()
   count
0      3
>>> # A query where all records that share a join key with another record in
>>> # their table are dropped, in both the left and right tables.
>>> query_drop_non_unique = (
...     QueryBuilder("my_private_data")
...     .join_private(
...         QueryBuilder("my_private_view"),
...         truncation_strategy_left=TruncationStrategy.DropNonUnique(),
...         truncation_strategy_right=TruncationStrategy.DropNonUnique(),
...     )
...     .count()
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query_drop_non_unique,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas()
   count
0      1
Parameters
Return type

QueryBuilder

replace_null_and_nan(self, replace_with=None)#

Updates the current query to replace null and NaN values in some columns.

Note: if null values are replaced in a column, then Analytics will raise an error if you use a KeySet with a null value for that column.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
... )
>>> my_private_data.toPandas()
      A    B    X
0  None  0.0  0.0
1     1  NaN  1.1
2     2  2.0  NaN
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'DECIMAL', 'X': 'DECIMAL'}
>>> # Building a query with a replace_null_and_nan transformation
>>> query = (
...     QueryBuilder("my_private_data")
...     .replace_null_and_nan(
...         replace_with={
...             "A": "new_value",
...             "B": 1234,
...             "X": 56.78,
...         },
...     )
...     .groupby(KeySet.from_dict({"A": ["new_value", "1", "2"]}))
...     .count()
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas()
           A  count
0          1      1
1          2      1
2  new_value      1
Parameters

replace_with (Optional[Mapping[str, Union[int, float, str, datetime.date, datetime.datetime]]]) – A dictionary mapping column names to values used to replace null and NaN values. If None (or empty), all columns will have null and NaN values replaced with Analytics defaults; see tmlt.analytics.query_expr.AnalyticsDefault.

Return type

QueryBuilder

replace_infinity(self, replace_with=None)#

Updates the current query to replace +inf and -inf values in some columns.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
... )
>>> my_private_data.toPandas()
    A    B    X
0  a1  0.0  0.0
1  a1  NaN -inf
2  a2  2.0  inf
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'DECIMAL', 'X': 'DECIMAL'}
>>> # Building a query with a replace_null_and_nan transformation
>>> query = (
...     QueryBuilder("my_private_data")
...     .replace_infinity(
...         replace_with={
...             "X": (-100, 100),
...         },
...     )
...     .groupby(KeySet.from_dict({"A": ["a1", "a2"]}))
...     .count()
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas()
    A  count
0  a1      2
1  a2      1
Parameters

replace_with (Optional[Dict[str, Tuple[float, float]]]) – A dictionary mapping column names to values used to replace -inf and +inf. If None (or empty), all columns will have infinite values replaced with Analytics defaults; see tmlt.analytics.query_expr.AnalyticsDefault.

Return type

QueryBuilder

drop_null_and_nan(self, columns)#

Updates the current query to drop rows containing null or NaN values.

Note: if null and NaN values are dropped from a column, then Analytics will raise an error if you use a KeySet that contains a null value for that column.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
... )
>>> my_private_data.toPandas()
    A    B    X
0  a1  2.0  0.0
1  a1  NaN  1.1
2  a2  2.0  NaN
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'DECIMAL'}
>>> # Count query on the original data
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(KeySet.from_dict({"A": ["a1", "a2"], "B": [None, 2]}))
...     .count()
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A", "B").toPandas()
    A    B  count
0  a1  NaN      1
1  a1  2.0      1
2  a2  NaN      0
3  a2  2.0      1
>>> # Building a query with a transformation
>>> query = (
...     QueryBuilder("my_private_data")
...     .drop_null_and_nan(columns=["B"])
...     .groupby(KeySet.from_dict({"A": ["a1", "a2"]}))
...     .count()
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas()
    A  count
0  a1      1
1  a2      1
Parameters

columns (Optional[List[str]]) – A list of columns in which to look for invalid values. If None (or empty), then all columns will be considered (so if any column has an invalid value, then the row will be dropped).

Return type

QueryBuilder

drop_infinity(self, columns)#

Updates the current query to drop rows containing infinite values.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
... )
>>> my_private_data.sort("A", "B", "X").toPandas()
    A  B    X
0  a1  1  1.1
1  a1  2  0.0
2  a2  2  inf
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'DECIMAL'}
>>> # Count query on the original data
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(KeySet.from_dict({"A": ["a1", "a2"]}))
...     .count()
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas()
    A  count
0  a1      2
1  a2      1
>>> # Building a query with a drop_infinity transformation
>>> query = (
...     QueryBuilder("my_private_data")
...     .drop_infinity(columns=["X"])
...     .groupby(KeySet.from_dict({"A": ["a1", "a2"]}))
...     .count()
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas()
    A  count
0  a1      2
1  a2      0
Parameters

columns (Optional[List[str]]) – A list of columns in which to look for invalid values. If None (or empty), then all columns will be considered (so if any column has an invalid value, then the row will be dropped).

Return type

QueryBuilder

rename(self, column_mapper)#

Updates the current query to rename the columns.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Building a query with a rename transformation
>>> query = (
...     QueryBuilder("my_private_data")
...     .rename({"X": "C"})
...     .groupby(KeySet.from_dict({"C": [0, 1]}))
...     .count()
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("C").toPandas()
   C  count
0  0      1
1  1      2
Parameters

column_mapper (Dict[str, str]) – A mapping of columns to new column names. Columns not specified in the mapper will remain the same.

Return type

QueryBuilder

filter(self, predicate)#

Updates the current query to filter for rows matching a predicate.

The predicate parameter accepts the same syntax as in PySpark’s filter() method: valid expressions are those that can be used in a WHERE clause in Spark SQL. Examples of valid predicates include:

  • age < 42

  • age BETWEEN 17 AND 42

  • age < 42 OR (age < 60 AND gender IS NULL)

  • LENGTH(name) > 17

  • favorite_color IN ('blue', 'red')

  • date = '2022-03-14'

  • time < '2022-01-01T12:45:00'

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Building a query with a filter transformation
>>> query = (
...     QueryBuilder("my_private_data")
...     .filter("A == '0'")
...     .count()
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas()
   count
0      1
Parameters

predicate (str) – A string of SQL expressions specifying the filter to apply to the data. For example, the string “A > B” matches rows where column A is greater than column B.

Return type

QueryBuilder

select(self, columns)#

Updates the current query to select certain columns.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Create a new view using a select query
>>> query = (
...     QueryBuilder("my_private_data")
...     .select(["A", "B"])
... )
>>> sess.create_view(query, "selected_data", cache=True)
>>> # Inspect the schema of the resulting view
>>> sess.get_schema("selected_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER'}
Parameters

columns (Sequence[str]) – The columns to select.

Return type

QueryBuilder

map(self, f, new_column_types, augment=False)#

Updates the current query to apply a mapping function to each row.

If you provide only a ColumnType for the new column types, Analytics assumes that all new columns created may contain null values (and that DECIMAL columns may contain NaN or infinite values).

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Building a query with a map transformation
>>> query = (
...     QueryBuilder("my_private_data")
...     .map(
...         lambda row: {"new": row["B"]*2},
...         new_column_types={"new": 'INTEGER'},
...         augment=True
...     )
...     .groupby(KeySet.from_dict({"new": [0, 1, 2, 3, 4]}))
...     .count()
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("new").toPandas()
   new  count
0    0      1
1    1      0
2    2      1
3    3      0
4    4      1
Parameters
  • f (Callable[[Row], Row]) – The function to be applied to each row. The function’s input is a dictionary matching each column name to its value for that row. This function should return a dictionary, which should always have the same keys regardless of input, and the values in that dictionary should match the column type specified in new_column_types. The function should not have any side effects (in particular, f cannot raise exceptions).

  • new_column_types (Mapping[str, Union[tmlt.analytics._schema.ColumnDescriptor, tmlt.analytics._schema.ColumnType]]) – Mapping from column names to types, for new columns produced by f. Using tmlt.analytics.query_builder.ColumnDescriptor is preferred.

  • augment (bool) – If True, add new columns to the existing dataframe (so new schema = old schema + schema_new_columns). If False, make the new dataframe with schema = schema_new_columns

Return type

QueryBuilder

flat_map(self, f, max_num_rows, new_column_types, augment=False, grouping=False)#

Updates the current query to apply a flat map.

If you provide only a ColumnType for the new column types, Analytics assumes that all new columns created may contain null values (and that DECIMAL columns may contain NaN or infinite values).

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
3  1  3  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Building a query with a flat map transformation
>>> query = (
...     QueryBuilder("my_private_data")
...     .flat_map(
...         lambda row: [{"i_B": i} for i in range(int(row["B"])+1)],
...         max_num_rows=3,
...         new_column_types={"i_B": ColumnDescriptor(
...             ColumnType.INTEGER,
...             allow_null=False,
...         )},
...         augment=True,
...         grouping=False
...     )
...     .groupby(KeySet.from_dict({"B": [0, 1, 2, 3]}))
...     .count()
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("B").toPandas()
   B  count
0  0      1
1  1      2
2  2      3
3  3      3
Parameters
  • f (Callable[[Row], List[Row]]) – The function to be applied to each row. The function’s input is a dictionary matching a column name to its value for that row. This function should return a list of dictionaries. Those dictionaries should always have the same keys regardless of input, and the values in those dictionaries should match the column type specified in new_column_types. The function should not have any side effects (in particular, f cannot raise exceptions).

  • max_num_rows (int) – The enforced limit on the number of rows from each f(row). If f produces more rows than this, only the first max_num_rows rows will be in the output.

  • new_column_types (Mapping[str, Union[str, tmlt.analytics._schema.ColumnDescriptor, tmlt.analytics._schema.ColumnType]]) – Mapping from column names to types, for new columns produced by f. Using tmlt.analytics.query_builder.ColumnDescriptor is preferred.

  • augment (bool) – If True, add new columns to the existing dataframe (so new schema = old schema + schema_new_columns). If False, make the new dataframe with schema = schema_new_columns

  • grouping (bool) – Whether this produces a new column that we want to groupby. If True, this requires that any groupby aggregations following this query include the new column as a groupby column. Only one new column is supported, and the new column must have distinct values for each input row.

Return type

QueryBuilder

bin_column(self, column, spec, name=None)#

Create a new column by assigning the values in a given column to bins.

Example

>>> from tmlt.analytics.binning_spec import BinningSpec
>>> sess = Session.from_dataframe(
...     PureDPBudget(float("inf")),
...     source_id="private_data",
...     dataframe=my_private_data
... )
>>> my_private_data.toPandas()
   age  income
0   11       0
1   17       6
2   30      54
3   18      14
4   59     126
5   48     163
6   76     151
7   91      18
8   48      97
9   53      85
>>> age_binspec = BinningSpec(
...     [0, 18, 65, 100], include_both_endpoints=False
... )
>>> income_tax_rate_binspec = BinningSpec(
...     [0, 10, 40, 86, 165], names=[10, 12, 22, 24]
... )
>>> keys = KeySet.from_dict(
...     {
...         "age_binned": age_binspec.bins(),
...         "marginal_tax_rate": income_tax_rate_binspec.bins()
...     }
... )
>>> query = (
...     QueryBuilder("private_data")
...     .bin_column("age", age_binspec)
...     .bin_column(
...         "income", income_tax_rate_binspec, name="marginal_tax_rate"
...     )
...     .groupby(keys).count()
... )
>>> answer = sess.evaluate(query, PureDPBudget(float("inf")))
>>> answer.sort("age_binned", "marginal_tax_rate").toPandas()
   age_binned  marginal_tax_rate  count
0     (0, 18]                 10      2
1     (0, 18]                 12      1
2     (0, 18]                 22      0
3     (0, 18]                 24      0
4    (18, 65]                 10      0
5    (18, 65]                 12      0
6    (18, 65]                 22      2
7    (18, 65]                 24      3
8   (65, 100]                 10      0
9   (65, 100]                 12      1
10  (65, 100]                 22      0
11  (65, 100]                 24      1
Parameters
  • column (str) – Name of the column used to assign bins.

  • spec (tmlt.analytics.binning_spec.BinningSpec) – A BinningSpec that defines the binning operation to be performed.

  • name (Optional[str]) – The name of the column that will be created. If None (the default), the input column name with _binned appended to it.

Return type

QueryBuilder

histogram(self, column, bin_edges, name=None)#

Returns a count query containing the frequency of values in specified column.

Example

>>> from tmlt.analytics.binning_spec import BinningSpec
>>> private_data = spark.createDataFrame(
...     pd.DataFrame(
...         {
...          "income_thousands": [83, 85, 86, 73, 82, 95,
...                               74, 92, 71, 86, 97]
...         }
...     )
... )
>>> session = Session.from_dataframe(
...     privacy_budget=PureDPBudget(epsilon=float('inf')),
...     source_id="private_data",
...     dataframe=private_data,
... )
>>> income_binspec = BinningSpec(
...     bin_edges=[i for i in range(70,110,10)],
...     include_both_endpoints=False
... )
>>> binned_income_count_query = (
...     QueryBuilder("private_data")
...     .histogram("income_thousands", income_binspec, "income_binned")
... )
>>> binned_income_counts = session.evaluate(
...     binned_income_count_query,
...     privacy_budget=PureDPBudget(epsilon=10),
... )
>>> print(binned_income_counts.sort("income_binned").toPandas())
  income_binned  count
0      (70, 80]      3
1      (80, 90]      5
2     (90, 100]      3
Parameters
  • column (str) – Name of the column used to assign bins.

  • bin_edges (Union[Sequence[tmlt.analytics.binning_spec.BinT], tmlt.analytics.binning_spec.BinningSpec]) – The bin edges for the histogram; provided as either a BinningSpec or as a list of supported data types. Values outside the range of the provided bins, None types, and NaN values are all mapped to None (null in Spark).

  • name (Optional[str]) – The name of the column that will be created. If None (the default), the input column name with _binned appended to it.

Return type

tmlt.analytics.query_expr.QueryExpr

groupby(self, keys)#

Groups the query by the given set of keys, returning a GroupedQueryBuilder.

Examples

>>> from tmlt.analytics.keyset import KeySet
>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
Answering a query with the exact groupby domain:
>>> groupby_keys = KeySet.from_dict({"A": ["0", "1"]})
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(groupby_keys)
...     .count()
... )
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas()
   A  count
0  0      1
1  1      2
Answering a query with an omitted domain value:
>>> groupby_keys = KeySet.from_dict({"A": ["0"]})
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(groupby_keys)
...     .count()
... )
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas()
   A  count
0  0      1
Answering a query with an added domain value:
>>> groupby_keys = KeySet.from_dict({"A": ["0", "1", "2"]})
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(groupby_keys)
...     .count()
... )
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas()
   A  count
0  0      1
1  1      2
2  2      0
Answering a query with a multi-column domain:
>>> groupby_keys = KeySet.from_dict(
...    {"A": ["0", "1"], "B": [0, 1, 2]}
... )
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(groupby_keys)
...     .count()
... )
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A", "B").toPandas()
   A  B  count
0  0  0      0
1  0  1      1
2  0  2      0
3  1  0      1
4  1  1      0
5  1  2      1
Answering a query with a multi-column domain and structural zeros:
>>> # Suppose it is known that A and B cannot be equal. This set of
>>> # groupby keys prevents those impossible values from being computed.
>>> keys_df = pd.DataFrame({
...     "A": ["0", "0", "1", "1"],
...     "B": [1, 2, 0, 2],
... })
>>> groupby_keys = KeySet.from_dataframe(spark.createDataFrame(keys_df))
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(groupby_keys)
...     .count()
... )
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A", "B").toPandas()
   A  B  count
0  0  1      1
1  0  2      0
2  1  0      1
3  1  2      1
Parameters

keys (tmlt.analytics.keyset.KeySet) – A KeySet giving the set of groupby keys to be used when performing an aggregation.

Return type

GroupedQueryBuilder

count(self, name=None, mechanism=CountMechanism.DEFAULT)#

Returns a count query that is ready to be evaluated.

Note that differentially-private counts may return values that are not possible for a non-DP query - including negative values. You can enforce non-negativity once the query returns its results; see the example below.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Building a count query
>>> query = (
...     QueryBuilder("my_private_data")
...     .count()
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas()
   count
0      3
>>> # Ensuring all results are non-negative
>>> import pyspark.sql.functions as sf
>>> answer = answer.withColumn(
...     "count", sf.when(sf.col("count") < 0, 0).otherwise(
...             sf.col("count")
...     )
... )
>>> answer.toPandas()
   count
0      3
Parameters
  • name (Optional[str]) – Name for the resulting aggregation column. Defaults to “count”.

  • mechanism (tmlt.analytics.query_expr.CountMechanism) – Choice of noise mechanism. By DEFAULT, the framework automatically selects an appropriate mechanism.

Return type

tmlt.analytics.query_expr.QueryExpr

count_distinct(self, cols=None, name=None, mechanism=CountDistinctMechanism.DEFAULT)#

Returns a count_distinct query that is ready to be evaluated.

Note that differentially-private counts may returns values that are not possible for a non-DP query - including negative values. You can enforce non-negativity once the query returns its results; see the example below.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  0  1  0
2  1  0  1
3  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Building a count_distinct query
>>> query = (
...     QueryBuilder("my_private_data")
...     .count_distinct()
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas()
   count_distinct
0               3
>>> # Ensuring all results are non-negative
>>> import pyspark.sql.functions as sf
>>> answer = answer.withColumn(
...     "count_distinct", sf.when(
...         sf.col("count_distinct") < 0, 0
...     ).otherwise(
...         sf.col("count_distinct")
...     )
... )
>>> answer.toPandas()
   count_distinct
0               3
Parameters
  • cols (Optional[List[str]]) – Columns in which to count distinct values. If none are provided, the query will count every distinct row.

  • name (Optional[str]) – Name for the resulting aggregation column. Defaults to “count_distinct” if no columns are provided, or “count_distinct(A, B, C)” if the provided columns are A, B, and C.

  • mechanism (tmlt.analytics.query_expr.CountDistinctMechanism) – Choice of noise mechanism. By DEFAULT, the framework automatically selects an appropriate mechanism.

Return type

tmlt.analytics.query_expr.QueryExpr

quantile(self, column, quantile, low, high, name=None)#

Returns a quantile query that is ready to be evaluated.

Note that if the column being measured contains NaN or null values, a drop_null_and_nan query will be performed first. If the column being measured contains infinite values, a drop_infinity query will be performed first.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Building a quantile query
>>> query = (
...     QueryBuilder("my_private_data")
...     .quantile(column="B", quantile=0.6, low=0, high=2)
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas() 
   B_quantile(0.6)
0         1.331107
Parameters
  • column (str) – The column to compute the quantile over.

  • quantile (float) – A number between 0 and 1 specifying the quantile to compute. For example, 0.5 would compute the median.

  • low (float) – The lower bound for clamping.

  • high (float) – The upper bound for clamping. Must be such that low < high.

  • name (Optional[str]) – The name to give the resulting aggregation column. Defaults to f”{column}_quantile({quantile})”.

Return type

tmlt.analytics.query_expr.QueryExpr

min(self, column, low, high, name=None)#

Returns a quantile query requesting a minimum value, ready to be evaluated.

Note that if the column being measured contains NaN or null values, a drop_null_and_nan query will be performed first. If the column being measured contains infinite values, a drop_infinity query will be performed first.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Building a quantile query
>>> query = (
...     QueryBuilder("my_private_data")
...     .min(column="B", low=0, high=5, name="min_B")
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas() 
      min_B
0  0.213415
Parameters
  • column (str) – The column to compute the quantile over.

  • low (float) – The lower bound for clamping.

  • high (float) – The upper bound for clamping. Must be such that low < high.

  • name (Optional[str]) – The name to give the resulting aggregation column. Defaults to f”{column}_min”.

Return type

tmlt.analytics.query_expr.QueryExpr

max(self, column, low, high, name=None)#

Returns a quantile query requesting a maximum value, ready to be evaluated.

Note that if the column being measured contains NaN or null values, a drop_null_and_nan query will be performed first. If the column being measured contains infinite values, a drop_infinity query will be performed first.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Building a quantile query
>>> query = (
...     QueryBuilder("my_private_data")
...     .max(column="B", low=0, high=5, name="max_B")
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas() 
      max_B
0  2.331871
Parameters
  • column (str) – The column to compute the quantile over.

  • low (float) – The lower bound for clamping.

  • high (float) – The upper bound for clamping. Must be such that low < high.

  • name (Optional[str]) – The name to give the resulting aggregation column. Defaults to f”{column}_max”.

Return type

tmlt.analytics.query_expr.QueryExpr

median(self, column, low, high, name=None)#

Returns a quantile query requesting a median value, ready to be evaluated.

Note that if the column being measured contains NaN or null values, a drop_null_and_nan query will be performed first. If the column being measured contains infinite values, a drop_infinity query will be performed first.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Building a quantile query
>>> query = (
...     QueryBuilder("my_private_data")
...     .median(column="B", low=0, high=5, name="median_B")
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas() 
   median_B
0  1.221197
Parameters
  • column (str) – The column to compute the quantile over.

  • low (float) – The lower bound for clamping.

  • high (float) – The upper bound for clamping. Must be such that low < high.

  • name (Optional[str]) – The name to give the resulting aggregation column. Defaults to f”{column}_median”.

Return type

tmlt.analytics.query_expr.QueryExpr

sum(self, column, low, high, name=None, mechanism=SumMechanism.DEFAULT)#

Returns a sum query that is ready to be evaluated.

Note that if the column being measured contains NaN or null values, a drop_null_and_nan query will be performed first. If the column being measured contains infinite values, a drop_infinity query will be performed first.

Note

Regarding the clamping params:

  1. The values for low and high are a choice the caller must make.

  2. All data will be clamped to lie within this range.

  3. The narrower the range, the less noise. Larger bounds mean more data is kept, but more noise needs to be added to the result.

  4. The clamping bounds are assumed to be public information. Avoid using the private data to set these values.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Building a sum query
>>> query = (
...     QueryBuilder("my_private_data")
...     .sum(column="B",low=0, high=2)
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas()
   B_sum
0      3
Parameters
  • column (str) – The column to compute the sum over.

  • low (float) – The lower bound for clamping.

  • high (float) – The upper bound for clamping. Must be such that low < high.

  • name (Optional[str]) – The name to give the resulting aggregation column. Defaults to f”{column}_sum”.

  • mechanism (tmlt.analytics.query_expr.SumMechanism) – Choice of noise mechanism. By DEFAULT, the framework automatically selects an appropriate mechanism.

Return type

tmlt.analytics.query_expr.QueryExpr

average(self, column, low, high, name=None, mechanism=AverageMechanism.DEFAULT)#

Returns an average query that is ready to be evaluated.

Note that if the column being measured contains NaN or null values, a drop_null_and_nan query will be performed first. If the column being measured contains infinite values, a drop_infinity query will be performed first.

Note

Regarding the clamping params:

  1. The values for low and high are a choice the caller must make.

  2. All data will be clamped to lie within this range.

  3. The narrower the range, the less noise. Larger bounds mean more data is kept, but more noise needs to be added to the result.

  4. The clamping bounds are assumed to be public information. Avoid using the private data to set these values.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Building an average query
>>> query = (
...     QueryBuilder("my_private_data")
...     .average(column="B",low=0, high=2)
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas()
   B_average
0        1.0
Parameters
  • column (str) – The column to compute the average over.

  • low (float) – The lower bound for clamping.

  • high (float) – The upper bound for clamping. Must be such that low < high.

  • name (Optional[str]) – The name to give the resulting aggregation column. Defaults to f”{column}_average”.

  • mechanism (tmlt.analytics.query_expr.AverageMechanism) – Choice of noise mechanism. By DEFAULT, the framework automatically selects an appropriate mechanism.

Return type

tmlt.analytics.query_expr.QueryExpr

variance(self, column, low, high, name=None, mechanism=VarianceMechanism.DEFAULT)#

Returns a variance query that is ready to be evaluated.

Note that if the column being measured contains NaN or null values, a drop_null_and_nan query will be performed first. If the column being measured contains infinite values, a drop_infinity query will be performed first.

Note

Regarding the clamping params:

  1. The values for low and high are a choice the caller must make.

  2. All data will be clamped to lie within this range.

  3. The narrower the range, the less noise. Larger bounds mean more data is kept, but more noise needs to be added to the result.

  4. The clamping bounds are assumed to be public information. Avoid using the private data to set these values.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Building a variance query
>>> query = (
...     QueryBuilder("my_private_data")
...     .variance(column="B",low=0, high=2)
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas()
   B_variance
0    0.666667
Parameters
  • column (str) – The column to compute the variance over.

  • low (float) – The lower bound for clamping.

  • high (float) – The upper bound for clamping. Must be such that low < high.

  • name (Optional[str]) – The name to give the resulting aggregation column. Defaults to f”{column}_variance”.

  • mechanism (tmlt.analytics.query_expr.VarianceMechanism) – Choice of noise mechanism. By DEFAULT, the framework automatically selects an appropriate mechanism.

Return type

tmlt.analytics.query_expr.QueryExpr

stdev(self, column, low, high, name=None, mechanism=StdevMechanism.DEFAULT)#

Returns a standard deviation query that is ready to be evaluated.

Note that if the column being measured contains NaN or null values, a drop_null_and_nan query will be performed first. If the column being measured contains infinite values, a drop_infinity query will be performed first.

Note

Regarding the clamping params:

  1. The values for low and high are a choice the caller must make.

  2. All data will be clamped to lie within this range.

  3. The narrower the range, the less noise. Larger bounds mean more data is kept, but more noise needs to be added to the result.

  4. The clamping bounds are assumed to be public information. Avoid using the private data to set these values.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Building a standard deviation query
>>> query = (
...     QueryBuilder("my_private_data")
...     .stdev(column="B",low=0, high=2)
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas()
    B_stdev
0  0.816497
Parameters
  • column (str) – The column to compute the stdev over.

  • low (float) – The lower bound for clamping.

  • high (float) – The upper bound for clamping. Must be such that low < high.

  • name (Optional[str]) – The name to give the resulting aggregation column. Defaults to f”{column}_stdev”.

  • mechanism (tmlt.analytics.query_expr.StdevMechanism) – Choice of noise mechanism. By DEFAULT, the framework automatically selects an appropriate mechanism.

Return type

tmlt.analytics.query_expr.QueryExpr

class GroupedQueryBuilder(source_id, query_expr, groupby_keys)#

A QueryBuilder that is grouped by a set of columns and can be aggregated.

__init__(source_id, query_expr, groupby_keys)#

Constructor.

Do not construct directly; use groupby().

count(self, name=None, mechanism=CountMechanism.DEFAULT)#

Returns a count query that is ready to be evaluated.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Building a groupby count query
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(KeySet.from_dict({"A": ["0", "1"]}))
...     .count()
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas()
   A  count
0  0      1
1  1      2
Parameters
  • name (Optional[str]) – Name for the resulting aggregation column. Defaults to “count”.

  • mechanism (tmlt.analytics.query_expr.CountMechanism) – Choice of noise mechanism. By DEFAULT, the framework automatically selects an appropriate mechanism.

Return type

tmlt.analytics.query_expr.QueryExpr

count_distinct(self, cols=None, name=None, mechanism=CountDistinctMechanism.DEFAULT)#

Returns a count_distinct query that is ready to be evaluated.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  0  1  0
2  1  0  1
3  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Building a groupby count_distinct query
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(KeySet.from_dict({"A": ["0", "1"]}))
...     .count_distinct(["B", "X"])
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas()
   A  count_distinct(B, X)
0  0                     1
1  1                     2
Parameters
  • cols (Optional[List[str]]) – Columns in which to count distinct values. If none are provided, the query will count every distinct row.

  • name (Optional[str]) – Name for the resulting aggregation column. Defaults to “count_distinct” if no columns are provided, or “count_distinct(A, B, C)” if the provided columns are A, B, and C.

  • mechanism (tmlt.analytics.query_expr.CountDistinctMechanism) – Choice of noise mechanism. By DEFAULT, the framework automatically selects an appropriate mechanism.

Return type

tmlt.analytics.query_expr.QueryExpr

quantile(self, column, quantile, low, high, name=None)#

Returns a quantile query that is ready to be evaluated.

Note that if the column being measured contains NaN or null values, a drop_null_and_nan query will be performed first. If the column being measured contains infinite values, a drop_infinity query will be performed first.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Building a groupby quantile query
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(KeySet.from_dict({"A": ["0", "1"]}))
...     .quantile(column="B", quantile=0.6, low=0, high=2)
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas() 
   A  B_quantile(0.6)
0  0         1.331107
1  1         1.331107
Parameters
  • column (str) – The column to compute the quantile over.

  • quantile (float) – A number between 0 and 1 specifying the quantile to compute. For example, 0.5 would compute the median.

  • low (float) – The lower bound for clamping.

  • high (float) – The upper bound for clamping. Must be such that low < high.

  • name (Optional[str]) – The name to give the resulting aggregation column. Defaults to f”{column}_quantile({quantile})”.

Return type

tmlt.analytics.query_expr.QueryExpr

min(self, column, low, high, name=None)#

Returns a quantile query requesting a minimum value, ready to be evaluated.

Note that if the column being measured contains NaN or null values, a drop_null_and_nan query will be performed first. If the column being measured contains infinite values, a drop_infinity query will be performed first.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Building a quantile query
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(KeySet.from_dict({"A": ["0", "1"]}))
...     .min(column="B", low=0, high=5, name="min_B")
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas() 
   A     min_B
0  0  0.213415
1  1  0.213415
Parameters
  • column (str) – The column to compute the quantile over.

  • low (float) – The lower bound for clamping.

  • high (float) – The upper bound for clamping. Must be such that low < high.

  • name (Optional[str]) – The name to give the resulting aggregation column. Defaults to f”{column}_min”.

Return type

tmlt.analytics.query_expr.QueryExpr

max(self, column, low, high, name=None)#

Returns a quantile query requesting a maximum value, ready to be evaluated.

Note that if the column being measured contains NaN or null values, a drop_null_and_nan query will be performed first. If the column being measured contains infinite values, a drop_infinity query will be performed first.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Building a quantile query
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(KeySet.from_dict({"A": ["0", "1"]}))
...     .max(column="B", low=0, high=5, name="max_B")
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas() 
   A     max_B
0  0  2.331871
1  1  2.331871
Parameters
  • column (str) – The column to compute the quantile over.

  • low (float) – The lower bound for clamping.

  • high (float) – The upper bound for clamping. Must be such that low < high.

  • name (Optional[str]) – The name to give the resulting aggregation column. Defaults to f”{column}_max”.

Return type

tmlt.analytics.query_expr.QueryExpr

median(self, column, low, high, name=None)#

Returns a quantile query requesting a median value, ready to be evaluated.

Note that if the column being measured contains NaN or null values, a drop_null_and_nan query will be performed first. If the column being measured contains infinite values, a drop_infinity query will be performed first.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Building a quantile query
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(KeySet.from_dict({"A": ["0", "1"]}))
...     .median(column="B", low=0, high=5, name="median_B")
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas() 
   A  median_B
0  0  1.221197
1  1  1.221197
Parameters
  • column (str) – The column to compute the quantile over.

  • low (float) – The lower bound for clamping.

  • high (float) – The upper bound for clamping. Must be such that low < high.

  • name (Optional[str]) – The name to give the resulting aggregation column. Defaults to f”{column}_median”.

Return type

tmlt.analytics.query_expr.QueryExpr

sum(self, column, low, high, name=None, mechanism=SumMechanism.DEFAULT)#

Returns a sum query that is ready to be evaluated.

Note that if the column being measured contains NaN or null values, a drop_null_and_nan query will be performed first. If the column being measured contains infinite values, a drop_infinity query will be performed first.

Note

Regarding the clamping params:

  1. The values for low and high are a choice the caller must make.

  2. All data will be clamped to lie within this range.

  3. The narrower the range, the less noise. Larger bounds mean more data is kept, but more noise needs to be added to the result.

  4. The clamping bounds are assumed to be public information. Avoid using the private data to set these values.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Building a groupby sum query
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(KeySet.from_dict({"A": ["0", "1"]}))
...     .sum(column="B",low=0, high=2)
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas()
   A  B_sum
0  0      1
1  1      2
Parameters
  • column (str) – The column to compute the sum over.

  • low (float) – The lower bound for clamping.

  • high (float) – The upper bound for clamping. Must be such that low < high.

  • name (Optional[str]) – The name to give the resulting aggregation column. Defaults to f”{column}_sum”.

  • mechanism (tmlt.analytics.query_expr.SumMechanism) – Choice of noise mechanism. By DEFAULT, the framework automatically selects an appropriate mechanism.

Return type

tmlt.analytics.query_expr.QueryExpr

average(self, column, low, high, name=None, mechanism=AverageMechanism.DEFAULT)#

Returns an average query that is ready to be evaluated.

Note that if the column being measured contains NaN or null values, a drop_null_and_nan query will be performed first. If the column being measured contains infinite values, a drop_infinity query will be performed first.

Note

Regarding the clamping params:

  1. The values for low and high are a choice the caller must make.

  2. All data will be clamped to lie within this range.

  3. The narrower the range, the less noise. Larger bounds mean more data is kept, but more noise needs to be added to the result.

  4. The clamping bounds are assumed to be public information. Avoid using the private data to set these values.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Building a groupby average query
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(KeySet.from_dict({"A": ["0", "1"]}))
...     .average(column="B",low=0, high=2)
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas()
   A  B_average
0  0        1.0
1  1        1.0
Parameters
  • column (str) – The column to compute the average over.

  • low (float) – The lower bound for clamping.

  • high (float) – The upper bound for clamping. Must be such that low < high.

  • name (Optional[str]) – The name to give the resulting aggregation column. Defaults to f”{column}_average”.

  • mechanism (tmlt.analytics.query_expr.AverageMechanism) – Choice of noise mechanism. By DEFAULT, the framework automatically selects an appropriate mechanism.

Return type

tmlt.analytics.query_expr.QueryExpr

variance(self, column, low, high, name=None, mechanism=VarianceMechanism.DEFAULT)#

Returns a variance query that is ready to be evaluated.

Note that if the column being measured contains NaN or null values, a drop_null_and_nan query will be performed first. If the column being measured contains infinite values, a drop_infinity query will be performed first.

Note

Regarding the clamping params:

  1. The values for low and high are a choice the caller must make.

  2. All data will be clamped to lie within this range.

  3. The narrower the range, the less noise. Larger bounds mean more data is kept, but more noise needs to be added to the result.

  4. The clamping bounds are assumed to be public information. Avoid using the private data to set these values.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Building a groupby variance query
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(KeySet.from_dict({"A": ["0", "1"]}))
...     .variance(column="B",low=0, high=2)
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas()
   A  B_variance
0  0         1.0
1  1         1.0
Parameters
  • column (str) – The column to compute the variance over.

  • low (float) – The lower bound for clamping.

  • high (float) – The upper bound for clamping. Must be such that low < high.

  • name (Optional[str]) – The name to give the resulting aggregation column. Defaults to f”{column}_variance”.

  • mechanism (tmlt.analytics.query_expr.VarianceMechanism) – Choice of noise mechanism. By DEFAULT, the framework automatically selects an appropriate mechanism.

Return type

tmlt.analytics.query_expr.QueryExpr

stdev(self, column, low, high, name=None, mechanism=StdevMechanism.DEFAULT)#

Returns a standard deviation query that is ready to be evaluated.

Note that if the column being measured contains NaN or null values, a drop_null_and_nan query will be performed first. If the column being measured contains infinite values, a drop_infinity query will be performed first.

Note

Regarding the clamping params:

  1. The values for low and high are a choice the caller must make.

  2. All data will be clamped to lie within this range.

  3. The narrower the range, the less noise. Larger bounds mean more data is kept, but more noise needs to be added to the result.

  4. The clamping bounds are assumed to be public information. Avoid using the private data to set these values.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Building a groupby standard deviation query
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(KeySet.from_dict({"A": ["0", "1"]}))
...     .stdev(column="B",low=0, high=2)
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas()
   A  B_stdev
0  0      1.0
1  1      1.0
Parameters
  • column (str) – The column to compute the stdev over.

  • low (float) – The lower bound for clamping.

  • high (float) – The upper bound for clamping. Must be such that low < high.

  • name (Optional[str]) – The name to give the resulting aggregation column. Defaults to f”{column}_stdev”.

  • mechanism (tmlt.analytics.query_expr.StdevMechanism) – Choice of noise mechanism. By DEFAULT, the framework automatically selects an appropriate mechanism.

Return type

tmlt.analytics.query_expr.QueryExpr