query_builder#

An API for building differentially private queries from basic operations.

The QueryBuilder class allows users to construct differentially private queries using SQL-like commands. These queries can then be used with a Session to obtain results or construct views on which further queries can be run.

The QueryBuilder class can apply transformations, such as joins or filters, as well as compute aggregations like counts, sums, and standard deviations. See QueryBuilder for a full list of supported transformations and aggregations.

After each transformation, the QueryBuilder is modified and returned with that transformation applied. To re-use the transformations in a QueryBuilder as the base for multiple queries, create a view using create_view() and write queries on that view.

At any point, a QueryBuilder instance can have an aggregation like count() applied to it, potentially after a groupby(), yielding an object that can be passed to evaluate() to obtain differentially private results to the query.

Data#

Row#

Type alias for a dictionary with string keys.

Classes#

AnalyticsDefault

Default values for each type of column in Tumult Analytics.

AverageMechanism

Possible mechanisms for the average() aggregation.

CountDistinctMechanism

Enumerating the possible mechanisms used for the count_distinct aggregation.

CountMechanism

Possible mechanisms for the count() aggregation.

StdevMechanism

Possible mechanisms for the stdev() aggregation.

SumMechanism

Possible mechanisms for the sum() aggregation.

VarianceMechanism

Possible mechanisms for the variance() aggregation.

ColumnDescriptor

Information about a column.

ColumnType

The supported SQL92 column types for Analytics data.

Query

Instances of the Query class represent expressions within Tumult Analytics.

GroupbyCountQuery

Stores the plan for a differentially private groupby count calculation.

QueryBuilder

High-level interface for specifying DP queries.

GroupedQueryBuilder

A QueryBuilder that is grouped by a set of columns and can be aggregated.

class AnalyticsDefault#

Default values for each type of column in Tumult Analytics.

INTEGER = 0#

The default value used for integers (0).

DECIMAL = 0.0#

The default value used for floats (0).

VARCHAR = ''#

The default value used for VARCHARs (the empty string).

DATE#

The default value used for dates (datetime.date.fromtimestamp(0)).

See fromtimestamp().

TIMESTAMP#

The default value used for timestamps (datetime.datetime.fromtimestamp(0)).

See fromtimestamp().

class AverageMechanism#

Bases: enum.Enum

Possible mechanisms for the average() aggregation.

Currently, the average() aggregation uses an additive noise mechanism to achieve differential privacy.

DEFAULT#

The framework automatically selects an appropriate mechanism. This choice might change over time as additional optimizations are added to the library.

LAPLACE#

Laplace and/or double-sided geometric noise is used, depending on the column type.

GAUSSIAN#

Discrete and/or continuous Gaussian noise is used, depending on the column type. Not compatible with pure DP.

name()#

The name of the Enum member.

value()#

The value of the Enum member.

class CountDistinctMechanism#

Bases: enum.Enum

Enumerating the possible mechanisms used for the count_distinct aggregation.

Currently, the count_distinct() aggregation uses an additive noise mechanism to achieve differential privacy.

DEFAULT#

The framework automatically selects an appropriate mechanism. This choice might change over time as additional optimizations are added to the library.

LAPLACE#

Double-sided geometric noise is used.

GAUSSIAN#

The discrete Gaussian mechanism is used. Not compatible with pure DP.

name()#

The name of the Enum member.

value()#

The value of the Enum member.

class CountMechanism#

Bases: enum.Enum

Possible mechanisms for the count() aggregation.

Currently, the count() aggregation uses an additive noise mechanism to achieve differential privacy.

DEFAULT#

The framework automatically selects an appropriate mechanism. This choice might change over time as additional optimizations are added to the library.

LAPLACE#

Double-sided geometric noise is used.

GAUSSIAN#

The discrete Gaussian mechanism is used. Not compatible with pure DP.

name()#

The name of the Enum member.

value()#

The value of the Enum member.

class StdevMechanism#

Bases: enum.Enum

Possible mechanisms for the stdev() aggregation.

Currently, the stdev() aggregation uses an additive noise mechanism to achieve differential privacy.

DEFAULT#

The framework automatically selects an appropriate mechanism. This choice might change over time as additional optimizations are added to the library.

LAPLACE#

Laplace and/or double-sided geometric noise is used, depending on the column type.

GAUSSIAN#

Discrete and/or continuous Gaussian noise is used, depending on the column type. Not compatible with pure DP.

name()#

The name of the Enum member.

value()#

The value of the Enum member.

class SumMechanism#

Bases: enum.Enum

Possible mechanisms for the sum() aggregation.

Currently, the sum() aggregation uses an additive noise mechanism to achieve differential privacy.

DEFAULT#

The framework automatically selects an appropriate mechanism. This choice might change over time as additional optimizations are added to the library.

LAPLACE#

Laplace and/or double-sided geometric noise is used, depending on the column type.

GAUSSIAN#

Discrete and/or continuous Gaussian noise is used, depending on the column type. Not compatible with pure DP.

name()#

The name of the Enum member.

value()#

The value of the Enum member.

class VarianceMechanism#

Bases: enum.Enum

Possible mechanisms for the variance() aggregation.

Currently, the variance() aggregation uses an additive noise mechanism to achieve differential privacy.

DEFAULT#

The framework automatically selects an appropriate mechanism. This choice might change over time as additional optimizations are added to the library.

LAPLACE#

Laplace and/or double-sided geometric noise is used, depending on the column type.

GAUSSIAN#

Discrete and/or continuous Gaussian noise is used, depending on the column type. Not compatible with pure DP.

name()#

The name of the Enum member.

value()#

The value of the Enum member.

class ColumnDescriptor#

Information about a column.

ColumnDescriptors have the following attributes:

column_type#

A ColumnType, specifying what type this column has.

allow_null#

bool. If True, this column allows null values.

allow_nan#

bool. If True, this column allows NaN values.

allow_inf#

bool. If True, this column allows infinite values.

class ColumnType#

Bases: enum.Enum

The supported SQL92 column types for Analytics data.

INTEGER#

Integer column type.

DECIMAL#

Floating-point column type.

VARCHAR#

String column type.

DATE#

Date column type.

TIMESTAMP#

Timestamp column type.

name()#

The name of the Enum member.

value()#

The value of the Enum member.

class Query#

Instances of the Query class represent expressions within Tumult Analytics.

A Query and its subclasses should not be directly constructed or deconstructed; use QueryBuilder to create them. A Query can be passed to evaluate() in order to evaluate the query and obtain differentially private results.

__eq__(other)#

Determines query equitability based on the underlying query expression.

Parameters:

other (Any) –

class GroupbyCountQuery#

Bases: Query

Stores the plan for a differentially private groupby count calculation.

__eq__(other)#

Determines query equitability based on the underlying query expression.

Parameters:

other (Any) –

suppress(threshold)#

Returns a SuppressAggregates query that is ready to be evaluated.

Example

>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> # Building a groupby count query and suppressing results < 1
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(KeySet.from_dict({"A": ["0", "1", "2"]}))
...     .count()
...     .suppress(1)
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas()
   A  count
0  0      1
1  1      2
Parameters:

threshold (float) – Threshold value. All results with a lower value in the aggregated column will be suppressed.

Return type:

Query

class QueryBuilder(source_id)#

High-level interface for specifying DP queries.

Each instance corresponds to applying a transformation. The full graph of QueryBuilder objects can be traversed from root to a node.

Example

>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> # Building a query
>>> query = QueryBuilder("my_private_data").count()
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas()
   count
0      3
Methods#

join_public()

Joins the table with a DataFrame or a public source.

join_private()

Join the table with another QueryBuilder.

replace_null_and_nan()

Replaces null and NaN values in specified columns.

replace_infinity()

Replaces +inf and -inf values in specified columns.

drop_null_and_nan()

Removes rows containing null or NaN values.

drop_infinity()

Remove rows containing infinite values.

rename()

Renames one or more columns in the table.

filter()

Filter rows matching a condition.

select()

Selects the specified columns, dropping the others.

map()

Applies a mapping function to each row.

flat_map()

Applies a mapping function to each row, returning zero or more rows.

flat_map_by_id()

Applies a transformation to each group of records sharing an ID.

bin_column()

Creates a new column by assigning the values in a given column to bins.

histogram()

Returns a count query containing the frequency of values in specified column.

enforce()

Enforces a Constraint on the table.

get_groups()

Returns a query that gets combinations of values in the listed columns.

get_bounds()

Returns a query that gets approximate upper and lower bounds for a column.

groupby()

Groups the query by the given set of keys, returning a GroupedQueryBuilder.

count()

Returns a count query ready to be evaluated.

count_distinct()

Returns a count_distinct query ready to be evaluated.

quantile()

Returns a quantile query ready to be evaluated.

min()

Returns a quantile query requesting a minimum value, ready to be evaluated.

max()

Returns a quantile query requesting a maximum value, ready to be evaluated.

median()

Returns a quantile query requesting a median value, ready to be evaluated.

sum()

Returns a sum query ready to be evaluated.

average()

Returns an average query ready to be evaluated.

variance()

Returns a variance query ready to be evaluated.

stdev()

Returns a standard deviation query ready to be evaluated.

Parameters:

source_id (str) –

__init__(source_id)#

Constructor.

Parameters:

source_id (str) – The source id used in the query_expr.

join_public(public_table, join_columns=None, how='inner')#

Joins the table with a DataFrame or a public source.

This operation is either an inner or left join.

This operation performs a natural join between two tables. This means that the resulting table will contain all columns unique to each input table, along with one copy of each common column. In most cases, the columns have the same names they did in the input tables.

By default, the input tables are joined on all common columns (i.e., columns whose names and data types match). However if join_columns is given, the tables will be joined only on the given columns, and the remaining common columns will be disambiguated in the resulting table by the addition of a _left or _right suffix to their names. If given, join_columns must contain a non-empty subset of the tables’ common columns. For example, two tables with columns A,B,C and A,B,D would by default be joined on columns A and B, resulting in a table with columns A,B,C,D; if join_columns=["B"] were given when performing this join, the resulting table would have columns A_left,A_right,B,C,D. The order of columns in the resulting table is not guaranteed.

Note

Columns must share both names and data types for them to be used in joining. If this condition is not met, one of the data sources must be transformed to be eligible for joining (e.g., by using rename() or map()).

Every row within a join group (i.e., every row that shares values in the join columns) from the private table will be joined with every row from that same group in the public table. For example, if a group has \(X\) rows in the private table and \(Y\) rows in the public table, then the output table will contain \(X*Y\) rows for this group.

Note

Tables with a ProtectedChange of AddRowsWithID must include the privacy ID column in the join columns.

An illustrated example can be found in the Simple transformations tutorial.

Example

>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> public_data.toPandas()
   A  C
0  0  0
1  0  1
2  1  1
3  1  2
>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> # Create a query joining with public_data as a dataframe:
>>> query = (
...     QueryBuilder("my_private_data")
...     .join_public(public_data)
...     .groupby(KeySet.from_dict({"C": [0, 1, 2]}))
...     .count()
... )
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("C").toPandas()
   C  count
0  0      1
1  1      3
2  2      2
>>> # Alternatively, the dataframe can be added to the Session as a public
>>> # source, and its source ID can be used to perform the join:
>>> sess.add_public_dataframe(
...     source_id="my_public_data", dataframe=public_data
... )
>>> query = (
...     QueryBuilder("my_private_data")
...     .join_public("my_public_data")
...     .groupby(KeySet.from_dict({"C": [0, 1, 2]}))
...     .count()
... )
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("C").toPandas()
   C  count
0  0      1
1  1      3
2  2      2
>>> # The join can also be a left join. This is helpful to keep
>>> # records which are not included in the table. One use for this
>>> # is to find the number of records which are or are not
>>> # included in a KeySet.
>>> from pyspark.sql import functions as sf
>>> my_keyset = KeySet.from_dict({"A": ["0"]})
>>> query = (
...     QueryBuilder("my_private_data")
...     .join_public(
...         my_keyset.dataframe().withColumn("Indicator", sf.lit(1)),
...         how="left",
...     )
...     .groupby(KeySet.from_dict({"Indicator": [1, None]}))
...     .count()
... )
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas()  # pandas converts int to float if there are NaNs
   Indicator  count
0        NaN      2
1        1.0      1
Parameters:
  • public_table (Union[pyspark.sql.DataFrame, str]) – A dataframe or source ID for a public source to natural join with private data.

  • join_columns (Optional[Sequence[str]]) – The columns to join on. If join_columns is not specified, the tables will be joined on all common columns.

  • how (str) – The type of join to perform. Must be one of “inner” or “left”.

Return type:

QueryBuilder

join_private(right_operand, truncation_strategy_left=None, truncation_strategy_right=None, join_columns=None)#

Join the table with another QueryBuilder.

The current query can also be joined with a named private table (represented as a string).

This operation is an inner, natural join, with the same behavior and requirements as join_public().

Private joins on tables with a ProtectedChange that protects adding or removing rows (e.g. AddMaxRows) require truncation, which is specified using the two TruncationStrategy arguments.

In contrast, operations on tables with a AddRowsWithID ProtectedChange do not require a TruncationStrategy, as no truncation is necessary to perform the join. In this case, the join columns must include the privacy ID columns of both tables, and these privacy ID columns must have the same name, and be in the same ID space.

An example of a private join can be found in the Doing more with privacy IDs tutorial.

Example

>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> from tmlt.analytics.query_builder import TruncationStrategy
>>> sess.create_view(
...     QueryBuilder("my_private_data")
...     .select(["A", "X"])
...     .rename({"X": "C"}),
...     source_id="my_private_view",
...     cache=False
... )
>>> # A query where only one row with each join key is kept on the left
>>> # table, but two are kept on the right table.
>>> query_drop_excess = (
...     QueryBuilder("my_private_data")
...     .join_private(
...         QueryBuilder("my_private_view"),
...         truncation_strategy_left=TruncationStrategy.DropExcess(1),
...         truncation_strategy_right=TruncationStrategy.DropExcess(2),
...     )
...     .count()
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query_drop_excess,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas()
   count
0      3
>>> # A query where all rows that share a join key with another row in
>>> # their table are dropped, in both the left and right tables.
>>> query_drop_non_unique = (
...     QueryBuilder("my_private_data")
...     .join_private(
...         QueryBuilder("my_private_view"),
...         truncation_strategy_left=TruncationStrategy.DropNonUnique(),
...         truncation_strategy_right=TruncationStrategy.DropNonUnique(),
...     )
...     .count()
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query_drop_non_unique,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas()
   count
0      1
Parameters:
  • right_operand (Union[QueryBuilder, str]) – QueryBuilder object representing the table to be joined with. When calling query_a.join_private(query_b, ...), we refer to query_a as the left table and query_b as the right table. query_a.join_private("table") is shorthand for query_a.join_private(QueryBuilder("table")).

  • truncation_strategy_left (Optional[tmlt.analytics.truncation_strategy.TruncationStrategy.Type]) – Strategy for truncation of the left table.

  • truncation_strategy_right (Optional[tmlt.analytics.truncation_strategy.TruncationStrategy.Type]) – Strategy for truncation of the right table.

  • join_columns (Optional[Sequence[str]]) – The columns to join on. If join_columns is not specified, the tables will be joined on all common columns.

Return type:

QueryBuilder

replace_null_and_nan(replace_with=None)#

Replaces null and NaN values in specified columns.

Note

Null values cannot be replaced in the ID column of a table initialized with a AddRowsWithID ProtectedChange, nor on a column generated by a flat_map() with the grouping parameter set to True.

Warning

If null values are replaced in a column, then Analytics will raise an error if a KeySet is used with a null value for that column.

Example

>>> my_private_data.toPandas()
      A    B    X
0  None  0.0  0.0
1     1  NaN  1.1
2     2  2.0  NaN
>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> # Building a query with a replace_null_and_nan transformation
>>> query = (
...     QueryBuilder("my_private_data")
...     .replace_null_and_nan(
...         replace_with={
...             "A": "new_value",
...             "B": 1234,
...             "X": 56.78,
...         },
...     )
...     .groupby(KeySet.from_dict({"A": ["new_value", "1", "2"]}))
...     .count()
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas()
           A  count
0          1      1
1          2      1
2  new_value      1
Parameters:

replace_with (Optional[Mapping[str, Union[int, float, str, datetime.date, datetime.datetime]]]) – A dictionary mapping column names to values used to replace null and NaN values. If None (or empty), all columns will have null and NaN values replaced with Analytics defaults; see AnalyticsDefault.

Return type:

QueryBuilder

replace_infinity(replace_with=None)#

Replaces +inf and -inf values in specified columns.

Example

>>> my_private_data.toPandas()
    A    B    X
0  a1  0.0  0.0
1  a1  NaN -inf
2  a2  2.0  inf
>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> # Building a query with a replace_infinity transformation
>>> query = (
...     QueryBuilder("my_private_data")
...     .replace_infinity(
...         replace_with={
...             "X": (-100, 100),
...         },
...     )
...     .groupby(KeySet.from_dict({"A": ["a1", "a2"]}))
...     .count()
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas()
    A  count
0  a1      2
1  a2      1
Parameters:

replace_with (Optional[Dict[str, Tuple[float, float]]]) – A dictionary mapping column names to values used to replace -inf and +inf. If None (or empty), all columns will have infinite values replaced with Analytics defaults; see AnalyticsDefault.

Return type:

QueryBuilder

drop_null_and_nan(columns)#

Removes rows containing null or NaN values.

Note

Null values cannot be dropped in the ID column of a table initialized with a AddRowsWithID ProtectedChange, nor on a column generated by a flat_map() with the grouping parameter set to True.

Warning

If null and NaN values are dropped from a column, then Analytics will raise an error if a KeySet that contains a null value is used for that column.

Example

>>> my_private_data.toPandas()
    A    B    X
0  a1  2.0  0.0
1  a1  NaN  1.1
2  a2  2.0  NaN
>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> # Count query on the original data
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(KeySet.from_dict({"A": ["a1", "a2"], "B": [None, 2]}))
...     .count()
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A", "B").toPandas()
    A    B  count
0  a1  NaN      1
1  a1  2.0      1
2  a2  NaN      0
3  a2  2.0      1
>>> # Building a query with a transformation
>>> query = (
...     QueryBuilder("my_private_data")
...     .drop_null_and_nan(columns=["B"])
...     .groupby(KeySet.from_dict({"A": ["a1", "a2"]}))
...     .count()
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas()
    A  count
0  a1      1
1  a2      1
Parameters:

columns (Optional[List[str]]) – A list of columns in which to look for null and NaN values. If None or an empty list, then all columns will be considered, meaning that if any column has a null/NaN value then the row it is in will be dropped.

Return type:

QueryBuilder

drop_infinity(columns)#

Remove rows containing infinite values.

Example

>>> my_private_data.toPandas()
    A  B    X
0  a1  1  1.1
1  a1  2  0.0
2  a2  2  inf
>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> # Count query on the original data
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(KeySet.from_dict({"A": ["a1", "a2"]}))
...     .count()
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas()
    A  count
0  a1      2
1  a2      1
>>> # Building a query with a drop_infinity transformation
>>> query = (
...     QueryBuilder("my_private_data")
...     .drop_infinity(columns=["X"])
...     .groupby(KeySet.from_dict({"A": ["a1", "a2"]}))
...     .count()
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas()
    A  count
0  a1      2
1  a2      0
Parameters:

columns (Optional[List[str]]) – A list of columns in which to look for positive and negative infinities. If None or an empty list, then all columns will be considered, meaning that if any column has an infinite value then the row it is in will be dropped.

Return type:

QueryBuilder

rename(column_mapper)#

Renames one or more columns in the table.

Example

>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> # Building a query with a rename transformation
>>> query = (
...     QueryBuilder("my_private_data")
...     .rename({"X": "C"})
...     .groupby(KeySet.from_dict({"C": [0, 1]}))
...     .count()
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("C").toPandas()
   C  count
0  0      1
1  1      2
Parameters:

column_mapper (Dict[str, str]) – A mapping of columns to new column names. Columns not specified in the mapper will remain the same.

Return type:

QueryBuilder

filter(condition)#

Filter rows matching a condition.

The condition parameter accepts the same syntax as in PySpark’s filter() method: valid expressions are those that can be used in a WHERE clause in Spark SQL. Examples of valid conditions include:

  • age < 42

  • age BETWEEN 17 AND 42

  • age < 42 OR (age < 60 AND gender IS NULL)

  • LENGTH(name) > 17

  • favorite_color IN ('blue', 'red')

  • date = '2022-03-14'

  • time < '2022-01-01T12:45:00'

Example

>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> # Building a query with a filter transformation
>>> query = (
...     QueryBuilder("my_private_data")
...     .filter("A == '0'")
...     .count()
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas()
   count
0      1
Parameters:

condition (str) – A string of SQL expressions specifying the filter to apply to the data. For example, the string “A > B” matches rows where column A is greater than column B.

Return type:

QueryBuilder

select(columns)#

Selects the specified columns, dropping the others.

Example

>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> sess.get_column_types("my_private_data")
{'A': ColumnType.VARCHAR, 'B': ColumnType.INTEGER, 'X': ColumnType.INTEGER}
>>> # Create a new view using a select query
>>> query = (
...     QueryBuilder("my_private_data")
...     .select(["A", "B"])
... )
>>> sess.create_view(query, "selected_data", cache=True)
>>> # Inspect the schema of the resulting view
>>> sess.get_column_types("selected_data")
{'A': ColumnType.VARCHAR, 'B': ColumnType.INTEGER}
Parameters:

columns (Sequence[str]) – The columns to select.

Return type:

QueryBuilder

map(f, new_column_types, augment=False)#

Applies a mapping function to each row.

If the new column types are specified using ColumnType and not ColumnDescriptor, Tumult Analytics assumes that all new columns created may contain null values, and that DECIMAL columns may contain NaN or infinite values.

An illustrated example can be found in the Simple transformations tutorial.

Example

>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> # Building a query with a map transformation
>>> query = (
...     QueryBuilder("my_private_data")
...     .map(
...         lambda row: {"new": row["B"]*2},
...         new_column_types={"new": ColumnType.INTEGER},
...         augment=True
...     )
...     .groupby(KeySet.from_dict({"new": [0, 1, 2, 3, 4]}))
...     .count()
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("new").toPandas()
   new  count
0    0      1
1    1      0
2    2      1
3    3      0
4    4      1
Parameters:
  • f (Callable[[Row], Row]) – The function to be applied to each row. The function’s input is a dictionary matching each column name to its value for that row. This function should return a dictionary, which should always have the same keys regardless of input, and the values in that dictionary should match the column types specified in new_column_types. The function should not have any side effects (in particular, f cannot raise exceptions).

  • new_column_types (Mapping[str, Union[tmlt.analytics._schema.ColumnDescriptor, tmlt.analytics._schema.ColumnType]]) – Mapping from column names to types, for new columns produced by f. Using ColumnDescriptor is preferred.

  • augment (bool) – If True, add new columns to the existing dataframe (so new schema = old schema + schema_new_columns). If False, make the new dataframe with schema = schema_new_columns

Return type:

QueryBuilder

flat_map(f, new_column_types, augment=False, grouping=False, max_rows=None, max_num_rows=None)#

Applies a mapping function to each row, returning zero or more rows.

If the new column types are specified using ColumnType and not ColumnDescriptor, Tumult Analytics assumes that all new columns created may contain null values, and that DECIMAL columns may contain NaN or infinite values.

The max_rows argument is ignored if the table was initialized with the AddRowsWithID ProtectedChange. Otherwise, it is required (and enforced).

The Simple transformations and Doing more with privacy IDs tutorials contain illustrated examples of flat maps.

Example

>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
3  1  3  1
>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> # Building a query with a flat map transformation
>>> query = (
...     QueryBuilder("my_private_data")
...     .flat_map(
...         lambda row: [{"i_B": i} for i in range(int(row["B"])+1)],
...         new_column_types={"i_B": ColumnDescriptor(
...             ColumnType.INTEGER, allow_null=False,
...         )},
...         augment=True,
...         grouping=False,
...         max_rows=3,
...     )
...     .groupby(KeySet.from_dict({"B": [0, 1, 2, 3]}))
...     .count()
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("B").toPandas()
   B  count
0  0      1
1  1      2
2  2      3
3  3      3
Parameters:
  • f (Callable[[Row], List[Row]]) – The function to be applied to each row. The function’s input is a dictionary matching a column name to its value for that row. This function should return a list of dictionaries. Those dictionaries should always have the same keys regardless of input, and the values in those dictionaries should match the column types specified in new_column_types. The function should not have any side effects (in particular, f must not raise exceptions), and must be deterministic (running it multiple times on a fixed input should always return the same output).

  • new_column_types (Mapping[str, Union[str, tmlt.analytics._schema.ColumnDescriptor, tmlt.analytics._schema.ColumnType]]) – Mapping from column names to types, for new columns produced by f. Using ColumnDescriptor is preferred.

  • augment (bool) – If True, add new columns to the existing dataframe (so new schema = old schema + schema_new_columns). If False, make the new dataframe with schema = schema_new_columns

  • grouping (bool) – Whether this produces a new column that we want to groupby. If True, this requires that any groupby aggregations following this query include the new column as a groupby column. Only one new column is supported, and the new column must have distinct values for each input row.

  • max_rows (Optional[int]) – The enforced limit on the number of rows from each f(row). If f produces more rows than this, only the first max_rows rows will be in the output.

  • max_num_rows (Optional[int]) – Deprecated synonym for max_rows.

Return type:

QueryBuilder

flat_map_by_id(f, new_column_types)#

Applies a transformation to each group of records sharing an ID.

Transforms groups of records that all share a common ID into a new group of records with that same ID based on a user-provided function. The number of rows produced does not have to match the number of input rows. The ID column is automatically added to the output of f, but all other input columns are lost unless f copies them into its output.

Note that this transformation is only valid on tables with the AddRowsWithID protected change.

If you provide only a ColumnType for the new column types, Analytics assumes that all new columns created may contain null values (and that DECIMAL columns may contain NaN or infinite values).

Example

>>> my_private_data.toPandas()
  id  A
0  0  1
1  1  0
2  1  1
3  1  4
>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddRowsWithID("id"),
... )
>>> # Using flat_map_by_id, each ID's records are pre-summed before
>>> # computing a total sum, allowing less data loss than truncating
>>> # and clamping each row individually without having to add
>>> # more noise.
>>> query = (
...     QueryBuilder("my_private_data")
...     .flat_map_by_id(
...         lambda rows: [{"per_id_sum": sum(r["A"] for r in rows)}],
...         new_column_types={
...             "per_id_sum": ColumnDescriptor(
...                 ColumnType.INTEGER, allow_null=False,
...             )
...         },
...     )
...     .enforce(MaxRowsPerID(1))
...     .sum("per_id_sum", low=0, high=5, name="sum")
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas()
   sum
0    6
Parameters:
  • f (Callable[[List[Row]], List[Row]]) – The function to be applied to each group of rows. The function’s input is a list of dictionaries, each with one key/value pair per column. This function should return a list of dictionaries. Those dictionaries must each have one key/value pair for each column types specified in new_column_types, and the values’ types must match the column types. The function must not have any side effects (in particular, it must not raise exceptions), and must be deterministic (running it multiple times on a fixed input should always return the same output).

  • new_column_types (Mapping[str, Union[str, tmlt.analytics._schema.ColumnDescriptor, tmlt.analytics._schema.ColumnType]]) – Mapping from column names to types for the new columns produced by f. Using ColumnDescriptor is preferred. Note that while the result of this transformation includes the ID column, the ID column must not be in new_column_types, and must not be included in the output rows from f.

Return type:

QueryBuilder

bin_column(column, spec, name=None)#

Creates a new column by assigning the values in a given column to bins.

An illustrated example can be found in the Simple transformations tutorial.

Example

>>> my_private_data.toPandas()
   age  income
0   11       0
1   17       6
2   30      54
3   18      14
4   59     126
5   48     163
6   76     151
7   91      18
8   48      97
9   53      85
>>> from tmlt.analytics.binning_spec import BinningSpec
>>> sess = Session.from_dataframe(
...     PureDPBudget(float("inf")),
...     source_id="private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> age_binspec = BinningSpec(
...     [0, 18, 65, 100], include_both_endpoints=False
... )
>>> income_tax_rate_binspec = BinningSpec(
...     [0, 10, 40, 86, 165], names=[10, 12, 22, 24]
... )
>>> keys = KeySet.from_dict(
...     {
...         "age_binned": age_binspec.bins(),
...         "marginal_tax_rate": income_tax_rate_binspec.bins()
...     }
... )
>>> query = (
...     QueryBuilder("private_data")
...     .bin_column("age", age_binspec)
...     .bin_column(
...         "income", income_tax_rate_binspec, name="marginal_tax_rate"
...     )
...     .groupby(keys).count()
... )
>>> answer = sess.evaluate(query, PureDPBudget(float("inf")))
>>> answer.sort("age_binned", "marginal_tax_rate").toPandas()
   age_binned  marginal_tax_rate  count
0     (0, 18]                 10      2
1     (0, 18]                 12      1
2     (0, 18]                 22      0
3     (0, 18]                 24      0
4    (18, 65]                 10      0
5    (18, 65]                 12      0
6    (18, 65]                 22      2
7    (18, 65]                 24      3
8   (65, 100]                 10      0
9   (65, 100]                 12      1
10  (65, 100]                 22      0
11  (65, 100]                 24      1
Parameters:
  • column (str) – Name of the column used to assign bins.

  • spec (tmlt.analytics.binning_spec.BinningSpec) – A BinningSpec that defines the binning operation to be performed.

  • name (Optional[str]) – The name of the column that will be created. If None (the default), the input column name with _binned appended to it.

Return type:

QueryBuilder

histogram(column, bin_edges, name=None)#

Returns a count query containing the frequency of values in specified column.

Example

>>> from tmlt.analytics.binning_spec import BinningSpec
>>> private_data = spark.createDataFrame(
...     pd.DataFrame(
...         {
...          "income_thousands": [83, 85, 86, 73, 82, 95,
...                               74, 92, 71, 86, 97]
...         }
...     )
... )
>>> session = Session.from_dataframe(
...     privacy_budget=PureDPBudget(epsilon=float('inf')),
...     source_id="private_data",
...     dataframe=private_data,
...     protected_change=AddOneRow(),
... )
>>> income_binspec = BinningSpec(
...     bin_edges=[i for i in range(70,110,10)],
...     include_both_endpoints=False
... )
>>> binned_income_count_query = (
...     QueryBuilder("private_data")
...     .histogram("income_thousands", income_binspec, "income_binned")
... )
>>> binned_income_counts = session.evaluate(
...     binned_income_count_query,
...     privacy_budget=PureDPBudget(epsilon=10),
... )
>>> print(binned_income_counts.sort("income_binned").toPandas())
  income_binned  count
0      (70, 80]      3
1      (80, 90]      5
2     (90, 100]      3
Parameters:
  • column (str) – Name of the column used to assign bins.

  • bin_edges (Union[Sequence[tmlt.analytics.binning_spec.BinT], tmlt.analytics.binning_spec.BinningSpec]) – The bin edges for the histogram; provided as either a BinningSpec or as a list of supported data types. Values outside the range of the provided bins, None types, and NaN values are all mapped to None (null in Spark).

  • name (Optional[str]) – The name of the column that will be created. If None (the default), the input column name with _binned appended to it.

Return type:

Query

enforce(constraint)#

Enforces a Constraint on the table.

This method can be used to enforce constraints on the current table. See the constraints module for information about the available constraints and what they are used for.

Example

>>> my_private_data.toPandas()
  id  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess = (
...     Session.Builder()
...     .with_privacy_budget(PureDPBudget(float("inf")))
...     .with_id_space("a")
...     .with_private_dataframe(
...         "my_private_data",
...         my_private_data,
...         protected_change=AddRowsWithID("id", "a"),
...     )
...     .build()
... )
>>> # No ID contributes more than 2 rows, so no rows are dropped when
>>> # enforcing the constraint
>>> query = QueryBuilder("my_private_data").enforce(MaxRowsPerID(2)).count()
>>> sess.evaluate(query, sess.remaining_privacy_budget).toPandas()
   count
0      3
>>> # ID 1 contributes more than one row, so one of the rows with ID 1 will
>>> # be dropped when enforcing the constraint
>>> query = QueryBuilder("my_private_data").enforce(MaxRowsPerID(1)).count()
>>> sess.evaluate(query, sess.remaining_privacy_budget).toPandas()
   count
0      2
Parameters:

constraint (tmlt.analytics.constraints.Constraint) – The constraint to enforce.

Return type:

QueryBuilder

get_groups(columns=None)#

Returns a query that gets combinations of values in the listed columns.

Note

Because this uses differential privacy, it won’t include all of the values in the input dataset columns, and may even return no results at all on datasets that have few values for each set of group keys.

Example

>>> my_private_data = spark.createDataFrame(
...     pd.DataFrame(
...         [["0", 1, 0] for _ in range(10000)]
...         + [["1", 2, 1] for _ in range(10000)],
...         columns=["A", "B", "X"],
...     )
... )
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=ApproxDPBudget(1, 1e-5),
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> # Building a get_groups query
>>> query = (
...     QueryBuilder("my_private_data")
...     .get_groups()
... )
>>> # Answering the query
>>> answer = sess.evaluate(
...     query,
...     sess.remaining_privacy_budget
... )
>>> answer.toPandas()
   A  B  X
0  0  1  0
1  1  2  1
Parameters:

columns (Optional[List[str]]) – Name of the column used to assign bins. If empty or none are provided, all of the columns in the table will be used, excluding any column marked as a privacy ID in a table with a ProtectedChange of AddRowsWithID.

Return type:

Query

get_bounds(column, lower_bound_column=None, upper_bound_column=None)#

Returns a query that gets approximate upper and lower bounds for a column.

The bounds are selected to give good performance when used as upper and lower bounds in other aggregations. They may not be close to the actual maximum and minimum values, and are not designed to give a tight representation of the data distribution. For any purpose other than providing a lower and upper bound to other aggregations we suggest using the quantile aggregation instead.

Note

If the column being measured contains NaN or null values, a drop_null_and_nan() query will be performed first. If the column being measured contains infinite values, a drop_infinity() query will be performed first.

Note

The algorithm is approximate, and differentially private, so the bounds may not be tight, and not all input values may fall between them.

Example

>>> my_private_data = spark.createDataFrame(
...     pd.DataFrame(
...         [[i] for i in range(100)],
...         columns=["X"],
...     )
... )
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=PureDPBudget(float('inf')),
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> # Building a get_groups query
>>> query = (
...     QueryBuilder("my_private_data")
...     .get_bounds("X")
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     sess.remaining_privacy_budget
... )
>>> answer.toPandas()
   X_upper_bound  X_lower_bound
0            128           -128
Parameters:
  • column (str) – Name of the column whose bounds we want to get.

  • lower_bound_column (Optional[str]) – Name of the column to store the lower bound. Defaults to f"{column}_lower_bound".

  • upper_bound_column (Optional[str]) – Name of the column to store the upper bound. Defaults to f"{column}_upper_bound".

Return type:

Query

groupby(by)#

Groups the query by the given set of keys, returning a GroupedQueryBuilder.

The aggregation will return one row for each key in the KeySet; other values present in the data will be discarded.

More information can be found in the Group-by queries tutorial.

Examples

>>> from tmlt.analytics.keyset import KeySet
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
Answering a query with the exact groupby domain:
>>> groupby_keys = KeySet.from_dict({"A": ["0", "1"]})
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(groupby_keys)
...     .count()
... )
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas()
   A  count
0  0      1
1  1      2
Answering a query with omitted and added domain values:
>>> groupby_keys = KeySet.from_dict({"A": ["0", "2"]})
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(groupby_keys)
...     .count()
... )
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas()
   A  count
0  0      1
1  2      0
Answering a query with a multi-column domain:
>>> groupby_keys = KeySet.from_dict(
...    {"A": ["0", "1"], "B": [0, 1, 2]}
... )
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(groupby_keys)
...     .count()
... )
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A", "B").toPandas()
   A  B  count
0  0  0      0
1  0  1      1
2  0  2      0
3  1  0      1
4  1  1      0
5  1  2      1
Answering a query with a multi-column domain and structural zeros:
>>> # Suppose it is known that A and B cannot be equal. This set of
>>> # groupby keys prevents those impossible values from being computed.
>>> keys_df = pd.DataFrame({
...     "A": ["0", "0", "1", "1"],
...     "B": [1, 2, 0, 2],
... })
>>> groupby_keys = KeySet.from_dataframe(spark.createDataFrame(keys_df))
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(groupby_keys)
...     .count()
... )
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A", "B").toPandas()
   A  B  count
0  0  1      1
1  0  2      0
2  1  0      1
3  1  2      1
Parameters:

by (Union[tmlt.analytics.keyset.KeySet, List[str], str]) – A KeySet which defines the columns to group on and the possible values for each column.

Return type:

GroupedQueryBuilder

count(name=None, mechanism=CountMechanism.DEFAULT)#

Returns a count query ready to be evaluated.

Note

Differentially private counts may return values that are not possible for a non-DP query - including negative values. You can enforce non-negativity once the query returns its results; see the example below.

Example

>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> # Building a count query
>>> query = (
...     QueryBuilder("my_private_data")
...     .count()
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas()
   count
0      3
>>> # Ensuring all results are non-negative
>>> import pyspark.sql.functions as sf
>>> answer = answer.withColumn(
...     "count", sf.when(sf.col("count") < 0, 0).otherwise(
...             sf.col("count")
...     )
... )
>>> answer.toPandas()
   count
0      3
Parameters:
  • name (Optional[str]) – Name for the resulting aggregation column. Defaults to “count”.

  • mechanism (tmlt.analytics._query_expr.CountMechanism) – Choice of noise mechanism. By default, the framework automatically selects an appropriate mechanism.

Return type:

Query

count_distinct(columns=None, name=None, mechanism=CountDistinctMechanism.DEFAULT, cols=None)#

Returns a count_distinct query ready to be evaluated.

Note

Differentially private counts may returns values that are not possible for a non-DP query - including negative values. You can enforce non-negativity once the query returns its results; see the example below.

Example

>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  0  1  0
2  1  0  1
3  1  2  1
>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> # Building a count_distinct query
>>> query = (
...     QueryBuilder("my_private_data")
...     .count_distinct()
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas()
   count_distinct
0               3
>>> # Ensuring all results are non-negative
>>> import pyspark.sql.functions as sf
>>> answer = answer.withColumn(
...     "count_distinct", sf.when(
...         sf.col("count_distinct") < 0, 0
...     ).otherwise(
...         sf.col("count_distinct")
...     )
... )
>>> answer.toPandas()
   count_distinct
0               3
Parameters:
  • columns (Optional[List[str]]) – Columns in which to count distinct values. If none are provided, the query will count every distinct row.

  • name (Optional[str]) – Name for the resulting aggregation column. Defaults to “count_distinct” if no columns are provided, or “count_distinct(A, B, C)” if the provided columns are A, B, and C.

  • mechanism (tmlt.analytics._query_expr.CountDistinctMechanism) – Choice of noise mechanism. By default, the framework automatically selects an appropriate mechanism.

  • cols (Optional[List[str]]) – Deprecated; use columns instead.

Return type:

Query

quantile(column, quantile, low, high, name=None)#

Returns a quantile query ready to be evaluated.

Note

If the column being measured contains NaN or null values, a drop_null_and_nan() query will be performed first. If the column being measured contains infinite values, a drop_infinity() query will be performed first.

Example

>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> # Building a quantile query
>>> query = (
...     QueryBuilder("my_private_data")
...     .quantile(column="B", quantile=0.6, low=0, high=2)
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas() 
   B_quantile(0.6)
0         1.331107
Parameters:
  • column (str) – The column to compute the quantile over.

  • quantile (float) – A number between 0 and 1 specifying the quantile to compute. For example, 0.5 would compute the median.

  • low (float) – The lower bound for clamping.

  • high (float) – The upper bound for clamping. Must be such that low is less than high.

  • name (Optional[str]) – The name to give the resulting aggregation column. Defaults to f"{column}_quantile({quantile})".

Return type:

Query

min(column, low, high, name=None)#

Returns a quantile query requesting a minimum value, ready to be evaluated.

Note

If the column being measured contains NaN or null values, a drop_null_and_nan() query will be performed first. If the column being measured contains infinite values, a drop_infinity() query will be performed first.

Example

>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> # Building a quantile query
>>> query = (
...     QueryBuilder("my_private_data")
...     .min(column="B", low=0, high=5, name="min_B")
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas() 
      min_B
0  0.213415
Parameters:
  • column (str) – The column to compute the quantile over.

  • low (float) – The lower bound for clamping.

  • high (float) – The upper bound for clamping. Must be such that low is less than high.

  • name (Optional[str]) – The name to give the resulting aggregation column. Defaults to f"{column}_min".

Return type:

Query

max(column, low, high, name=None)#

Returns a quantile query requesting a maximum value, ready to be evaluated.

Note

If the column being measured contains NaN or null values, a drop_null_and_nan() query will be performed first. If the column being measured contains infinite values, a drop_infinity() query will be performed first.

Example

>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> # Building a quantile query
>>> query = (
...     QueryBuilder("my_private_data")
...     .max(column="B", low=0, high=5, name="max_B")
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas() 
      max_B
0  2.331871
Parameters:
  • column (str) – The column to compute the quantile over.

  • low (float) – The lower bound for clamping.

  • high (float) – The upper bound for clamping. Must be such that low is less than high.

  • name (Optional[str]) – The name to give the resulting aggregation column. Defaults to f"{column}_max".

Return type:

Query

median(column, low, high, name=None)#

Returns a quantile query requesting a median value, ready to be evaluated.

Note

If the column being measured contains NaN or null values, a drop_null_and_nan() query will be performed first. If the column being measured contains infinite values, a drop_infinity() query will be performed first.

Example

>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> # Building a quantile query
>>> query = (
...     QueryBuilder("my_private_data")
...     .median(column="B", low=0, high=5, name="median_B")
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas() 
   median_B
0  1.221197
Parameters:
  • column (str) – The column to compute the quantile over.

  • low (float) – The lower bound for clamping.

  • high (float) – The upper bound for clamping. Must be such that low is less than high.

  • name (Optional[str]) – The name to give the resulting aggregation column. Defaults to f"{column}_median".

Return type:

Query

sum(column, low, high, name=None, mechanism=SumMechanism.DEFAULT)#

Returns a sum query ready to be evaluated.

Note

If the column being measured contains NaN or null values, a drop_null_and_nan() query will be performed first. If the column being measured contains infinite values, a drop_infinity() query will be performed first.

Note

Regarding the clamping bounds:

  1. The values for low and high are a choice the caller must make.

  2. All data will be clamped to lie within this range.

  3. The narrower the range, the less noise. Larger bounds mean more data is kept, but more noise needs to be added to the result.

  4. The clamping bounds are assumed to be public information. Avoid using the private data to set these values.

More information can be found in the Numerical aggregations tutorial.

Example

>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> # Building a sum query
>>> query = (
...     QueryBuilder("my_private_data")
...     .sum(column="B",low=0, high=2)
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas()
   B_sum
0      3
Parameters:
  • column (str) – The column to compute the sum over.

  • low (float) – The lower bound for clamping.

  • high (float) – The upper bound for clamping. Must be such that low is less than high.

  • name (Optional[str]) – The name to give the resulting aggregation column. Defaults to f"{column}_sum".

  • mechanism (tmlt.analytics._query_expr.SumMechanism) – Choice of noise mechanism. By default, the framework automatically selects an appropriate mechanism.

Return type:

Query

average(column, low, high, name=None, mechanism=AverageMechanism.DEFAULT)#

Returns an average query ready to be evaluated.

Note

If the column being measured contains NaN or null values, a drop_null_and_nan() query will be performed first. If the column being measured contains infinite values, a drop_infinity() query will be performed first.

Note

Regarding the clamping bounds:

  1. The values for low and high are a choice the caller must make.

  2. All data will be clamped to lie within this range.

  3. The narrower the range, the less noise. Larger bounds mean more data is kept, but more noise needs to be added to the result.

  4. The clamping bounds are assumed to be public information. Avoid using the private data to set these values.

More information can be found in the Numerical aggregations tutorial.

Example

>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> # Building an average query
>>> query = (
...     QueryBuilder("my_private_data")
...     .average(column="B",low=0, high=2)
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas()
   B_average
0        1.0
Parameters:
  • column (str) – The column to compute the average over.

  • low (float) – The lower bound for clamping.

  • high (float) – The upper bound for clamping. Must be such that low is less than high.

  • name (Optional[str]) – The name to give the resulting aggregation column. Defaults to f"{column}_average".

  • mechanism (tmlt.analytics._query_expr.AverageMechanism) – Choice of noise mechanism. By default, the framework automatically selects an appropriate mechanism.

Return type:

Query

variance(column, low, high, name=None, mechanism=VarianceMechanism.DEFAULT)#

Returns a variance query ready to be evaluated.

Note

If the column being measured contains NaN or null values, a drop_null_and_nan() query will be performed first. If the column being measured contains infinite values, a drop_infinity() query will be performed first.

Note

Regarding the clamping bounds:

  1. The values for low and high are a choice the caller must make.

  2. All data will be clamped to lie within this range.

  3. The narrower the range, the less noise. Larger bounds mean more data is kept, but more noise needs to be added to the result.

  4. The clamping bounds are assumed to be public information. Avoid using the private data to set these values.

Consult the Numerical aggregations tutorial for more information.

Example

>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> # Building a variance query
>>> query = (
...     QueryBuilder("my_private_data")
...     .variance(column="B",low=0, high=2)
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas()
   B_variance
0         1.0
Parameters:
  • column (str) – The column to compute the variance over.

  • low (float) – The lower bound for clamping.

  • high (float) – The upper bound for clamping. Must be such that low is less than high.

  • name (Optional[str]) – The name to give the resulting aggregation column. Defaults to f"{column}_variance".

  • mechanism (tmlt.analytics._query_expr.VarianceMechanism) – Choice of noise mechanism. By default, the framework automatically selects an appropriate mechanism.

Return type:

Query

stdev(column, low, high, name=None, mechanism=StdevMechanism.DEFAULT)#

Returns a standard deviation query ready to be evaluated.

Note

If the column being measured contains NaN or null values, a drop_null_and_nan() query will be performed first. If the column being measured contains infinite values, a drop_infinity() query will be performed first.

Note

Regarding the clamping bounds:

  1. The values for low and high are a choice the caller must make.

  2. All data will be clamped to lie within this range.

  3. The narrower the range, the less noise. Larger bounds mean more data is kept, but more noise needs to be added to the result.

  4. The clamping bounds are assumed to be public information. Avoid using the private data to set these values.

More information can be found in the Numerical aggregations tutorial.

Example

>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> # Building a standard deviation query
>>> query = (
...     QueryBuilder("my_private_data")
...     .stdev(column="B",low=0, high=2)
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas()
   B_stdev
0      1.0
Parameters:
  • column (str) – The column to compute the stdev over.

  • low (float) – The lower bound for clamping.

  • high (float) – The upper bound for clamping. Must be such that low is less than high.

  • name (Optional[str]) – The name to give the resulting aggregation column. Defaults to f"{column}_stdev".

  • mechanism (tmlt.analytics._query_expr.StdevMechanism) – Choice of noise mechanism. By default, the framework automatically selects an appropriate mechanism.

Return type:

Query

class GroupedQueryBuilder(source_id, query_expr, groupby_keys)#

A QueryBuilder that is grouped by a set of columns and can be aggregated.

Methods#

count()

Returns a GroupedCountQuery with a count query.

count_distinct()

Returns a Query with a count_distinct query.

quantile()

Returns a Query with a quantile query.

min()

Returns a Query with a quantile query requesting a minimum value.

max()

Returns a Query with a quantile query requesting a maximum value.

median()

Returns a Query with a quantile query requesting a median value.

sum()

Returns a Query with a sum query.

average()

Returns a Query with an average query.

variance()

Returns a Query with a variance query.

stdev()

Returns a Query with a standard deviation query.

get_bounds()

Returns an Query that gets approximate upper and lower bounds for a column.

__init__(source_id, query_expr, groupby_keys)#

Constructor.

Do not construct directly; use groupby().

Return type:

None

count(name=None, mechanism=CountMechanism.DEFAULT)#

Returns a GroupedCountQuery with a count query.

Example

>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> # Building a groupby count query
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(KeySet.from_dict({"A": ["0", "1"]}))
...     .count()
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas()
   A  count
0  0      1
1  1      2
Parameters:
  • name (Optional[str]) – Name for the resulting aggregation column. Defaults to “count”.

  • mechanism (tmlt.analytics._query_expr.CountMechanism) – Choice of noise mechanism. By default, the framework automatically selects an appropriate mechanism.

Return type:

GroupbyCountQuery

count_distinct(columns=None, name=None, mechanism=CountDistinctMechanism.DEFAULT, cols=None)#

Returns a Query with a count_distinct query.

Example

>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  0  1  0
2  1  0  1
3  1  2  1
>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> # Building a groupby count_distinct query
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(KeySet.from_dict({"A": ["0", "1"]}))
...     .count_distinct(["B", "X"])
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas()
   A  count_distinct(B, X)
0  0                     1
1  1                     2
Parameters:
  • columns (Optional[List[str]]) – Columns in which to count distinct values. If none are provided, the query will count every distinct row.

  • name (Optional[str]) – Name for the resulting aggregation column. Defaults to “count_distinct” if no columns are provided, or “count_distinct(A, B, C)” if the provided columns are A, B, and C.

  • mechanism (tmlt.analytics._query_expr.CountDistinctMechanism) – Choice of noise mechanism. By default, the framework automatically selects an appropriate mechanism.

  • cols (Optional[List[str]]) – Deprecated; use columns instead.

Return type:

Query

quantile(column, quantile, low, high, name=None)#

Returns a Query with a quantile query.

Note

If the column being measured contains NaN or null values, a drop_null_and_nan() query will be performed first. If the column being measured contains infinite values, a drop_infinity() query will be performed first.

Example

>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> # Building a groupby quantile query
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(KeySet.from_dict({"A": ["0", "1"]}))
...     .quantile(column="B", quantile=0.6, low=0, high=2)
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas() 
   A  B_quantile(0.6)
0  0         1.331107
1  1         1.331107
Parameters:
  • column (str) – The column to compute the quantile over.

  • quantile (float) – A number between 0 and 1 specifying the quantile to compute. For example, 0.5 would compute the median.

  • low (float) – The lower bound for clamping.

  • high (float) – The upper bound for clamping. Must be such that low is less than high.

  • name (Optional[str]) – The name to give the resulting aggregation column. Defaults to f"{column}_quantile({quantile})".

Return type:

Query

min(column, low, high, name=None)#

Returns a Query with a quantile query requesting a minimum value.

Note

If the column being measured contains NaN or null values, a drop_null_and_nan() query will be performed first. If the column being measured contains infinite values, a drop_infinity() query will be performed first.

Example

>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> # Building a quantile query
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(KeySet.from_dict({"A": ["0", "1"]}))
...     .min(column="B", low=0, high=5, name="min_B")
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas() 
   A     min_B
0  0  0.213415
1  1  0.213415
Parameters:
  • column (str) – The column to compute the quantile over.

  • low (float) – The lower bound for clamping.

  • high (float) – The upper bound for clamping. Must be such that low is less than high.

  • name (Optional[str]) – The name to give the resulting aggregation column. Defaults to f"{column}_min".

Return type:

Query

max(column, low, high, name=None)#

Returns a Query with a quantile query requesting a maximum value.

Note

If the column being measured contains NaN or null values, a drop_null_and_nan() query will be performed first. If the column being measured contains infinite values, a drop_infinity() query will be performed first.

Example

>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> # Building a quantile query
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(KeySet.from_dict({"A": ["0", "1"]}))
...     .max(column="B", low=0, high=5, name="max_B")
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas() 
   A     max_B
0  0  2.331871
1  1  2.331871
Parameters:
  • column (str) – The column to compute the quantile over.

  • low (float) – The lower bound for clamping.

  • high (float) – The upper bound for clamping. Must be such that low is less than high.

  • name (Optional[str]) – The name to give the resulting aggregation column. Defaults to f"{column}_max".

Return type:

Query

median(column, low, high, name=None)#

Returns a Query with a quantile query requesting a median value.

Note

If the column being measured contains NaN or null values, a drop_null_and_nan() query will be performed first. If the column being measured contains infinite values, a drop_infinity() query will be performed first.

Example

>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> # Building a quantile query
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(KeySet.from_dict({"A": ["0", "1"]}))
...     .median(column="B", low=0, high=5, name="median_B")
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas() 
   A  median_B
0  0  1.221197
1  1  1.221197
Parameters:
  • column (str) – The column to compute the quantile over.

  • low (float) – The lower bound for clamping.

  • high (float) – The upper bound for clamping. Must be such that low is less than high.

  • name (Optional[str]) – The name to give the resulting aggregation column. Defaults to f"{column}_median".

Return type:

Query

sum(column, low, high, name=None, mechanism=SumMechanism.DEFAULT)#

Returns a Query with a sum query.

Note

If the column being measured contains NaN or null values, a drop_null_and_nan() query will be performed first. If the column being measured contains infinite values, a drop_infinity() query will be performed first.

Note

Regarding the clamping bounds:

  1. The values for low and high are a choice the caller must make.

  2. All data will be clamped to lie within this range.

  3. The narrower the range, the less noise. Larger bounds mean more data is kept, but more noise needs to be added to the result.

  4. The clamping bounds are assumed to be public information. Avoid using the private data to set these values.

More information can be found in the Numerical aggregations tutorial.

Example

>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> # Building a groupby sum query
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(KeySet.from_dict({"A": ["0", "1"]}))
...     .sum(column="B",low=0, high=2)
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas()
   A  B_sum
0  0      1
1  1      2
Parameters:
  • column (str) – The column to compute the sum over.

  • low (float) – The lower bound for clamping.

  • high (float) – The upper bound for clamping. Must be such that low is less than high.

  • name (Optional[str]) – The name to give the resulting aggregation column. Defaults to f"{column}_sum".

  • mechanism (tmlt.analytics._query_expr.SumMechanism) – Choice of noise mechanism. By default, the framework automatically selects an appropriate mechanism.

Return type:

Query

average(column, low, high, name=None, mechanism=AverageMechanism.DEFAULT)#

Returns a Query with an average query.

Note

If the column being measured contains NaN or null values, a drop_null_and_nan() query will be performed first. If the column being measured contains infinite values, a drop_infinity() query will be performed first.

Note

Regarding the clamping bounds:

  1. The values for low and high are a choice the caller must make.

  2. All data will be clamped to lie within this range.

  3. The narrower the range, the less noise. Larger bounds mean more data is kept, but more noise needs to be added to the result.

  4. The clamping bounds are assumed to be public information. Avoid using the private data to set these values.

More information can be found in the Numerical aggregations tutorial.

Example

>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> # Building a groupby average query
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(KeySet.from_dict({"A": ["0", "1"]}))
...     .average(column="B",low=0, high=2)
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas()
   A  B_average
0  0        1.0
1  1        1.0
Parameters:
  • column (str) – The column to compute the average over.

  • low (float) – The lower bound for clamping.

  • high (float) – The upper bound for clamping. Must be such that low is less than high.

  • name (Optional[str]) – The name to give the resulting aggregation column. Defaults to f"{column}_average".

  • mechanism (tmlt.analytics._query_expr.AverageMechanism) – Choice of noise mechanism. By default, the framework automatically selects an appropriate mechanism.

Return type:

Query

variance(column, low, high, name=None, mechanism=VarianceMechanism.DEFAULT)#

Returns a Query with a variance query.

Note

If the column being measured contains NaN or null values, a drop_null_and_nan() query will be performed first. If the column being measured contains infinite values, a drop_infinity() query will be performed first.

Note

Regarding the clamping bounds:

  1. The values for low and high are a choice the caller must make.

  2. All data will be clamped to lie within this range.

  3. The narrower the range, the less noise. Larger bounds mean more data is kept, but more noise needs to be added to the result.

  4. The clamping bounds are assumed to be public information. Avoid using the private data to set these values.

More information can be found in the Numerical aggregations tutorial.

Example

>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> # Building a groupby variance query
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(KeySet.from_dict({"A": ["0", "1"]}))
...     .variance(column="B",low=0, high=2)
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas()
   A  B_variance
0  0         NaN
1  1         1.0
Parameters:
  • column (str) – The column to compute the variance over.

  • low (float) – The lower bound for clamping.

  • high (float) – The upper bound for clamping. Must be such that low is less than high.

  • name (Optional[str]) – The name to give the resulting aggregation column. Defaults to f"{column}_variance".

  • mechanism (tmlt.analytics._query_expr.VarianceMechanism) – Choice of noise mechanism. By default, the framework automatically selects an appropriate mechanism.

Return type:

Query

stdev(column, low, high, name=None, mechanism=StdevMechanism.DEFAULT)#

Returns a Query with a standard deviation query.

Note

If the column being measured contains NaN or null values, a drop_null_and_nan() query will be performed first. If the column being measured contains infinite values, a drop_infinity() query will be performed first.

Note

Regarding the clamping bounds:

  1. The values for low and high are a choice the caller must make.

  2. All data will be clamped to lie within this range.

  3. The narrower the range, the less noise. Larger bounds mean more data is kept, but more noise needs to be added to the result.

  4. The clamping bounds are assumed to be public information. Avoid using the private data to set these values.

More information can be found in the Numerical aggregations tutorial.

Example

>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> # Building a groupby standard deviation query
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(KeySet.from_dict({"A": ["0", "1"]}))
...     .stdev(column="B",low=0, high=2)
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas()
   A  B_stdev
0  0      NaN
1  1      1.0
Parameters:
  • column (str) – The column to compute the stdev over.

  • low (float) – The lower bound for clamping.

  • high (float) – The upper bound for clamping. Must be such that low is less than high.

  • name (Optional[str]) – The name to give the resulting aggregation column. Defaults to f"{column}_stdev".

  • mechanism (tmlt.analytics._query_expr.StdevMechanism) – Choice of noise mechanism. By default, the framework automatically selects an appropriate mechanism.

Return type:

Query

get_bounds(column, lower_bound_column=None, upper_bound_column=None)#

Returns an Query that gets approximate upper and lower bounds for a column.

The bounds are selected to give good performance when used as upper and lower bounds in other aggregations. They may not be close to the actual maximum and minimum values, and are not designed to give a tight representation of the data distribution. For any purpose other than providing a lower and upper bound to other aggregations we suggest using the quantile aggregation instead.

Note

If the column being measured contains NaN or null values, a drop_null_and_nan() query will be performed first. If the column being measured contains infinite values, a drop_infinity() query will be performed first.

Note

The algorithm is approximate, and differentially private, so the bounds may not be tight, and not all input values may fall between them.

Example

>>> my_private_data = spark.createDataFrame(
...     pd.DataFrame(
...         [["0", 1, 0], ["1", 0, 10], ["1", 2, 10], ["2", 2, 1]],
...         columns=["A", "B", "X"]
...     )
... )
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=PureDPBudget(float('inf')),
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> # Building a get_groups query
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(KeySet.from_dict({"A": ["0", "1"]}))
...     .get_bounds(column="X")
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     sess.remaining_privacy_budget
... )
>>> answer.sort("A").toPandas()
   A  X_upper_bound  X_lower_bound
0  0              1             -1
1  1             16            -16
Parameters:
  • column (str) – Name of the column whose bounds we want to get.

  • lower_bound_column (Optional[str]) – Name of the column to store the lower bound. Defaults to f"{column}_lower_bound".

  • upper_bound_column (Optional[str]) – Name of the column to store the upper bound. Defaults to f"{column}_upper_bound".

Return type:

Query