QueryBuilder.get_bounds#

from tmlt.analytics import QueryBuilder
QueryBuilder.get_bounds(column, lower_bound_column=None, upper_bound_column=None)#

Returns a query that gets approximate upper and lower bounds for a column.

The bounds are selected to give good performance when used as upper and lower bounds in other aggregations. They may not be close to the actual maximum and minimum values, and are not designed to give a tight representation of the data distribution. For any purpose other than providing a lower and upper bound to other aggregations we suggest using the quantile aggregation instead.

Note

If the column being measured contains NaN or null values, a drop_null_and_nan() query will be performed first. If the column being measured contains infinite values, a drop_infinity() query will be performed first.

Note

The algorithm is approximate, and differentially private, so the bounds may not be tight, and not all input values may fall between them.

Example

>>> my_private_data = spark.createDataFrame(
...     pd.DataFrame(
...         [[i] for i in range(100)],
...         columns=["X"],
...     )
... )
>>> sess = Session.from_dataframe(
...     privacy_budget=PureDPBudget(float('inf')),
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> # Building a get_groups query
>>> query = (
...     QueryBuilder("my_private_data")
...     .get_bounds("X")
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     sess.remaining_privacy_budget
... )
>>> answer.toPandas()
   X_upper_bound  X_lower_bound
0            128           -128
Parameters:
  • column (str) – Name of the column whose bounds we want to get.

  • lower_bound_column (Optional[str]) – Name of the column to store the lower bound. Defaults to f"{column}_lower_bound".

  • upper_bound_column (Optional[str]) – Name of the column to store the upper bound. Defaults to f"{column}_upper_bound".

Return type:

Query

from tmlt.analytics import GroupedQueryBuilder
GroupedQueryBuilder.get_bounds(column, lower_bound_column=None, upper_bound_column=None)#

Returns an Query that gets approximate upper and lower bounds for a column.

The bounds are selected to give good performance when used as upper and lower bounds in other aggregations. They may not be close to the actual maximum and minimum values, and are not designed to give a tight representation of the data distribution. For any purpose other than providing a lower and upper bound to other aggregations we suggest using the quantile aggregation instead.

Note

If the column being measured contains NaN or null values, a drop_null_and_nan() query will be performed first. If the column being measured contains infinite values, a drop_infinity() query will be performed first.

Note

The algorithm is approximate, and differentially private, so the bounds may not be tight, and not all input values may fall between them.

Example

>>> my_private_data = spark.createDataFrame(
...     pd.DataFrame(
...         [["0", 1, 0], ["1", 0, 10], ["1", 2, 10], ["2", 2, 1]],
...         columns=["A", "B", "X"]
...     )
... )
>>> sess = Session.from_dataframe(
...     privacy_budget=PureDPBudget(float('inf')),
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> # Building a get_groups query
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(KeySet.from_dict({"A": ["0", "1"]}))
...     .get_bounds(column="X")
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     sess.remaining_privacy_budget
... )
>>> answer.sort("A").toPandas()
   A  X_upper_bound  X_lower_bound
0  0              1             -1
1  1             16            -16
Parameters:
  • column (str) – Name of the column whose bounds we want to get.

  • lower_bound_column (Optional[str]) – Name of the column to store the lower bound. Defaults to f"{column}_lower_bound".

  • upper_bound_column (Optional[str]) – Name of the column to store the upper bound. Defaults to f"{column}_upper_bound".

Return type:

Query