QueryBuilder.get_bounds#
from tmlt.analytics import QueryBuilder
- QueryBuilder.get_bounds(column, lower_bound_column=None, upper_bound_column=None)#
Returns a query that gets approximate upper and lower bounds for a column.
The bounds are selected to give good performance when used as upper and lower bounds in other aggregations. They may not be close to the actual maximum and minimum values, and are not designed to give a tight representation of the data distribution. For any purpose other than providing a lower and upper bound to other aggregations we suggest using the quantile aggregation instead.
Note
If the column being measured contains NaN or null values, a
drop_null_and_nan()
query will be performed first. If the column being measured contains infinite values, adrop_infinity()
query will be performed first.Note
The algorithm is approximate, and differentially private, so the bounds may not be tight, and not all input values may fall between them.
Example
>>> my_private_data = spark.createDataFrame( ... pd.DataFrame( ... [[i] for i in range(100)], ... columns=["X"], ... ) ... ) >>> sess = Session.from_dataframe( ... privacy_budget=PureDPBudget(float('inf')), ... source_id="my_private_data", ... dataframe=my_private_data, ... protected_change=AddOneRow(), ... ) >>> # Building a get_groups query >>> query = ( ... QueryBuilder("my_private_data") ... .get_bounds("X") ... ) >>> # Answering the query with infinite privacy budget >>> answer = sess.evaluate( ... query, ... sess.remaining_privacy_budget ... ) >>> answer.toPandas() X_upper_bound X_lower_bound 0 128 -128
- Parameters:
column (
str
) – Name of the column whose bounds we want to get.lower_bound_column (
Optional
[str
]) – Name of the column to store the lower bound. Defaults tof"{column}_lower_bound"
.upper_bound_column (
Optional
[str
]) – Name of the column to store the upper bound. Defaults tof"{column}_upper_bound"
.
- Return type:
from tmlt.analytics import GroupedQueryBuilder
- GroupedQueryBuilder.get_bounds(column, lower_bound_column=None, upper_bound_column=None)#
Returns an Query that gets approximate upper and lower bounds for a column.
The bounds are selected to give good performance when used as upper and lower bounds in other aggregations. They may not be close to the actual maximum and minimum values, and are not designed to give a tight representation of the data distribution. For any purpose other than providing a lower and upper bound to other aggregations we suggest using the quantile aggregation instead.
Note
If the column being measured contains NaN or null values, a
drop_null_and_nan()
query will be performed first. If the column being measured contains infinite values, adrop_infinity()
query will be performed first.Note
The algorithm is approximate, and differentially private, so the bounds may not be tight, and not all input values may fall between them.
Example
>>> my_private_data = spark.createDataFrame( ... pd.DataFrame( ... [["0", 1, 0], ["1", 0, 10], ["1", 2, 10], ["2", 2, 1]], ... columns=["A", "B", "X"] ... ) ... ) >>> sess = Session.from_dataframe( ... privacy_budget=PureDPBudget(float('inf')), ... source_id="my_private_data", ... dataframe=my_private_data, ... protected_change=AddOneRow(), ... ) >>> # Building a get_groups query >>> query = ( ... QueryBuilder("my_private_data") ... .groupby(KeySet.from_dict({"A": ["0", "1"]})) ... .get_bounds(column="X") ... ) >>> # Answering the query with infinite privacy budget >>> answer = sess.evaluate( ... query, ... sess.remaining_privacy_budget ... ) >>> answer.sort("A").toPandas() A X_upper_bound X_lower_bound 0 0 1 -1 1 1 16 -16
- Parameters:
column (
str
) – Name of the column whose bounds we want to get.lower_bound_column (
Optional
[str
]) – Name of the column to store the lower bound. Defaults tof"{column}_lower_bound"
.upper_bound_column (
Optional
[str
]) – Name of the column to store the upper bound. Defaults tof"{column}_upper_bound"
.
- Return type: