QueryBuilder.get_bounds#
from tmlt.analytics import QueryBuilder
- QueryBuilder.get_bounds(column, lower_bound_column=None, upper_bound_column=None)#
Returns a query that gets approximate upper and lower bounds for a column.
The bounds are selected to give good performance when used as upper and lower bounds in other aggregations. They may not be close to the actual maximum and minimum values, and are not designed to give a tight representation of the data distribution. For any purpose other than providing a lower and upper bound to other aggregations we suggest using the quantile aggregation instead.
Note
If the column being measured contains NaN or null values, a
drop_null_and_nan()query will be performed first. If the column being measured contains infinite values, adrop_infinity()query will be performed first.Note
The algorithm is approximate, and differentially private, so the bounds may not be tight, and not all input values may fall between them.
Example
>>> my_private_data = spark.createDataFrame( ... pd.DataFrame( ... [[i] for i in range(100)], ... columns=["X"], ... ) ... ) >>> sess = Session.from_dataframe( ... privacy_budget=PureDPBudget(float('inf')), ... source_id="my_private_data", ... dataframe=my_private_data, ... protected_change=AddOneRow(), ... ) >>> # Building a get_groups query >>> query = ( ... QueryBuilder("my_private_data") ... .get_bounds("X") ... ) >>> # Answering the query with infinite privacy budget >>> answer = sess.evaluate( ... query, ... sess.remaining_privacy_budget ... ) >>> answer.toPandas() X_upper_bound X_lower_bound 0 128 -128
- Parameters:
- Return type:
from tmlt.analytics import GroupedQueryBuilder
- GroupedQueryBuilder.get_bounds(column, lower_bound_column=None, upper_bound_column=None)#
Returns an Query that gets approximate upper and lower bounds for a column.
The bounds are selected to give good performance when used as upper and lower bounds in other aggregations. They may not be close to the actual maximum and minimum values, and are not designed to give a tight representation of the data distribution. For any purpose other than providing a lower and upper bound to other aggregations we suggest using the quantile aggregation instead.
Note
If the column being measured contains NaN or null values, a
drop_null_and_nan()query will be performed first. If the column being measured contains infinite values, adrop_infinity()query will be performed first.Note
The algorithm is approximate, and differentially private, so the bounds may not be tight, and not all input values may fall between them.
Example
>>> my_private_data = spark.createDataFrame( ... pd.DataFrame( ... [["0", 1, 0], ["1", 0, 10], ["1", 2, 10], ["2", 2, 1]], ... columns=["A", "B", "X"] ... ) ... ) >>> sess = Session.from_dataframe( ... privacy_budget=PureDPBudget(float('inf')), ... source_id="my_private_data", ... dataframe=my_private_data, ... protected_change=AddOneRow(), ... ) >>> # Building a get_groups query >>> query = ( ... QueryBuilder("my_private_data") ... .groupby(KeySet.from_dict({"A": ["0", "1"]})) ... .get_bounds(column="X") ... ) >>> # Answering the query with infinite privacy budget >>> answer = sess.evaluate( ... query, ... sess.remaining_privacy_budget ... ) >>> answer.sort("A").toPandas() A X_upper_bound X_lower_bound 0 0 1 -1 1 1 16 -16
- Parameters:
- Return type: