QueryBuilder.groupby#
from tmlt.analytics import QueryBuilder
- QueryBuilder.groupby(by)#
Groups the query by the given set of keys, returning a GroupedQueryBuilder.
The aggregation will return one row for each key in the KeySet; other values present in the data will be discarded.
More information can be found in the Group-by queries tutorial.
Warning
A KeySet containing null values for a column cannot be used on a table where that column has had null values dropped or replaced.
Examples
>>> from tmlt.analytics import KeySet >>> my_private_data.toPandas() A B X 0 0 1 0 1 1 0 1 2 1 2 1 >>> budget = PureDPBudget(float("inf")) >>> sess = Session.from_dataframe( ... privacy_budget=budget, ... source_id="my_private_data", ... dataframe=my_private_data, ... protected_change=AddOneRow(), ... )
- Answering a query with the exact groupby domain:
>>> groupby_keys = KeySet.from_dict({"A": ["0", "1"]}) >>> query = ( ... QueryBuilder("my_private_data") ... .groupby(groupby_keys) ... .count() ... ) >>> answer = sess.evaluate( ... query, ... PureDPBudget(float("inf")) ... ) >>> answer.sort("A").toPandas() A count 0 0 1 1 1 2
- Answering a query with omitted and added domain values:
>>> groupby_keys = KeySet.from_dict({"A": ["0", "2"]}) >>> query = ( ... QueryBuilder("my_private_data") ... .groupby(groupby_keys) ... .count() ... ) >>> answer = sess.evaluate( ... query, ... PureDPBudget(float("inf")) ... ) >>> answer.toPandas() A count 0 0 1 1 2 0
- Answering a query with a multi-column domain:
>>> groupby_keys = KeySet.from_dict( ... {"A": ["0", "1"], "B": [0, 1, 2]} ... ) >>> query = ( ... QueryBuilder("my_private_data") ... .groupby(groupby_keys) ... .count() ... ) >>> answer = sess.evaluate( ... query, ... PureDPBudget(float("inf")) ... ) >>> answer.sort("A", "B").toPandas() A B count 0 0 0 0 1 0 1 1 2 0 2 0 3 1 0 1 4 1 1 0 5 1 2 1
- Answering a query with a multi-column domain and structural zeros:
>>> # Suppose it is known that A and B cannot be equal. This set of >>> # groupby keys prevents those impossible values from being computed. >>> keys_df = pd.DataFrame({ ... "A": ["0", "0", "1", "1"], ... "B": [1, 2, 0, 2], ... }) >>> groupby_keys = KeySet.from_dataframe(spark.createDataFrame(keys_df)) >>> query = ( ... QueryBuilder("my_private_data") ... .groupby(groupby_keys) ... .count() ... ) >>> answer = sess.evaluate( ... query, ... PureDPBudget(float("inf")) ... ) >>> answer.sort("A", "B").toPandas() A B count 0 0 1 1 1 0 2 0 2 1 0 1 3 1 2 1