QueryBuilder.groupby#

from tmlt.analytics import QueryBuilder

QueryBuilder.groupby(by)#

Groups the query by the given set of keys, returning a GroupedQueryBuilder.

The aggregation will return one row for each key in the KeySet; other values present in the data will be discarded.

More information can be found in the Group-by queries tutorial.

Warning

A KeySet containing null values for a column cannot be used on a table where that column has had null values dropped or replaced.

Examples

>>> from tmlt.analytics import KeySet
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> budget = PureDPBudget(float("inf"))
>>> sess = Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )

Answering a query with the exact groupby domain:

>>> groupby_keys = KeySet.from_dict({"A": ["0", "1"]})
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(groupby_keys)
...     .count()
... )
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas()
   A  count
0  0      1
1  1      2

Answering a query with omitted and added domain values:

>>> groupby_keys = KeySet.from_dict({"A": ["0", "2"]})
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(groupby_keys)
...     .count()
... )
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas()
   A  count
0  0      1
1  2      0

Answering a query with a multi-column domain:

>>> groupby_keys = KeySet.from_dict(
...    {"A": ["0", "1"], "B": [0, 1, 2]}
... )
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(groupby_keys)
...     .count()
... )
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A", "B").toPandas()
   A  B  count
0  0  0      0
1  0  1      1
2  0  2      0
3  1  0      1
4  1  1      0
5  1  2      1

Answering a query with a multi-column domain and structural zeros:

>>> # Suppose it is known that A and B cannot be equal. This set of
>>> # groupby keys prevents those impossible values from being computed.
>>> keys_df = pd.DataFrame({
...     "A": ["0", "0", "1", "1"],
...     "B": [1, 2, 0, 2],
... })
>>> groupby_keys = KeySet.from_dataframe(spark.createDataFrame(keys_df))
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(groupby_keys)
...     .count()
... )
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A", "B").toPandas()
   A  B  count
0  0  1      1
1  0  2      0
2  1  0      1
3  1  2      1

Parameters:: by (Union[KeySet, List[str], str]) – A KeySet which defines the columns to group on and the possible values for each column.
Return type:: GroupedQueryBuilder

Tumult Platform

QueryBuilder.groupby#