
from tmlt.analytics import QueryBuilder
QueryBuilder.count_distinct(columns=None, name=None, mechanism=CountDistinctMechanism.DEFAULT, cols=None)#

Returns a count_distinct query ready to be evaluated.


Differentially private counts may returns values that are not possible for a non-DP query - including negative values. You can enforce non-negativity once the query returns its results; see the example below.


>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  0  1  0
2  1  0  1
3  1  2  1
>>> budget = PureDPBudget(float("inf"))
>>> sess = Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> # Building a count_distinct query
>>> query = (
...     QueryBuilder("my_private_data")
...     .count_distinct()
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas()
0               3
>>> # Ensuring all results are non-negative
>>> import pyspark.sql.functions as sf
>>> answer = answer.withColumn(
...     "count_distinct", sf.when(
...         sf.col("count_distinct") < 0, 0
...     ).otherwise(
...         sf.col("count_distinct")
...     )
... )
>>> answer.toPandas()
0               3
  • columns (Optional[List[str]]) – Columns in which to count distinct values. If none are provided, the query will count every distinct row.

  • name (Optional[str]) – Name for the resulting aggregation column. Defaults to “count_distinct” if no columns are provided, or “count_distinct(A, B, C)” if the provided columns are A, B, and C.

  • mechanism (CountDistinctMechanism) – Choice of noise mechanism. By default, the framework automatically selects an appropriate mechanism.

  • cols (Optional[List[str]]) – Deprecated; use columns instead.

Return type:


from tmlt.analytics import GroupedQueryBuilder
GroupedQueryBuilder.count_distinct(columns=None, name=None, mechanism=CountDistinctMechanism.DEFAULT, cols=None)#

Returns a Query with a count_distinct query.


>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  0  1  0
2  1  0  1
3  1  2  1
>>> budget = PureDPBudget(float("inf"))
>>> sess = Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> # Building a groupby count_distinct query
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(KeySet.from_dict({"A": ["0", "1"]}))
...     .count_distinct(["B", "X"])
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas()
   A  count_distinct(B, X)
0  0                     1
1  1                     2
  • columns (Optional[List[str]]) – Columns in which to count distinct values. If none are provided, the query will count every distinct row.

  • name (Optional[str]) – Name for the resulting aggregation column. Defaults to “count_distinct” if no columns are provided, or “count_distinct(A, B, C)” if the provided columns are A, B, and C.

  • mechanism (CountDistinctMechanism) – Choice of noise mechanism. By default, the framework automatically selects an appropriate mechanism.

  • cols (Optional[List[str]]) – Deprecated; use columns instead.

Return type:


from tmlt.analytics import CountDistinctMechanism
class tmlt.analytics.CountDistinctMechanism(value)#

Bases: Enum

Enumerating the possible mechanisms used for the count_distinct aggregation.

Currently, the count_distinct() aggregation uses an additive noise mechanism to achieve differential privacy.


The framework automatically selects an appropriate mechanism. This choice might change over time as additional optimizations are added to the library.


Double-sided geometric noise is used.


The discrete Gaussian mechanism is used. Not compatible with pure DP.