QueryBuilder.count_distinct#

from tmlt.analytics import QueryBuilder
QueryBuilder.count_distinct(columns=None, name=None, mechanism=CountDistinctMechanism.DEFAULT, cols=None)#

Returns a count_distinct query ready to be evaluated.

Note

Differentially private counts may returns values that are not possible for a non-DP query - including negative values. You can enforce non-negativity once the query returns its results; see the example below.

Example

>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  0  1  0
2  1  0  1
3  1  2  1
>>> budget = PureDPBudget(float("inf"))
>>> sess = Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> # Building a count_distinct query
>>> query = (
...     QueryBuilder("my_private_data")
...     .count_distinct()
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas()
   count_distinct
0               3
>>> # Ensuring all results are non-negative
>>> import pyspark.sql.functions as sf
>>> answer = answer.withColumn(
...     "count_distinct", sf.when(
...         sf.col("count_distinct") < 0, 0
...     ).otherwise(
...         sf.col("count_distinct")
...     )
... )
>>> answer.toPandas()
   count_distinct
0               3
Parameters:
  • columns (Optional[List[str]]) – Columns in which to count distinct values. If none are provided, the query will count every distinct row.

  • name (Optional[str]) – Name for the resulting aggregation column. Defaults to “count_distinct” if no columns are provided, or “count_distinct(A, B, C)” if the provided columns are A, B, and C.

  • mechanism (CountDistinctMechanism) – Choice of noise mechanism. By default, the framework automatically selects an appropriate mechanism.

  • cols (Optional[List[str]]) – Deprecated; use columns instead.

Return type:

Query

from tmlt.analytics import GroupedQueryBuilder
GroupedQueryBuilder.count_distinct(columns=None, name=None, mechanism=CountDistinctMechanism.DEFAULT, cols=None)#

Returns a Query with a count_distinct query.

Example

>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  0  1  0
2  1  0  1
3  1  2  1
>>> budget = PureDPBudget(float("inf"))
>>> sess = Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> # Building a groupby count_distinct query
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(KeySet.from_dict({"A": ["0", "1"]}))
...     .count_distinct(["B", "X"])
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas()
   A  count_distinct(B, X)
0  0                     1
1  1                     2
Parameters:
  • columns (Optional[List[str]]) – Columns in which to count distinct values. If none are provided, the query will count every distinct row.

  • name (Optional[str]) – Name for the resulting aggregation column. Defaults to “count_distinct” if no columns are provided, or “count_distinct(A, B, C)” if the provided columns are A, B, and C.

  • mechanism (CountDistinctMechanism) – Choice of noise mechanism. By default, the framework automatically selects an appropriate mechanism.

  • cols (Optional[List[str]]) – Deprecated; use columns instead.

Return type:

Query

from tmlt.analytics import CountDistinctMechanism
class tmlt.analytics.CountDistinctMechanism(value)#

Bases: Enum

Enumerating the possible mechanisms used for the count_distinct aggregation.

Currently, the count_distinct() aggregation uses an additive noise mechanism to achieve differential privacy.

DEFAULT = 1#

The framework automatically selects an appropriate mechanism. This choice might change over time as additional optimizations are added to the library.

LAPLACE = 2#

Double-sided geometric noise is used.

GAUSSIAN = 3#

The discrete Gaussian mechanism is used. Not compatible with pure DP.