QueryBuilder.count_distinct#
from tmlt.analytics import QueryBuilder
- QueryBuilder.count_distinct(columns=None, name=None, mechanism=CountDistinctMechanism.DEFAULT, cols=None)#
Returns a count_distinct query ready to be evaluated.
Note
Differentially private counts may returns values that are not possible for a non-DP query - including negative values. You can enforce non-negativity once the query returns its results; see the example below.
Example
>>> my_private_data.toPandas() A B X 0 0 1 0 1 0 1 0 2 1 0 1 3 1 2 1 >>> budget = PureDPBudget(float("inf")) >>> sess = Session.from_dataframe( ... privacy_budget=budget, ... source_id="my_private_data", ... dataframe=my_private_data, ... protected_change=AddOneRow(), ... ) >>> # Building a count_distinct query >>> query = ( ... QueryBuilder("my_private_data") ... .count_distinct() ... ) >>> # Answering the query with infinite privacy budget >>> answer = sess.evaluate( ... query, ... PureDPBudget(float("inf")) ... ) >>> answer.toPandas() count_distinct 0 3 >>> # Ensuring all results are non-negative >>> import pyspark.sql.functions as sf >>> answer = answer.withColumn( ... "count_distinct", sf.when( ... sf.col("count_distinct") < 0, 0 ... ).otherwise( ... sf.col("count_distinct") ... ) ... ) >>> answer.toPandas() count_distinct 0 3
- Parameters:
columns (
Optional
[List
[str
]]) – Columns in which to count distinct values. If none are provided, the query will count every distinct row.name (
Optional
[str
]) – Name for the resulting aggregation column. Defaults to “count_distinct” if no columns are provided, or “count_distinct(A, B, C)” if the provided columns are A, B, and C.mechanism (
CountDistinctMechanism
) – Choice of noise mechanism. By default, the framework automatically selects an appropriate mechanism.cols (
Optional
[List
[str
]]) – Deprecated; usecolumns
instead.
- Return type:
from tmlt.analytics import GroupedQueryBuilder
- GroupedQueryBuilder.count_distinct(columns=None, name=None, mechanism=CountDistinctMechanism.DEFAULT, cols=None)#
Returns a Query with a count_distinct query.
Example
>>> my_private_data.toPandas() A B X 0 0 1 0 1 0 1 0 2 1 0 1 3 1 2 1 >>> budget = PureDPBudget(float("inf")) >>> sess = Session.from_dataframe( ... privacy_budget=budget, ... source_id="my_private_data", ... dataframe=my_private_data, ... protected_change=AddOneRow(), ... ) >>> # Building a groupby count_distinct query >>> query = ( ... QueryBuilder("my_private_data") ... .groupby(KeySet.from_dict({"A": ["0", "1"]})) ... .count_distinct(["B", "X"]) ... ) >>> # Answering the query with infinite privacy budget >>> answer = sess.evaluate( ... query, ... PureDPBudget(float("inf")) ... ) >>> answer.sort("A").toPandas() A count_distinct(B, X) 0 0 1 1 1 2
- Parameters:
columns (
Optional
[List
[str
]]) – Columns in which to count distinct values. If none are provided, the query will count every distinct row.name (
Optional
[str
]) – Name for the resulting aggregation column. Defaults to “count_distinct” if no columns are provided, or “count_distinct(A, B, C)” if the provided columns are A, B, and C.mechanism (
CountDistinctMechanism
) – Choice of noise mechanism. By default, the framework automatically selects an appropriate mechanism.cols (
Optional
[List
[str
]]) – Deprecated; usecolumns
instead.
- Return type:
from tmlt.analytics import CountDistinctMechanism
- class tmlt.analytics.CountDistinctMechanism(value)#
Bases:
Enum
Enumerating the possible mechanisms used for the count_distinct aggregation.
Currently, the
count_distinct()
aggregation uses an additive noise mechanism to achieve differential privacy.- DEFAULT = 1#
The framework automatically selects an appropriate mechanism. This choice might change over time as additional optimizations are added to the library.
- LAPLACE = 2#
Double-sided geometric noise is used.
- GAUSSIAN = 3#
The discrete Gaussian mechanism is used. Not compatible with pure DP.