QueryBuilder.sum#
from tmlt.analytics import QueryBuilder
- QueryBuilder.sum(column, low, high, name=None, mechanism=SumMechanism.DEFAULT)#
Returns a sum query ready to be evaluated.
Note
If the column being measured contains NaN or null values, a
drop_null_and_nan()
query will be performed first. If the column being measured contains infinite values, adrop_infinity()
query will be performed first.Note
Regarding the clamping bounds:
The values for
low
andhigh
are a choice the caller must make.All data will be clamped to lie within this range.
The narrower the range, the less noise. Larger bounds mean more data is kept, but more noise needs to be added to the result.
The clamping bounds are assumed to be public information. Avoid using the private data to set these values.
More information can be found in the Numerical aggregations tutorial.
Example
>>> my_private_data.toPandas() A B X 0 0 1 0 1 1 0 1 2 1 2 1 >>> budget = PureDPBudget(float("inf")) >>> sess = Session.from_dataframe( ... privacy_budget=budget, ... source_id="my_private_data", ... dataframe=my_private_data, ... protected_change=AddOneRow(), ... ) >>> # Building a sum query >>> query = ( ... QueryBuilder("my_private_data") ... .sum(column="B",low=0, high=2) ... ) >>> # Answering the query with infinite privacy budget >>> answer = sess.evaluate( ... query, ... PureDPBudget(float("inf")) ... ) >>> answer.toPandas() B_sum 0 3
- Parameters:
column (
str
) – The column to compute the sum over.low (
float
) – The lower bound for clamping.high (
float
) – The upper bound for clamping. Must be such thatlow
is less thanhigh
.name (
Optional
[str
]) – The name to give the resulting aggregation column. Defaults tof"{column}_sum"
.mechanism (
SumMechanism
) – Choice of noise mechanism. By default, the framework automatically selects an appropriate mechanism.
- Return type:
from tmlt.analytics import GroupedQueryBuilder
- GroupedQueryBuilder.sum(column, low, high, name=None, mechanism=SumMechanism.DEFAULT)#
Returns a Query with a sum query.
Note
If the column being measured contains NaN or null values, a
drop_null_and_nan()
query will be performed first. If the column being measured contains infinite values, adrop_infinity()
query will be performed first.Note
Regarding the clamping bounds:
The values for
low
andhigh
are a choice the caller must make.All data will be clamped to lie within this range.
The narrower the range, the less noise. Larger bounds mean more data is kept, but more noise needs to be added to the result.
The clamping bounds are assumed to be public information. Avoid using the private data to set these values.
More information can be found in the Numerical aggregations tutorial.
Example
>>> my_private_data.toPandas() A B X 0 0 1 0 1 1 0 1 2 1 2 1 >>> budget = PureDPBudget(float("inf")) >>> sess = Session.from_dataframe( ... privacy_budget=budget, ... source_id="my_private_data", ... dataframe=my_private_data, ... protected_change=AddOneRow(), ... ) >>> # Building a groupby sum query >>> query = ( ... QueryBuilder("my_private_data") ... .groupby(KeySet.from_dict({"A": ["0", "1"]})) ... .sum(column="B",low=0, high=2) ... ) >>> # Answering the query with infinite privacy budget >>> answer = sess.evaluate( ... query, ... PureDPBudget(float("inf")) ... ) >>> answer.sort("A").toPandas() A B_sum 0 0 1 1 1 2
- Parameters:
column (
str
) – The column to compute the sum over.low (
float
) – The lower bound for clamping.high (
float
) – The upper bound for clamping. Must be such thatlow
is less thanhigh
.name (
Optional
[str
]) – The name to give the resulting aggregation column. Defaults tof"{column}_sum"
.mechanism (
SumMechanism
) – Choice of noise mechanism. By default, the framework automatically selects an appropriate mechanism.
- Return type:
from tmlt.analytics import SumMechanism
- class tmlt.analytics.SumMechanism(value)#
Bases:
Enum
Possible mechanisms for the sum() aggregation.
Currently, the
sum()
aggregation uses an additive noise mechanism to achieve differential privacy.- DEFAULT = 1#
The framework automatically selects an appropriate mechanism. This choice might change over time as additional optimizations are added to the library.
- LAPLACE = 2#
Laplace and/or double-sided geometric noise is used, depending on the column type.
- GAUSSIAN = 3#
Discrete and/or continuous Gaussian noise is used, depending on the column type. Not compatible with pure DP.