Tutorial 2: Working with privacy budgets#

In our first tutorial, we saw how to run a simple aggregation query on private data. When loading the private data into a Session, we had to specify a privacy budget. This raises two kinds of questions.

  1. What is a privacy budget, what formal guarantee does it provide, and how can we choose the value of this parameter?

  2. How do we work with privacy budgets using Tumult Analytics, and what is the privacy promise of the interface?

In this tutorial, we will focus on the second question; the only thing you will need to know is that a smaller privacy budget translates to a stronger privacy guarantee. If you want to first learn more about privacy budget fundamentals, you can consult the following resources.

  • If you’re interested in understanding the formal guarantee of privacy budgets, you can consult this explainer. It presents an intuitive interpretation of differential privacy using betting odds, and formalizes it using a Bayesian attacker model.

  • If you would like to know what privacy parameters are commonly used for data publication, you can consult this list of real-world use cases.

These are only optional reading! The one-sentence summary above (smaller budget = better privacy) is enough to follow the rest of this tutorial. Let’s get started!

Setup#

Just like earlier, we import Python packages…

import os
from pyspark import SparkFiles
from pyspark.sql import SparkSession
from tmlt.analytics.privacy_budget import PureDPBudget
from tmlt.analytics.query_builder import QueryBuilder
from tmlt.analytics.session import Session

… and download the dataset, in case we haven’t already done so.

spark = SparkSession.builder.getOrCreate()
spark.sparkContext.addFile(
    "https://tumult-public.s3.amazonaws.com/library-members.csv"
)
members_df = spark.read.csv(
    SparkFiles.get("library-members.csv"), header=True, inferSchema=True
)

Creating a Session with a fixed budget#

Let’s initialize our Session. We will allocate a fixed privacy budget of epsilon=2.5 to it, using the classical (“pure”) differential privacy definition.

budget = PureDPBudget(epsilon=2.5) # maximum budget consumed in the Session
session = Session.from_dataframe(
    privacy_budget=budget,
    source_id="members",
    dataframe=members_df,
)

Initializing a Session with a finite privacy budget gives a simple interface promise: all queries evaluated on this Session, taken together, will provide differentially private results with at most epsilon=2.5. This parameter measures the potential privacy loss: a lower epsilon gives a stricter limit on the privacy loss, and therefore a higher level of protection. Here, the corresponding interface promise is a privacy guarantee: it enforces a minimum level of protection on the private data. For more information about this promise and its caveats, you can consult the relevant topic guide.

Now, how does the Session enforce that guarantee in practice?

Consuming the budget by evaluating queries#

Each time we evaluate a query in our Session, we will consume some of the overall budget, and we will need to specify how much of this budget we want to consume. Let’s start with a simple example: how many minors are members of the library? We will answer that question using a simple filter query, consuming epsilon=1 out of our total budget.

minor_query = QueryBuilder("members").filter("age < 18").count()
minor_count = session.evaluate(
    minor_query,
    privacy_budget=PureDPBudget(epsilon=1),
)
minor_count.show()
+-----+
|count|
+-----+
|13817|
+-----+

Now, evaluating that query consumed some of our privacy budget. To see this, we can consult the Session’s remaining_privacy_budget:

print(session.remaining_privacy_budget)
PureDPBudget(epsilon=1.5)

We consumed a budget of 1 out of a total of 2.5, so there is 1.5 left. Let’s try another query: how many library members have a Master’s degree or a higher level of formal education?

edu_query = (
    QueryBuilder("members")
    .filter("education_level IN ('masters-degree', 'doctorate-professional')")
    .count()
)
edu_count = session.evaluate(
    edu_query,
    privacy_budget=PureDPBudget(epsilon=1),
)
edu_count.show()
+-----+
|count|
+-----+
| 4765|
+-----+

You can probably guess how much budget we have left:

print(session.remaining_privacy_budget)
PureDPBudget(epsilon=0.5)

Now, what happens if we try to consume more budget than what we have left?

total_count = session.evaluate(
    QueryBuilder("members").count(),
    privacy_budget=PureDPBudget(epsilon=1),
)
Traceback (most recent call last):
RuntimeError: Cannot answer query without exceeding privacy budget: it needs
approximately 1.000, but the remaining budget is approximately 0.500 (difference: 5.000e-01)

The evaluate call returns an error. This is how the Session enforces its privacy promise: it makes sure that the queries cannot consume more than the initial privacy budget.

Note that since the call to evaluate was rejected by the Session, it did not consume any privacy budget.

print(session.remaining_privacy_budget)
PureDPBudget(epsilon=0.5)

If we don’t consume this leftover budget, that’s OK: the privacy promise is still enforced. But of course, this is somewhat “wasteful”: we could have answered more queries, or allocated more budget to answer previous queries more accurately. Here, let us simply modify the last query to use all the budget that we have left.

total_count = session.evaluate(
    QueryBuilder("members").count(),
    privacy_budget=session.remaining_privacy_budget,
)
total_count.show()
+-----+
|count|
+-----+
|54215|
+-----+

Now, suppose you have a fixed privacy budget, and your task is to publish the result of multiple queries. How to split the privacy budget across the different queries? To learn more about this question, you can consult our longer topic guide about privacy budget fundamentals.