Tutorial 1: First steps with Tumult Analytics#
In this first tutorial, we will demonstrate how to load data, run a simple aggregation query, and get our first differentially private results. You can run this tutorial (as well as the next ones) as you go: simply follow the Installation instructions, and use the copy/paste button of each code block to reproduce it.
Throughout these tutorials, we’ll imagine we are the data protection officer for a fictional institution, the Pierre-Simon Laplace Public Library. We want to publish statistics about how our library serves the needs of its community. Of course, we have the privacy of our members at heart, so we want to make sure that the data we release does not reveal anything about specific people.
This is a perfect use case for differential privacy: it will allow us to publish useful insights about groups, while protecting data about individuals. Importantly, Tumult Analytics does not require you to have an in-depth understanding of differential privacy. In these tutorials, we will gloss over all the details of what happens behind the scenes, and focus on how to accomplish common tasks. To learn more about the trade-offs involved in parameter setting and mechanism design, you can consult our Topic guides.
First, let’s import some Python packages.
import os from pyspark import SparkFiles from pyspark.sql import SparkSession from tmlt.analytics.privacy_budget import PureDPBudget from tmlt.analytics.query_builder import QueryBuilder from tmlt.analytics.session import Session
Next, we initialize the Spark session.
spark = SparkSession.builder.getOrCreate()
When using Java 11, some additional configuration must be passed to Spark, so the previous code block would instead be:
spark = ( SparkSession.builder .config("spark.driver.extraJavaOptions", "-Dio.netty.tryReflectionSetAccessible=true") .config("spark.executor.extraJavaOptions", "-Dio.netty.tryReflectionSetAccessible=true") .getOrCreate() )
Then, we need to load our first dataset, containing information about the
members of our public library. Here, we get the data from a public
repository, and load it into a
spark.sparkContext.addFile( "https://tumult-public.s3.amazonaws.com/library-members.csv" ) members_df = spark.read.csv( SparkFiles.get("library-members.csv"), header=True, inferSchema=True )
For more information about loading data files into Spark, see the Spark data sources documentation.
Creating a Session#
To compute queries using Tumult Analytics, we must first encapsulate the data
Session. The following snippet
instantiates a Session on a Spark DataFrame with our private data, using the
session = Session.from_dataframe( privacy_budget=PureDPBudget(3), source_id="members", dataframe=members_df )
Note that in addition to the data itself, we needed to provide the Session builder with a couple of additional pieces of information.
privacy_budgetspecifies what privacy guarantee this Session will provide. We will discuss this in more detail in the next tutorial.
source_idis the identifier for the DataFrame. We will then use it to refer to this DataFrame when constructing queries.
For a more complete description of the various ways a Session can be initialized, you can consult the relevant topic guide.
Evaluating queries in a Session#
Now that we have our Session, we can ask our first query. How many members does
our library have? To answer this question with a query, we will use the
count_query = QueryBuilder("members").count()
The first part,
QueryBuilder("members"), specifies which private data we
want to run the query on; this corresponds to the
source_id parameter from
earlier. Then, the
count() statement requests the total number of records in
After creating our query, we need to actually run it on the data, using the
evaluate method of our Session.
This requires us to allocate some privacy budget to this evaluation: here, let’s
evaluate the query with differential privacy, using ε=1.
total_count = session.evaluate( count_query, privacy_budget=PureDPBudget(epsilon=1) )
The results of the query are returned as a Spark DataFrame. We can see them
show() method of this DataFrame.
+-----+ |count| +-----+ |54215| +-----+
If you’re running this code along with the tutorial, you might see different values! This is a central characteristic of differential privacy: it injects some randomization (we call this noise) in the execution of the query. Let’s evaluate the same query again to demonstrate this.
total_count = session.evaluate( count_query, privacy_budget=PureDPBudget(1) ) total_count.show()
+-----+ |count| +-----+ |54218| +-----+
The query result is slightly different from the previous one.
The noise added to the computation of the query can depend on the privacy parameters, the type of aggregation, and the data itself. But in many cases, the result will still convey accurate insights about the original data. Here, that’s the case: we can verify this by running a count query directly on the original dataframe, which gives us the true result.
total_count = members_df.count() print(total_count)
We just evaluated our first differentially private query using Tumult Analytics. In the next tutorial, we’ll say a bit more about how privacy budgets work in practice, and evaluate some more complicated queries.