Basics of error measurement#
Note
This tutorial uses features that are only available on a paid version of Tumult Analytics. If you would like to hear more, please contact us at info@tmlt.io.
In previous tutorials, we saw how to write differentially private queries using the Tumult Analytics API. Differential privacy adds noise to data, and also involves truncation, so before going ahead with publishing or sharing DP statistics, we typically want to evaluate how accurate these statistics are, and carefully select the parameters used in our Tumult Analytics program.
That’s exactly what this tutorial series covers: measuring error and tuning parameters.
In this tutorial, we will walk you through the first step of the process of measuring and optimizing error: converting a simple program to have a fixed structure, and evaluating its error. Let’s get started!
Setup#
First, let’s import the Python packages we will use and download some sample data.
Throughout this tutorial section, we will use two tables: members
contains information about the members of our fictional public library, and nc-zip-codes
lists all the ZIP codes in North Carolina, where our library is located.
import matplotlib.pyplot as plt
import seaborn as sns
from pyspark import SparkFiles
from pyspark.sql import DataFrame, SparkSession
from tmlt.analytics import (
AddOneRow,
KeySet,
PureDPBudget,
QueryBuilder,
Session,
SessionProgram,
)
from tmlt.tune import SessionProgramTuner, Tunable
spark = SparkSession.builder.getOrCreate()
spark.sparkContext.addFile(
"https://tumult-public.s3.amazonaws.com/demos/library/v2/members.csv"
)
members_df = spark.read.csv(
SparkFiles.get("members.csv"), header=True, inferSchema=True
)
# ZIP code data is based on https://worldpopulationreview.com/zips/north-carolina
spark.sparkContext.addFile(
"https://tumult-public.s3.amazonaws.com/nc-zip-codes.csv"
)
nc_zip_codes_df = spark.read.csv(
SparkFiles.get("nc-zip-codes.csv"), header=True, inferSchema=True
)
nc_zip_codes_df = nc_zip_codes_df.withColumnRenamed("Zip Code", "zip_code")
nc_zip_codes_df = nc_zip_codes_df.withColumn("zip_code", nc_zip_codes_df.zip_code.cast('string'))
nc_zip_codes_df = nc_zip_codes_df.fillna(0)
nc_zip_codes_df = nc_zip_codes_df.select("zip_code")
A simple query#
Suppose we want to calculate how many books each user has borrowed from our library, broken down by ZIP code. We know how to do this, thanks to what we learned in previous tutorials. Below, we use two input Spark DataFrames: members_df
, which contains a list of members of our fictional public library and information such as how many books each one borrowed, and nc_zip_codes_df
, which contains a list of ZIP codes in North Carolina. The former needs to be protected, the second is public.
session = Session.from_dataframe(
source_id="members",
dataframe=members_df,
protected_change=AddOneRow(),
privacy_budget=PureDPBudget(epsilon=3),
)
zip_code_keys = KeySet.from_dataframe(nc_zip_codes_df.select("zip_code"))
query = (
QueryBuilder("members")
.groupby(zip_code_keys)
.sum("books_borrowed", low=0, high=500)
)
books_by_zip_code = session.evaluate(query, session.remaining_privacy_budget)
If we run this program, we get a DataFrame with noisy sums and we can use a visualization library to plot this into a nice map. Let’s consider only the top ten ZIP codes in which the most books have been borrowed and display the noisy sums as a graph.
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style="whitegrid")
top_10_zipcodes = books_by_zip_code.orderBy(col("books_borrowed_sum").desc()).limit(10)
data_to_plot = top_10_zipcodes.toPandas()
g = sns.barplot(x="zip_code", y="books_borrowed_sum", data=data_to_plot, color="#1f77b4")
g.set_xticklabels(
data_to_plot["zip_code"], rotation=45, horizontalalignment="right"
)
plt.title("The ten ZIP codes in which most books have been borrowed")
plt.xlabel("ZIP code")
plt.ylabel("Total books borrowed")
plt.tight_layout()
plt.show()
However, it’s hard to know how trustworthy these results are. Let’s figure it out by computing the error of the above query. To get a sense of the “typical” error in these noisy statistics, we will compute the relative error for each zip code and then report the median value of those errors.
From Session to SessionProgram#
To get an error report, we must first convert our Session
into a SessionProgram
. In particular, we will create a subclass of SessionProgram
, which defines both the structure of our Session
and the logic for evaluating queries within the Session.
class BooksByZipCodeProgram(SessionProgram):
class ProtectedInputs:
members: DataFrame
class UnprotectedInputs:
nc_zip_codes: DataFrame
class Outputs:
books_by_zip_code: DataFrame
def session_interaction(self, session):
nc_zip_codes_df = session.public_source_dataframes["nc_zip_codes"]
nc_zip_code_keys = KeySet.from_dataframe(nc_zip_codes_df.select("zip_code"))
query = (
QueryBuilder("members")
.groupby(nc_zip_code_keys)
.sum("books_borrowed", low=0, high=500)
)
budget = session.remaining_privacy_budget
return {"books_by_zip_code": session.evaluate(query, budget)}
A subclass of SessionProgram
has two main parts: an input/output specification, and a session_interaction()
method. Let’s look at each component separately.
- First, we list our program’s inputs (distinguishing the protected ones from the non-protected ones) and outputs.
The
ProtectedInputs
nested class contains one class variable for each input DataFrame that contain sensitive data (and therefore should be protected by differential privacy). All protected inputs must be DataFrames. In this case, the only protected input is themembers
DataFrame.The
UnprotectedInputs
nested class contains one class variable for each input DataFrame that contains public data (and does not require differential privacy). All unprotected inputs must be DataFrames. These DataFrame can be accessed directly by the program. In this case, the only unprotected input is the publicly availablenc_zip_codes
DataFrame.The
Outputs
nested class contains one class variable for each output DataFrame. All outputs must be DataFrames. In this case, we have only one output,books_by_zip_code
, which will hold the results obtained by the program.
- Second, we write our program logic inside a
session_interaction()
method, which is used to define the interaction between thisSessionProgram
and aSession
. This method: Uses a
Session
to evaluate queries as we would in any other Tumult Analytics program. Unlike other Tumult Analytics programs, we don’t construct the Session directly - instead, the Session is provided to session_interaction as a method parameter.Returns a dictionary of output DataFrames keyed by their names (which must correspond to the names in the
Outputs
).
- Second, we write our program logic inside a
Initializing and running a SessionProgram#
Let’s initialize our BooksByZipCodeProgram
using its Builder (which is inherited from SessionProgram
). The Builder interface is very similar to Session.Builder
. We use builder methods to pass in the protected and unprotected dataframes, and the privacy budget we want to enforce.
zip_code_program = (
BooksByZipCodeProgram.Builder()
.with_private_dataframe(
source_id="members",
dataframe=members_df,
protected_change=AddOneRow()
)
.with_public_dataframe(
source_id="nc_zip_codes",
dataframe=nc_zip_codes_df,
)
.with_privacy_budget(PureDPBudget(3))
.build()
)
Once initialized, instead of calling queries on it like the Session, we run the whole program at once using the run()
method.
Note
The only way that a SessionProgram can access its input data is through the Session
provided as a parameter in the session_interaction()
method. Its privacy guarantee is exactly the same as a Session
’s - the data is protected with the privacy loss budget we specify when instantiating the program (using the same syntax as the Session
).
outputs = zip_code_program.run()
outputs["books_by_zip_code"].show(10)
+--------+------------------+
|zip_code|books_borrowed_sum|
+--------+------------------+
| 27006| 359|
| 27007| -218|
| 27009| 477|
| 27011| -145|
| 27012| 424|
| 27013| 8|
| 27014| -302|
| 27016| 278|
| 27017| -322|
| 27018| 274|
+--------+------------------+
only showing top 10 rows
One characteristic of a SessionProgram
is that it can only be run once: running it again is forbidden.
>>> zip_code_program.run()
Traceback (most recent call last):
...
RuntimeError: The SessionProgram has already been run. It can only be run once.
First look at error#
Now that we have clearly defined our program, and specified its inputs, outputs, and query logic, we can use a SessionProgramTuner
to measure its error.
In particular, we will write a subclass of SessionProgramTuner
, and pass our BooksByZipCodeProgram
as a class argument.
We will also specify a set of error metrics in the metrics
class variable to measure how accurate our output is.
For example, we will use the built-in metric MedianRelativeError
to measure the program’s median relative error.
from tmlt.tune import MedianRelativeError
class BooksTuner(SessionProgramTuner, program=BooksByZipCodeProgram):
metrics = [
MedianRelativeError(
output="books_by_zip_code",
measure_column="books_borrowed_sum",
join_columns=["zip_code"]
),
]
- The
MedianRelativeError
metric requires three parameters: output
: the name of the output DataFrame whose error we want to measure. In this case, it is our only output table:books_by_zip_code
.measure_column
: the column in the output DataFrame whose error we want to measure. In this case, it is the sum column from our query,books_borrowed_sum
.join_columns
: the columns that can be used to uniquely identify an output row. In this case, it is thezip_code
column because we have one output row per zip code.
- The
measure_column
andjoin_columns
parameters are used behind the scenes to measure error: First, we automatically generate the baseline output - a DataFrame containing the “true” (non-noisy) query results.
Then, we join the noisy answers to the baseline on the
join_columns
.Finally, we calculate the relative error for each row between the
measure_column
in the noisy and baseline DataFrames, and return the median.
Our BooksTuner
can be initialized just like a Session, but it does not provide any privacy guarantees: error information is not differentially private! This is true even though we provide a budget to the tuner - the budget is used to calculate the DP dataset, but we then compare the DP answers to the true, un-noised answers. For this reason, it is a good practice to use different datasets for tuning and deployment — historical datasets or synthetic datasets are common choices.
simple_tuner = (
BooksTuner.Builder()
.with_private_dataframe(
source_id="members",
dataframe=members_df,
protected_change=AddOneRow()
)
.with_public_dataframe(
source_id="nc_zip_codes",
dataframe=nc_zip_codes_df,
)
.with_privacy_budget(PureDPBudget(3))
.build()
)
Now that our BooksTuner
is initialized, we can get our very first error report by calling the error_report()
method.
error_report = simple_tuner.error_report()
error_report.show()
Error report ran with budget PureDPBudget(epsilon=3) and no parameters.
Metric results:
+---------+----------+------------+--------------------------------------------------------------------------------+
| Value | Metric | Baseline | Description |
+=========+==========+============+================================================================================+
| 0.199 | mre | default | Median relative error for column books_borrowed_sum of table books_by_zip_code |
+---------+----------+------------+--------------------------------------------------------------------------------+
We computed our first error metric!
You probably have lots of questions, like “how do I choose which metrics to use to quantify error”, or “how do I measure error with many parameters at once”, or “how do I determine what baseline is used behind the scenes to measure error”. Great news — that’s exactly the topic of the next tutorials in this series.