Tuning parameters#

Note

This tutorial uses features that are only available on a paid version of Tumult Analytics. If you would like to hear more, please contact us at info@tmlt.io.

In the previous two tutorials in this series, we explored how to measure the error of Tumult Analytics programs using both built-in and custom metrics. Measuring error is especially useful for optimizing the privacy/utility trade-offs of our differentially private mechanisms. As its name suggests, the SessionProgramTuner class is designed to help us do just that. Let’s see how it works!

Setup#

As in earlier tutorials, we import the necessary packages and download the data.

import matplotlib.pyplot as plt
import seaborn as sns

from pyspark import SparkFiles
from pyspark.sql import SparkSession, DataFrame
from tmlt.analytics.keyset import KeySet
from tmlt.analytics.privacy_budget import PureDPBudget
from tmlt.analytics.protected_change import AddOneRow, AddMaxRows
from tmlt.analytics.query_builder import QueryBuilder
from tmlt.analytics.session import Session
from tmlt.analytics.program import SessionProgram
from tmlt.analytics.tuner import SessionProgramTuner, Tunable
from tmlt.analytics.metrics import MedianRelativeError, QuantileRelativeError

# Analytics' multi-error reports, which we will see in this tutorial,
# have a built-in progress bar that requires Spark's native progress bar
# to be disabled.
# So, here we configure Spark to turn off that progress bar,
# by passing the relevant option to the SparkSession builder.
spark = (
    SparkSession
    .builder
    .config("spark.ui.showConsoleProgress", "false")
    .getOrCreate()
)
spark.sparkContext.addFile(
    "https://tumult-public.s3.amazonaws.com/demos/library/v2/members.csv"
)
members_df = spark.read.csv(
   SparkFiles.get("members.csv"), header=True, inferSchema=True
)

# ZIP code data is based on https://worldpopulationreview.com/zips/north-carolina
spark.sparkContext.addFile(
    "https://tumult-public.s3.amazonaws.com/nc-zip-codes.csv"
)
nc_zip_codes_df = spark.read.csv(
    SparkFiles.get("nc-zip-codes.csv"), header=True, inferSchema=True
)
nc_zip_codes_df = nc_zip_codes_df.withColumnRenamed("Zip Code", "zip_code")
nc_zip_codes_df = nc_zip_codes_df.withColumn("zip_code", nc_zip_codes_df.zip_code.cast('string'))
nc_zip_codes_df = nc_zip_codes_df.fillna(0)

Parametrizing the SessionProgram#

We previously used a SessionProgram that calculated the total number of books borrowed by library members, grouped by ZIP code. The query included clamping bounds:

nc_zip_code_keys = KeySet.from_dataframe(nc_zip_codes_df.select("zip_code"))
query = (
    QueryBuilder("members")
    .groupby(nc_zip_code_keys)
    .sum("books_borrowed", low=0, high=500)
)

The upper clamping bound of 500 was chosen arbitrarily. To tune this value, we must first parametrize it. To do this, we must declare this parameter as part of our SessionProgram, and then use it in our query.

class BooksByZipCodeProgram(SessionProgram):
    class ProtectedInputs:
        members: DataFrame
    class UnprotectedInputs:
        nc_zip_codes: DataFrame
    class Outputs:
        books_by_zip_code: DataFrame
    class Parameters:
        upper_clamping_bound: int

    def session_interaction(self, session: Session):
        nc_zip_codes_df = self.unprotected_inputs["nc_zip_codes"]
        nc_zip_code_keys = KeySet.from_dataframe(nc_zip_codes_df.select("zip_code"))
        query = (
            QueryBuilder("members")
            .groupby(nc_zip_code_keys)
            .sum("books_borrowed", low=0, high=self.parameters["upper_clamping_bound"])
        )
        budget = session.remaining_privacy_budget
        return {"books_by_zip_code": session.evaluate(query, budget)}

Since our program now has a parameter, we must specify the parameter’s value when we build it.

zip_code_program = (
    BooksByZipCodeProgram.Builder()
    .with_private_dataframe(
        source_id="members",
        dataframe=members_df,
        protected_change=AddOneRow()
    )
    .with_public_dataframe(
        source_id="nc_zip_codes",
        dataframe=nc_zip_codes_df
    )
    .with_privacy_budget(PureDPBudget(3))
    .with_parameter("upper_clamping_bound", 500)
    .build()
)

But we don’t want to specify it just yet — we first want to try different values for it, and observe which ones give us acceptable accuracy.

Running multiple error reports at once#

Let’s define a SessionProgramTuner with our program class to compute two metrics: the median and 75th percentile relative error. We’ve already seen similar metrics, but this time we are giving them a short name.

class SimpleTuner(SessionProgramTuner, program=BooksByZipCodeProgram):
    metrics = [
        MedianRelativeError(
            output="books_by_zip_code",
            measure_column="books_borrowed_sum",
            join_columns=["zip_code"],
            name="mre"
        ),
        QuantileRelativeError(
            output="books_by_zip_code",
            measure_column="books_borrowed_sum",
            quantile=0.75,
            join_columns=["zip_code"],
            name="qre_0.75",
        ),
    ]

Now that we have defined our class, let’s build it. Unlike a SessionProgram, a SessionProgramTuner does not require specifying all parameters. Instead, when initializing it, we can specify one or more parameters as Tunable.

tuner = (
    SimpleTuner.Builder()
    .with_private_dataframe(source_id="members", dataframe=members_df, protected_change=AddOneRow())
    .with_public_dataframe(source_id="nc_zip_codes", dataframe=nc_zip_codes_df,)
    .with_privacy_budget(PureDPBudget(3))
    .with_parameter("upper_clamping_bound", Tunable("upper_clamping_bound"))
    .build()
)

In this case, we have designated the upper clamping bound as a Tunable parameter. However, as we will see later, Tunable objects can also be used to parametrize other inputs to the program, such as the privacy budget or the input data.

Now that it’s built, we can run an error report on our SessionProgramTuner object by specifying concrete values for all Tunable parameters.

tuner.error_report({"upper_clamping_bound":  500})

More usefully, we can generate many error reports at once:

multi_error_report = tuner.multi_error_report([
    {"upper_clamping_bound": bound} for bound in [25, 75, 150, 250, 400]
])
Running a total of 5 error reports.
Done!

multi_error_report() returns a MultiErrorReport, which can be iterated over to get individual ErrorReport objects and provides a method to get all of the metrics in a single Pandas DataFrame.

df = multi_error_report.dataframe()
print(df)
   upper_clamping_bound    mre_default  qre_0.75_default
0                    25          0.451             0.537
1                    75          0.224             0.362
2                   150          0.132             0.319
3                   250          0.142             0.567
4                   400          0.167             0.689

Looking at these results, we can see that the error seems to be smallest for upper clamping bounds between 150 and 250. We can run a second error report that focuses on this part of the search space:

multi_error_report = tuner.multi_error_report([
    {"upper_clamping_bound": bound} for bound in [150, 175, 200, 225, 250]
])
Running a total of 5 error reports.
Done!

We can then plot the results to visualize the privacy-utility trade-off.

df = multi_error_report.dataframe()
sns.lineplot(data=df, x="upper_clamping_bound", y="mre_default", label="median")
sns.lineplot(data=df, x="upper_clamping_bound", y="qre_0.75_default", label="75th percentile")
plt.ylabel("Relative Error")
plt.xlabel("Upper Clamping Bound")
plt.title("Relative Error vs. Upper Clamping Bound")
plt.show()
A line chart plotting the median and 75th percentile relative error against the upper clamping bounds in the range 100 to 500.

Setting the clamping bound to about 175 seems to be getting us the best result.

Tuning: not just for parameters!#

We now have a better sense of how our upper clamping bound affects the error of our program. But what about its privacy budget? How will the error change if we use a different privacy budget? We can use a Tunable to find out.

tuner = (
    SimpleTuner.Builder()
    .with_private_dataframe(
        source_id="members",
        dataframe=members_df,
        protected_change=AddOneRow()
    )
    .with_public_dataframe(
        source_id="nc_zip_codes",
        dataframe=nc_zip_codes_df
    )
    .with_privacy_budget(Tunable("budget"))
    .with_parameter("upper_clamping_bound", Tunable("upper_clamping_bound"))
    .build()
)

We then specify the values of our Tunable parameters for methods like error_report() and multi_error_report().

multi_error_report = tuner.multi_error_report([
    {"budget": PureDPBudget(epsilon), "upper_clamping_bound": bound}
    for epsilon in [1, 2, 3, 4, 5]
    for bound in [150, 175, 200]
])
Running a total of 15 error reports.
Done!

And plot the privacy-utility trade-off using the optimal upper clamping bound values for each privacy budget.

df = multi_error_report.dataframe()
# Convert 'budget' column to string representation focusing on the 'epsilon' attribute
df['budget'] = df['budget'].apply(lambda x: f"epsilon={x.epsilon}")
fig, axes = plt.subplots(2, 1, figsize=(10, 10))
sns.lineplot(data=df, x="upper_clamping_bound", y="mre_default", hue="budget", ax=axes[0])
sns.lineplot(data=df, x="upper_clamping_bound", y="qre_0.75_default", hue="budget", ax=axes[1])
axes[0].set_ylabel("Median Relative Error")
axes[1].set_ylabel("75th Percentile Relative Error")
axes[0].set_xlabel("Upper Clamping Bound")
axes[1].set_xlabel("Upper Clamping Bound")
plt.show()
Two line charts plotting relative error against upper clamping bounds in the range 150 to 200. The top chart shows median relative error, and the bottom chart shows 75th percentile relative error. Each chart has a line for each privacy budget, with the x-axis representing the upper clamping bound and the y-axis representing the relative error.

A Tunable can also replace a DataFrame argument (to run program on different datasets in the same Tuner), or a protected change (to evaluate the impact of changing how the data is protected).

For example, if we wanted to fix the clamping bound parameter, but tune both the DataFrame and protected change used by the program, we would supply a Tunable for each of these arguments, like so:

tuner = (
    SimpleTuner.Builder()
    .with_private_dataframe(
        source_id="members",
        dataframe=Tunable("members_df"),
        protected_change=Tunable("protected_change")
    )
    .with_public_dataframe(
        source_id="nc_zip_codes",
        dataframe=nc_zip_codes_df
    )
    .with_privacy_budget(PureDPBudget(3))
    .with_parameter("upper_clamping_bound", 500)
    .build()
)

We can then run the error report with different Tunable values:

multi_error_report = tuner.multi_error_report([
    {"members_df": df, "protected_change": protected_change}
    for df in [members_df, members_df.sample(fraction=0.5)]
    for protected_change in [AddOneRow(), AddMaxRows(2)]
])
Running a total of 4 error reports.
Done!

For more information, consult the API reference for Tunable and SessionProgramTuner.