Simple Data Analysis with Tumult Core#
In this tutorial you will learn how to:
compute a basic query on a dataset, and
observe the privacy properties of this computation.
In this tutorial, we show how to count the number of records in a dataset whose age is greater than 18. Tumult Core can handle multiple types of data, but at present it primarily uses Spark DataFrames. Before we do anything, we need to create a spark session and read in the data:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
This creates an Core-ready Spark Session. For more details on using Spark sessions with Core, or to troubleshoot, see the Spark Topic Guide.
from pyspark.sql.types import *
spark_schema = StructType(
[StructField("Name", StringType()), StructField("Age", IntegerType())]
)
df = spark.createDataFrame([("Alice", 30), ("Bob", 15), ("Carlos", 50)], schema=spark_schema)
We build queries out of basic transformations and measurements. Transformations are functions that transform the data, but are not private on their own. Measurements are randomized mechanisms with privacy properties. We can build complex measurements by combining transformations with simple measurements. Additionally we can combine transformations to produce more complex transformations.
To compute the number of records with age over 18, we combine the following 3 components:
(Transformation) Filter out records with age < 18.
(Transformation) Count the total number of records.
(Measurement) Add noise to the count to produce a noisy count.
Note that while the noise added by the measurement is necessary to guarantee privacy, the transformations also have properties that are tracked and contribute to the privacy guarantee. For this reason, constructing the entire analysis task from transformations and measurements is needed for the privacy guarantee to hold (rather than, e.g., performing steps (1) and (2) in pure Spark).
We begin by constructing the full measurement (steps 1-3 above), running the measurement on our data, and printing the privacy guarantee of the measurement. Next, we will walk through and explain each step in this process.
from tmlt.core.transformations.spark_transformations.filter import Filter
from tmlt.core.transformations.spark_transformations.agg import Count
from tmlt.core.domains.spark_domains import convert_spark_schema, SparkDataFrameDomain
from tmlt.core.metrics import SymmetricDifference
from tmlt.core.measurements.noise_mechanisms import AddGeometricNoise
from tmlt.core.utils.misc import print_sdf
tumult_schema = SparkDataFrameDomain(convert_spark_schema(spark_schema))
over_18_measurement = (
Filter(filter_expr="Age >= 18", domain=tumult_schema, metric=SymmetricDifference())
| Count(input_domain=tumult_schema, input_metric=SymmetricDifference())
| AddGeometricNoise(2)
)
print("Noisy count of records with age >= 18:")
print(over_18_measurement(df))
print("Privacy loss (epsilon):")
print(over_18_measurement.privacy_function(1))
Noisy count of records with age >= 18:
5
Privacy loss (epsilon):
1/2
The first step is to construct the filter
component.
tumult_schema = SparkDataFrameDomain(convert_spark_schema(spark_schema))
filter = Filter(filter_expr="Age >= 18", domain=tumult_schema, metric=SymmetricDifference())
This component also requires a schema, but the format is slightly different from the Spark schema, so we used a conversion function.
The filter
transformation
created above is a function that can be
run on our Spark DataFrame. The component filters out records with age
less than 18, as well as tracking other properties necessary to ensure
the privacy guarantee holds when we eventually create a measurement.
print_sdf(filter(df))
Name Age
0 Alice 30
1 Carlos 50
Next, we construct the
count
component.
count = Count(input_domain=tumult_schema, input_metric=SymmetricDifference())
Like the filter
transformation we constructed above, count
can
be run on the data, and will produce the exact count of records in the
dataset.
print(count(df))
3
However, we want to count the number of records in the filtered dataset,
not the original dataset. To do this, we create a new transformation
that performs both the filter and the count. We can combine
transformations into new transformations using the chain operator,
|
.
filter_and_count = filter | count
filter_and_count
is a new transformation that chains together the
filter and count transformations, as we can verify below:
print(filter_and_count(df))
2
Finally, we create a measurement to add noise
in a privacy-preserving
way. The following measurement produces a noisy number by adding
geometric noise with scale alpha
.
add_noise = AddGeometricNoise(2)
To create a measurement that filters and counts before adding noise, we
chain our previous filter_and_count
transformation with the
add_noise
measurement we just created.
over_18_measurement = filter_and_count | add_noise
If we apply our over_18_measurement
to our dataset, we see a noisy
count of the number of records with age over 18.
print(over_18_measurement(df))
2
This measurement has a privacy guarantee, which is automatically
calculated from properties of its constituent parts. You can see the
privacy guarantee of the measurement using the privacy_function
member.
print(over_18_measurement.privacy_function(1))
1/2
The privacy guarantee says, informally, that if you call this function on similar dataframes, you will get statistically similar noisy counts. The privacy_function
quantifies this guarantee precisely. By calling this function with an input of 1, we learn how statistically similar the outputs will be for two dataframes that differ by 1 row. The function return value tells us that the noisy counts satisfy \(\epsilon\)-differential privacy with \(\epsilon = 1/2\).
If we call this function with an input of 2 (dataframes differing by 2 rows), we learn how statistically similar the outputs will be for two dataframes that differ by 2 rows. That is, we learn that the group privacy guarantee: the mechanism satisfies \(\epsilon\)-differential privacy for groups of size 2, with \(\epsilon = 1\).
print(over_18_measurement.privacy_function(2))
1