Tumult Core Architecture#

Tumult Core is a collection of composable components for implementing algorithms to perform differentially private computations. The design of Tumult Core is based on the design proposed in the OpenDP White Paper. On this page, we give an overview of this design. Readers who want more information should refer to the linked white paper.

Transformation and Measurements#

The main building blocks of Tumult Core are transformations and measurements. In this section we begin by defining a transformation, then later define a measurement, which shares many of the same concepts.

A transformation is a function that takes an input from some defined input domain and produces an output in some given output domain. Each transformation additionally has an input metric – a function defining distances 1 between elements in the input domain – and an output metric – a function defining distances between elements in the output domain.

Note

In Tumult core, subclasses of Transformation and Measurement are constructors for various types of transformations; instances of these classes are the transformations and measurements themselves. For example, the FlatMap class does not define a single transformation. It is a constructor that can create many distinct transformations, all with different domains, metrics, and stability relations.

For example, the following code uses the Filter transformation constructor to create a transformation that filters out the records with ages of at least 18 years. This transformation constructor takes only a single domain and metric, because the input and output metric/domain must match. Even if some metric and domain information is inferred by the constructor, we can always inspect the constructed transformation to see its input domain, input metric, output domain, and output metric.

spark_schema = StructType(
    [StructField("Name", StringType()), StructField("Age", IntegerType())]
)
df = spark.createDataFrame([("Alice", 30), ("Bob", 15), ("Carlos", 50)], schema=spark_schema)
tumult_schema = SparkDataFrameDomain(convert_spark_schema(spark_schema))
filter_transformation = Filter(filter_expr="Age >= 18", domain=tumult_schema, metric=SymmetricDifference())

print(filter_transformation.input_domain)
print(filter_transformation.output_domain)
print(filter_transformation.input_metric)
print(filter_transformation.output_metric)
SparkDataFrameDomain(schema={'Name': SparkStringColumnDescriptor(allow_null=True), 'Age': SparkIntegerColumnDescriptor(allow_null=True, size=32)})
SparkDataFrameDomain(schema={'Name': SparkStringColumnDescriptor(allow_null=True), 'Age': SparkIntegerColumnDescriptor(allow_null=True, size=32)})
SymmetricDifference()
SymmetricDifference()

Finally, each transformation has a stability function (this statement is not completely true, see A note on privacy/stability functions and relations). This is a function that, given an input distance d_in, returns an output distance d_out satisfying the following:

For any pair of inputs in the input domain whose distance is at most d_in in the input metric, the corresponding outputs (the result of the transformation applied to each of the two inputs) will have distance at most d_out in the output metric.

Ideally, we want the stability function to give the smallest such d_out satisfying the statement above, but this is not required. Although this guarantee on its own is not a privacy guarantee, we will see in the following section that it allows us to construct complex measurements with provable privacy guarantees.

print(filter_transformation.stability_function(1))
1

Measurements have many things in common with transformations, for example they both have input domains and input metrics. One key difference is that measurements produced random outputs, while transformations are deterministic. Since the output is random, a measurement has an output measure rather than an output metric.

The measure defines a distance between distributions of outputs over the output domain of the measurement. For example, if \(P\) and \(Q\) denote two distributions over some output domain, the distance between \(P\) and \(Q\) in the PureDP output measure is \(D_{\infty}(X \| Y)\), where \(D_{\infty}(P \| Q) = \sup _{x \in \operatorname{supp} (Q)} \log \frac{P(x)}{Q(x)}\) is the Rényi divergence of infinite order.

add_noise = AddGeometricNoise(2)

print(add_noise.input_domain)
print(add_noise.input_metric)
print(add_noise.output_measure)
NumpyIntegerDomain(size=64)
AbsoluteDifference()
PureDP()

Like transformations, measurements have a guarantee that relates a distance in the input metric to a distance in the output measure. We call this guarantee the privacy function, and it works similarly to a stability function. The privacy function takes an input distance d_in and returns an output distance d_out satisfying the following:

For any pair of inputs in the input domain whose distance is at most d_in in the input metric, the distribution of the corresponding random outputs (the result of the measurements applied to each of the two inputs), will have distance at most d_out in the output measure.

Like stability functions, we want the privacy function to give the smallest such d_out satisfying the statement above in order to get the best possible privacy guarantee, but this is not required. The privacy function of a measurement can give us a standard \(\epsilon\)-differential privacy guarantee, but is more general and can give other provable guarantees (see Tumult Core Privacy Guarantee). In addition, the privacy function can be used to build more complex measurements, as we will see in the next section.

print(add_noise.privacy_function(1))
1/2

A note on privacy/stability functions and relations#

In some cases, privacy and stability functions are not sufficient to capture the guarantees we want to make about the transformations or measurements. For example, the approximate differential privacy measure has two parameters: \(\epsilon\) and \(\delta\). Some measurements that use this privacy measure actually satisfy a continuum of \((\epsilon, \delta)\)-differential privacy guarantees that are not comparable, and it is therefore not possible for the privacy function to give the best guarantee.

Because of cases like these, we also consider more general versions of the stability and privacy functions, called stability relations and privacy relations respectively. A stability relation is a function that, given both an input distance d_in and an output distance d_out, returns either True or False. If the relation returns True, it means that the transformation satisfies the following:

For any pair of inputs in the input domain whose distance is at most d_in in the input metric, the corresponding outputs (the result of the transformation applied to each of the two inputs) will have distance at most d_out in the output metric.

Similarly, a privacy relation takes an input distance d_in and an output distance d_out, and returns True or False. If it returns True, the measurement satisfies the following:

For any pair of inputs in the input domain whose distance is at most d_in in the input metric, the distribution of the corresponding random outputs (the result of the measurements applied to each of the two inputs), will have distance at most d_out in the output measure.

Every transformation has a stability relation and every measurement has a privacy relation. If the stability and privacy functions are defined, these relations are defined using the corresponding functions. For transformations and measurements for which the stability and privacy functions are insufficient, only the stability and privacy relations are defined. Transformations and measurements with only stability and privacy relations have the drawback that they take extra work to chain (see Combinators for an overview on chaining).

print(filter_transformation.stability_relation(1,2))
print(filter_transformation.stability_function(1))
True
1
print(add_noise.privacy_relation(1,1))
print(add_noise.privacy_function(1))
True
1/2

Combinators#

The power of Tumult Core lies in the ways that we can combine components to produce larger and more complex components. The first way that we can do this is by chaining components. The ChainTT component combines two transformations into a single transformation and ChainTM combines a transformation and measurement into a new measurement. These components behave like function composition, e.g. ChainTT applies the first transformation to the input, then passes the output to the second transformation, then returns the output of the second transformation. Most importantly though, ChainTT and ChainTM have their own stability and privacy relations (and functions, if the subcomponents have stability/privacy functions) that are derived automatically 2 from their constituent pieces. This is what allows us to build complex measurements with privacy relations that are automatically determined.

The following example uses ChainTT 3 to combine our previous filter transformation with a new count transformation.

count = Count(input_domain=tumult_schema, input_metric=SymmetricDifference())
filter_and_count = ChainTT(filter_transformation, count)

print(filter_and_count.input_domain)
print(filter_and_count.output_domain)
print(filter_and_count.input_metric)
print(filter_and_count.output_metric)
print(filter_and_count.stability_function(1))
SparkDataFrameDomain(schema={'Name': SparkStringColumnDescriptor(allow_null=True), 'Age': SparkIntegerColumnDescriptor(allow_null=True, size=32)})
NumpyIntegerDomain(size=64)
SymmetricDifference()
AbsoluteDifference()
1

This new transformation can be chained with our previously created measurement using ChainTM to create a new measurement.

measurement = ChainTM(filter_and_count, add_noise)

print(measurement.input_domain)
print(measurement.input_metric)
print(measurement.output_measure)
print(measurement.privacy_function(1))
SparkDataFrameDomain(schema={'Name': SparkStringColumnDescriptor(allow_null=True), 'Age': SparkIntegerColumnDescriptor(allow_null=True, size=32)})
SymmetricDifference()
PureDP()
1/2

Additionally, Tumult Core provides components for composing measurements. These components are specific to particular privacy measures, and leverage the composition properties of that measure. For example, the sequential composition property of \(\epsilon\)-differential privacy is leveraged in the Composition class. Another example of a measurement combinator is ParallelComposition, which composes measurements that are applied to a series of datasets with a bound on the contribution of a single user across all the datasets. Note that this class is actually a different type of measurement: one that supports interactivity (discussed in the next section). Tumult Core could additionally define a measurement for composing measurements in a non-interactive way, but the interactive versions provide the same functionality and more.

Interactivity#

Tumult Core also supports interactivity. Instances of the class Queryable (that we refer to as queryables) are objects that can queried interactively. Queryables have some state, including, e.g. the private data and the remaining privacy budget afforded to the queryable. Queryables are not instantiated directly, but rather by evaluating an interactive measurement which gives the privacy guarantee of the queryable. An interactive measurement (a Measurement with is_interactive set to True) is a type of measurement that produces a queryable, rather than directly producing some private output. Unlike non-interactive measurements, the output measure of an interactive measurement applies to the transcript resulting from an interaction between a user and the produced queryable (all queries made to the queryable, along with the responses from the queryable).

Tumult Core supports various types of interactive components, such as privacy filters [RVRU16] (SequentialComposition). Like the non-interactive components of Tumult Core, interactive components are composable and extensible. For example, queryables can evaluate interactive measurements that spawn new queryables without knowing anything about the behavior of the spawned queryable. This allows for rich interactions between the user and the private data whose privacy properties are derived from the constituent interactive measurements.

One complexity surrounding interactivity is that the privacy guarantee of the concurrent composition of queryables is not known for some privacy definitions. Concurrent composition occurs when queries to the composed queryables are interleaved (i.e. ask a query of queryable 1, ask a query of qeuryable 2, then again ask a query of queryable 1). Although differential privacy supports concurrent composition, other privacy notions such as zCDP [BS16] have not been shown to support concurrent composition (see [VW21]). Tumult Core currently maintains a consistent approach to concurrent composition: it is not permitted regardless of the privacy measure.

The modularity of the Tumult Core design of interactivity, combined with the restriction on concurrent composition makes interactivity somewhat complicated to work with. For this reason, Tumult Core provides the PrivacyAccountant interface for working with interactivity that hides some of this complexity and manages details for the user.

Our model for interactive is based on the interactive mechanisms defined by [VW21]. Compared to this work, we use slightly different terminology and roughly split [VW21]’s interactive mechanism into an instantiation phase (Tumult Core interactive measurement) and the interaction phase (Tumult Core queryable).

Footnotes

1

Although it is useful to think of this function as defining distances, in reality these distances need not satisfy the triangle inequality, and do not even need to be numbers (e.g. see DictMetric) – “distances” are any set with a partial ordering.

2

When both the constituent components do not have stability/privacy functions defined, it’s necessary to provide a chaining hint to the chaining component in order for it to construct it’s stability/privacy guarantee. Providing a good hint can be challenging, but since most components have stability or privacy functions defined, chaining hints are beyond the scope of this article.

3

In these examples we explicitly use ChainTT and ChainTM. More commonly, we use the operator | for chaining, which automatically selects between the two chaining combinators.