Tumult Core Privacy Guarantee#
The privacy guarantee of a Core Measurement
, \(\mathcal{M}\) is
the following. Let \(r\) denote the privacy relation of \(\mathcal{M}\),
\(I\) denote the input domain, and \(d\) denote the input metric, and
let \(D\) denote the output measure. Then, for any pair of elements
\(x, y \in I\) and for all distances \(\alpha\)
such that \(d(x,y) \le \alpha\), \(r(\alpha, \beta) = True\) implies
\(D(\mathcal{M}(x), \mathcal{M}(y)) \le \beta\).
This privacy guarantee generalizes \(\epsilon\)-differential privacy, and we
can get the standard \(\epsilon\)-differential privacy guarantee for
specific settings of the parameters. Suppose the input metric is
SymmetricDifference
and the output measure is defined as
\(D(X, Y) = \epsilon\) if \(D_{\infty}(X \| Y) = \epsilon\) (in Core,
this is when output measure of the Measurement
is PureDP
).
Then, \(r(1, \epsilon) = True\) implies that the mechanism satisfies
\(\epsilon\)-differential privacy.
The rest of this section discusses qualifications to the privacy guarantee with respect to side channel information. In addition, see Known Vulnerabilities for vulnerabilities in Tumult Core that may affect the privacy guarantee.
Pseudo-side channel information#
The privacy guarantee in the previous section applies to the abstract outputs of the mechanism. That is, the output of \(\mathcal{M}\) may be a list of numbers, or a multiset of records. However, the implementations in Tumult Core are python objects that may contain additional information that could leak private data. An example of pseudo-side channel information is the ordering of records in Spark DataFrame, which is meant to represent an unordered multiset of records.
We call pseudo-side channel information distinguishing if it can be used to learn about the private input. In Tumult Core, protecting against leakage of distinguishing pseudo-side channel information is not part of the privacy guarantee. However, we make a best-effort attempt to make sure that all pseudo-side channel information released by measurements is not distinguishing.
Specific protections against pseudo-side channel leakage#
For Spark DataFrame outputs, we perform the following mitigations against leaking distinguishing pseudo-side channel information.
Materialize the DataFrame: Spark DataFrames are computed lazily, and Spark DataFrame with random noise will compute a new sample of the noise each time an action is performed (e.g. printing the DataFrame). To prevent this, we eagerly compute the DataFrame and save it.
Repartition the DataFrame: Spark DataFrames are partitioned, and the number and content of these partitions are potentially distinguishing pseudo-side channel information. To prevent this, we randomly repartition the output DataFrame.
Sorting each partition: The records in a Spark DataFrame have an order, but our privacy guarantee is on the unordered multiset of records that this DataFrame represents. To prevent the ordering from leaking private data, after repartitioning the DataFrame as described above, we sort each partition.
Postprocessing udfs and pseudo-side channel information#
Some parts of the Tumult Core code accept user code when constructing
pre/postprocessing functions (PostProcess
,
NonInteractivePostProcess
, DecorateQueryable
). For these
measurements, our privacy guarantee relies on the assumption that these
functions do not use distinguishing pseudo-side channel information. Tumult Core
makes a best effort attempt to make sure all pseudo-side channel information is
not distinguishing, see Specific protections against pseudo-side channel leakage for details.
More formally, our privacy guarantee is based on the assumption that these functions are well defined on the abstract domains. Let \(f\) be the function. Let \(A\) be abstract input domain of \(f\). That is, \(A\) contains the abstract elements represented by objects passed to the implementation of \(f\). (in the case of a Spark DataFrame, these elements are unordered multisets of records). Likewise, let \(B\) be the abstract output domain of f. Then for any \(x,y \in A\) such that \(x = y\), it must be the case that \(f(x) = f(y)\).
Suppose, for example, that Tumult core did not protect against information leakage via the ordering of the records in the Spark DataFrame, and the ordering revealed something about the private data. Then for some record \(r\), consider a postprocessing function on a Spark DataFrame that outputs 1 if the first record of the DataFrame is \(r\), and 0 otherwise. Such a function would break the Tumult Core privacy guarantee, because it uses distinguishing pseudo-side channel information. This function is also not well defined on the abstract input domain. There exist two DataFrames \(D,D'\) that represent the same multiset of records (and are therefore equal in the abstract domain), but \(f(D) \ne f(D')\) because \(r\) is the first record of \(D\) but not \(D'\). This is example is hypothetical since Tumult Core does protect against information leakage via the ordering.
Side channel information#
Side channel information includes any information that can be learned from running a measurement that is not explicitly part of the output of the measurement. Examples include the amount of time it takes for a measurement to run, or the amount of memory consumed when running the measurement. Note that the amount of time the measurement takes to run could be measured indirectly by the user: if user code adds timestamped entries to a logfile at different points in the measurement, the resulting logfile could leak private data and this is not protected by the Tumult Core guarantee.
The privacy guarantee of Core Measurements applies only to the explicit output, it does not extend to any side channel information. Additionally, Tumult Core makes no attempt to make side channel information non-distinguishing.