Tumult Analytics’ privacy promise#

This topic guide outlines the “privacy promise” provided by Tumult Analytics, along with its caveats. This guarantee is based on one of the core abstractions of Tumult Analytics: Sessions.

At a high level, a Session allows you to evaluate queries on private data in a way that satisfies differential privacy. When creating a Session, private data must first be loaded into it, along with a privacy budget. You can then evaluate queries on your private data, consuming at most the privacy budget provided at initialization time.

The privacy promise in more detail#

A Session is initialized with:

  • one or more private data sources (data you wish to query in a differentially private way);

  • one or more public data sources (data that does not require privacy protection, but may be used as auxiliary input to your computation);

  • a privacy definition along with its associated privacy parameters (e.g. tutorials use PureDBBudget, corresponding to pure differential privacy, and Tumult Analytics also supports zero-concentrated differential privacy).

After initialization, the Session guarantees that the answers returned by calling evaluate() to evaluate queries satisfy the corresponding privacy definition with respect to the private data, using the specified parameters. For example, a Session initialized with PureDPBudget(1) provides \({\varepsilon}\)-differential privacy with \({\varepsilon}=1\).

Subtlety 1: unit of privacy#

By default, the privacy guarantee prevents an attacker for learning whether one individual row was added or removed in each private dataset. If the data of a single individual can span multiple rows in the same private dataset, then this individual is not covered by the privacy promise, only individual rows are.

If you know that a single individual can appear in at most k rows in an input dataset, you can load it into a Session using the optional stability=k parameter. Then, Tumult Analytics will hide the addition or removal of up to k rows at once from the corresponding private dataset, providing individual-level protection.

For this reason, you should be careful of what kind of pre-processing is done to the private data before loading it into a Session. If you start from a DataFrame where each individual appears in a single record, but this property stops being true before the data is loaded into a Session, then you might not get the expected privacy guarantees.

Subtlety 2: covered inputs & outputs#

A Session only provides guarantees on the private datasets, and this guarantee only covers data returned by evaluate calls. Use of the private data in any other way is not protected by the Session.

This means that you should not directly use private data; instead, you should only access it indirectly by executing evaluate() on well-specified queries. In particular, public sources and parameters like groupby information or clamping bounds are not protected. They can reveal private information if the private data is used directly to determine them.

Subtlety 3: adversarial model#

Tumult Analytics, and in particular the Session interface, is designed to make it easy to obtain expected differential privacy guarantees, and difficult to accidentally break these guarantees. However, this library was not designed to defend against actively malicious users. In particular:

  1. Do not inspect the private state of a Session or other objects. The privacy guarantees of a Session only apply to the public API. Inspecting a Session’s private state and using this information to tailor your analysis workflow will break the privacy guarantee.

  2. Do not use map() or flat_map() operations with side-effects. These operations allow you to transform data using arbitrary user-defined functions (UDFs). When using map or flatmap, a Session’s privacy guarantee only holds if the UDFs do not have side-effects with externally-observable behaviors. For example, a UDF could be designed to throw an exception if a specific record is found in the data. This would reveal information about the private data and break the privacy promise.

  3. Do not release side-channel information. The privacy guarantee only applies to the output of calls to evaluate(). Information such as how long a query ran or how much memory it required might reveal private information. Do not use this library in an untrusted context where protection against such side-channels is important.