Working with Sessions#

This topic guide covers how to work with one of the core abstractions of Tumult Analytics: Session. In particular, we will demonstrate the different ways that a Session can be initialized and examined. For a simple end-to-end usage example of a Session, a better place to start is the privacy budget tutorial.

At a high level, a Session allows you to evaluate queries on private data in a way that satisfies differential privacy. When creating a Session, private data must first be loaded into it, along with a privacy budget. You can then use pieces of the total privacy budget to evaluate queries and return differentially private results. Tumult Analytics’ privacy promise and its caveats are described in detail in the privacy promise topic guide.

Constructing a Session#

There are two ways to construct a Session:

  • directly by initializing it from a data source

  • or using a Session Builder.

Both options are described below – for even more details, consult the Session API Reference.

Initializing from a data source#

Sessions are constructed from Spark DataFrames. For example, with a dataframe named spark_df you can construct a Session using from_dataframe() as follows:

session_from_dataframe = Session.from_dataframe(
    privacy_budget=PureDPBudget(2),
    source_id="my_private_data",
    dataframe=spark_df,
    protected_change=AddOneRow(),
)

When you load a Spark DataFrame into a Session, you don’t need to specify the schema of the source; it is automatically inferred from the DataFrame’s schema. Recall from the first steps tutorial that source_id is simply a unique identifier for the private data that is used when constructing queries.

Using a Session Builder#

For analysis use cases involving only one private data source, from_dataframe() is a convenient way of initializing a Session. However, when you have multiple sources of data, a Session Builder may be used instead. First, create your Builder:

session_builder = Session.Builder()

Next, add a private source to it:

session_builder = session_builder.with_private_dataframe(
    source_id="my_private_data",
    dataframe=spark_df,
    protected_change=AddOneRow(),
)

You may add additional private sources to the Session, although this is a more advanced and uncommon use case. Suppose you had additional private data stored in a CSV file:

name, salary
alice, 52000
bob, 75000
carol, 96000
...

First load the data into a Spark dataframe, then add it to the Session:

salary_df = spark.read.csv(private_csv_path, header=True, inferSchema=True)
session_builder = session_builder.with_private_dataframe(
    source_id="my_other_private_data",
    dataframe=salary_df,
    protected_change=AddOneRow(),
)

Any data file format supported by Spark can be used with Tumult Analytics this way. See the Spark data sources documentation for more details on what formats are supported and the available options for them.

A more common use case is to register public data with your Session (e.g., for use in join operations with the private source).

session_builder = session_builder.with_public_dataframe(
    source_id="my_public_data",
    dataframe=public_df,
)

Public sources can also be added retroactively after a Session is created using the add_public_dataframe() method.

When using a Session Builder, you must specify the overall privacy budget separately:

session_builder = session_builder.with_privacy_budget(PureDPBudget(1))

Once your Session is configured, the final step is to build it:

session = session_builder.build()

Examining a Session’s state#

After creation, a Session exposes several pieces of information. You can list the string identifiers of available private or public data sources using private_sources or public_sources, respectively.

print(session.private_sources)
print(session.public_sources)
['my_other_private_data', 'my_private_data']
['my_public_data']

These IDs will typically be used when constructing queries, to specify which data source a query refers to. They can also be used to access schema information about individual data sources, through get_schema().

print(session.get_schema('my_private_data'))
Schema({'name': ColumnDescriptor(column_type=ColumnType.VARCHAR, allow_null=True, allow_nan=False, allow_inf=False),
  'age': ColumnDescriptor(column_type=ColumnType.INTEGER, allow_null=True, allow_nan=False, allow_inf=False),
  'grade': ColumnDescriptor(column_type=ColumnType.DECIMAL, allow_null=True, allow_nan=True, allow_inf=True)})

As you can see, Schemas contain information about what columns are in the data, what their types are, and whether each column can contain null, NaN, or infinite values.

You can access the underlying DataFrames of public sources directly using public_source_dataframes. Note that there is no corresponding accessor for private source DataFrames; after creating a Session, the private data should not be inspected or modified.

The last key piece of information a Session exposes is how much privacy budget the Session has left. As you evaluate queries, the Session’s remaining budget will decrease. The currently-available privacy budget can be accessed through remaining_privacy_budget. For example, we can inspect the budget of our Session created from the Builder above:

print(session.remaining_privacy_budget)
PureDPBudget(epsilon=1)

We have not evaluated any queries yet using this Session, so the remaining budget is the same as the total budget that we initialized the Session with earlier.