program#

SessionProgram and SessionProgram.Builder interfaces.

SessionPrograms are used to define structured DP programs that rely on the privacy protection provided by the Session API. By defining a standard interface for creating and running these programs, we can build higher level tools that can interact with them in a consistent way, such as SessionProgramTuner.

The SessionProgram class is an abstract base class that defines the interface for a structured DP program. It is designed to be subclassed to define specific programs.

Every SessionProgram has three minimal requirements:

Defines at least one protected input, which is a DataFrame that can only be accessed through the Session API.

Defines at least one output, which is a DataFrame that is produced by the program.

Defines a session_interaction() method that takes a session as an argument and returns a dictionary containing the expected outputs.

>>> class MinimalProgram(SessionProgram):
...     class ProtectedInputs:
...         protected_df: DataFrame  # DataFrame type annotation is required
...     class Outputs:
...         total_count: DataFrame  # required here too
...     def session_interaction(self, session: Session):
...         count_query = QueryBuilder("protected_df").count()
...         budget = self.privacy_budget  #  session.remaining_privacy_budget also works
...         total_count = session.evaluate(count_query, budget)
...         return {"total_count": total_count}

Once a program is defined, it can be instantiated by using the automatically generated builder for that class. It has a very similar interface to Session.Builder.

>>> protected_df = spark.createDataFrame([(1, 2), (3, 4)], ["a", "b"])
>>> program = (
...     MinimalProgram.Builder()
...     .with_privacy_budget(PureDPBudget(epsilon=1))
...     .with_private_dataframe("protected_df", protected_df, AddOneRow())
...     .build()
... )

The program can then be run to produce the expected outputs.

>>> program.run()
{'total_count': DataFrame[count: bigint]}

Each instance of a program has the same privacy guarantee as a Session with the same privacy budget, protected DataFrames, and protected changes.

Because of this, each instance of a program can only be run once. To run the program again, you must create a new instance of the program.

>>> program.run()
Traceback (most recent call last):
...
RuntimeError: run cannot be called more than once

Most of the time, you will also want to define Parameters and/or UnprotectedInputs for your program. This can be done by adding them to the program class similar to ProtectedInputs and Outputs.

Additionally, once you have structured your program into a SessionProgram, you will want to take advantage of tools like SessionProgramTuner.

See the tutorials starting at Basics of error measurement for more examples on how to take advantage of SessionProgram and related classes.

Classes#

`SessionProgram`	Base class for defining a structured DP program that uses the Session API.
`NamedValue`	A parameter value associated with a human-readable name.

class SessionProgram(builder)#

Bases: abc.ABC

Base class for defining a structured DP program that uses the Session API.

Note

This is only available on a paid version of Tumult Analytics. If you would like to hear more, please contact us at info@tmlt.io.

Warning

SessionPrograms should not be directly constructed. Instead, users should create a subclass of SessionProgram, then create an instance of their SessionProgram using the automatically generated Builder attribute of that subclass.

Parameters:: builder (SessionProgram)

class Builder#

Automatically generated builder for initializing a SessionProgram.

A subclass ofthis class is automatically generated for each subclass of SessionProgram. It has a similar interface to Session.Builder.

Note

This is only available on a paid version of Tumult Analytics. If you would like to hear more, please contact us at info@tmlt.io.

with_private_dataframe(source_id, dataframe, protected_change)#

Adds a Spark DataFrame as a private source.

Not all Spark column types are supported in private sources; see tmlt.analytics.session.SUPPORTED_SPARK_TYPES for information about which types are supported.

Parameters:

source_id (str) – Source id for the private source dataframe.
dataframe (pyspark.sql.DataFrame) – Private source dataframe to perform queries on, corresponding to the source_id.
protected_change (tmlt.analytics.protected_change.ProtectedChange) – A ProtectedChange specifying what changes to the input data should be protected.

Return type:

SessionProgram

with_public_dataframe(source_id, dataframe)#

Adds a public dataframe.

Parameters:

source_id (str)
dataframe (pyspark.sql.DataFrame)

Return type:

SessionProgram

with_parameter(name, value)#

Set the value of a parameter.

Parameters:

name (str)
value (Any)

Return type:

SessionProgram

build()#

Returns an instance of the matching SessionProgram subtype.

Return type:: SessionProgram

with_privacy_budget(privacy_budget)#

Set the privacy budget for the object being built.

Parameters:: privacy_budget (tmlt.analytics.privacy_budget.PrivacyBudget)

with_id_space(id_space)#

Adds an identifier space.

This defines a space of identifiers that map 1-to-1 to the identifiers being protected by a table with the AddRowsWithID protected change. Any table with such a protected change must be a member of some identifier space.

Parameters:: id_space (str)

class ProtectedInputs#

Annotation class for protected inputs to a SessionProgram.

The ProtectedInput class enumerates the expected protected DataFrames that will be used in the program. These are the DataFrames that will be protected by differential privacy according to their protected change and the privacy budget provided in the builder.

Each protected DataFrame can be specified in the builder using with_private_dataframe(). They are then accessible in the session_interaction() method as a private source with the same name in the given Session.

Example

>>> class ProgramWithProtectedInputs(SessionProgram):
...     class ProtectedInputs:
...         protected_df: DataFrame
...     class Outputs:
...         total_count: DataFrame
...     def session_interaction(self, session: Session):
...         print("Private sources:", session.private_sources)
...         count_query = QueryBuilder("protected_df").count()
...         total_count = session.evaluate(count_query, self.privacy_budget)
...         return {"total_count": total_count}
>>> protected_df = spark.createDataFrame([(1, 2), (3, 4)], ["a", "b"])
>>> program = (
...     ProgramWithProtectedInputs.Builder()
...     .with_privacy_budget(PureDPBudget(epsilon=1))
...     .with_private_dataframe("protected_df", protected_df, AddOneRow())
...     .build()
... )
>>> program.run()
Private sources: ['protected_df']
{'total_count': DataFrame[count: bigint]}

class UnprotectedInputs#

An annotation class for unprotected inputs to a SessionProgram.

The UnprotectedInput class enumerates the expected unprotected DataFrames that will be used by the program. These DataFrames are not protected by differential privacy, and can be accessed directly by the program. They are typically used to specify public information used in a public join or in a KeySet.

Each unprotected DataFrame can be specified in the builder using with_public_dataframe(). They are then accessible in the session_interaction() method as a public source with the same name in the given session, or through the unprotected_inputs property.

Example

>>> class ProgramWithUnprotectedInputs(SessionProgram):
...     class ProtectedInputs:
...         protected_df: DataFrame
...     class UnprotectedInputs:
...         public_df: DataFrame
...     class Outputs:
...         total_count: DataFrame
...     def session_interaction(self, session: Session):
...         print("Public sources:", session.public_sources)
...         assert session.public_source_dataframes == {
...             "public_df": public_df
...         }
...         assert self.unprotected_inputs == {"public_df": public_df}
...         count_query = QueryBuilder("protected_df").count()
...         total_count = session.evaluate(count_query, self.privacy_budget)
...         return {"total_count": total_count}
>>> protected_df = spark.createDataFrame([(1, 2), (3, 4)], ["a", "b"])
>>> public_df = spark.createDataFrame([(1, 2), (3, 4)], ["c", "d"])
>>> program = (
...     ProgramWithUnprotectedInputs.Builder()
...     .with_privacy_budget(PureDPBudget(epsilon=1))
...     .with_private_dataframe("protected_df", protected_df, AddOneRow())
...     .with_public_dataframe("public_df", public_df)
...     .build()
... )
>>> program.run()
Public sources: ['public_df']
{'total_count': DataFrame[count: bigint]}

class Parameters#

Annotation class for parameters to a SessionProgram.

The Parameter class enumerates the expected parameters that will be used by the program. These parameters are arbitrary (typically simple) Python objects that are most often used to configure the behavior of the program, such as setting thresholds, clamping bounds, budget allocations, choosing among algorithms, etc.

Each parameter can be specified in the builder using with_parameter(). They are then accessible for use in the session_interaction() through the parameters property.

Example

>>> class ProgramWithParameters(SessionProgram):
...     class ProtectedInputs:
...         protected_df: DataFrame
...     class Outputs:
...         a_sum: DataFrame
...     class Parameters:
...         low: int
...         high: int
...     def session_interaction(self, session: Session):
...         low = self.parameters["low"]
...         high = self.parameters["high"]
...         sum_query = QueryBuilder("protected_df").sum("a", low, high)
...         a_sum = session.evaluate(sum_query, self.privacy_budget)
...         return {"a_sum": a_sum}
>>> protected_df = spark.createDataFrame([(1, 2), (3, 4)], ["a", "b"])
>>> program = (
...     ProgramWithParameters.Builder()
...     .with_privacy_budget(PureDPBudget(epsilon=1))
...     .with_private_dataframe("protected_df", protected_df, AddOneRow())
...     .with_parameter("low", 0)
...     .with_parameter("high", 5)
...     .build()
... )
>>> program.run()
{'a_sum': DataFrame[a_sum: bigint]}

class Outputs#

Annotation class for the outputs of a SessionProgram.

These outputs are expected to be returned by the session_interaction() method as a dictionary, where the keys are the names of the outputs and the values are the corresponding DataFrames.

Example

>>> class ProgramWithOutputs(SessionProgram):
...     class ProtectedInputs:
...         protected_df: DataFrame
...     class Outputs:
...         total_count: DataFrame
...     def session_interaction(self, session: Session):
...         count_query = QueryBuilder("protected_df").count()
...         total_count = session.evaluate(count_query, self.privacy_budget)
...         return {"total_count": total_count}
>>> protected_df = spark.createDataFrame([(1, 2), (3, 4)], ["a", "b"])
>>> program = (
...     ProgramWithOutputs.Builder()
...     .with_privacy_budget(PureDPBudget(epsilon=1))
...     .with_private_dataframe("protected_df", protected_df, AddOneRow())
...     .build()
... )
>>> program.run()
{'total_count': DataFrame[count: bigint]}

property privacy_budget: tmlt.analytics.privacy_budget.PrivacyBudget#

Privacy budget for this program.

Return type:: tmlt.analytics.privacy_budget.PrivacyBudget

property unprotected_inputs: Dict[str, pyspark.sql.DataFrame]#

Unprotected inputs for this program.

Return type:: Dict[str, pyspark.sql.DataFrame]

property parameters: Dict[str, Any]#

The parameters for this program.

Return type:: Dict[str, Any]

classmethod output_types()#

Returns a dictionary associating each program output name with its type.

Return type:: Dict[str, Type]

run()#

Runs the program and return its outputs.

Note that this method can only be called once. If you need to run the program again, you must create a new instance of the program.

Return type:: Dict[str, pyspark.sql.DataFrame]

abstract session_interaction(session)#

The interaction with the Session that this program performs.

This method should be overridden by subclasses to generate the expected outputs of the program using the given session. The method should return a dictionary of the expected outputs, where the keys are the names of the outputs and the values are the corresponding DataFrames.

Warning

Do not call this method directly. Instead, call the run() method.

Parameters:: session (tmlt.analytics.session.Session) – The Session to interact with. It will be initialized with the protected and unprotected DataFrames as well as the privacy budget.
Return type:: Dict[str, pyspark.sql.DataFrame]

class NamedValue#

A parameter value associated with a human-readable name.

A NamedValue can be used as a parameter in SessionProgramTuner methods. This name is then used when printing error reports and converting them to DataFrames, which can be useful when using parameters that do not have a simple string representation.

When using a NamedValue to specify a parameter of a SessionProgram, the name is ignored: initializing the program using .with_parameter("param", NamedValue(42)) is exactly equivalent to initializing it using .with_parameter("param", 42). Similarly, a parameter passed as a NamedValue is unwrapped before being passed to a view(), a custom metric(), or a custom baseline().

value: Any#: The value passed as a parameter to the program.

name: str#: The human-readable name of the value.

Tumult Analytics Pro

program#

Classes#