program#
SessionProgram and SessionProgram.Builder interfaces.
SessionProgram
s are used to define structured DP programs that rely on the
privacy protection provided by the Session
API. By defining a
standard interface for creating and running these programs, we can build higher level
tools that can interact with them in a consistent way, such as
SessionProgramTuner
.
The SessionProgram
class is an abstract base class that defines the interface
for a structured DP program. It is designed to be subclassed to define specific
programs.
Every SessionProgram
has three minimal requirements:
Defines at least one protected input, which is a DataFrame that can only be accessed through the
Session
API.Defines at least one output, which is a DataFrame that is produced by the program.
Defines a
session_interaction()
method that takes a session as an argument and returns a dictionary containing the expected outputs.
>>> class MinimalProgram(SessionProgram):
... class ProtectedInputs:
... protected_df: DataFrame # DataFrame type annotation is required
... class Outputs:
... total_count: DataFrame # required here too
... def session_interaction(self, session: Session):
... count_query = QueryBuilder("protected_df").count()
... budget = self.privacy_budget # session.remaining_privacy_budget also works
... total_count = session.evaluate(count_query, budget)
... return {"total_count": total_count}
Once a program is defined, it can be instantiated by using the automatically
generated builder for that class. It has a very similar interface to
Session.Builder
.
>>> protected_df = spark.createDataFrame([(1, 2), (3, 4)], ["a", "b"])
>>> program = (
... MinimalProgram.Builder()
... .with_privacy_budget(PureDPBudget(epsilon=1))
... .with_private_dataframe("protected_df", protected_df, AddOneRow())
... .build()
... )
The program can then be run to produce the expected outputs.
>>> program.run()
{'total_count': DataFrame[count: bigint]}
Each instance of a program has the same privacy guarantee as a
Session
with the same privacy budget, protected DataFrames, and
protected changes.
Because of this, each instance of a program can only be run once. To run the program again, you must create a new instance of the program.
>>> program.run()
Traceback (most recent call last):
...
RuntimeError: run cannot be called more than once
Most of the time, you will also want to define Parameters
and/or
UnprotectedInputs
for your program. This can be done by adding them to the
program class similar to ProtectedInputs
and Outputs
.
Additionally, once you have structured your program into a SessionProgram
,
you will want to take advantage of tools like SessionProgramTuner
.
See the tutorials starting at Basics of error measurement for more examples on
how to take advantage of SessionProgram
and related classes.
Classes#
Base class for defining a structured DP program that uses the Session API. |
|
A parameter value associated with a human-readable name. |
- class SessionProgram(builder)#
Bases:
abc.ABC
Base class for defining a structured DP program that uses the Session API.
Note
This is only available on a paid version of Tumult Analytics. If you would like to hear more, please contact us at info@tmlt.io.
Warning
SessionProgram
s should not be directly constructed. Instead, users should create a subclass ofSessionProgram
, then create an instance of theirSessionProgram
using the automatically generatedBuilder
attribute of that subclass.- Parameters:
builder (SessionProgram)
- class Builder#
Automatically generated builder for initializing a
SessionProgram
.A subclass ofthis class is automatically generated for each subclass of
SessionProgram
. It has a similar interface toSession.Builder
.Note
This is only available on a paid version of Tumult Analytics. If you would like to hear more, please contact us at info@tmlt.io.
- with_private_dataframe(source_id, dataframe, protected_change)#
Adds a Spark DataFrame as a private source.
Not all Spark column types are supported in private sources; see
tmlt.analytics.session.SUPPORTED_SPARK_TYPES
for information about which types are supported.- Parameters:
source_id (str) – Source id for the private source dataframe.
dataframe (pyspark.sql.DataFrame) – Private source dataframe to perform queries on, corresponding to the
source_id
.protected_change (tmlt.analytics.protected_change.ProtectedChange) – A
ProtectedChange
specifying what changes to the input data should be protected.
- Return type:
- with_public_dataframe(source_id, dataframe)#
Adds a public dataframe.
- Parameters:
source_id (str)
dataframe (pyspark.sql.DataFrame)
- Return type:
- with_parameter(name, value)#
Set the value of a parameter.
- Parameters:
name (str)
value (Any)
- Return type:
- build()#
Returns an instance of the matching
SessionProgram
subtype.- Return type:
- with_privacy_budget(privacy_budget)#
Set the privacy budget for the object being built.
- Parameters:
privacy_budget (tmlt.analytics.privacy_budget.PrivacyBudget)
- with_id_space(id_space)#
Adds an identifier space.
This defines a space of identifiers that map 1-to-1 to the identifiers being protected by a table with the
AddRowsWithID
protected change. Any table with such a protected change must be a member of some identifier space.- Parameters:
id_space (str)
- class ProtectedInputs#
Annotation class for protected inputs to a
SessionProgram
.The ProtectedInput class enumerates the expected protected DataFrames that will be used in the program. These are the DataFrames that will be protected by differential privacy according to their protected change and the privacy budget provided in the builder.
Each protected DataFrame can be specified in the builder using
with_private_dataframe()
. They are then accessible in thesession_interaction()
method as a private source with the same name in the given Session.Example
>>> class ProgramWithProtectedInputs(SessionProgram): ... class ProtectedInputs: ... protected_df: DataFrame ... class Outputs: ... total_count: DataFrame ... def session_interaction(self, session: Session): ... print("Private sources:", session.private_sources) ... count_query = QueryBuilder("protected_df").count() ... total_count = session.evaluate(count_query, self.privacy_budget) ... return {"total_count": total_count} >>> protected_df = spark.createDataFrame([(1, 2), (3, 4)], ["a", "b"]) >>> program = ( ... ProgramWithProtectedInputs.Builder() ... .with_privacy_budget(PureDPBudget(epsilon=1)) ... .with_private_dataframe("protected_df", protected_df, AddOneRow()) ... .build() ... ) >>> program.run() Private sources: ['protected_df'] {'total_count': DataFrame[count: bigint]}
- class UnprotectedInputs#
An annotation class for unprotected inputs to a
SessionProgram
.The UnprotectedInput class enumerates the expected unprotected DataFrames that will be used by the program. These DataFrames are not protected by differential privacy, and can be accessed directly by the program. They are typically used to specify public information used in a public join or in a
KeySet
.Each unprotected DataFrame can be specified in the builder using
with_public_dataframe()
. They are then accessible in thesession_interaction()
method as a public source with the same name in the given session, or through theunprotected_inputs
property.Example
>>> class ProgramWithUnprotectedInputs(SessionProgram): ... class ProtectedInputs: ... protected_df: DataFrame ... class UnprotectedInputs: ... public_df: DataFrame ... class Outputs: ... total_count: DataFrame ... def session_interaction(self, session: Session): ... print("Public sources:", session.public_sources) ... assert session.public_source_dataframes == { ... "public_df": public_df ... } ... assert self.unprotected_inputs == {"public_df": public_df} ... count_query = QueryBuilder("protected_df").count() ... total_count = session.evaluate(count_query, self.privacy_budget) ... return {"total_count": total_count} >>> protected_df = spark.createDataFrame([(1, 2), (3, 4)], ["a", "b"]) >>> public_df = spark.createDataFrame([(1, 2), (3, 4)], ["c", "d"]) >>> program = ( ... ProgramWithUnprotectedInputs.Builder() ... .with_privacy_budget(PureDPBudget(epsilon=1)) ... .with_private_dataframe("protected_df", protected_df, AddOneRow()) ... .with_public_dataframe("public_df", public_df) ... .build() ... ) >>> program.run() Public sources: ['public_df'] {'total_count': DataFrame[count: bigint]}
- class Parameters#
Annotation class for parameters to a SessionProgram.
The Parameter class enumerates the expected parameters that will be used by the program. These parameters are arbitrary (typically simple) Python objects that are most often used to configure the behavior of the program, such as setting thresholds, clamping bounds, budget allocations, choosing among algorithms, etc.
Each parameter can be specified in the builder using
with_parameter()
. They are then accessible for use in thesession_interaction()
through theparameters
property.Example
>>> class ProgramWithParameters(SessionProgram): ... class ProtectedInputs: ... protected_df: DataFrame ... class Outputs: ... a_sum: DataFrame ... class Parameters: ... low: int ... high: int ... def session_interaction(self, session: Session): ... low = self.parameters["low"] ... high = self.parameters["high"] ... sum_query = QueryBuilder("protected_df").sum("a", low, high) ... a_sum = session.evaluate(sum_query, self.privacy_budget) ... return {"a_sum": a_sum} >>> protected_df = spark.createDataFrame([(1, 2), (3, 4)], ["a", "b"]) >>> program = ( ... ProgramWithParameters.Builder() ... .with_privacy_budget(PureDPBudget(epsilon=1)) ... .with_private_dataframe("protected_df", protected_df, AddOneRow()) ... .with_parameter("low", 0) ... .with_parameter("high", 5) ... .build() ... ) >>> program.run() {'a_sum': DataFrame[a_sum: bigint]}
- class Outputs#
Annotation class for the outputs of a
SessionProgram
.These outputs are expected to be returned by the
session_interaction()
method as a dictionary, where the keys are the names of the outputs and the values are the corresponding DataFrames.Example
>>> class ProgramWithOutputs(SessionProgram): ... class ProtectedInputs: ... protected_df: DataFrame ... class Outputs: ... total_count: DataFrame ... def session_interaction(self, session: Session): ... count_query = QueryBuilder("protected_df").count() ... total_count = session.evaluate(count_query, self.privacy_budget) ... return {"total_count": total_count} >>> protected_df = spark.createDataFrame([(1, 2), (3, 4)], ["a", "b"]) >>> program = ( ... ProgramWithOutputs.Builder() ... .with_privacy_budget(PureDPBudget(epsilon=1)) ... .with_private_dataframe("protected_df", protected_df, AddOneRow()) ... .build() ... ) >>> program.run() {'total_count': DataFrame[count: bigint]}
- property privacy_budget: tmlt.analytics.privacy_budget.PrivacyBudget#
Privacy budget for this program.
- Return type:
- property unprotected_inputs: Dict[str, pyspark.sql.DataFrame]#
Unprotected inputs for this program.
- Return type:
Dict[str, pyspark.sql.DataFrame]
- classmethod output_types()#
Returns a dictionary associating each program output name with its type.
- run()#
Runs the program and return its outputs.
Note that this method can only be called once. If you need to run the program again, you must create a new instance of the program.
- Return type:
Dict[str, pyspark.sql.DataFrame]
- abstract session_interaction(session)#
The interaction with the Session that this program performs.
This method should be overridden by subclasses to generate the expected outputs of the program using the given session. The method should return a dictionary of the expected outputs, where the keys are the names of the outputs and the values are the corresponding DataFrames.
Warning
Do not call this method directly. Instead, call the
run()
method.- Parameters:
session (tmlt.analytics.session.Session) – The Session to interact with. It will be initialized with the protected and unprotected DataFrames as well as the privacy budget.
- Return type:
Dict[str, pyspark.sql.DataFrame]
- class NamedValue#
A parameter value associated with a human-readable name.
A
NamedValue
can be used as a parameter inSessionProgramTuner
methods. This name is then used when printing error reports and converting them to DataFrames, which can be useful when using parameters that do not have a simple string representation.When using a
NamedValue
to specify a parameter of aSessionProgram
, the name is ignored: initializing the program using.with_parameter("param", NamedValue(42))
is exactly equivalent to initializing it using.with_parameter("param", 42)
. Similarly, a parameter passed as aNamedValue
is unwrapped before being passed to aview()
, a custommetric()
, or a custombaseline()
.- value: Any#
The value passed as a parameter to the program.