NoPrivacySession#

from tmlt.tune import NoPrivacySession

class tmlt.tune.NoPrivacySession(accountant, public_sources, session_data, options)#

Bases: Session

Session-like class to evaluate queries without privacy guarantees.

Note

The features described in this page are only available on a paid version of the Tumult Platform. If you would like to hear more, please contact us at info@tmlt.io.

A NoPrivacySession can be used as a plug-and-play replacement for a a Session to evaluate queries without some or all of the differential privacy features. This is useful for generating baseline outputs for DP programs, which is used for error measurement and tuning. See the API reference for more information.

NoPrivacySession has the exact same API as Session, except it has additional options to configure how queries are evaluated. See NoPrivacySession.Options for more information about available options. All valid Tumult Analytics programs can be converted to use non-private query evaluation by replacing the Session with a NoPrivacySession; the only thing that changes is how the queries are evaluated.

As its name suggests, a NoPrivacySession does not provide any privacy guarantees. Only use it during tuning, not deployment, and do not share or publish its outputs. Additionally, there should be little reason to use NoPrivacySession directly; instead, use tools in SessionProgramTuner which rely on NoPrivacySession internally and allow for the same level of configurability.

Warning

NoPrivacySessions should not be directly constructed. Instead, they should be created using NoPrivacySession.from_dataframe() or with a NoPrivacySession.Builder.

class Builder#

Bases: Builder

Builder for NoPrivacySession.

with_privacy_budget(privacy_budget)#

Sets the privacy budget applied to this NoPrivacySession.

Privacy budget accounting works in the same way as in Session: it gets “spent” in the same way, and query evaluation stops when none is left, even though the NoPrivacySession does not provide any privacy guarantee. This is to ensure compatibility with programs that rely on privacy budget accounting behavior.

get_class_type()#: Returns NoPrivacySession type.

build()#

Builds NoPrivacySession with specified configuration.

Return type:: NoPrivacySession

class Options(enforce_keysets=False, enforce_clamping_bounds=False, enforce_constraints=False, enforce_flat_map_truncation=False, enforce_private_join_truncation=False, enforce_suppression=False)#

Bases: object

Configuration for how a NoPrivacySession evaluates queries.

All enforcement properties default to False when a NoPrivacySession is created. This means that, by default, a NoPrivacySession ignores keysets, clamping bounds, constraints, and truncations when evaluating queries.

These options for a NoPrivacySession can be modified at any point after a NoPrivacySession is created. In particular, this allows you to evaluate different queries with different options using the same NoPrivacySession.

Example

>>> spark_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> # Set up session
>>> sess = NoPrivacySession.from_dataframe(
...     privacy_budget=PureDPBudget(3),
...     source_id="my_private_data",
...     dataframe=spark_data,
...     protected_change=AddOneRow(),
... )
>>> # By default enforce_keysets is False
>>> sess.options.enforce_keysets
False
>>> keyset = KeySet.from_dict({"A": ["0", "2"]})
>>> sess.evaluate(
...     QueryBuilder("my_private_data").groupby(keyset).count(),
...     PureDPBudget(1),
... ).toPandas()
   A  count
0  0      1
1  1      2
>>> sess.options.enforce_keysets = True
>>> # Subsequent queries will use the provided keyset to answer GroupBy
>>> # queries
>>> sess.evaluate(
...     QueryBuilder("my_private_data").groupby(keyset).count(),
...     PureDPBudget(1),
... ).toPandas()
   A  count
0  0      1
1  2      0
>>> # The `enforce_keysets` option can be set to False once again
>>> # All subsequent queries will ignore the keyset (unless the option
>>> # is set to `True` again)
>>> sess.options.enforce_keysets = False
>>> sess.evaluate(
...     QueryBuilder("my_private_data").groupby(keyset).count(),
...     PureDPBudget(1),
... ).toPandas()
   A  count
0  0      1
1  1      2

property enforce_keysets: bool#

Whether KeySets will be used to answer GroupBy queries.

This option affects:

groupby()
- Doesn’t use the keys parameter, and also won’t use a differentially private mechanism to generate a keyset if keys aren’t provided.

Defaults to False.

property enforce_clamping_bounds: bool#

Whether clamping bounds will be enforced when answering queries.

This option affects:

All aggregations with low and high parameters, e.g. sum() and quantile()

Defaults to False.

property enforce_constraints: bool#

Whether Constraints will be enforced.

This option affects:

enforce()
- Doesn’t apply the constraint to the private data. Note that enforce still must be called with the required constraints to avoid the same errors as in a private session.

Defaults to False.

property enforce_flat_map_truncation: bool#

Whether output of flat maps will be truncated.

This option affects:

flat_map()
- Doesn’t use the max_rows parameter.

Defaults to False.

property enforce_private_join_truncation: bool#

Whether truncation will be used in private joins.

This option affects:

join_private()
- Doesn’t use the truncation_strategy_left and truncation_strategy_right parameters.

Defaults to False.

property enforce_suppression: bool#

Whether suppression will be enforced when running SuppressAggregates.

This option affects:

suppress()
- The original results of the query are returned when suppression is not enforced.

Defaults to False.

__eq__(other)#

Returns whether this object is equal to another object.

Return type:: bool

classmethod from_dataframe(privacy_budget, source_id, dataframe, protected_change)#

Initializes a NoPrivacySession from a Spark DataFrame.

Only one data source is supported with this initialization method; if you need multiple data sources, use NoPrivacySession.Builder.

Not all Spark column types are supported in sources; see ColumnType for information about which types are supported.

Example

>>> spark_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> # Declare budget for the session.
>>> session_budget = PureDPBudget(1)
>>> # Set up Session
>>> sess = NoPrivacySession.from_dataframe(
...     privacy_budget=session_budget,
...     source_id="my_private_data",
...     dataframe=spark_data,
...     protected_change=AddOneRow(),
... )
>>> sess.private_sources
['my_private_data']
>>> sess.get_column_types("my_private_data") 
{'A': ColumnType.VARCHAR, 'B': ColumnType.INTEGER, 'X': ColumnType.INTEGER}

Parameters:

privacy_budget (PrivacyBudget) – The privacy budget for the session. If a non-infinite budget is provided, it will be replaced with an infinite budget of the same type.
source_id (str) – The source ID for the source DataFrame.
dataframe (DataFrame) – The source DataFrame to perform queries on, corresponding to the source_id.
protected_change (ProtectedChange) – A ProtectedChange specifying what changes to the input data the resulting NoPrivacySession should protect.

Return type:

NoPrivacySession

evaluate(query_expr, privacy_budget)#

Answers a query without any privacy guarantees, returning a Spark DataFrame.

Note that query evaluation behavior depends on options (see NoPrivacySession.Options for more information).

The type of privacy_budget must match the type your NoPrivacySession was initialized with (i.e., you cannot evaluate a query using RhoZCDPBudget if this NoPrivacySession was initialized with a PureDPBudget, and vice versa). And, even though it does not provide any privacy guarantees, NoPrivacySession keeps tracks its privacy budget just like a Session. In particular, this method “spends” the specified privacy_budget.

Example

>>> sess.private_sources
['my_private_data']
>>> sess.get_column_types("my_private_data") 
{'A': ColumnType.VARCHAR, 'B': ColumnType.INTEGER, 'X': ColumnType.INTEGER}
>>> sess.remaining_privacy_budget
PureDPBudget(epsilon=1)
>>> # Evaluate Queries
>>> filter_query = QueryBuilder("my_private_data").filter("A > 0")
>>> count_query = filter_query.groupby(KeySet.from_dict({"X": [0, 1]})).count()
>>> count_answer = sess.evaluate(
...     query_expr=count_query,
...     privacy_budget=PureDPBudget(0.5),
... )
>>> sum_query = filter_query.sum(column="B", low=0, high=1)
>>> sum_answer = sess.evaluate(
...     query_expr=sum_query,
...     privacy_budget=PureDPBudget(0.5),
... )
>>> count_answer # TODO(#798): Seed randomness and change to toPandas()
DataFrame[X: bigint, count: bigint]
>>> sum_answer # TODO(#798): Seed randomness and change to toPandas()
DataFrame[B_sum: bigint]
>>> sess.remaining_privacy_budget
PureDPBudget(epsilon=0)

Parameters:

query_expr (Query) – One query expression to answer.
privacy_budget (PrivacyBudget) – The privacy budget used for the query.

Return type:

Any

create_view(query_expr, source_id, cache)#

Creates a new view from a transformation and possibly cache it.

Example

>>> sess.private_sources
['my_private_data']
>>> sess.get_column_types("my_private_data") 
{'A': ColumnType.VARCHAR, 'B': ColumnType.INTEGER, 'X': ColumnType.INTEGER}
>>> public_spark_data.toPandas()
   A  C
0  0  0
1  0  1
2  1  1
3  1  2
>>> sess.add_public_dataframe("my_public_data", public_spark_data)
>>> # Create a view
>>> join_query = (
...     QueryBuilder("my_private_data")
...     .join_public("my_public_data")
...     .select(["A", "B", "C"])
... )
>>> sess.create_view(
...     join_query,
...     source_id="private_public_join",
...     cache=True
... )
>>> sess.private_sources
['private_public_join', 'my_private_data']
>>> sess.get_column_types("private_public_join") 
{'A': ColumnType.VARCHAR, 'B': ColumnType.INTEGER, 'C': ColumnType.INTEGER}
>>> # Delete the view
>>> sess.delete_view("private_public_join")
>>> sess.private_sources
['my_private_data']

Parameters:

query_expr (QueryBuilder) – A query that performs a transformation.
source_id (str) – The name, or unique identifier, of the view.
cache (bool) – Whether or not to cache the view.

delete_view(source_id)#

Deletes a view and decaches it if it was cached.

Parameters:: source_id (str) – The name of the view.

partition_and_create(source_id, privacy_budget, column, splits)#

Returns new NoPrivacySessions for each partition.

This works exactly like partition_and_create(), but returns NoPrivacySessions instead of Sessions.

Example

This example partitions the session into two sessions, one with A = “0” and one with A = “1”. Due to parallel composition, each of these Sessions are given the same budget, while only one count of that budget is deducted from Session.

Unlike partition_and_create(), the new Sessions are of type NoPrivacySession, and so the result of the count query is exact.

>>> sess.private_sources
['my_private_data']
>>> sess.get_column_types("my_private_data") 
{'A': ColumnType.VARCHAR, 'B': ColumnType.INTEGER, 'X': ColumnType.INTEGER}
>>> sess.remaining_privacy_budget
PureDPBudget(epsilon=1)
>>> # Partition the Session
>>> new_sessions = sess.partition_and_create(
...     "my_private_data",
...     privacy_budget=PureDPBudget(0.75),
...     column="A",
...     splits={"part0":"0", "part1":"1"}
... )
>>> sess.remaining_privacy_budget
PureDPBudget(epsilon=0.25)
>>> # The new sessions are NoPrivacySessions
>>> isinstance(new_sessions["part0"], NoPrivacySession)
True
>>> new_sessions["part0"].private_sources
['part0']
>>> new_sessions["part0"].get_column_types("part0") 
{'A': ColumnType.VARCHAR, 'B': ColumnType.INTEGER, 'X': ColumnType.INTEGER}
>>> new_sessions["part0"].remaining_privacy_budget
PureDPBudget(epsilon=0.75)
>>> new_sessions["part1"].private_sources
['part1']
>>> new_sessions["part1"].get_column_types("part1") 
{'A': ColumnType.VARCHAR, 'B': ColumnType.INTEGER, 'X': ColumnType.INTEGER}
>>> new_sessions["part1"].remaining_privacy_budget
PureDPBudget(epsilon=0.75)

When you are done with a new session, you can use the NoPrivacySession.stop() method to allow the next one to become active:

>>> new_sessions["part0"].stop()
>>> new_sessions["part1"].private_sources
['part1']
>>> count_query = QueryBuilder("part1").count()
>>> count_answer = new_sessions["part1"].evaluate(
...     count_query,
...     PureDPBudget(0.75),
... )

>>> # The result is exact, because new_sessions["part1"] is a NoPrivacySession
>>> count_answer.toPandas() 
   count
0    2

Parameters:

source_id (str) – The private source to partition.
privacy_budget (PrivacyBudget) – Privacy budget to pass to each new session.
column (str) – The name of the column partitioning on.
splits (Union[Dict[str, str], Dict[str, int]]) – Mapping of split name to value of partition. Split name is source_id in new session.

Return type:

Dict[str, NoPrivacySession]

stop()#

Closes out this NoPrivacySession, allowing others to become active.

Return type:: None

property options: Options#: Returns the query evaluation options for this NoPrivacySession.

property remaining_privacy_budget: PrivacyBudget#

Returns the remaining privacy_budget.

Privacy budget accounting works in the same way as in Session: it gets “spent” in the same way, and query evaluation stops when none is left, even though the NoPrivacySession does not provide any privacy guarantee.