no_privacy_session#

Interactive query evaluation without any privacy guarantees.

NoPrivacySession can be used to evaluate queries without differential privacy while using the same syntax as the Session. It is primarily meant to be used by SessionProgramTuner to compute baselines and error metrics for tuning SessionPrograms; in most cases, using the SessionProgramTuner (instead of using a NoPrivacySession directly) is a better choice.

Classes#

class NoPrivacySession(accountant, public_sources, session_data, options)#

Bases: tmlt.analytics.session.Session

A Session-like interface for evaluating queries without privacy guarantees.

Note

This is only available on a paid version of Tumult Analytics. If you would like to hear more, please contact us at info@tmlt.io.

A NoPrivacySession can be used as a plug-and-play replacement for a a Session to evaluate queries without some or all of the differential privacy features. This is useful for generating baseline outputs for DP programs, which is used for error measurement and tuning. See tuner for more information.

NoPrivacySession has the exact same API as Session, except it has additional options to configure how queries are evaluated. See NoPrivacySession.Options for more information about available options. All valid Tumult Analytics programs can be converted to use non-private query evaluation by replacing the Session with a NoPrivacySession; the only thing that changes is how the queries are evaluated.

As its name suggests, a NoPrivacySession does not provide any privacy guarantees. Only use it during tuning, not deployment, and do not share or publish its outputs. Additionally, there should be little reason to use NoPrivacySession directly; instead, use tools in tuner which rely on NoPrivacySession internally and allow for the same level of configurability.

Warning

NoPrivacySessions should not be directly constructed. Instead, they should be created using NoPrivacySession.from_dataframe() or with a NoPrivacySession.Builder.

Classes#
`Builder`	Builder for `NoPrivacySession`.
`Options`	Configuration for how a `NoPrivacySession` evaluates queries.

Properties#
`options`	Returns the query evaluation options for this `NoPrivacySession`.
`remaining_privacy_budget`	Returns the remaining privacy_budget.
`private_sources`	Returns the IDs of the private sources.
`public_sources`	Returns the IDs of the public sources.
`public_source_dataframes`	Returns a dictionary of public source DataFrames.

Methods#
`from_dataframe()`	Initializes a `NoPrivacySession` from a Spark DataFrame.
`evaluate()`	Answers a query without any privacy guarantees, returning a Spark DataFrame.
`create_view()`	Creates a new view from a transformation and possibly cache it.
`delete_view()`	Deletes a view and decaches it if it was cached.
`partition_and_create()`	Returns new `NoPrivacySession`s for each partition.
`describe()`	Describes this session, or one of its tables, or the result of a query.
`get_schema()`	Returns the schema for any data source.
`get_column_types()`	Returns the column types for any data source.
`get_grouping_column()`	Returns an optional column that must be grouped by in this query.
`get_id_column()`	Returns the ID column of a table, if it has one.
`get_id_space()`	Returns the ID space of a table, if it has one.
`add_public_dataframe()`	Adds a public data source to the session.
`stop()`	Closes out this session, allowing other sessions to become active.

Parameters:

accountant (tmlt.core.measurements.interactive_measurements.PrivacyAccountant)
public_sources (Dict[str, pyspark.sql.DataFrame])
session_data (Dict[tmlt.analytics._table_identifier.Identifier, Any])
options (Options)

class Builder#

Bases: tmlt.analytics.session.Session.Builder

Builder for NoPrivacySession.

with_privacy_budget(privacy_budget)#

Sets the privacy budget applied to this NoPrivacySession.

Privacy budget accounting works in the same way as in Session: it gets “spent” in the same way, and query evaluation stops when none is left, even though the NoPrivacySession does not provide any privacy guarantee. This is to ensure compatibility with programs that rely on privacy budget accounting behavior.

Parameters:: privacy_budget (tmlt.analytics.privacy_budget.PrivacyBudget)

get_class_type()#: Returns NoPrivacySession type.

build()#

Builds NoPrivacySession with specified configuration.

Return type:: NoPrivacySession

with_private_dataframe(source_id, dataframe, protected_change)#

Adds a Spark DataFrame as a private source.

Not all Spark column types are supported in private sources; see tmlt.analytics.session.SUPPORTED_SPARK_TYPES for information about which types are supported.

Parameters:

source_id (str) – Source id for the private source dataframe.
dataframe (pyspark.sql.DataFrame) – Private source dataframe to perform queries on, corresponding to the source_id.
protected_change (tmlt.analytics.protected_change.ProtectedChange) – A ProtectedChange specifying what changes to the input data should be protected.

with_public_dataframe(source_id, dataframe)#

Adds a public dataframe.

Parameters:

source_id (str)
dataframe (pyspark.sql.DataFrame)

with_id_space(id_space)#

Adds an identifier space.

This defines a space of identifiers that map 1-to-1 to the identifiers being protected by a table with the AddRowsWithID protected change. Any table with such a protected change must be a member of some identifier space.

Parameters:: id_space (str)

class Options(enforce_keysets=False, enforce_clamping_bounds=False, enforce_constraints=False, enforce_flat_map_truncation=False, enforce_private_join_truncation=False, enforce_suppression=False)#

Configuration for how a NoPrivacySession evaluates queries.

All enforcement properties default to False when a NoPrivacySession is created. This means that, by default, a NoPrivacySession ignores keysets, clamping bounds, constraints, and truncations when evaluating queries.

These options for a NoPrivacySession can be modified at any point after a NoPrivacySession is created. In particular, this allows you to evaluate different queries with different options using the same NoPrivacySession.

Example

>>> spark_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> # Set up session
>>> sess = NoPrivacySession.from_dataframe(
...     privacy_budget=PureDPBudget(3),
...     source_id="my_private_data",
...     dataframe=spark_data,
...     protected_change=AddOneRow(),
... )
>>> # By default enforce_keysets is False
>>> sess.options.enforce_keysets
False
>>> keyset = KeySet.from_dict({"A": ["0", "2"]})
>>> sess.evaluate(
...     QueryBuilder("my_private_data").groupby(keyset).count(),
...     PureDPBudget(1),
... ).toPandas()
   A  count
0  0      1
1  1      2
>>> sess.options.enforce_keysets = True
>>> # Subsequent queries will use the provided keyset to answer GroupBy
>>> # queries
>>> sess.evaluate(
...     QueryBuilder("my_private_data").groupby(keyset).count(),
...     PureDPBudget(1),
... ).toPandas()
   A  count
0  0      1
1  2      0
>>> # The `enforce_keysets` option can be set to False once again
>>> # All subsequent queries will ignore the keyset (unless the option
>>> # is set to `True` again)
>>> sess.options.enforce_keysets = False
>>> sess.evaluate(
...     QueryBuilder("my_private_data").groupby(keyset).count(),
...     PureDPBudget(1),
... ).toPandas()
   A  count
0  0      1
1  1      2

Parameters:

enforce_keysets (bool)
enforce_clamping_bounds (bool)
enforce_constraints (bool)
enforce_flat_map_truncation (bool)
enforce_private_join_truncation (bool)
enforce_suppression (bool)

property enforce_keysets: bool#

Whether KeySets will be used to answer GroupBy queries.

This option affects:

groupby()
- Doesn’t use the keys parameter, and also won’t use a differentially private mechanism to generate a keyset if keys aren’t provided.

Defaults to False.

Return type:: bool

property enforce_clamping_bounds: bool#

Whether clamping bounds will be enforced when answering queries.

This option affects:

All aggregations with low and high parameters, e.g. sum() and quantile()

Defaults to False.

Return type:: bool

property enforce_constraints: bool#

Whether Constraints will be enforced.

This option affects:

enforce()
- Doesn’t apply the constraint to the private data. Note that enforce still must be called with the required constraints to avoid the same errors as in a private session.

Defaults to False.

Return type:: bool

property enforce_flat_map_truncation: bool#

Whether output of flat maps will be truncated.

This option affects:

flat_map()
- Doesn’t use the max_rows parameter.

Defaults to False.

Return type:: bool

property enforce_private_join_truncation: bool#

Whether truncation will be used in private joins.

This option affects:

join_private()
- Doesn’t use the truncation_strategy_left and truncation_strategy_right parameters.

Defaults to False.

Return type:: bool

property enforce_suppression: bool#

Whether suppression will be enforced when running SuppressAggregates.

This option affects:

suppress()
- The original results of the query are returned when suppression is not enforced.

Defaults to False.

Return type:: bool

__eq__(other)#

Returns whether this object is equal to another object.

Parameters:: other (Any)
Return type:: bool

property options: Options#

Returns the query evaluation options for this NoPrivacySession.

Return type:: Options

property remaining_privacy_budget: tmlt.analytics.privacy_budget.PrivacyBudget#

Returns the remaining privacy_budget.

Privacy budget accounting works in the same way as in Session: it gets “spent” in the same way, and query evaluation stops when none is left, even though the NoPrivacySession does not provide any privacy guarantee.

Return type:: tmlt.analytics.privacy_budget.PrivacyBudget

property private_sources: List[str]#

Returns the IDs of the private sources.

Return type:: List[str]

property public_sources: List[str]#

Returns the IDs of the public sources.

Return type:: List[str]

property public_source_dataframes: Dict[str, pyspark.sql.DataFrame]#

Returns a dictionary of public source DataFrames.

Return type:: Dict[str, pyspark.sql.DataFrame]

classmethod from_dataframe(privacy_budget, source_id, dataframe, protected_change)#

Initializes a NoPrivacySession from a Spark DataFrame.

Only one data source is supported with this initialization method; if you need multiple data sources, use NoPrivacySession.Builder.

Not all Spark column types are supported in sources; see SUPPORTED_SPARK_TYPES for information about which types are supported.

Example

>>> spark_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> # Declare budget for the session.
>>> session_budget = PureDPBudget(1)
>>> # Set up Session
>>> sess = NoPrivacySession.from_dataframe(
...     privacy_budget=session_budget,
...     source_id="my_private_data",
...     dataframe=spark_data,
...     protected_change=AddOneRow(),
... )
>>> sess.private_sources
['my_private_data']
>>> sess.get_column_types("my_private_data") 
{'A': ColumnType.VARCHAR, 'B': ColumnType.INTEGER, 'X': ColumnType.INTEGER}

Parameters:

privacy_budget (tmlt.analytics.privacy_budget.PrivacyBudget) – The privacy budget for the session. If a non-infinite budget is provided, it will be replaced with an infinite budget of the same type.
source_id (str) – The source ID for the source DataFrame.
dataframe (pyspark.sql.DataFrame) – The source DataFrame to perform queries on, corresponding to the source_id.
protected_change (tmlt.analytics.protected_change.ProtectedChange) – A ProtectedChange specifying what changes to the input data the resulting NoPrivacySession should protect.

Return type:

NoPrivacySession

evaluate(query_expr, privacy_budget)#

Answers a query without any privacy guarantees, returning a Spark DataFrame.

Note that query evaluation behavior depends on options (see NoPrivacySession.Options for more information).

The type of privacy_budget must match the type your NoPrivacySession was initialized with (i.e., you cannot evaluate a query using RhoZCDPBudget if this NoPrivacySession was initialized with a PureDPBudget, and vice versa). And, even though it does not provide any privacy guarantees, NoPrivacySession keeps tracks its privacy budget just like a Session. In particular, this method “spends” the specified privacy_budget.

Example

>>> sess.private_sources
['my_private_data']
>>> sess.get_column_types("my_private_data") 
{'A': ColumnType.VARCHAR, 'B': ColumnType.INTEGER, 'X': ColumnType.INTEGER}
>>> sess.remaining_privacy_budget
PureDPBudget(epsilon=1)
>>> # Evaluate Queries
>>> filter_query = QueryBuilder("my_private_data").filter("A > 0")
>>> count_query = filter_query.groupby(KeySet.from_dict({"X": [0, 1]})).count()
>>> count_answer = sess.evaluate(
...     query_expr=count_query,
...     privacy_budget=PureDPBudget(0.5),
... )
>>> sum_query = filter_query.sum(column="B", low=0, high=1)
>>> sum_answer = sess.evaluate(
...     query_expr=sum_query,
...     privacy_budget=PureDPBudget(0.5),
... )
>>> count_answer # TODO(#798): Seed randomness and change to toPandas()
DataFrame[X: bigint, count: bigint]
>>> sum_answer # TODO(#798): Seed randomness and change to toPandas()
DataFrame[B_sum: bigint]
>>> sess.remaining_privacy_budget
PureDPBudget(epsilon=0)

Parameters:

query_expr (tmlt.analytics.query_builder.Query) – One query expression to answer.
privacy_budget (tmlt.analytics.privacy_budget.PrivacyBudget) – The privacy budget used for the query.

Return type:

Any

create_view(query_expr, source_id, cache)#

Creates a new view from a transformation and possibly cache it.

Example

>>> sess.private_sources
['my_private_data']
>>> sess.get_column_types("my_private_data") 
{'A': ColumnType.VARCHAR, 'B': ColumnType.INTEGER, 'X': ColumnType.INTEGER}
>>> public_spark_data.toPandas()
   A  C
0  0  0
1  0  1
2  1  1
3  1  2
>>> sess.add_public_dataframe("my_public_data", public_spark_data)
>>> # Create a view
>>> join_query = (
...     QueryBuilder("my_private_data")
...     .join_public("my_public_data")
...     .select(["A", "B", "C"])
... )
>>> sess.create_view(
...     join_query,
...     source_id="private_public_join",
...     cache=True
... )
>>> sess.private_sources
['private_public_join', 'my_private_data']
>>> sess.get_column_types("private_public_join") 
{'A': ColumnType.VARCHAR, 'B': ColumnType.INTEGER, 'C': ColumnType.INTEGER}
>>> # Delete the view
>>> sess.delete_view("private_public_join")
>>> sess.private_sources
['my_private_data']

Parameters:

query_expr (tmlt.analytics.query_builder.QueryBuilder) – A query that performs a transformation.
source_id (str) – The name, or unique identifier, of the view.
cache (bool) – Whether or not to cache the view.

delete_view(source_id)#

Deletes a view and decaches it if it was cached.

Parameters:: source_id (str) – The name of the view.

partition_and_create(source_id, privacy_budget, column, splits)#

Returns new NoPrivacySessions for each partition.

This works exactly like partition_and_create(), but returns NoPrivacySessions instead of Sessions.

Example

This example partitions the session into two sessions, one with A = “0” and one with A = “1”. Due to parallel composition, each of these sessions are given the same budget, while only one count of that budget is deducted from session.

Unlike partition_and_create(), the new sessions are of type NoPrivacySession, and so the result of the count query is exact.

>>> sess.private_sources
['my_private_data']
>>> sess.get_column_types("my_private_data") 
{'A': ColumnType.VARCHAR, 'B': ColumnType.INTEGER, 'X': ColumnType.INTEGER}
>>> sess.remaining_privacy_budget
PureDPBudget(epsilon=1)
>>> # Partition the Session
>>> new_sessions = sess.partition_and_create(
...     "my_private_data",
...     privacy_budget=PureDPBudget(0.75),
...     column="A",
...     splits={"part0":"0", "part1":"1"}
... )
>>> sess.remaining_privacy_budget
PureDPBudget(epsilon=0.25)
>>> # The new sessions are NoPrivacySessions
>>> isinstance(new_sessions["part0"], NoPrivacySession)
True
>>> new_sessions["part0"].private_sources
['part0']
>>> new_sessions["part0"].get_column_types("part0") 
{'A': ColumnType.VARCHAR, 'B': ColumnType.INTEGER, 'X': ColumnType.INTEGER}
>>> new_sessions["part0"].remaining_privacy_budget
PureDPBudget(epsilon=0.75)
>>> new_sessions["part1"].private_sources
['part1']
>>> new_sessions["part1"].get_column_types("part1") 
{'A': ColumnType.VARCHAR, 'B': ColumnType.INTEGER, 'X': ColumnType.INTEGER}
>>> new_sessions["part1"].remaining_privacy_budget
PureDPBudget(epsilon=0.75)

When you are done with a new session, you can use the NoPrivacySession.stop() method to allow the next one to become active:

>>> new_sessions["part0"].stop()
>>> new_sessions["part1"].private_sources
['part1']
>>> count_query = QueryBuilder("part1").count()
>>> count_answer = new_sessions["part1"].evaluate(
...     count_query,
...     PureDPBudget(0.75),
... )

>>> # The result is exact, because new_sessions["part1"] is a NoPrivacySession
>>> count_answer.toPandas() 
   count
0    2

Parameters:

source_id (str) – The private source to partition.
privacy_budget (tmlt.analytics.privacy_budget.PrivacyBudget) – Privacy budget to pass to each new session.
column (str) – The name of the column partitioning on.
splits (Union[Dict[str, str], Dict[str, int]]) – Mapping of split name to value of partition. Split name is source_id in new session.

Return type:

Dict[str, NoPrivacySession]

describe(obj=None)#

Describes this session, or one of its tables, or the result of a query.

If obj is not specified, session.describe() will describe the Session and all of the tables it contains.

If obj is a QueryBuilder or Query, session.describe(obj) will describe the table that would result from that query if it were applied to the Session.

If obj is a string, session.describe(obj) will describe the table with that name. This is a shorthand for session.describe(QueryBuilder(obj)).

Examples

>>> # describe a session, "sess"
>>> sess.describe() 
The session has a remaining privacy budget of PureDPBudget(epsilon=1).
The following private tables are available:
Table 'my_private_data' (no constraints):
Column Name    Column Type    Nullable
-------------  -------------  ----------
A              VARCHAR        True
B              INTEGER        True
X              INTEGER        True
>>> # describe a query object
>>> query = QueryBuilder("my_private_data").drop_null_and_nan(["B", "X"])
>>> sess.describe(query) 
Column Name    Column Type    Nullable
-------------  -------------  ----------
A              VARCHAR        True
B              INTEGER        False
X              INTEGER        False
>>> # describe a table by name
>>> sess.describe("my_private_data") 
Column Name    Column Type    Nullable
-------------  -------------  ----------
A              VARCHAR        True
B              INTEGER        True
X              INTEGER        True

Parameters:: obj (Optional[Union[tmlt.analytics.query_builder.QueryBuilder, tmlt.analytics.query_builder.GroupedQueryBuilder, tmlt.analytics.query_builder.Query, str]]) – The table or query to be described, or None to describe the whole Session.
Return type:: None

get_schema(source_id)#

Returns the schema for any data source.

This includes information on whether the columns are nullable.

Parameters:: source_id (str) – The ID for the data source whose column types are being retrieved.
Return type:: Dict[str, tmlt.analytics.query_builder.ColumnDescriptor]

get_column_types(source_id)#

Returns the column types for any data source.

This does not include information on whether the columns are nullable.

Parameters:: source_id (str)
Return type:: Dict[str, tmlt.analytics.query_builder.ColumnType]

get_grouping_column(source_id)#

Returns an optional column that must be grouped by in this query.

When a groupby aggregation is appended to any query on this table, it must include this column as a groupby column.

Parameters:: source_id (str) – The ID for the data source whose grouping column is being retrieved.
Return type:: Optional[str]

get_id_column(source_id)#

Returns the ID column of a table, if it has one.

Parameters:: source_id (str) – The name of the table whose ID column is being retrieved.
Return type:: Optional[str]

get_id_space(source_id)#

Returns the ID space of a table, if it has one.

Parameters:: source_id (str) – The name of the table whose ID space is being retrieved.
Return type:: Optional[str]

add_public_dataframe(source_id, dataframe)#

Adds a public data source to the session.

Not all Spark column types are supported in public sources; see SUPPORTED_SPARK_TYPES for information about which types are supported.

Example

>>> public_spark_data.toPandas()
   A  C
0  0  0
1  0  1
2  1  1
3  1  2
>>> # Add public data
>>> sess.add_public_dataframe(
...     source_id="my_public_data", dataframe=public_spark_data
... )
>>> sess.public_sources
['my_public_data']
>>> sess.get_column_types("my_public_data") 
{'A': ColumnType.VARCHAR, 'C': ColumnType.INTEGER}

Parameters:

source_id (str) – The name of the public data source.
dataframe (pyspark.sql.DataFrame) – The public data source corresponding to the source_id.

stop()#

Closes out this session, allowing other sessions to become active.

Return type:: None

Tumult Analytics Pro

no_privacy_session#

Classes#