no_privacy_session#
Interactive query evaluation without any privacy guarantees.
NoPrivacySession
can be used to evaluate queries without differential
privacy while using the same syntax as the Session
. It is
primarily meant to be used by SessionProgramTuner
to compute
baselines and error metrics for tuning SessionProgram
s; in most cases,
using the SessionProgramTuner (instead of using a NoPrivacySession
directly)
is a better choice.
Classes#
- class NoPrivacySession(accountant, public_sources, session_data, options)#
Bases:
tmlt.analytics.session.Session
A
Session
-like interface for evaluating queries without privacy guarantees.Note
This is only available on a paid version of Tumult Analytics. If you would like to hear more, please contact us at info@tmlt.io.
A
NoPrivacySession
can be used as a plug-and-play replacement for a aSession
to evaluate queries without some or all of the differential privacy features. This is useful for generating baseline outputs for DP programs, which is used for error measurement and tuning. Seetuner
for more information.NoPrivacySession
has the exact same API asSession
, except it has additional options to configure how queries are evaluated. SeeNoPrivacySession.Options
for more information about available options. All valid Tumult Analytics programs can be converted to use non-private query evaluation by replacing theSession
with aNoPrivacySession
; the only thing that changes is how the queries are evaluated.As its name suggests, a
NoPrivacySession
does not provide any privacy guarantees. Only use it during tuning, not deployment, and do not share or publish its outputs. Additionally, there should be little reason to useNoPrivacySession
directly; instead, use tools intuner
which rely onNoPrivacySession
internally and allow for the same level of configurability.Warning
NoPrivacySession
s should not be directly constructed. Instead, they should be created usingNoPrivacySession.from_dataframe()
or with aNoPrivacySession.Builder
.# Builder for
NoPrivacySession
.Configuration for how a
NoPrivacySession
evaluates queries.# Returns the query evaluation options for this
NoPrivacySession
.Returns the remaining privacy_budget.
Returns the IDs of the private sources.
Returns the IDs of the public sources.
Returns a dictionary of public source DataFrames.
# Initializes a
NoPrivacySession
from a Spark DataFrame.Answers a query without any privacy guarantees, returning a Spark DataFrame.
Creates a new view from a transformation and possibly cache it.
Deletes a view and decaches it if it was cached.
Returns new
NoPrivacySession
s for each partition.Describes this session, or one of its tables, or the result of a query.
Returns the schema for any data source.
Returns the column types for any data source.
Returns an optional column that must be grouped by in this query.
Returns the ID column of a table, if it has one.
Returns the ID space of a table, if it has one.
Adds a public data source to the session.
Closes out this session, allowing other sessions to become active.
- Parameters:
accountant (tmlt.core.measurements.interactive_measurements.PrivacyAccountant)
public_sources (Dict[str, pyspark.sql.DataFrame])
session_data (Dict[tmlt.analytics._table_identifier.Identifier, Any])
options (Options)
- class Builder#
Bases:
tmlt.analytics.session.Session.Builder
Builder for
NoPrivacySession
.- with_privacy_budget(privacy_budget)#
Sets the privacy budget applied to this
NoPrivacySession
.Privacy budget accounting works in the same way as in
Session
: it gets “spent” in the same way, and query evaluation stops when none is left, even though theNoPrivacySession
does not provide any privacy guarantee. This is to ensure compatibility with programs that rely on privacy budget accounting behavior.- Parameters:
privacy_budget (tmlt.analytics.privacy_budget.PrivacyBudget)
- get_class_type()#
Returns
NoPrivacySession
type.
- build()#
Builds
NoPrivacySession
with specified configuration.- Return type:
- with_private_dataframe(source_id, dataframe, protected_change)#
Adds a Spark DataFrame as a private source.
Not all Spark column types are supported in private sources; see
tmlt.analytics.session.SUPPORTED_SPARK_TYPES
for information about which types are supported.- Parameters:
source_id (str) – Source id for the private source dataframe.
dataframe (pyspark.sql.DataFrame) – Private source dataframe to perform queries on, corresponding to the
source_id
.protected_change (tmlt.analytics.protected_change.ProtectedChange) – A
ProtectedChange
specifying what changes to the input data should be protected.
- with_public_dataframe(source_id, dataframe)#
Adds a public dataframe.
- Parameters:
source_id (str)
dataframe (pyspark.sql.DataFrame)
- with_id_space(id_space)#
Adds an identifier space.
This defines a space of identifiers that map 1-to-1 to the identifiers being protected by a table with the
AddRowsWithID
protected change. Any table with such a protected change must be a member of some identifier space.- Parameters:
id_space (str)
- class Options(enforce_keysets=False, enforce_clamping_bounds=False, enforce_constraints=False, enforce_flat_map_truncation=False, enforce_private_join_truncation=False, enforce_suppression=False)#
Configuration for how a
NoPrivacySession
evaluates queries.All enforcement properties default to False when a
NoPrivacySession
is created. This means that, by default, aNoPrivacySession
ignores keysets, clamping bounds, constraints, and truncations when evaluating queries.These options for a
NoPrivacySession
can be modified at any point after aNoPrivacySession
is created. In particular, this allows you to evaluate different queries with different options using the sameNoPrivacySession
.Example
>>> spark_data.toPandas() A B X 0 0 1 0 1 1 0 1 2 1 2 1 >>> # Set up session >>> sess = NoPrivacySession.from_dataframe( ... privacy_budget=PureDPBudget(3), ... source_id="my_private_data", ... dataframe=spark_data, ... protected_change=AddOneRow(), ... ) >>> # By default enforce_keysets is False >>> sess.options.enforce_keysets False >>> keyset = KeySet.from_dict({"A": ["0", "2"]}) >>> sess.evaluate( ... QueryBuilder("my_private_data").groupby(keyset).count(), ... PureDPBudget(1), ... ).toPandas() A count 0 0 1 1 1 2 >>> sess.options.enforce_keysets = True >>> # Subsequent queries will use the provided keyset to answer GroupBy >>> # queries >>> sess.evaluate( ... QueryBuilder("my_private_data").groupby(keyset).count(), ... PureDPBudget(1), ... ).toPandas() A count 0 0 1 1 2 0 >>> # The `enforce_keysets` option can be set to False once again >>> # All subsequent queries will ignore the keyset (unless the option >>> # is set to `True` again) >>> sess.options.enforce_keysets = False >>> sess.evaluate( ... QueryBuilder("my_private_data").groupby(keyset).count(), ... PureDPBudget(1), ... ).toPandas() A count 0 0 1 1 1 2
- Parameters:
- property enforce_keysets: bool#
Whether
KeySet
s will be used to answer GroupBy queries.This option affects:
-
Doesn’t use the keys parameter, and also won’t use a differentially private mechanism to generate a keyset if keys aren’t provided.
Defaults to False.
- Return type:
-
- property enforce_clamping_bounds: bool#
Whether clamping bounds will be enforced when answering queries.
This option affects:
All aggregations with low and high parameters, e.g.
sum()
andquantile()
Defaults to False.
- Return type:
- property enforce_constraints: bool#
Whether
Constraint
s will be enforced.This option affects:
-
Doesn’t apply the constraint to the private data. Note that enforce still must be called with the required constraints to avoid the same errors as in a private session.
Defaults to False.
- Return type:
-
- property enforce_flat_map_truncation: bool#
Whether output of flat maps will be truncated.
This option affects:
-
Doesn’t use the max_rows parameter.
Defaults to False.
- Return type:
-
- property enforce_private_join_truncation: bool#
Whether truncation will be used in private joins.
This option affects:
-
Doesn’t use the truncation_strategy_left and truncation_strategy_right parameters.
Defaults to False.
- Return type:
-
- property options: Options#
Returns the query evaluation options for this
NoPrivacySession
.- Return type:
- property remaining_privacy_budget: tmlt.analytics.privacy_budget.PrivacyBudget#
Returns the remaining privacy_budget.
Privacy budget accounting works in the same way as in
Session
: it gets “spent” in the same way, and query evaluation stops when none is left, even though theNoPrivacySession
does not provide any privacy guarantee.- Return type:
- property public_source_dataframes: Dict[str, pyspark.sql.DataFrame]#
Returns a dictionary of public source DataFrames.
- Return type:
Dict[str, pyspark.sql.DataFrame]
- classmethod from_dataframe(privacy_budget, source_id, dataframe, protected_change)#
Initializes a
NoPrivacySession
from a Spark DataFrame.Only one data source is supported with this initialization method; if you need multiple data sources, use
NoPrivacySession.Builder
.Not all Spark column types are supported in sources; see
SUPPORTED_SPARK_TYPES
for information about which types are supported.Example
>>> spark_data.toPandas() A B X 0 0 1 0 1 1 0 1 2 1 2 1 >>> # Declare budget for the session. >>> session_budget = PureDPBudget(1) >>> # Set up Session >>> sess = NoPrivacySession.from_dataframe( ... privacy_budget=session_budget, ... source_id="my_private_data", ... dataframe=spark_data, ... protected_change=AddOneRow(), ... ) >>> sess.private_sources ['my_private_data'] >>> sess.get_column_types("my_private_data") {'A': ColumnType.VARCHAR, 'B': ColumnType.INTEGER, 'X': ColumnType.INTEGER}
- Parameters:
privacy_budget (tmlt.analytics.privacy_budget.PrivacyBudget) – The privacy budget for the session. If a non-infinite budget is provided, it will be replaced with an infinite budget of the same type.
source_id (str) – The source ID for the source DataFrame.
dataframe (pyspark.sql.DataFrame) – The source DataFrame to perform queries on, corresponding to the source_id.
protected_change (tmlt.analytics.protected_change.ProtectedChange) – A
ProtectedChange
specifying what changes to the input data the resultingNoPrivacySession
should protect.
- Return type:
- evaluate(query_expr, privacy_budget)#
Answers a query without any privacy guarantees, returning a Spark DataFrame.
Note that query evaluation behavior depends on
options
(seeNoPrivacySession.Options
for more information).The type of
privacy_budget
must match the type yourNoPrivacySession
was initialized with (i.e., you cannot evaluate a query using RhoZCDPBudget if thisNoPrivacySession
was initialized with a PureDPBudget, and vice versa). And, even though it does not provide any privacy guarantees,NoPrivacySession
keeps tracks its privacy budget just like aSession
. In particular, this method “spends” the specifiedprivacy_budget
.Example
>>> sess.private_sources ['my_private_data'] >>> sess.get_column_types("my_private_data") {'A': ColumnType.VARCHAR, 'B': ColumnType.INTEGER, 'X': ColumnType.INTEGER} >>> sess.remaining_privacy_budget PureDPBudget(epsilon=1) >>> # Evaluate Queries >>> filter_query = QueryBuilder("my_private_data").filter("A > 0") >>> count_query = filter_query.groupby(KeySet.from_dict({"X": [0, 1]})).count() >>> count_answer = sess.evaluate( ... query_expr=count_query, ... privacy_budget=PureDPBudget(0.5), ... ) >>> sum_query = filter_query.sum(column="B", low=0, high=1) >>> sum_answer = sess.evaluate( ... query_expr=sum_query, ... privacy_budget=PureDPBudget(0.5), ... ) >>> count_answer # TODO(#798): Seed randomness and change to toPandas() DataFrame[X: bigint, count: bigint] >>> sum_answer # TODO(#798): Seed randomness and change to toPandas() DataFrame[B_sum: bigint] >>> sess.remaining_privacy_budget PureDPBudget(epsilon=0)
- Parameters:
query_expr (tmlt.analytics.query_builder.Query) – One query expression to answer.
privacy_budget (tmlt.analytics.privacy_budget.PrivacyBudget) – The privacy budget used for the query.
- Return type:
Any
- create_view(query_expr, source_id, cache)#
Creates a new view from a transformation and possibly cache it.
Example
>>> sess.private_sources ['my_private_data'] >>> sess.get_column_types("my_private_data") {'A': ColumnType.VARCHAR, 'B': ColumnType.INTEGER, 'X': ColumnType.INTEGER} >>> public_spark_data.toPandas() A C 0 0 0 1 0 1 2 1 1 3 1 2 >>> sess.add_public_dataframe("my_public_data", public_spark_data) >>> # Create a view >>> join_query = ( ... QueryBuilder("my_private_data") ... .join_public("my_public_data") ... .select(["A", "B", "C"]) ... ) >>> sess.create_view( ... join_query, ... source_id="private_public_join", ... cache=True ... ) >>> sess.private_sources ['private_public_join', 'my_private_data'] >>> sess.get_column_types("private_public_join") {'A': ColumnType.VARCHAR, 'B': ColumnType.INTEGER, 'C': ColumnType.INTEGER} >>> # Delete the view >>> sess.delete_view("private_public_join") >>> sess.private_sources ['my_private_data']
- Parameters:
query_expr (tmlt.analytics.query_builder.QueryBuilder) – A query that performs a transformation.
source_id (str) – The name, or unique identifier, of the view.
cache (bool) – Whether or not to cache the view.
- delete_view(source_id)#
Deletes a view and decaches it if it was cached.
- Parameters:
source_id (str) – The name of the view.
- partition_and_create(source_id, privacy_budget, column, splits)#
Returns new
NoPrivacySession
s for each partition.This works exactly like
partition_and_create()
, but returnsNoPrivacySession
s instead ofSession
s.Example
This example partitions the session into two sessions, one with A = “0” and one with A = “1”. Due to parallel composition, each of these sessions are given the same budget, while only one count of that budget is deducted from session.
Unlike
partition_and_create()
, the new sessions are of typeNoPrivacySession
, and so the result of the count query is exact.>>> sess.private_sources ['my_private_data'] >>> sess.get_column_types("my_private_data") {'A': ColumnType.VARCHAR, 'B': ColumnType.INTEGER, 'X': ColumnType.INTEGER} >>> sess.remaining_privacy_budget PureDPBudget(epsilon=1) >>> # Partition the Session >>> new_sessions = sess.partition_and_create( ... "my_private_data", ... privacy_budget=PureDPBudget(0.75), ... column="A", ... splits={"part0":"0", "part1":"1"} ... ) >>> sess.remaining_privacy_budget PureDPBudget(epsilon=0.25) >>> # The new sessions are NoPrivacySessions >>> isinstance(new_sessions["part0"], NoPrivacySession) True >>> new_sessions["part0"].private_sources ['part0'] >>> new_sessions["part0"].get_column_types("part0") {'A': ColumnType.VARCHAR, 'B': ColumnType.INTEGER, 'X': ColumnType.INTEGER} >>> new_sessions["part0"].remaining_privacy_budget PureDPBudget(epsilon=0.75) >>> new_sessions["part1"].private_sources ['part1'] >>> new_sessions["part1"].get_column_types("part1") {'A': ColumnType.VARCHAR, 'B': ColumnType.INTEGER, 'X': ColumnType.INTEGER} >>> new_sessions["part1"].remaining_privacy_budget PureDPBudget(epsilon=0.75)
When you are done with a new session, you can use the
NoPrivacySession.stop()
method to allow the next one to become active:>>> new_sessions["part0"].stop() >>> new_sessions["part1"].private_sources ['part1'] >>> count_query = QueryBuilder("part1").count() >>> count_answer = new_sessions["part1"].evaluate( ... count_query, ... PureDPBudget(0.75), ... )
>>> # The result is exact, because new_sessions["part1"] is a NoPrivacySession >>> count_answer.toPandas() count 0 2
- Parameters:
source_id (str) – The private source to partition.
privacy_budget (tmlt.analytics.privacy_budget.PrivacyBudget) – Privacy budget to pass to each new session.
column (str) – The name of the column partitioning on.
splits (Union[Dict[str, str], Dict[str, int]]) – Mapping of split name to value of partition. Split name is
source_id
in new session.
- Return type:
Dict[str, NoPrivacySession]
- describe(obj=None)#
Describes this session, or one of its tables, or the result of a query.
If
obj
is not specified,session.describe()
will describe the Session and all of the tables it contains.If
obj
is aQueryBuilder
orQuery
,session.describe(obj)
will describe the table that would result from that query if it were applied to the Session.If
obj
is a string,session.describe(obj)
will describe the table with that name. This is a shorthand forsession.describe(QueryBuilder(obj))
.Examples
>>> # describe a session, "sess" >>> sess.describe() The session has a remaining privacy budget of PureDPBudget(epsilon=1). The following private tables are available: Table 'my_private_data' (no constraints): Column Name Column Type Nullable ------------- ------------- ---------- A VARCHAR True B INTEGER True X INTEGER True >>> # describe a query object >>> query = QueryBuilder("my_private_data").drop_null_and_nan(["B", "X"]) >>> sess.describe(query) Column Name Column Type Nullable ------------- ------------- ---------- A VARCHAR True B INTEGER False X INTEGER False >>> # describe a table by name >>> sess.describe("my_private_data") Column Name Column Type Nullable ------------- ------------- ---------- A VARCHAR True B INTEGER True X INTEGER True
- Parameters:
obj (Optional[Union[tmlt.analytics.query_builder.QueryBuilder, tmlt.analytics.query_builder.GroupedQueryBuilder, tmlt.analytics.query_builder.Query, str]]) – The table or query to be described, or None to describe the whole Session.
- Return type:
None
- get_schema(source_id)#
Returns the schema for any data source.
This includes information on whether the columns are nullable.
- Parameters:
source_id (str) – The ID for the data source whose column types are being retrieved.
- Return type:
- get_column_types(source_id)#
Returns the column types for any data source.
This does not include information on whether the columns are nullable.
- Parameters:
source_id (str)
- Return type:
- get_grouping_column(source_id)#
Returns an optional column that must be grouped by in this query.
When a groupby aggregation is appended to any query on this table, it must include this column as a groupby column.
- get_id_column(source_id)#
Returns the ID column of a table, if it has one.
- get_id_space(source_id)#
Returns the ID space of a table, if it has one.
- add_public_dataframe(source_id, dataframe)#
Adds a public data source to the session.
Not all Spark column types are supported in public sources; see
SUPPORTED_SPARK_TYPES
for information about which types are supported.Example
>>> public_spark_data.toPandas() A C 0 0 0 1 0 1 2 1 1 3 1 2 >>> # Add public data >>> sess.add_public_dataframe( ... source_id="my_public_data", dataframe=public_spark_data ... ) >>> sess.public_sources ['my_public_data'] >>> sess.get_column_types("my_public_data") {'A': ColumnType.VARCHAR, 'C': ColumnType.INTEGER}
- Parameters:
source_id (str) – The name of the public data source.
dataframe (pyspark.sql.DataFrame) – The public data source corresponding to the
source_id
.
- stop()#
Closes out this session, allowing other sessions to become active.
- Return type:
None