session#
Interactive query evaluation using a differential privacy framework.
Session
provides an interface for managing data sources and performing
differentially private queries on them. A simple session with a single private
datasource can be created using Session.from_dataframe()
, or a more
complex one with multiple datasources can be constructed using
Session.Builder
. Queries can then be evaluated on the data using
Session.evaluate()
.
A Session is initialized with a
PrivacyBudget
, and ensures that queries
evaluated on the private data do not consume more than this budget. By default,
a Session enforces this privacy guarantee at the row level: the queries prevent
an attacker from learning whether an individual row has been added or removed in
each of the private tables, provided that the private data is not used elsewhere
in the computation of the queries.
More details on the exact privacy promise provided by Session
can be
found in the Privacy promise topic guide.
Data#
- SUPPORTED_SPARK_TYPES#
Set of Spark data types supported by Tumult Analytics.
Support for Spark data types in Analytics is currently as follows:
Type
Supported
yes
yes, by coercion to
LongType
yes
yes, by coercion to
DoubleType
yes
yes
yes
Other Spark types
no
Columns with unsupported types must be dropped or converted to supported ones before loading the data into Analytics.
- TYPE_COERCION_MAP :Dict[pyspark.sql.types.DataType, pyspark.sql.types.DataType]#
Mapping describing how Spark’s data types are coerced by Tumult Analytics.
Classes#
Allows differentially private query evaluation on sensitive data. |
- class Session(accountant, public_sources)#
Allows differentially private query evaluation on sensitive data.
Sessions should not be directly constructed. Instead, they should be created using
from_dataframe()
or with aBuilder
.Classes# Builder for
Session
.Methods# Initializes a DP session from a Spark dataframe.
Returns the IDs of the private sources.
Returns the IDs of the public sources.
Returns a dictionary of public source DataFrames.
Returns the remaining privacy_budget left in the session.
Describes this session, or one of its tables, or the result of a query.
Returns the schema for any data source.
Returns the column types for any data source.
Returns an optional column that must be grouped by in this query.
Returns the ID column of a table, if it has one.
Returns the ID space of a table, if it has one.
Adds a public data source to the session.
Answers a query within the given privacy budget and returns a Spark dataframe.
Creates a new view from a transformation and possibly cache it.
Deletes a view and decaches it if it was cached.
Returns new sessions from a partition mapped to split name/
source_id
.Closes out this session, allowing other sessions to become active.
- Parameters
accountant (tmlt.core.measurements.interactive_measurements.PrivacyAccountant) –
public_sources (Dict[str, pyspark.sql.DataFrame]) –
- class Builder#
Builder for
Session
.- with_private_dataframe(source_id, dataframe, protected_change)#
Adds a Spark DataFrame as a private source.
Not all Spark column types are supported in private sources; see
tmlt.analytics.session.SUPPORTED_SPARK_TYPES
for information about which types are supported.- Parameters
source_id (str) – Source id for the private source dataframe.
dataframe (pyspark.sql.DataFrame) – Private source dataframe to perform queries on, corresponding to the
source_id
.protected_change (tmlt.analytics.protected_change.ProtectedChange) – A
ProtectedChange
specifying what changes to the input data should be protected.
- with_public_dataframe(source_id, dataframe)#
Adds a public dataframe.
- Parameters
source_id (str) –
dataframe (pyspark.sql.DataFrame) –
- with_id_space(id_space)#
Adds an identifier space.
This defines a space of identifiers that map 1-to-1 to the identifiers being protected by a table with the
AddRowsWithID
protected change. Any table with such a protected change must be a member of some identifier space.- Parameters
id_space (str) –
- with_privacy_budget(privacy_budget)#
Set the privacy budget for the object being built.
- Parameters
privacy_budget (tmlt.analytics.privacy_budget.PrivacyBudget) –
- __init__(accountant, public_sources)#
Initializes a DP session from a queryable.
Warning
This constructor is not intended to be used directly. Use
Session.Builder
orfrom_
constructors instead.
- classmethod from_dataframe(privacy_budget, source_id, dataframe, protected_change)#
Initializes a DP session from a Spark dataframe.
Only one private data source is supported with this initialization method; if you need multiple data sources, use
Builder
.Not all Spark column types are supported in private sources; see
SUPPORTED_SPARK_TYPES
for information about which types are supported.Example
>>> spark_data.toPandas() A B X 0 0 1 0 1 1 0 1 2 1 2 1 >>> # Declare budget for the session. >>> session_budget = PureDPBudget(1) >>> # Set up Session >>> sess = Session.from_dataframe( ... privacy_budget=session_budget, ... source_id="my_private_data", ... dataframe=spark_data, ... protected_change=AddOneRow(), ... ) >>> sess.private_sources ['my_private_data'] >>> sess.get_schema("my_private_data").column_types {'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
- Parameters
privacy_budget (tmlt.analytics.privacy_budget.PrivacyBudget) – The total privacy budget allocated to this session.
source_id (str) – The source id for the private source dataframe.
dataframe (pyspark.sql.DataFrame) – The private source dataframe to perform queries on, corresponding to the source_id.
protected_change (tmlt.analytics.protected_change.ProtectedChange) – A
ProtectedChange
specifying what changes to the input data the resultingSession
should protect.
- Return type
- property public_source_dataframes#
Returns a dictionary of public source DataFrames.
- Return type
Dict[str, pyspark.sql.DataFrame]
- property remaining_privacy_budget#
Returns the remaining privacy_budget left in the session.
The type of the budget (e.g., PureDP or rho-zCDP) will be the same as the type of the budget the Session was initialized with.
- Return type
- describe(obj=None)#
Describes this session, or one of its tables, or the result of a query.
If
obj
is not specified,session.describe()
will describe the Session and all of the tables it contains.If
obj
is aQueryBuilder
orQueryExpr
,session.describe(obj)
will describe the table that would result from that query if it were applied to the Session.If
obj
is a string,session.describe(obj)
will describe the table with that name. This is a shorthand forsession.describe(QueryBuilder(obj))
.Examples
>>> # describe a session, "sess" >>> sess.describe() The session has a remaining privacy budget of PureDPBudget(epsilon=1). The following private tables are available: Table 'my_private_data' (no constraints): Columns: - 'A' VARCHAR - 'B' INTEGER - 'X' INTEGER >>> # describe a query object >>> query = QueryBuilder("my_private_data").drop_null_and_nan(["B", "X"]) >>> sess.describe(query) Columns: - 'A' VARCHAR - 'B' INTEGER, not null - 'X' INTEGER, not null >>> # describe a table by name >>> sess.describe("my_private_data") Columns: - 'A' VARCHAR - 'B' INTEGER - 'X' INTEGER
- Parameters
obj (Optional[Union[tmlt.analytics.query_expr.QueryExpr, tmlt.analytics.query_builder.QueryBuilder, tmlt.analytics.query_builder.GroupedQueryBuilder, tmlt.analytics.query_builder.AggregatedQueryBuilder, str]]) – The table or query to be described, or None to describe the whole Session.
- Return type
None
- get_schema(source_id)#
Returns the schema for any data source.
This includes information on whether the columns are nullable.
- Parameters
source_id (str) – The ID for the data source whose column types are being retrieved.
- Return type
- get_column_types(source_id)#
Returns the column types for any data source.
This does not include information on whether the columns are nullable.
- Parameters
source_id (str) –
- Return type
- get_grouping_column(source_id)#
Returns an optional column that must be grouped by in this query.
When a groupby aggregation is appended to any query on this table, it must include this column as a groupby column.
- get_id_column(source_id)#
Returns the ID column of a table, if it has one.
- get_id_space(source_id)#
Returns the ID space of a table, if it has one.
- add_public_dataframe(source_id, dataframe)#
Adds a public data source to the session.
Not all Spark column types are supported in public sources; see
SUPPORTED_SPARK_TYPES
for information about which types are supported.Example
>>> public_spark_data.toPandas() A C 0 0 0 1 0 1 2 1 1 3 1 2 >>> # Add public data >>> sess.add_public_dataframe( ... source_id="my_public_data", dataframe=public_spark_data ... ) >>> sess.public_sources ['my_public_data'] >>> sess.get_schema('my_public_data').column_types {'A': 'VARCHAR', 'C': 'INTEGER'}
- Parameters
source_id (str) – The name of the public data source.
dataframe (pyspark.sql.DataFrame) – The public data source corresponding to the
source_id
.
- evaluate(query_expr, privacy_budget)#
Answers a query within the given privacy budget and returns a Spark dataframe.
The type of privacy budget that you use must match the type your Session was initialized with (i.e., you cannot evaluate a query using RhoZCDPBudget if the Session was initialized with a PureDPBudget, and vice versa).
Example
>>> sess.private_sources ['my_private_data'] >>> sess.get_schema("my_private_data").column_types {'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'} >>> sess.remaining_privacy_budget PureDPBudget(epsilon=1) >>> # Evaluate Queries >>> filter_query = QueryBuilder("my_private_data").filter("A > 0") >>> count_query = filter_query.groupby(KeySet.from_dict({"X": [0, 1]})).count() >>> count_answer = sess.evaluate( ... query_expr=count_query, ... privacy_budget=PureDPBudget(0.5), ... ) >>> sum_query = filter_query.sum(column="B", low=0, high=1) >>> sum_answer = sess.evaluate( ... query_expr=sum_query, ... privacy_budget=PureDPBudget(0.5), ... ) >>> count_answer # TODO(#798): Seed randomness and change to toPandas() DataFrame[X: bigint, count: bigint] >>> sum_answer # TODO(#798): Seed randomness and change to toPandas() DataFrame[B_sum: bigint]
- Parameters
query_expr (Union[tmlt.analytics.query_expr.QueryExpr, tmlt.analytics.query_builder.AggregatedQueryBuilder]) – One query expression to answer.
privacy_budget (tmlt.analytics.privacy_budget.PrivacyBudget) – The privacy budget used for the query.
- Return type
Any
- create_view(query_expr, source_id, cache)#
Creates a new view from a transformation and possibly cache it.
Example
>>> sess.private_sources ['my_private_data'] >>> sess.get_schema("my_private_data").column_types {'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'} >>> public_spark_data.toPandas() A C 0 0 0 1 0 1 2 1 1 3 1 2 >>> sess.add_public_dataframe("my_public_data", public_spark_data) >>> # Create a view >>> join_query = ( ... QueryBuilder("my_private_data") ... .join_public("my_public_data") ... .select(["A", "B", "C"]) ... ) >>> sess.create_view( ... join_query, ... source_id="private_public_join", ... cache=True ... ) >>> sess.private_sources ['private_public_join', 'my_private_data'] >>> sess.get_schema("private_public_join").column_types {'A': 'VARCHAR', 'B': 'INTEGER', 'C': 'INTEGER'} >>> # Delete the view >>> sess.delete_view("private_public_join") >>> sess.private_sources ['my_private_data']
- Parameters
query_expr (Union[tmlt.analytics.query_expr.QueryExpr, tmlt.analytics.query_builder.QueryBuilder]) – A query that performs a transformation.
source_id (str) – The name, or unique identifier, of the view.
cache (bool) – Whether or not to cache the view.
- delete_view(source_id)#
Deletes a view and decaches it if it was cached.
- Parameters
source_id (str) – The name of the view.
- partition_and_create(source_id, privacy_budget, column, splits)#
Returns new sessions from a partition mapped to split name/
source_id
.The type of privacy budget that you use must match the type your Session was initialized with (i.e., you cannot use a
RhoZCDPBudget
to partition your Session if the Session was created using aPureDPBudget
, and vice versa).The sessions returned must be used in the order that they were created. Using this session again or calling stop() will stop all partition sessions.
Example
This example partitions the session into two sessions, one with A = “0” and one with A = “1”. Due to parallel composition, each of these sessions are given the same budget, while only one count of that budget is deducted from session.
>>> sess.private_sources ['my_private_data'] >>> sess.get_schema("my_private_data").column_types {'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'} >>> sess.remaining_privacy_budget PureDPBudget(epsilon=1) >>> # Partition the Session >>> new_sessions = sess.partition_and_create( ... "my_private_data", ... privacy_budget=PureDPBudget(0.75), ... column="A", ... splits={"part0":"0", "part1":"1"} ... ) >>> sess.remaining_privacy_budget PureDPBudget(epsilon=0.25) >>> new_sessions["part0"].private_sources ['part0'] >>> new_sessions["part0"].get_schema("part0").column_types {'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'} >>> new_sessions["part0"].remaining_privacy_budget PureDPBudget(epsilon=0.75) >>> new_sessions["part1"].private_sources ['part1'] >>> new_sessions["part1"].get_schema("part1").column_types {'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'} >>> new_sessions["part1"].remaining_privacy_budget PureDPBudget(epsilon=0.75)
When you are done with a new session, you can use the
stop()
method to allow the next one to become active:>>> new_sessions["part0"].stop() >>> new_sessions["part1"].private_sources ['part1'] >>> count_query = QueryBuilder("part1").count() >>> count_answer = new_sessions["part1"].evaluate( ... count_query, ... PureDPBudget(0.75), ... ) >>> count_answer.toPandas() count 0 ...
- Parameters
source_id (str) – The private source to partition.
privacy_budget (tmlt.analytics.privacy_budget.PrivacyBudget) – Privacy budget to pass to each new session.
column (str) – The name of the column partitioning on.
splits (Union[Dict[str, str], Dict[str, int]]) – Mapping of split name to value of partition. Split name is
source_id
in new session.
- Return type
- stop()#
Closes out this session, allowing other sessions to become active.
- Return type
None