QueryBuilder.join_private#

from tmlt.analytics import QueryBuilder
QueryBuilder.join_private(right_operand, truncation_strategy_left=None, truncation_strategy_right=None, join_columns=None)#

Join the table with another QueryBuilder.

The current query can also be joined with a named private table (represented as a string).

This operation is an inner, natural join, with the same behavior and requirements as join_public().

Private joins on tables with a ProtectedChange that protects adding or removing rows (e.g. AddMaxRows) require truncation, which is specified using the two TruncationStrategy arguments.

In contrast, operations on tables with a AddRowsWithID ProtectedChange do not require a TruncationStrategy, as no truncation is necessary to perform the join. In this case, the join columns must include the privacy ID columns of both tables, and these privacy ID columns must have the same name, and be in the same ID space.

An example of a private join can be found in the Doing more with privacy IDs tutorial.

Example

>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> budget = PureDPBudget(float("inf"))
>>> sess = Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> from tmlt.analytics import TruncationStrategy
>>> sess.create_view(
...     QueryBuilder("my_private_data")
...     .select(["A", "X"])
...     .rename({"X": "C"}),
...     source_id="my_private_view",
...     cache=False
... )
>>> # A query where only one row with each join key is kept on the left
>>> # table, but two are kept on the right table.
>>> query_drop_excess = (
...     QueryBuilder("my_private_data")
...     .join_private(
...         QueryBuilder("my_private_view"),
...         truncation_strategy_left=TruncationStrategy.DropExcess(1),
...         truncation_strategy_right=TruncationStrategy.DropExcess(2),
...     )
...     .count()
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query_drop_excess,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas()
   count
0      3
>>> # A query where all rows that share a join key with another row in
>>> # their table are dropped, in both the left and right tables.
>>> query_drop_non_unique = (
...     QueryBuilder("my_private_data")
...     .join_private(
...         QueryBuilder("my_private_view"),
...         truncation_strategy_left=TruncationStrategy.DropNonUnique(),
...         truncation_strategy_right=TruncationStrategy.DropNonUnique(),
...     )
...     .count()
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query_drop_non_unique,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas()
   count
0      1
Parameters:
  • right_operand (Union[QueryBuilder, str]) – QueryBuilder object representing the table to be joined with. When calling query_a.join_private(query_b, ...), we refer to query_a as the left table and query_b as the right table. query_a.join_private("table") is shorthand for query_a.join_private(QueryBuilder("table")).

  • truncation_strategy_left (Optional[Type]) – Strategy for truncation of the left table.

  • truncation_strategy_right (Optional[Type]) – Strategy for truncation of the right table.

  • join_columns (Optional[Sequence[str]]) – The columns to join on. If join_columns is not specified, the tables will be joined on all common columns.

Return type:

QueryBuilder

from tmlt.analytics import TruncationStrategy
class tmlt.analytics.TruncationStrategy#

Bases: object

Strategies for performing truncation in private joins.

These are used to determine the sensitivity of a private join between two tables having AddMaxRows as a protected change. The formula for the sensitivity of the table resulting from a private join is:

\(sensitivity=(T_{left}*S_{right}*M_{left}) + (T_{right}*S_{left}*M_{right})\)

where:

  • \(T_{left}\) and \(T_{right}\) are the truncation thresholds for the left and right truncation strategies, respectively. This value is 1 for DropNonUnique.

  • \(S_{left}\) and \(S_{right}\) are the stability of the left and right truncation strategies, respectively. This value is 2 for DropExcess and 1 for DropNonUnique.

  • \(M_{left}\) and \(M_{right}\) are the max_rows parameters of the AddMaxRows protected changes of the left and right tables, respectively.

class Type#

Bases: ABC

Type of TruncationStrategy variants.

class DropExcess(max_rows)#

Bases: Type

Drop rows with matching join keys above a threshold.

This truncation strategy drops rows such that no more than max_rows rows have the same join key. Which rows are kept is deterministic and does not depend on the order in which they appear in the private data. For example, using the DropExcess(1) strategy while joining on columns A and B in the below table:

A

B

Val

a

b

1

a

c

2

a

b

3

b

a

4

causes it to be treated as one of the below tables:

A

B

Val

a

b

1

a

c

2

b

a

4

A

B

Val

a

b

3

a

c

2

b

a

4

This is generally the preferred truncation strategy, even when the DropNonUnique strategy could also be used, because it results in fewer dropped rows.

max_rows: int#

Maximum number of rows to keep.

__post_init__()#

Check arguments to constructor.

class DropNonUnique#

Bases: Type

Drop all rows with non-unique join keys.

This truncation strategy drops all rows which share join keys with another row in the dataset. It is similar to the DropExcess(1) strategy, but doesn’t keep any of the rows with duplicate join keys. For example, using the DropNonUnique strategy while joining on columns A and B in the below table:

A

B

Val

a

b

1

a

c

2

a

b

3

b

a

4

causes it to be treated as:

A

B

Val

a

c

2

b

a

4

This truncation strategy results in less noise than DropExcess(1). However, it also drops more rows in datasets where many rows have non-unique join keys. In most cases, DropExcess is the preferred strategy.