QueryBuilder.join_public#

from tmlt.analytics import QueryBuilder

QueryBuilder.join_public(public_table, join_columns=None, how='inner')#

Joins the table with a DataFrame or a public source.

This operation is either an inner or left join.

This operation performs a natural join between two tables. This means that the resulting table will contain all columns unique to each input table, along with one copy of each common column. In most cases, the columns have the same names they did in the input tables.

By default, the input tables are joined on all common columns (i.e., columns whose names and data types match). However if join_columns is given, the tables will be joined only on the given columns, and the remaining common columns will be disambiguated in the resulting table by the addition of a _left or _right suffix to their names. If given, join_columns must contain a non-empty subset of the tables’ common columns. For example, two tables with columns A,B,C and A,B,D would by default be joined on columns A and B, resulting in a table with columns A,B,C,D; if join_columns=["B"] were given when performing this join, the resulting table would have columns A_left,A_right,B,C,D. The order of columns in the resulting table is not guaranteed.

Note

Columns must share both names and data types for them to be used in joining. If this condition is not met, one of the data sources must be transformed to be eligible for joining (e.g., by using rename() or map()).

Every row within a join group (i.e., every row that shares values in the join columns) from the private table will be joined with every row from that same group in the public table. For example, if a group has \(X\) rows in the private table and \(Y\) rows in the public table, then the output table will contain \(X*Y\) rows for this group.

Note

Tables with a ProtectedChange of AddRowsWithID must include the privacy ID column in the join columns.

An illustrated example can be found in the Simple transformations tutorial.

Example

>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> public_data.toPandas()
   A  C
0  0  0
1  0  1
2  1  1
3  1  2
>>> budget = PureDPBudget(float("inf"))
>>> sess = Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> # Create a query joining with public_data as a dataframe:
>>> query = (
...     QueryBuilder("my_private_data")
...     .join_public(public_data)
...     .groupby(KeySet.from_dict({"C": [0, 1, 2]}))
...     .count()
... )
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("C").toPandas()
   C  count
0  0      1
1  1      3
2  2      2
>>> # Alternatively, the dataframe can be added to the Session as a public
>>> # source, and its source ID can be used to perform the join:
>>> sess.add_public_dataframe(
...     source_id="my_public_data", dataframe=public_data
... )
>>> query = (
...     QueryBuilder("my_private_data")
...     .join_public("my_public_data")
...     .groupby(KeySet.from_dict({"C": [0, 1, 2]}))
...     .count()
... )
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("C").toPandas()
   C  count
0  0      1
1  1      3
2  2      2
>>> # The join can also be a left join. This is helpful to keep
>>> # records which are not included in the table. One use for this
>>> # is to find the number of records which are or are not
>>> # included in a KeySet.
>>> from pyspark.sql import functions as sf
>>> my_keyset = KeySet.from_dict({"A": ["0"]})
>>> query = (
...     QueryBuilder("my_private_data")
...     .join_public(
...         my_keyset.dataframe().withColumn("Indicator", sf.lit(1)),
...         how="left",
...     )
...     .groupby(KeySet.from_dict({"Indicator": [1, None]}))
...     .count()
... )
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas()  # pandas converts int to float if there are NaNs
   Indicator  count
0        NaN      2
1        1.0      1

Parameters:

public_table (Union[DataFrame, str]) – A dataframe or source ID for a public source to natural join with private data.
join_columns (Optional[Sequence[str]]) – The columns to join on. If join_columns is not specified, the tables will be joined on all common columns.
how (str) – The type of join to perform. Must be one of “inner” or “left”.

Return type:

QueryBuilder

Tumult Platform

QueryBuilder.join_public#