QueryBuilder.join_public#
from tmlt.analytics import QueryBuilder
- QueryBuilder.join_public(public_table, join_columns=None, how='inner')#
Joins the table with a DataFrame or a public source.
This operation is either an inner or left join.
This operation performs a natural join between two tables. This means that the resulting table will contain all columns unique to each input table, along with one copy of each common column. In most cases, the columns have the same names they did in the input tables.
By default, the input tables are joined on all common columns (i.e., columns whose names and data types match). However if
join_columns
is given, the tables will be joined only on the given columns, and the remaining common columns will be disambiguated in the resulting table by the addition of a_left
or_right
suffix to their names. If given,join_columns
must contain a non-empty subset of the tables’ common columns. For example, two tables with columnsA,B,C
andA,B,D
would by default be joined on columnsA
andB
, resulting in a table with columnsA,B,C,D
; ifjoin_columns=["B"]
were given when performing this join, the resulting table would have columnsA_left,A_right,B,C,D
. The order of columns in the resulting table is not guaranteed.Note
Columns must share both names and data types for them to be used in joining. If this condition is not met, one of the data sources must be transformed to be eligible for joining (e.g., by using
rename()
ormap()
).Every row within a join group (i.e., every row that shares values in the join columns) from the private table will be joined with every row from that same group in the public table. For example, if a group has \(X\) rows in the private table and \(Y\) rows in the public table, then the output table will contain \(X*Y\) rows for this group.
Note
Tables with a
ProtectedChange
ofAddRowsWithID
must include the privacy ID column in the join columns.An illustrated example can be found in the Simple transformations tutorial.
Example
>>> my_private_data.toPandas() A B X 0 0 1 0 1 1 0 1 2 1 2 1 >>> public_data.toPandas() A C 0 0 0 1 0 1 2 1 1 3 1 2 >>> budget = PureDPBudget(float("inf")) >>> sess = Session.from_dataframe( ... privacy_budget=budget, ... source_id="my_private_data", ... dataframe=my_private_data, ... protected_change=AddOneRow(), ... ) >>> # Create a query joining with public_data as a dataframe: >>> query = ( ... QueryBuilder("my_private_data") ... .join_public(public_data) ... .groupby(KeySet.from_dict({"C": [0, 1, 2]})) ... .count() ... ) >>> answer = sess.evaluate( ... query, ... PureDPBudget(float("inf")) ... ) >>> answer.sort("C").toPandas() C count 0 0 1 1 1 3 2 2 2 >>> # Alternatively, the dataframe can be added to the Session as a public >>> # source, and its source ID can be used to perform the join: >>> sess.add_public_dataframe( ... source_id="my_public_data", dataframe=public_data ... ) >>> query = ( ... QueryBuilder("my_private_data") ... .join_public("my_public_data") ... .groupby(KeySet.from_dict({"C": [0, 1, 2]})) ... .count() ... ) >>> answer = sess.evaluate( ... query, ... PureDPBudget(float("inf")) ... ) >>> answer.sort("C").toPandas() C count 0 0 1 1 1 3 2 2 2 >>> # The join can also be a left join. This is helpful to keep >>> # records which are not included in the table. One use for this >>> # is to find the number of records which are or are not >>> # included in a KeySet. >>> from pyspark.sql import functions as sf >>> my_keyset = KeySet.from_dict({"A": ["0"]}) >>> query = ( ... QueryBuilder("my_private_data") ... .join_public( ... my_keyset.dataframe().withColumn("Indicator", sf.lit(1)), ... how="left", ... ) ... .groupby(KeySet.from_dict({"Indicator": [1, None]})) ... .count() ... ) >>> answer = sess.evaluate( ... query, ... PureDPBudget(float("inf")) ... ) >>> answer.toPandas() # pandas converts int to float if there are NaNs Indicator count 0 NaN 2 1 1.0 1
- Parameters:
public_table (
Union
[DataFrame
,str
]) – A dataframe or source ID for a public source to natural join with private data.join_columns (
Optional
[Sequence
[str
]]) – The columns to join on. Ifjoin_columns
is not specified, the tables will be joined on all common columns.how (
str
) – The type of join to perform. Must be one of “inner” or “left”.
- Return type: