join#

Utilities related to joining dataframes.

Functions#

natural_join_columns()

Returns the columns to join on to perform a natural join.

columns_after_join()

Return the expected output columns and their origin from joining two dataframes.

domain_after_join()

Returns the domain of the join of two dataframes.

join()

Returns the join of two dataframes.

natural_join_columns(left_columns, right_columns)#

Returns the columns to join on to perform a natural join.

The columns to join on are the ones that are in both left_columns and right_columns, in the order they appear in left_columns.

Note

Returns an empty list if there are no common columns.

Examples

>>> natural_join_columns(["a", "b", "c"], ["b", "c", "d"])
['b', 'c']
>>> natural_join_columns(["a", "b", "c"], ["d", "e", "f"])
[]
Parameters
  • left_columns (List[str]) – Columns of the left dataframe.

  • right_columns (List[str]) – Columns of the right dataframe.

Return type

List[str]

columns_after_join(left_columns, right_columns, on=None)#

Return the expected output columns and their origin from joining two dataframes.

The keys are the names of the output columns in the order they will appear in the output dataframe. The values are tuples of the form (left_column, right_column), where left_column and right_column are the names of the columns in the left and right dataframes that the output column is derived from. If the output column is not derived from a column in the left or right dataframe, the corresponding value is None.

The output columns are ordered as follows:

  • Join columns (in the order given by the user, or the order they are in the left table if not provided) appear first.

  • Columns of left table (with _left appended as required) appear next in the input order. (excluding join columns)

  • Columns of the right table (with _right appended as required) appear last in the input order. (excluding join columns)

_left and _right are appended to output column names if they are in both left_columns and right_columns, but are not joined on.

Also does input validation. Checks:

  • The join involves at least one column.

  • Join columns are in both tables.

  • None of the column names are duplicated in any of the inputs.

  • No name collisions when adding _left or _right to a column name.

Examples

>>> columns_after_join(["a", "b", "c"], ["b", "c", "d"])
{'b': ('b', 'b'), 'c': ('c', 'c'), 'a': ('a', None), 'd': (None, 'd')}
>>> columns_after_join(["a", "b", "c"], ["b", "c", "d"], ["b"])
{'b': ('b', 'b'), 'a': ('a', None), 'c_left': ('c', None), 'c_right': (None, 'c'), 'd': (None, 'd')}
>>> columns_after_join(["a", "b", "c"], ["d", "e", "f"])
Traceback (most recent call last):
...
ValueError: Join must involve at least one column.
>>> columns_after_join(["a", "b", "c"], ["a", "b", "c"], ["d"])
Traceback (most recent call last):
...
ValueError: Join column 'd' not in the left table.
>>> columns_after_join(["a", "b", "c"], ["a", "b", "c"], ["a", "a"])
Traceback (most recent call last):
...
ValueError: Join columns (`on`) contain duplicates.
>>> columns_after_join(["a", "b"], ["a", "b", "b_right"], ["a"])
Traceback (most recent call last):
...
ValueError: Name collision, 'b_right' would appear more than once in the output.
Parameters
  • left_columns (List[str]) – Columns of the left dataframe.

  • right_columns (List[str]) – Columns of the right dataframe.

  • on (Optional[List[str]]) – Columns to join on. If None, join on all columns with the same name.

Return type

Dict[str, Union[Tuple[str, str], Tuple[str, None], Tuple[None, str]]]

domain_after_join(left_domain, right_domain, on=None, how='inner', nulls_are_equal=False)#

Returns the domain of the join of two dataframes.

Also does input validation. Checks:

  • All checks from columns_after_join().

  • how is one of “left”, “right”, “inner”, or “outer”.

  • Join columns have the same data type.

  • Left and right domains are SparkDataFrameDomains.

Note

This takes into account extra metadata about the columns, such as whether nulls/ infs are allowed, and what kind of join is performed.

See NaNs, nulls, and infs for more information about comparisons involving special values.

Parameters
  • left_domain (tmlt.core.domains.base.Domain) – Domain of the left dataframe.

  • right_domain (tmlt.core.domains.base.Domain) – Domain of the right dataframe.

  • on (Optional[List[str]]) – Columns to join on. If None, join on all columns with the same name.

  • how (str) – Join type. Must be one of “left”, “right”, “inner”, “outer”. This defaults to “inner”.

  • nulls_are_equal (bool) – If True, treats null values as equal. Defaults to False.

Return type

tmlt.core.domains.spark_domains.SparkDataFrameDomain

join(left, right, on=None, how='inner', nulls_are_equal=False)#

Returns the join of two dataframes.

Parameters
  • left (pyspark.sql.DataFrame) – Left dataframe.

  • right (pyspark.sql.DataFrame) – Right dataframe.

  • on (Optional[List[str]]) – Columns to join on. If None, join on all columns with the same name.

  • how (str) – Join type. Must be one of “left”, “right”, “inner”, “outer”. If None, defaults to “inner”.

  • nulls_are_equal (bool) – If True, treats null values as equal. Defaults to False.

Return type

pyspark.sql.DataFrame