keyset#

A KeySet specifies a list of values for one or more columns.

For example, a KeySet could specify the values [“a1”, “a2”] for column A and the values [0, 1, 2, 3] for column B.

Currently, KeySets are used as a simpler way to specify domains for groupby transformations.

Classes#

KeySet

A class containing a set of values for specific columns.

class KeySet(dataframe)#

A class containing a set of values for specific columns.

Note that if a column has null values dropped or replaced, then Analytics will raise an error if you use a KeySet that contains a null value for that column.

Parameters

dataframe (Union[pyspark.sql.DataFrame, Callable[[], pyspark.sql.DataFrame]]) –

__init__(dataframe)#

Construct a new keyset.

The from_dict() and from_dataframe() methods are preferred over directly using the constructor to create new KeySets. Directly constructing KeySets skips checks that guarantee the uniqueness of output rows.

Parameters

dataframe (Union[pyspark.sql.dataframe.DataFrame, Callable[[], pyspark.sql.dataframe.DataFrame]]) –

Return type

None

dataframe(self)#

Return the dataframe associated with this KeySet.

This dataframe contains every combination of values being selected in the KeySet, and its rows are guaranteed to be unique as long as the KeySet was constructed safely.

Return type

pyspark.sql.DataFrame

classmethod from_dict(cls, domains)#

Create a KeySet from a dictionary.

The domains dictionary should map column names to the desired values for those columns. The KeySet returned is the cross-product of those columns. Duplicate values in the column domains are allowed, but only one of the duplicates is kept.

Example

>>> domains = {
...     "A": ["a1", "a2"],
...     "B": ["b1", "b2"],
... }
>>> keyset = KeySet.from_dict(domains)
>>> keyset.dataframe().sort("A", "B").toPandas()
    A   B
0  a1  b1
1  a1  b2
2  a2  b1
3  a2  b2
Parameters
Return type

KeySet

classmethod from_dataframe(cls, dataframe)#

Create a KeySet from a dataframe.

This DataFrame should contain every combination of values being selected in the KeySet. If there are duplicate rows in the dataframe, only one copy of each will be kept.

When creating KeySets with this method, it is the responsibility of the caller to ensure that the given dataframe remains valid for the lifetime of the KeySet. If the dataframe becomes invalid, for example because its Spark session is closed, this method or any uses of the resulting dataframe may raise exceptions or have other unanticipated effects.

Parameters
Return type

KeySet

filter(self, expr)#

Filter this KeySet using some expression.

This method accepts the same syntax as filter(): valid expressions are those that can be used in a WHERE clause in Spark SQL. Examples of valid predicates include:

  • age < 42

  • age BETWEEN 17 AND 42

  • age < 42 OR (age < 60 AND gender IS NULL)

  • LENGTH(name) > 17

  • favorite_color IN ('blue', 'red')

Example

>>> domains = {
...     "A": ["a1", "a2"],
...     "B": [0, 1, 2, 3],
... }
>>> keyset = KeySet.from_dict(domains)
>>> filtered_keyset = keyset.filter("B < 2")
>>> filtered_keyset.dataframe().sort("A", "B").toPandas()
    A  B
0  a1  0
1  a1  1
2  a2  0
3  a2  1
>>> filtered_keyset = keyset.filter(keyset.dataframe().A != "a1")
>>> filtered_keyset.dataframe().sort("A", "B").toPandas()
    A  B
0  a2  0
1  a2  1
2  a2  2
3  a2  3
Parameters

expr (Union[pyspark.sql.Column, str]) –

Return type

KeySet

__getitem__(self, cols)#

KeySet[col, col, …] returns a KeySet with those columns only.

The returned KeySet contains all unique combinations of values in the given columns that were present in the original KeySet.

Example

>>> domains = {
...     "A": ["a1", "a2"],
...     "B": ["b1", "b2"],
...     "C": ["c1", "c2"],
...     "D": [0, 1, 2, 3]
... }
>>> keyset = KeySet.from_dict(domains)
>>> a_b_keyset = keyset["A", "B"]
>>> a_b_keyset.dataframe().sort("A", "B").toPandas()
    A   B
0  a1  b1
1  a1  b2
2  a2  b1
3  a2  b2
>>> a_b_keyset = keyset[["A", "B"]]
>>> a_b_keyset.dataframe().sort("A", "B").toPandas()
    A   B
0  a1  b1
1  a1  b2
2  a2  b1
3  a2  b2
>>> a_keyset = keyset["A"]
>>> a_keyset.dataframe().sort("A").toPandas()
    A
0  a1
1  a2
Parameters

cols (Union[str, Tuple[str, ...], List[str]]) –

Return type

KeySet

__mul__(self, other)#

A product (KeySet * KeySet) returns the cross-product of both KeySets.

Example

>>> keyset1 = KeySet.from_dict({"A": ["a1", "a2"]})
>>> keyset2 = KeySet.from_dict({"B": ["b1", "b2"]})
>>> product = keyset1 * keyset2
>>> product.dataframe().sort("A", "B").toPandas()
    A   B
0  a1  b1
1  a1  b2
2  a2  b1
3  a2  b2
Parameters

other (KeySet) –

Return type

KeySet

__eq__(self, other)#

Override equality.

Two KeySets are equal if their dataframes contain the same values for the same columns (in any order).

Example

>>> keyset1 = KeySet.from_dict({"A": ["a1", "a2"]})
>>> keyset2 = KeySet.from_dict({"A": ["a1", "a2"]})
>>> keyset3 = KeySet.from_dict({"A": ["a2", "a1"]})
>>> keyset1 == keyset2
True
>>> keyset1 == keyset3
True
>>> different_keyset = KeySet.from_dict({"B": ["a1", "a2"]})
>>> keyset1 == different_keyset
False
Parameters

other (object) –

Return type

bool

schema(self)#

Returns a Schema based on the KeySet.

Example

>>> domains = {
...     "A": ["a1", "a2"],
...     "B": [0, 1, 2, 3],
... }
>>> keyset = KeySet.from_dict(domains)
>>> schema = keyset.schema()
>>> schema 
Schema({'A': ColumnDescriptor(column_type=ColumnType.VARCHAR, allow_null=True, allow_nan=False, allow_inf=False),
        'B': ColumnDescriptor(column_type=ColumnType.INTEGER, allow_null=True, allow_nan=False, allow_inf=False)})
Return type

tmlt.analytics._schema.Schema