keyset#

A KeySet specifies a list of values for one or more columns.

They are used as input to the groupby() method to build group-by queries. An introduction to KeySets can be found in the Group-by queries tutorial.

Data#

LOW_SIZE#

An arbitrary threshold below which a KeySet is considered small.

Classes#

class KeySet#

Bases: abc.ABC

A class containing a set of values for specific columns.

An introduction to KeySet initialization and manipulation can be found in the Group-by queries tutorial.

Warning

If a column has null values dropped or replaced, then Analytics will raise an error if you use a KeySet that contains a null value for that column.

Note

The from_dict() and from_dataframe() methods are the preferred way to construct KeySets. Directly constructing KeySets skips checks that guarantee the uniqueness of output rows, and __init__ methods are not guaranteed to work the same way between releases.

Methods#

from_dict()

Creates a KeySet from a dictionary.

from_dataframe()

Creates a KeySet from a dataframe.

from_tuples()

Creates a KeySet from a list of tuples and column names.

dataframe()

Returns the dataframe associated with this KeySet.

__getitem__()

KeySet[col, col, ...] returns a KeySet with those columns only.

__eq__()

Returns whether two KeySets are equal.

schema()

Returns the KeySet’s schema.

__mul__()

A product (KeySet * KeySet) returns the cross-product of both KeySets.

columns()

Returns the list of columns used in this KeySet.

filter()

Filters this KeySet using some condition.

size()

Returns the size of this KeySet.

cache()

Caches the KeySet’s dataframe in memory.

uncache()

Removes the KeySet’s dataframe from memory and disk.

classmethod from_dict(domains)#

Creates a KeySet from a dictionary.

The domains dictionary should map column names to the desired values for those columns. The KeySet returned is the cross-product of those columns. Duplicate values in the column domains are allowed, but only one of the duplicates is kept.

Example

>>> domains = {
...     "A": ["a1", "a2"],
...     "B": ["b1", "b2"],
... }
>>> keyset = KeySet.from_dict(domains)
>>> keyset.dataframe().sort("A", "B").toPandas()
    A   B
0  a1  b1
1  a1  b2
2  a2  b1
3  a2  b2
Parameters:

domains (Mapping[str, Union[Iterable[Optional[str]], Iterable[Optional[int]], Iterable[Optional[datetime.date]]]])

Return type:

KeySet

classmethod from_dataframe(dataframe)#

Creates a KeySet from a dataframe.

This DataFrame should contain every combination of values being selected in the KeySet. If there are duplicate rows in the dataframe, only one copy of each will be kept.

When creating KeySets with this method, it is the responsibility of the caller to ensure that the given dataframe remains valid for the lifetime of the KeySet. If the dataframe becomes invalid, for example because its Spark session is closed, this method or any uses of the resulting dataframe may raise exceptions or have other unanticipated effects.

Parameters:

dataframe (pyspark.sql.DataFrame)

Return type:

KeySet

classmethod from_tuples(tuples, columns)#

Creates a KeySet from a list of tuples and column names.

Example

>>> tuples = [
...   ("a1", "b1"),
...   ("a2", "b1"),
...   ("a3", "b3"),
... ]
>>> keyset = KeySet.from_tuples(tuples, ["A", "B"])
>>> keyset.dataframe().sort("A", "B").toPandas()
    A   B
0  a1  b1
1  a2  b1
2  a3  b3
Parameters:
Return type:

KeySet

abstract dataframe()#

Returns the dataframe associated with this KeySet.

This dataframe contains every combination of values being selected in the KeySet, and its rows are guaranteed to be unique as long as the KeySet was constructed safely.

Return type:

pyspark.sql.DataFrame

abstract __getitem__(columns)#

KeySet[col, col, ...] returns a KeySet with those columns only.

The returned KeySet contains all unique combinations of values in the given columns that were present in the original KeySet.

Example

>>> domains = {
...     "A": ["a1", "a2"],
...     "B": ["b1", "b2"],
...     "C": ["c1", "c2"],
...     "D": [0, 1, 2, 3]
... }
>>> keyset = KeySet.from_dict(domains)
>>> a_b_keyset = keyset["A", "B"]
>>> a_b_keyset.dataframe().sort("A", "B").toPandas()
    A   B
0  a1  b1
1  a1  b2
2  a2  b1
3  a2  b2
>>> a_b_keyset = keyset[["A", "B"]]
>>> a_b_keyset.dataframe().sort("A", "B").toPandas()
    A   B
0  a1  b1
1  a1  b2
2  a2  b1
3  a2  b2
>>> a_keyset = keyset["A"]
>>> a_keyset.dataframe().sort("A").toPandas()
    A
0  a1
1  a2
Parameters:

columns (Union[str, Tuple[str], Sequence[str]])

Return type:

KeySet

__eq__(other)#

Returns whether two KeySets are equal.

Two KeySets are equal if their dataframes contain the same values for the same columns (in any order).

Example

>>> keyset1 = KeySet.from_dict({"A": ["a1", "a2"]})
>>> keyset2 = KeySet.from_dict({"A": ["a1", "a2"]})
>>> keyset3 = KeySet.from_dict({"A": ["a2", "a1"]})
>>> keyset1 == keyset2
True
>>> keyset1 == keyset3
True
>>> different_keyset = KeySet.from_dict({"B": ["a1", "a2"]})
>>> keyset1 == different_keyset
False
Parameters:

other (object)

Return type:

bool

abstract schema()#

Returns the KeySet’s schema.

Example

>>> domains = {
...     "A": ["a1", "a2"],
...     "B": [0, 1, 2, 3],
... }
>>> keyset = KeySet.from_dict(domains)
>>> schema = keyset.schema()
>>> schema 
{'A': ColumnDescriptor(column_type=ColumnType.VARCHAR, allow_null=True, allow_nan=False, allow_inf=False),
 'B': ColumnDescriptor(column_type=ColumnType.INTEGER, allow_null=True, allow_nan=False, allow_inf=False)}
Return type:

Dict[str, tmlt.analytics._schema.ColumnDescriptor]

__mul__(other)#

A product (KeySet * KeySet) returns the cross-product of both KeySets.

Example

>>> keyset1 = KeySet.from_dict({"A": ["a1", "a2"]})
>>> keyset2 = KeySet.from_dict({"B": ["b1", "b2"]})
>>> product = keyset1 * keyset2
>>> product.dataframe().sort("A", "B").toPandas()
    A   B
0  a1  b1
1  a1  b2
2  a2  b1
3  a2  b2
Parameters:

other (KeySet)

Return type:

KeySet

abstract columns()#

Returns the list of columns used in this KeySet.

Return type:

List[str]

abstract filter(condition)#

Filters this KeySet using some condition.

This method accepts the same syntax as pyspark.sql.DataFrame.filter(): valid conditions are those that can be used in a WHERE clause in Spark SQL. Examples of valid conditions include:

  • age < 42

  • age BETWEEN 17 AND 42

  • age < 42 OR (age < 60 AND gender IS NULL)

  • LENGTH(name) > 17

  • favorite_color IN ('blue', 'red')

Example

>>> domains = {
...     "A": ["a1", "a2"],
...     "B": [0, 1, 2, 3],
... }
>>> keyset = KeySet.from_dict(domains)
>>> filtered_keyset = keyset.filter("B < 2")
>>> filtered_keyset.dataframe().sort("A", "B").toPandas()
    A  B
0  a1  0
1  a1  1
2  a2  0
3  a2  1
>>> filtered_keyset = keyset.filter(keyset.dataframe().A != "a1")
>>> filtered_keyset.dataframe().sort("A", "B").toPandas()
    A  B
0  a2  0
1  a2  1
2  a2  2
3  a2  3
Parameters:

condition (Union[pyspark.sql.Column, str])

Return type:

KeySet

abstract size()#

Returns the size of this KeySet.

Note

A KeySet with no rows and no columns has a size of 1, because queries grouped on such an empty KeySet will return one row (aggregating data over the entire dataset). A KeySet with no rows but non-zero columns has a size of 0.

Return type:

int

cache()#

Caches the KeySet’s dataframe in memory.

Return type:

None

uncache()#

Removes the KeySet’s dataframe from memory and disk.

Return type:

None