KeySet#

from tmlt.analytics import KeySet

class tmlt.analytics.KeySet(op_tree, columns)#

Bases: object

A class containing a set of values for specific columns.

An introduction to KeySet initialization and manipulation can be found in the Group-by queries tutorial.

Warning

If a column has null values dropped or replaced, then Analytics will raise an error if you use a KeySet that contains a null value for that column.

static from_dataframe(df)#

Creates a KeySet from a dataframe.

This DataFrame should contain every combination of values being selected in the KeySet. If there are duplicate rows in the dataframe, only one copy of each will be kept.

When creating KeySets with this method, it is the responsibility of the caller to ensure that the given dataframe remains valid for the lifetime of the KeySet. If the dataframe becomes invalid, for example because its Spark session is closed, this method or any uses of the resulting dataframe may raise exceptions or have other unanticipated effects.

Return type:: KeySet

static from_tuples(tuples, columns)#

Creates a KeySet from a list of tuples and column names.

Return type:: KeySet

Example

>>> tuples = [
...   ("a1", "b1"),
...   ("a2", "b1"),
...   ("a3", "b3"),
... ]
>>> keyset = KeySet.from_tuples(tuples, ["A", "B"])
>>> keyset.dataframe().sort("A", "B").toPandas()
    A   B
0  a1  b1
1  a2  b1
2  a3  b3

static from_dict(domains)#

Creates a KeySet from a dictionary.

The domains dictionary should map column names to the desired values for those columns. The KeySet returned is the cross-product of those columns. Duplicate values in the column domains are allowed, but only one of the duplicates is kept.

Return type:: KeySet

Example

>>> domains = {
...     "A": ["a1", "a2"],
...     "B": ["b1", "b2"],
... }
>>> keyset = KeySet.from_dict(domains)
>>> keyset.dataframe().sort("A", "B").toPandas()
    A   B
0  a1  b1
1  a1  b2
2  a2  b1
3  a2  b2

__mul__(other)#

The Cartesian product of the two KeySet factors.

Example

>>> keyset1 = KeySet.from_tuples([("a1",), ("a2",)], columns=["A"])
>>> keyset2 = KeySet.from_tuples([("b1",), ("b2",)], columns=["B"])
>>> product = keyset1 * keyset2
>>> product.dataframe().sort("A", "B").toPandas()
    A   B
0  a1  b1
1  a1  b2
2  a2  b1
3  a2  b2

__sub__(other)#

Remove rows in this that match rows in another KeySet.

Equivalent to a left anti-join between this KeySet and other.

other must have a subset of the columns of this KeySet. Any rows in this KeySet where the values in those columns match values in other are removed.

Return type:: KeySet

Example

>>> keyset1 = KeySet.from_dict({"A": [1, 2], "B": ["a", "b"]})
>>> result = keyset1 - KeySet.from_tuples([(1, "b")], columns=["A", "B"])
>>> result.dataframe().sort("A", "B").toPandas()
   A  B
0  1  a
1  2  a
2  2  b

__getitem__(desired_columns)#

KeySet[col, col, ...] returns a KeySet with those columns only.

The returned KeySet contains all unique combinations of values in the given columns that were present in the original KeySet.

Return type:: KeySet

Example

>>> domains = {
...     "A": ["a1", "a2"],
...     "B": ["b1", "b2"],
...     "C": ["c1", "c2"],
...     "D": [0, 1, 2, 3]
... }
>>> keyset = KeySet.from_dict(domains)
>>> a_b_keyset = keyset["A", "B"]
>>> a_b_keyset.dataframe().sort("A", "B").toPandas()
    A   B
0  a1  b1
1  a1  b2
2  a2  b1
3  a2  b2
>>> a_b_keyset = keyset[["A", "B"]]
>>> a_b_keyset.dataframe().sort("A", "B").toPandas()
    A   B
0  a1  b1
1  a1  b2
2  a2  b1
3  a2  b2
>>> a_keyset = keyset["A"]
>>> a_keyset.dataframe().sort("A").toPandas()
    A
0  a1
1  a2

join(other)#

The inner natural join of two KeySet objects.

The two KeySets are inner joined on columns with matching names, treating nulls as equal to one another.

Example

>>> keyset1 = KeySet.from_tuples([("a1",), ("a2",)], columns=["A"])
>>> keyset2 = KeySet.from_tuples(
...     [("a2", "b1"), ("a3", "b2")], columns=["A", "B"]
... )
>>> keyset1.join(keyset2).dataframe().sort("A", "B").toPandas()
    A   B
0  a2  b1

filter(condition)#

Filters this KeySet using some condition.

This method accepts the same syntax as pyspark.sql.DataFrame.filter(): valid conditions are those that can be used in a WHERE clause in Spark SQL. Examples of valid conditions include:

age < 42
age BETWEEN 17 AND 42
age < 42 OR (age < 60 AND gender IS NULL)
LENGTH(name) > 17
favorite_color IN ('blue', 'red')

Example

>>> domains = {
...     "A": ["a1", "a2"],
...     "B": [0, 1, 2, 3],
... }
>>> keyset = KeySet.from_dict(domains)
>>> filtered_keyset = keyset.filter("B < 2")
>>> filtered_keyset.dataframe().sort("A", "B").toPandas()
    A  B
0  a1  0
1  a1  1
2  a2  0
3  a2  1
>>> import pyspark.sql.functions as sf
>>> filtered_keyset = keyset.filter(sf.col("A") != "a1")
>>> filtered_keyset.dataframe().sort("A", "B").toPandas()
    A  B
0  a2  0
1  a2  1
2  a2  2
3  a2  3

Parameters:: condition (Union[Column, str]) – A string of SQL expressions or a PySpark Column specifying the filter to apply to the data.
Return type:: KeySet

columns()#

Returns the list of columns used in this KeySet.

Return type:: list[str]

schema()#

Returns the KeySet’s schema.

Return type:: dict[str, ColumnDescriptor]

Example

>>> keys = [
...     ("a1", 0),
...     ("a2", None),
... ]
>>> keyset = KeySet.from_tuples(keys, columns=["A", "B"])
>>> schema = keyset.schema()
>>> schema 
{'A': ColumnDescriptor(column_type=ColumnType.VARCHAR, allow_null=False, allow_nan=False, allow_inf=False),
 'B': ColumnDescriptor(column_type=ColumnType.INTEGER, allow_null=True, allow_nan=False, allow_inf=False)}

dataframe()#

Returns the dataframe associated with this KeySet.

This dataframe contains every combination of values being selected in the KeySet, and its rows are guaranteed to be unique.

Return type:: DataFrame

size()#

Returns the number of groups included in this KeySet.

Note that in some situations this method may need to count the elements in the KeySet’s dataframe, which can be extremely slow.

Return type:: int

cache()#

Caches the KeySet’s dataframe in memory.

Return type:: None

uncache()#

Removes the KeySet’s dataframe from memory and disk.

Return type:: None

is_equivalent(other)#

Determine if another KeySet is equivalent to this one, if possible.

This method is an alternative to KeySet.__eq__() which is guaranteed to never evaluate the full KeySet dataframe. This ensures that it is never time-consuming to call, but also means that it cannot always determine if two KeySets are equivalent. If the KeySets are neither definitely equivalent nor easily shown to not be equivalent, this method returns None.

Return type:: Optional[bool]

__eq__(other)#

Determine if another KeySet is equal to this one.

Two KeySets are equal if they contain the same values for the same columns; the rows and columns may appear in any order.

Example

>>> ks1 = KeySet.from_dict({"A": [1, 2], "B": [3, 4]})
>>> ks2 = KeySet.from_dict({"B": [3, 4], "A": [1, 2]})
>>> ks3 = KeySet.from_dict({"B": [4, 3], "A": [2, 1]})
>>> ks4 = KeySet.from_dict({"B": [4, 5], "A": [1, 2]})
>>> ks1 == ks2
True
>>> ks1 == ks3
True
>>> ks1 == ks4
False

__hash__()#: Hash the KeySet based on its schema.

Tumult Platform

KeySet#