grouped_dataframe#

Grouped DataFrame aware of group keys when performing aggregations.

Classes#

GroupedDataFrame

Grouped DataFrame implementation supporting explicit group keys.

class GroupedDataFrame(dataframe, group_keys)#

Grouped DataFrame implementation supporting explicit group keys.

A GroupedDataFrame object encapsulates the spark DataFrame to be grouped by as well as the group keys. The output of an aggregation on a GroupedDataFrame object is guaranteed to have exactly one row for each group key, unless there are no group keys, in which case it will have a single row.

Parameters

dataframe (pyspark.sql.DataFrame) –
group_keys (pyspark.sql.DataFrame) –

__init__(dataframe, group_keys)#

Constructor.

Parameters

dataframe (DataFrameDataFrame) – DataFrame to perform groupby on.
group_keys (DataFrameDataFrame) – DataFrame where each row corresponds to a group key. Duplicate rows are silently dropped.

property group_keys(self)#

Returns DataFrame containing group keys.

Return type: pyspark.sql.DataFrame

property groupby_columns(self)#

Returns DataFrame containing group keys.

Return type: List[str]

select(self, columns)#

Returns a new GroupedDataFrame object with specified subset of columns.

Note

columns must contain the groupby columns.

Parameters: columns (List[str]) – List of column names to keep. This must include the groupby columns.
Return type: GroupedDataFrame

agg(self, func, fill_value)#

Applies given spark function (column expression) to each group.

The output DataFrame is guaranteed to have exactly one row for each group key. For group keys corresponding to empty groups, the output column will contain the supplied fill_value. The output DataFrame is also sorted by the groupby columns.

Parameters

func (pyspark.sql.Column) – Function to apply to each group.
fill_value (Any) – Output value for empty groups.

Return type

pyspark.sql.DataFrame

apply_in_pandas(self, aggregation_function, aggregation_output_schema)#

Returns DataFrame obtained by applying aggregation function to each group.

Each group is passed to the aggregation_function as a pandas DataFrame and the returned pandas DataFrames are stacked into a single spark DataFrame.

The output DataFrame is guaranteed to have exactly one row for each group key. For group keys corresponding to empty groups, the aggregation function is applied to an empty pandas DataFrame with the expected schema. The output DataFrame is also sorted by the groupby columns.

Parameters

aggregation_function (Callable[[pandas.DataFrame], pandas.DataFrame]) – Aggregation function to be applied to each group.
aggregation_output_schema (pyspark.sql.types.StructType) – Expected spark schema for the output of the aggregation function.

Return type

pyspark.sql.DataFrame

get_groups(self)#

Returns the groups as dictionary of DataFrames.

Return type: Dict[pyspark.sql.Row, pyspark.sql.DataFrame]

Tumult Core

grouped_dataframe#

Classes#