grouped_dataframe#

Grouped DataFrame aware of group keys when performing aggregations.

Classes#

GroupedDataFrame

Grouped DataFrame implementation supporting explicit group keys.

class GroupedDataFrame(dataframe, group_keys)#

Grouped DataFrame implementation supporting explicit group keys.

A GroupedDataFrame object encapsulates the spark DataFrame to be grouped by as well as the group keys. The output of an aggregation on a GroupedDataFrame object is guaranteed to have exactly one row for each group key, unless there are no group keys, in which case it will have a single row.

Parameters
__init__(dataframe, group_keys)#

Constructor.

Parameters
  • dataframe (DataFrameDataFrame) – DataFrame to perform groupby on.

  • group_keys (DataFrameDataFrame) – DataFrame where each row corresponds to a group key. Duplicate rows are silently dropped.

property group_keys(self)#

Returns DataFrame containing group keys.

Return type

pyspark.sql.DataFrame

property groupby_columns(self)#

Returns DataFrame containing group keys.

Return type

List[str]

select(self, columns)#

Returns a new GroupedDataFrame object with specified subset of columns.

Note

columns must contain the groupby columns.

Parameters

columns (List[str]) – List of column names to keep. This must include the groupby columns.

Return type

GroupedDataFrame

agg(self, func, fill_value)#

Applies given spark function (column expression) to each group.

The output DataFrame is guaranteed to have exactly one row for each group key. For group keys corresponding to empty groups, the output column will contain the supplied fill_value. The output DataFrame is also sorted by the groupby columns.

Parameters
  • func (pyspark.sql.Column) – Function to apply to each group.

  • fill_value (Any) – Output value for empty groups.

Return type

pyspark.sql.DataFrame

apply_in_pandas(self, aggregation_function, aggregation_output_schema)#

Returns DataFrame obtained by applying aggregation function to each group.

Each group is passed to the aggregation_function as a pandas DataFrame and the returned pandas DataFrames are stacked into a single spark DataFrame.

The output DataFrame is guaranteed to have exactly one row for each group key. For group keys corresponding to empty groups, the aggregation function is applied to an empty pandas DataFrame with the expected schema. The output DataFrame is also sorted by the groupby columns.

Parameters
Return type

pyspark.sql.DataFrame

get_groups(self)#

Returns the groups as dictionary of DataFrames.

Return type

Dict[pyspark.sql.Row, pyspark.sql.DataFrame]