grouped_dataframe#
Grouped DataFrame aware of group keys when performing aggregations.
Classes#
Grouped DataFrame implementation supporting explicit group keys. |
- class GroupedDataFrame(dataframe, group_keys)#
Grouped DataFrame implementation supporting explicit group keys.
A GroupedDataFrame object encapsulates the spark DataFrame to be grouped by as well as the group keys. The output of an aggregation on a GroupedDataFrame object is guaranteed to have exactly one row for each group key, unless there are no group keys, in which case it will have a single row.
- Parameters:
dataframe (pyspark.sql.DataFrame) –
group_keys (pyspark.sql.DataFrame) –
- property group_keys: pyspark.sql.DataFrame#
Returns DataFrame containing group keys.
- Return type:
- property groupby_columns: List[str]#
Returns DataFrame containing group keys.
- Return type:
List[str]
- __init__(dataframe, group_keys)#
Constructor.
- select(columns)#
Returns a new GroupedDataFrame object with specified subset of columns.
Note
columns
must contain the groupby columns.- Parameters:
columns (List[str]) – List of column names to keep. This must include the groupby columns.
- Return type:
- agg(func, fill_value)#
Applies given spark function (column expression) to each group.
The output DataFrame is guaranteed to have exactly one row for each group key. For group keys corresponding to empty groups, the output column will contain the supplied
fill_value
. The output DataFrame is also sorted by the groupby columns.- Parameters:
func (pyspark.sql.Column) – Function to apply to each group.
fill_value (Any) – Output value for empty groups.
- Return type:
- apply_in_pandas(aggregation_function, aggregation_output_schema)#
Returns DataFrame obtained by applying aggregation function to each group.
Each group is passed to the
aggregation_function
as a pandas DataFrame and the returned pandas DataFrames are stacked into a single spark DataFrame.The output DataFrame is guaranteed to have exactly one row for each group key. For group keys corresponding to empty groups, the aggregation function is applied to an empty pandas DataFrame with the expected schema. The output DataFrame is also sorted by the groupby columns.
- Parameters:
aggregation_function (Callable[[pandas.DataFrame], pandas.DataFrame]) – Aggregation function to be applied to each group.
aggregation_output_schema (pyspark.sql.types.StructType) – Expected spark schema for the output of the aggregation function.
- Return type:
- get_groups()#
Returns the groups as dictionary of DataFrames.
- Return type: