truncation#
Functions for truncating Spark DataFrames.
Functions#
Order rows by a hash function and keep at most |
|
Drop all rows for groups that have more than |
|
Order keys by a hash function and keep at most |
- truncate_large_groups(df, grouping_columns, threshold)#
Order rows by a hash function and keep at most
thresholdrows for each group.Example
>>> # Example input >>> print_sdf(spark_dataframe) A B 0 a1 b1 1 a2 b1 2 a3 b2 3 a3 b2 4 a3 b3 >>> print_sdf(truncate_large_groups(spark_dataframe, ["A"], 3)) A B 0 a1 b1 1 a2 b1 2 a3 b2 3 a3 b2 4 a3 b3 >>> print_sdf(truncate_large_groups(spark_dataframe, ["A"], 2)) A B 0 a1 b1 1 a2 b1 2 a3 b2 3 a3 b3 >>> print_sdf(truncate_large_groups(spark_dataframe, ["A"], 1)) A B 0 a1 b1 1 a2 b1 2 a3 b2
- Parameters:
df (pyspark.sql.DataFrame) – DataFrame to truncate.
grouping_columns (List[str]) – Columns defining the groups.
threshold (int) – Maximum number of rows to include for each group.
- Return type:
- drop_large_groups(df, grouping_columns, threshold)#
Drop all rows for groups that have more than
thresholdrows.Example
>>> # Example input >>> print_sdf(spark_dataframe) A B 0 a1 b1 1 a2 b1 2 a3 b2 3 a3 b2 4 a3 b3 >>> print_sdf(drop_large_groups(spark_dataframe, ["A"], 3)) A B 0 a1 b1 1 a2 b1 2 a3 b2 3 a3 b2 4 a3 b3 >>> print_sdf(drop_large_groups(spark_dataframe, ["A"], 2)) A B 0 a1 b1 1 a2 b1 >>> print_sdf(drop_large_groups(spark_dataframe, ["A"], 1)) A B 0 a1 b1 1 a2 b1
- Parameters:
df (pyspark.sql.DataFrame) – DataFrame to truncate.
grouping_columns (List[str]) – Columns defining the groups.
threshold (int) – Threshold for dropping groups. If more than
thresholdrows belong to the same group, all rows in that group are dropped.
- Return type:
- limit_keys_per_group(df, grouping_columns, key_columns, threshold)#
Order keys by a hash function and keep at most
thresholdkeys for each group.Note
After truncation there may still be an unbounded number of rows per key, but at most
thresholdkeys per groupExample
>>> # Example input >>> print_sdf(spark_dataframe) A B 0 a1 b1 1 a2 b1 2 a3 b2 3 a3 b2 4 a3 b3 5 a4 b1 6 a4 b2 7 a4 b3 >>> print_sdf( ... limit_keys_per_group( ... df=spark_dataframe, ... grouping_columns=["A"], ... key_columns=["B"], ... threshold=2, ... ) ... ) A B 0 a1 b1 1 a2 b1 2 a3 b2 3 a3 b2 4 a3 b3 5 a4 b2 6 a4 b3 >>> print_sdf( ... limit_keys_per_group( ... df=spark_dataframe, ... grouping_columns=["A"], ... key_columns=["B"], ... threshold=1, ... ) ... ) A B 0 a1 b1 1 a2 b1 2 a3 b3 3 a4 b3
- Parameters:
df (pyspark.sql.DataFrame) – DataFrame to truncate.
grouping_columns (List[str]) – Columns defining the groups.
key_columns (List[str]) – Column defining the keys.
threshold (int) – Maximum number of keys to include for each group.
- Return type: