groupby#

Transformations for performing groupby on Spark DataFrames.

Functions#

create_groupby_from_column_domains()

Returns GroupBy transformation with Cartesian product of column domains as keys.

create_groupby_from_list_of_keys()

Returns a GroupBy transformation using user-supplied list of group keys.

compute_full_domain_df()

Returns a DataFrame containing the Cartesian product of given column domains.

create_groupby_from_column_domains(input_domain, input_metric, use_l2, column_domains)#

Returns GroupBy transformation with Cartesian product of column domains as keys.

Example

>>> # Example input
>>> print_sdf(spark_dataframe)
    A   B   C
0  a1  b1  c1
1  a2  b1  c2
2  a3  b2  c1
3  a3  b2  c1
>>> groupby_B_C = create_groupby_from_column_domains(
...     input_domain=SparkDataFrameDomain(
...         {
...             "A": SparkStringColumnDescriptor(),
...             "B": SparkStringColumnDescriptor(),
...             "C": SparkStringColumnDescriptor(),
...         }
...     ),
...     input_metric=SymmetricDifference(),
...     use_l2=False,
...     column_domains={
...         "B": ["b1", "b2"],
...         "C": ["c1", "c2"],
...     }
... )
>>> # Apply transformation to data
>>> grouped_dataframe = groupby_B_C(spark_dataframe)
>>> groups_df = grouped_dataframe.agg(count("*").alias("count"), fill_value=0)
>>> print(groups_df.toPandas().sort_values(["B", "C"], ignore_index=True))
    B   C  count
0  b1  c1      1
1  b1  c2      1
2  b2  c1      2
3  b2  c2      0
>>> # Note that the group key ("b2", "c2") does not appear in the DataFrame
>>> # but appears in the aggregation output with the given fill value.
Parameters
Return type

GroupBy

Note

column_domains must be public.

create_groupby_from_list_of_keys(input_domain, input_metric, use_l2, groupby_columns, keys)#

Returns a GroupBy transformation using user-supplied list of group keys.

Example

>>> # Example input
>>> print_sdf(spark_dataframe)
    A   B   C
0  a1  b1  c1
1  a2  b1  c2
2  a3  b2  c1
3  a3  b2  c1
>>> groupby_B_C = create_groupby_from_list_of_keys(
...     input_domain=SparkDataFrameDomain(
...         {
...             "A": SparkStringColumnDescriptor(),
...             "B": SparkStringColumnDescriptor(),
...             "C": SparkStringColumnDescriptor(),
...         }
...     ),
...     input_metric=SymmetricDifference(),
...     use_l2=False,
...     groupby_columns=["B", "C"],
...     keys=[("b1", "c1"), ("b2", "c2")]
... )
>>> # Apply transformation to data
>>> grouped_dataframe = groupby_B_C(spark_dataframe)
>>> groups_df = grouped_dataframe.agg(count("*").alias("count"), fill_value=0)
>>> print(groups_df.toPandas().sort_values(["B", "C"], ignore_index=True))
    B   C  count
0  b1  c1      1
1  b2  c2      0
>>> # Note that there is no record corresponding to the key ("b1", "c2")
>>> # since we did not specify this key while constructing the GroupBy even
>>> # though this key appears in the input DataFrame.
Parameters
Return type

GroupBy

Note

keys must be public list of tuples with no duplicates.

compute_full_domain_df(column_domains)#

Returns a DataFrame containing the Cartesian product of given column domains.

Parameters

column_domains (Mapping[str, Union[List[str], List[Optional[str]], List[int], List[Optional[int]], List[datetime.date], List[Optional[datetime.date]]]]) –

Return type

pyspark.sql.DataFrame

Classes#

GroupBy

Groups a Spark DataFrame by given group keys.

class GroupBy(input_domain, input_metric, use_l2, group_keys)#

Bases: tmlt.core.transformations.base.Transformation

Groups a Spark DataFrame by given group keys.

Example

>>> # Example input
>>> print_sdf(spark_dataframe)
    A   B
0  a1  b1
1  a2  b1
2  a3  b2
3  a3  b2
>>> groupby_B = GroupBy(
...     input_domain=SparkDataFrameDomain(
...         {
...             "A": SparkStringColumnDescriptor(),
...             "B": SparkStringColumnDescriptor(),
...         }
...     ),
...     input_metric=SymmetricDifference(),
...     use_l2=False,
...     group_keys=spark.createDataFrame(
...         pd.DataFrame(
...             {
...                 "B":["b1", "b2"]
...             }
...         )
...     )
... )
>>> # Apply transformation to data
>>> grouped_dataframe = groupby_B(spark_dataframe)
>>> counts_df = grouped_dataframe.agg(count("*").alias("count"), fill_value=0)
>>> print(counts_df.sort("B").toPandas())
    B  count
0  b1      2
1  b2      2
Transformation Contract:
>>> groupby_B.input_domain
SparkDataFrameDomain(schema={'A': SparkStringColumnDescriptor(allow_null=False), 'B': SparkStringColumnDescriptor(allow_null=False)})
>>> groupby_B.output_domain
SparkGroupedDataFrameDomain(schema={'A': SparkStringColumnDescriptor(allow_null=False), 'B': SparkStringColumnDescriptor(allow_null=False)}, groupby_columns=['B'])
>>> groupby_B.input_metric
SymmetricDifference()
>>> groupby_B.output_metric
SumOf(inner_metric=SymmetricDifference())
Stability Guarantee:

GroupBy’s stability_function() returns the d_in if the input_metric is SymmetricDifference or IfGroupedBy, otherwise it returns d_in times 2.

>>> groupby_B.stability_function(1)
1
Methods#

use_l2()

Returns whether the output metric will use RootSumOfSquared.

group_keys()

Returns DataFrame containing group keys.

groupby_columns()

Returns list of columns to groupby.

stability_function()

Returns the smallest d_out satisfied by the transformation.

__call__()

Performs groupby.

input_domain()

Return input domain for the measurement.

input_metric()

Distance metric on input domain.

output_domain()

Return input domain for the measurement.

output_metric()

Distance metric on input domain.

stability_relation()

Returns True only if close inputs produce close outputs.

__or__()

Return this transformation chained with another component.

Parameters
__init__(input_domain, input_metric, use_l2, group_keys)#

Constructor.

Parameters

Note

group_keys must be public.

property use_l2#

Returns whether the output metric will use RootSumOfSquared.

Return type

bool

property group_keys#

Returns DataFrame containing group keys.

Return type

pyspark.sql.DataFrame

property groupby_columns#

Returns list of columns to groupby.

Return type

List[str]

stability_function(d_in)#

Returns the smallest d_out satisfied by the transformation.

Parameters

d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under input_metric.

Return type

tmlt.core.utils.exact_number.ExactNumber

__call__(sdf)#

Performs groupby.

Parameters

sdf (pyspark.sql.DataFrame) –

Return type

tmlt.core.utils.grouped_dataframe.GroupedDataFrame

property input_domain#

Return input domain for the measurement.

Return type

tmlt.core.domains.base.Domain

property input_metric#

Distance metric on input domain.

Return type

tmlt.core.metrics.Metric

property output_domain#

Return input domain for the measurement.

Return type

tmlt.core.domains.base.Domain

property output_metric#

Distance metric on input domain.

Return type

tmlt.core.metrics.Metric

stability_relation(d_in, d_out)#

Returns True only if close inputs produce close outputs.

See the privacy and stability tutorial (add link?) for more information.

Parameters
  • d_in (Any) – Distance between inputs under input_metric.

  • d_out (Any) – Distance between outputs under output_metric.

Return type

bool

__or__(other: Transformation) Transformation#
__or__(other: tmlt.core.measurements.base.Measurement) tmlt.core.measurements.base.Measurement

Return this transformation chained with another component.