truncation#
Transformations for truncating Spark DataFrames.
Classes#
Keep at most k rows per group. |
|
Keep at most k keys per group. |
|
For each group, limit k rows per key. |
- class LimitRowsPerGroup(input_domain, output_metric, grouping_column, threshold)#
Bases:
tmlt.core.transformations.base.Transformation
Keep at most k rows per group.
See
truncate_large_groups()
for more information about truncation.Example
>>> # Example input >>> print_sdf(spark_dataframe) A B 0 a1 b1 1 a2 b1 2 a3 b2 3 a3 b2 4 a3 b2 5 a4 b1 6 a4 b2 7 a4 b3 8 a4 b4 >>> truncate = LimitRowsPerGroup( ... input_domain=SparkDataFrameDomain( ... { ... "A": SparkStringColumnDescriptor(), ... "B": SparkStringColumnDescriptor(), ... } ... ), ... output_metric=SymmetricDifference(), ... grouping_column="A", ... threshold=2, ... ) >>> # Apply transformation to data >>> truncated_spark_dataframe = truncate(spark_dataframe) >>> print_sdf(truncated_spark_dataframe) A B 0 a1 b1 1 a2 b1 2 a3 b2 3 a3 b2 4 a4 b3 5 a4 b4
- Transformation Contract:
Input domain -
SparkDataFrameDomain
Output domain -
SparkDataFrameDomain
(matches input domain)Input metric -
IfGroupedBy
on the grouping column, with inner metricSymmetricDifference
Output metric -
SymmetricDifference
orIfGroupedBy
on the grouping column, with inner metricSymmetricDifference
>>> truncate.input_domain SparkDataFrameDomain(schema={'A': SparkStringColumnDescriptor(allow_null=False), 'B': SparkStringColumnDescriptor(allow_null=False)}) >>> truncate.output_domain SparkDataFrameDomain(schema={'A': SparkStringColumnDescriptor(allow_null=False), 'B': SparkStringColumnDescriptor(allow_null=False)}) >>> truncate.input_metric IfGroupedBy(column='A', inner_metric=SymmetricDifference()) >>> truncate.output_metric SymmetricDifference()
- Stability Guarantee:
LimitRowsPerGroup
‘sstability_function()
returnsthreshold * d_in
ifoutput_metric
isSymmetricDifference()
andd_in
otherwise.>>> truncate.stability_function(1) 2 >>> truncate.stability_function(2) 4
- Parameters:
input_domain (tmlt.core.domains.spark_domains.SparkDataFrameDomain)
output_metric (Union[tmlt.core.metrics.SymmetricDifference, tmlt.core.metrics.IfGroupedBy])
grouping_column (str)
threshold (int)
- property threshold: int#
Returns the maximum number of rows per group after truncation.
- Return type:
- property input_domain: tmlt.core.domains.base.Domain#
Return input domain for the measurement.
- Return type:
- property input_metric: tmlt.core.metrics.Metric#
Distance metric on input domain.
- Return type:
- property output_domain: tmlt.core.domains.base.Domain#
Return input domain for the measurement.
- Return type:
- property output_metric: tmlt.core.metrics.Metric#
Distance metric on input domain.
- Return type:
- __init__(input_domain, output_metric, grouping_column, threshold)#
Constructor.
- Parameters:
input_domain (
SparkDataFrameDomain
) – Domain of input DataFrame.output_metric (
Union
[SymmetricDifference
,IfGroupedBy
]) – Distance metric for output DataFrames. This should beSymmetricDifference()
orIfGroupedBy(grouping_column, SymmetricDifference())
.grouping_column (
str
) – Name of column defining the groups to truncate.threshold (
int
) – The maximum number of rows per group after truncation.
- stability_function(d_in)#
Returns the smallest d_out satisfied by the transformation.
See the architecture overview for more information.
- Parameters:
d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under input_metric.
- Return type:
- __call__(sdf)#
Returns a truncated dataframe.
- Parameters:
sdf (pyspark.sql.DataFrame)
- Return type:
- stability_relation(d_in, d_out)#
Returns True only if close inputs produce close outputs.
See the privacy and stability tutorial (add link?) for more information.
- Parameters:
d_in (Any) – Distance between inputs under input_metric.
d_out (Any) – Distance between outputs under output_metric.
- Return type:
- __or__(other: Transformation) Transformation #
- __or__(other: tmlt.core.measurements.base.Measurement) tmlt.core.measurements.base.Measurement
Return this transformation chained with another component.
- class LimitKeysPerGroup(input_domain, output_metric, grouping_column, key_column, threshold)#
Bases:
tmlt.core.transformations.base.Transformation
Keep at most k keys per group.
See
limit_keys_per_group()
for more information about truncation.Example
>>> # Example input >>> print_sdf(spark_dataframe) A B 0 a1 b1 1 a2 b1 2 a3 b2 3 a3 b2 4 a3 b2 5 a4 b1 6 a4 b2 7 a4 b3 8 a4 b4 >>> truncate = LimitKeysPerGroup( ... input_domain=SparkDataFrameDomain( ... { ... "A": SparkStringColumnDescriptor(), ... "B": SparkStringColumnDescriptor(), ... } ... ), ... output_metric=IfGroupedBy("B", SumOf(IfGroupedBy("A", SymmetricDifference()))), ... grouping_column="A", ... key_column="B", ... threshold=2, ... ) >>> # Apply transformation to data >>> truncated_spark_dataframe = truncate(spark_dataframe) >>> print_sdf(truncated_spark_dataframe) A B 0 a1 b1 1 a2 b1 2 a3 b2 3 a3 b2 4 a3 b2 5 a4 b3 6 a4 b4
- Transformation Contract:
Input domain -
SparkDataFrameDomain
Output domain -
SparkDataFrameDomain
(matches input domain)Input metric -
IfGroupedBy
on the grouping column, with inner metricSymmetricDifference
Output metric -
IfGroupedBy
on the grouping column, with inner metricSymmetricDifference
orIfGroupedBy
on the key column, with inner metric as aSumOf
orRootSumOfSquared
over aIfGroupedBy
on the grouping column, with inner metricSymmetricDifference
>>> truncate.input_domain SparkDataFrameDomain(schema={'A': SparkStringColumnDescriptor(allow_null=False), 'B': SparkStringColumnDescriptor(allow_null=False)}) >>> truncate.output_domain SparkDataFrameDomain(schema={'A': SparkStringColumnDescriptor(allow_null=False), 'B': SparkStringColumnDescriptor(allow_null=False)}) >>> truncate.input_metric IfGroupedBy(column='A', inner_metric=SymmetricDifference()) >>> truncate.output_metric IfGroupedBy(column='B', inner_metric=SumOf(inner_metric=IfGroupedBy(column='A', inner_metric=SymmetricDifference())))
- Stability Guarantee:
LimitKeysPerGroup
‘sstability_function()
returnsd_in
ifoutput_metric
isIfGroupedBy(grouping_column, SymmetricDifference())
,sqrt(threshold) * d_in
ifoutput_metric
isIfGroupedBy(key_column, RootSumOfSquared(IfGroupedBy(grouping_column, SymmetricDifference())))
, andthreshold * d_in
otherwise.>>> truncate.stability_function(1) 2 >>> truncate.stability_function(2) 4
- Parameters:
input_domain (tmlt.core.domains.spark_domains.SparkDataFrameDomain)
output_metric (tmlt.core.metrics.IfGroupedBy)
grouping_column (str)
key_column (str)
threshold (int)
- property threshold: int#
Returns the maximum number of keys per group after truncation.
- Return type:
- property input_domain: tmlt.core.domains.base.Domain#
Return input domain for the measurement.
- Return type:
- property input_metric: tmlt.core.metrics.Metric#
Distance metric on input domain.
- Return type:
- property output_domain: tmlt.core.domains.base.Domain#
Return input domain for the measurement.
- Return type:
- property output_metric: tmlt.core.metrics.Metric#
Distance metric on input domain.
- Return type:
- __init__(input_domain, output_metric, grouping_column, key_column, threshold)#
Constructor.
- Parameters:
input_domain (
SparkDataFrameDomain
) – Domain of input DataFrame.output_metric (
IfGroupedBy
) – Distance metric for output DataFrames. This should beIfGroupedBy(key_column, SumOf(IfGroupedBy(grouping_column, SymmetricDifference())))
orIfGroupedBy(key_column, RootSumOfSquared(IfGroupedBy(grouping_column, SymmetricDifference())))
orIfGroupedBy(grouping_column, SymmetricDifference())
.grouping_column (
str
) – Name of column defining the groups to truncate.key_column (
str
) – Name of column defining the keys.threshold (
int
) – The maximum number of keys per group after truncation.
- stability_function(d_in)#
Returns the smallest d_out satisfied by the transformation.
See the architecture overview for more information.
- Parameters:
d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under input_metric.
- Return type:
- __call__(sdf)#
Returns a truncated dataframe.
- Parameters:
sdf (pyspark.sql.DataFrame)
- Return type:
- stability_relation(d_in, d_out)#
Returns True only if close inputs produce close outputs.
See the privacy and stability tutorial (add link?) for more information.
- Parameters:
d_in (Any) – Distance between inputs under input_metric.
d_out (Any) – Distance between outputs under output_metric.
- Return type:
- __or__(other: Transformation) Transformation #
- __or__(other: tmlt.core.measurements.base.Measurement) tmlt.core.measurements.base.Measurement
Return this transformation chained with another component.
- class LimitRowsPerKeyPerGroup(input_domain, input_metric, grouping_column, key_column, threshold)#
Bases:
tmlt.core.transformations.base.Transformation
For each group, limit k rows per key.
See
truncate_large_groups()
for more information about truncation.Example
>>> # Example input >>> print_sdf(spark_dataframe) A B 0 a1 b1 1 a2 b1 2 a3 b2 3 a3 b2 4 a3 b2 5 a4 b1 6 a4 b2 7 a4 b3 8 a4 b4 >>> truncate = LimitRowsPerKeyPerGroup( ... input_domain=SparkDataFrameDomain( ... { ... "A": SparkStringColumnDescriptor(), ... "B": SparkStringColumnDescriptor(), ... } ... ), ... input_metric=IfGroupedBy("B", SumOf(IfGroupedBy("A", SymmetricDifference()))), ... grouping_column="A", ... key_column="B", ... threshold=2, ... ) >>> # Apply transformation to data >>> truncated_spark_dataframe = truncate(spark_dataframe) >>> print_sdf(truncated_spark_dataframe) A B 0 a1 b1 1 a2 b1 2 a3 b2 3 a3 b2 4 a4 b1 5 a4 b2 6 a4 b3 7 a4 b4
- Transformation Contract:
Input domain -
SparkDataFrameDomain
Output domain -
SparkDataFrameDomain
(matches input domain)Input metric -
IfGroupedBy
on the grouping column, with inner metricSymmetricDifference
orIfGroupedBy
on the key column, with inner metric as aSumOf
orRootSumOfSquared
over aIfGroupedBy
on the grouping column, with inner metricSymmetricDifference
Output metric -
SymmetricDifference
orIfGroupedBy
on the key column, with inner metric as aRootSumOfSquared
, with inner metricSymmetricDifference
orIfGroupedBy
on the grouping column, with inner metricSymmetricDifference
>>> truncate.input_domain SparkDataFrameDomain(schema={'A': SparkStringColumnDescriptor(allow_null=False), 'B': SparkStringColumnDescriptor(allow_null=False)}) >>> truncate.output_domain SparkDataFrameDomain(schema={'A': SparkStringColumnDescriptor(allow_null=False), 'B': SparkStringColumnDescriptor(allow_null=False)}) >>> truncate.input_metric IfGroupedBy(column='B', inner_metric=SumOf(inner_metric=IfGroupedBy(column='A', inner_metric=SymmetricDifference()))) >>> truncate.output_metric SymmetricDifference()
- Stability Guarantee:
LimitRowsPerKeyPerGroup
‘sstability_function()
returnsd_in
ifinput_metric
isIfGroupedBy(grouping_column, SymmetricDifference())
andthreshold * d_in
otherwise.>>> truncate.stability_function(1) 2 >>> truncate.stability_function(2) 4
- Parameters:
input_domain (tmlt.core.domains.spark_domains.SparkDataFrameDomain)
input_metric (tmlt.core.metrics.IfGroupedBy)
grouping_column (str)
key_column (str)
threshold (int)
- property threshold: int#
Returns the maximum number of rows each unique (key, grouping column value) pair may appear in after truncation.
- Return type:
- property input_domain: tmlt.core.domains.base.Domain#
Return input domain for the measurement.
- Return type:
- property input_metric: tmlt.core.metrics.Metric#
Distance metric on input domain.
- Return type:
- property output_domain: tmlt.core.domains.base.Domain#
Return input domain for the measurement.
- Return type:
- property output_metric: tmlt.core.metrics.Metric#
Distance metric on input domain.
- Return type:
- __init__(input_domain, input_metric, grouping_column, key_column, threshold)#
Constructor.
- Parameters:
input_domain (
SparkDataFrameDomain
) – Domain of input DataFrame.input_metric (
IfGroupedBy
) – Distance metric for input DataFrames. This should beIfGroupedBy(key_column, SumOf(IfGroupedBy(grouping_column, SymmetricDifference())))
orIfGroupedBy(key_column, RootSumOfSquared(IfGroupedBy(grouping_column, SymmetricDifference())))
orIfGroupedBy(grouping_column, SymmetricDifference())
.grouping_column (
str
) – Name of column defining the groups to truncate.key_column (
str
) – Name of column defining the keys.threshold (
int
) – The maximum number of rows each unique (key, grouping column value) pair may appear in after truncation.
- stability_function(d_in)#
Returns the smallest d_out satisfied by the transformation.
See the architecture overview for more information.
- Parameters:
d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under input_metric.
- Return type:
- __call__(sdf)#
Returns a truncated dataframe.
- Parameters:
sdf (pyspark.sql.DataFrame)
- Return type:
- stability_relation(d_in, d_out)#
Returns True only if close inputs produce close outputs.
See the privacy and stability tutorial (add link?) for more information.
- Parameters:
d_in (Any) – Distance between inputs under input_metric.
d_out (Any) – Distance between outputs under output_metric.
- Return type:
- __or__(other: Transformation) Transformation #
- __or__(other: tmlt.core.measurements.base.Measurement) tmlt.core.measurements.base.Measurement
Return this transformation chained with another component.