truncation#
Transformations for truncating Spark DataFrames.
Classes#
Keep at most k rows per group. |
|
Keep at most k keys per group. |
|
For each group, limit k rows per key. |
- class LimitRowsPerGroup(input_domain, output_metric, grouping_column, threshold)#
Bases:
tmlt.core.transformations.base.TransformationKeep at most k rows per group.
See
truncate_large_groups()for more information about truncation.Example
>>> # Example input >>> print_sdf(spark_dataframe) A B 0 a1 b1 1 a2 b1 2 a3 b2 3 a3 b2 4 a3 b2 5 a4 b1 6 a4 b2 7 a4 b3 8 a4 b4 >>> truncate = LimitRowsPerGroup( ... input_domain=SparkDataFrameDomain( ... { ... "A": SparkStringColumnDescriptor(), ... "B": SparkStringColumnDescriptor(), ... } ... ), ... output_metric=SymmetricDifference(), ... grouping_column="A", ... threshold=2, ... ) >>> # Apply transformation to data >>> truncated_spark_dataframe = truncate(spark_dataframe) >>> print_sdf(truncated_spark_dataframe) A B 0 a1 b1 1 a2 b1 2 a3 b2 3 a3 b2 4 a4 b3 5 a4 b4
- Transformation Contract:
Input domain -
SparkDataFrameDomainOutput domain -
SparkDataFrameDomain(matches input domain)Input metric -
IfGroupedByon the grouping column, with inner metricSymmetricDifferenceOutput metric -
SymmetricDifferenceorIfGroupedByon the grouping column, with inner metricSymmetricDifference
>>> truncate.input_domain SparkDataFrameDomain(schema={'A': SparkStringColumnDescriptor(allow_null=False), 'B': SparkStringColumnDescriptor(allow_null=False)}) >>> truncate.output_domain SparkDataFrameDomain(schema={'A': SparkStringColumnDescriptor(allow_null=False), 'B': SparkStringColumnDescriptor(allow_null=False)}) >>> truncate.input_metric IfGroupedBy(column='A', inner_metric=SymmetricDifference()) >>> truncate.output_metric SymmetricDifference()
- Stability Guarantee:
LimitRowsPerGroup‘sstability_function()returnsthreshold * d_inifoutput_metricisSymmetricDifference()andd_inotherwise.>>> truncate.stability_function(1) 2 >>> truncate.stability_function(2) 4
- Parameters:
input_domain (tmlt.core.domains.spark_domains.SparkDataFrameDomain)
output_metric (Union[tmlt.core.metrics.SymmetricDifference, tmlt.core.metrics.IfGroupedBy])
grouping_column (str)
threshold (int)
- property threshold: int#
Returns the maximum number of rows per group after truncation.
- Return type:
- property input_domain: tmlt.core.domains.base.Domain#
Return input domain for the measurement.
- Return type:
- property input_metric: tmlt.core.metrics.Metric#
Distance metric on input domain.
- Return type:
- property output_domain: tmlt.core.domains.base.Domain#
Return input domain for the measurement.
- Return type:
- property output_metric: tmlt.core.metrics.Metric#
Distance metric on input domain.
- Return type:
- __init__(input_domain, output_metric, grouping_column, threshold)#
Constructor.
- Parameters:
input_domain (
SparkDataFrameDomain) – Domain of input DataFrame.output_metric (
Union[SymmetricDifference,IfGroupedBy]) – Distance metric for output DataFrames. This should beSymmetricDifference()orIfGroupedBy(grouping_column, SymmetricDifference()).grouping_column (
str) – Name of column defining the groups to truncate.threshold (
int) – The maximum number of rows per group after truncation.
- stability_function(d_in)#
Returns the smallest d_out satisfied by the transformation.
See the architecture overview for more information.
- Parameters:
d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under input_metric.
- Return type:
- __call__(sdf)#
Returns a truncated dataframe.
- Parameters:
sdf (pyspark.sql.DataFrame)
- Return type:
- stability_relation(d_in, d_out)#
Returns True only if close inputs produce close outputs.
See the privacy and stability tutorial (add link?) for more information.
- Parameters:
d_in (Any) – Distance between inputs under input_metric.
d_out (Any) – Distance between outputs under output_metric.
- Return type:
- __or__(other: Transformation) Transformation#
- __or__(other: tmlt.core.measurements.base.Measurement) tmlt.core.measurements.base.Measurement
Return this transformation chained with another component.
- class LimitKeysPerGroup(input_domain, output_metric, grouping_column, key_column, threshold)#
Bases:
tmlt.core.transformations.base.TransformationKeep at most k keys per group.
See
limit_keys_per_group()for more information about truncation.Example
>>> # Example input >>> print_sdf(spark_dataframe) A B 0 a1 b1 1 a2 b1 2 a3 b2 3 a3 b2 4 a3 b2 5 a4 b1 6 a4 b2 7 a4 b3 8 a4 b4 >>> truncate = LimitKeysPerGroup( ... input_domain=SparkDataFrameDomain( ... { ... "A": SparkStringColumnDescriptor(), ... "B": SparkStringColumnDescriptor(), ... } ... ), ... output_metric=IfGroupedBy("B", SumOf(IfGroupedBy("A", SymmetricDifference()))), ... grouping_column="A", ... key_column="B", ... threshold=2, ... ) >>> # Apply transformation to data >>> truncated_spark_dataframe = truncate(spark_dataframe) >>> print_sdf(truncated_spark_dataframe) A B 0 a1 b1 1 a2 b1 2 a3 b2 3 a3 b2 4 a3 b2 5 a4 b3 6 a4 b4
- Transformation Contract:
Input domain -
SparkDataFrameDomainOutput domain -
SparkDataFrameDomain(matches input domain)Input metric -
IfGroupedByon the grouping column, with inner metricSymmetricDifferenceOutput metric -
IfGroupedByon the grouping column, with inner metricSymmetricDifferenceorIfGroupedByon the key column, with inner metric as aSumOforRootSumOfSquaredover aIfGroupedByon the grouping column, with inner metricSymmetricDifference
>>> truncate.input_domain SparkDataFrameDomain(schema={'A': SparkStringColumnDescriptor(allow_null=False), 'B': SparkStringColumnDescriptor(allow_null=False)}) >>> truncate.output_domain SparkDataFrameDomain(schema={'A': SparkStringColumnDescriptor(allow_null=False), 'B': SparkStringColumnDescriptor(allow_null=False)}) >>> truncate.input_metric IfGroupedBy(column='A', inner_metric=SymmetricDifference()) >>> truncate.output_metric IfGroupedBy(column='B', inner_metric=SumOf(inner_metric=IfGroupedBy(column='A', inner_metric=SymmetricDifference())))
- Stability Guarantee:
LimitKeysPerGroup‘sstability_function()returnsd_inifoutput_metricisIfGroupedBy(grouping_column, SymmetricDifference()),sqrt(threshold) * d_inifoutput_metricisIfGroupedBy(key_column, RootSumOfSquared(IfGroupedBy(grouping_column, SymmetricDifference()))), andthreshold * d_inotherwise.>>> truncate.stability_function(1) 2 >>> truncate.stability_function(2) 4
- Parameters:
input_domain (tmlt.core.domains.spark_domains.SparkDataFrameDomain)
output_metric (tmlt.core.metrics.IfGroupedBy)
grouping_column (str)
key_column (str)
threshold (int)
- property threshold: int#
Returns the maximum number of keys per group after truncation.
- Return type:
- property input_domain: tmlt.core.domains.base.Domain#
Return input domain for the measurement.
- Return type:
- property input_metric: tmlt.core.metrics.Metric#
Distance metric on input domain.
- Return type:
- property output_domain: tmlt.core.domains.base.Domain#
Return input domain for the measurement.
- Return type:
- property output_metric: tmlt.core.metrics.Metric#
Distance metric on input domain.
- Return type:
- __init__(input_domain, output_metric, grouping_column, key_column, threshold)#
Constructor.
- Parameters:
input_domain (
SparkDataFrameDomain) – Domain of input DataFrame.output_metric (
IfGroupedBy) – Distance metric for output DataFrames. This should beIfGroupedBy(key_column, SumOf(IfGroupedBy(grouping_column, SymmetricDifference())))orIfGroupedBy(key_column, RootSumOfSquared(IfGroupedBy(grouping_column, SymmetricDifference())))orIfGroupedBy(grouping_column, SymmetricDifference()).grouping_column (
str) – Name of column defining the groups to truncate.key_column (
str) – Name of column defining the keys.threshold (
int) – The maximum number of keys per group after truncation.
- stability_function(d_in)#
Returns the smallest d_out satisfied by the transformation.
See the architecture overview for more information.
- Parameters:
d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under input_metric.
- Return type:
- __call__(sdf)#
Returns a truncated dataframe.
- Parameters:
sdf (pyspark.sql.DataFrame)
- Return type:
- stability_relation(d_in, d_out)#
Returns True only if close inputs produce close outputs.
See the privacy and stability tutorial (add link?) for more information.
- Parameters:
d_in (Any) – Distance between inputs under input_metric.
d_out (Any) – Distance between outputs under output_metric.
- Return type:
- __or__(other: Transformation) Transformation#
- __or__(other: tmlt.core.measurements.base.Measurement) tmlt.core.measurements.base.Measurement
Return this transformation chained with another component.
- class LimitRowsPerKeyPerGroup(input_domain, input_metric, grouping_column, key_column, threshold)#
Bases:
tmlt.core.transformations.base.TransformationFor each group, limit k rows per key.
See
truncate_large_groups()for more information about truncation.Example
>>> # Example input >>> print_sdf(spark_dataframe) A B 0 a1 b1 1 a2 b1 2 a3 b2 3 a3 b2 4 a3 b2 5 a4 b1 6 a4 b2 7 a4 b3 8 a4 b4 >>> truncate = LimitRowsPerKeyPerGroup( ... input_domain=SparkDataFrameDomain( ... { ... "A": SparkStringColumnDescriptor(), ... "B": SparkStringColumnDescriptor(), ... } ... ), ... input_metric=IfGroupedBy("B", SumOf(IfGroupedBy("A", SymmetricDifference()))), ... grouping_column="A", ... key_column="B", ... threshold=2, ... ) >>> # Apply transformation to data >>> truncated_spark_dataframe = truncate(spark_dataframe) >>> print_sdf(truncated_spark_dataframe) A B 0 a1 b1 1 a2 b1 2 a3 b2 3 a3 b2 4 a4 b1 5 a4 b2 6 a4 b3 7 a4 b4
- Transformation Contract:
Input domain -
SparkDataFrameDomainOutput domain -
SparkDataFrameDomain(matches input domain)Input metric -
IfGroupedByon the grouping column, with inner metricSymmetricDifferenceorIfGroupedByon the key column, with inner metric as aSumOforRootSumOfSquaredover aIfGroupedByon the grouping column, with inner metricSymmetricDifferenceOutput metric -
SymmetricDifferenceorIfGroupedByon the key column, with inner metric as aRootSumOfSquared, with inner metricSymmetricDifferenceorIfGroupedByon the grouping column, with inner metricSymmetricDifference
>>> truncate.input_domain SparkDataFrameDomain(schema={'A': SparkStringColumnDescriptor(allow_null=False), 'B': SparkStringColumnDescriptor(allow_null=False)}) >>> truncate.output_domain SparkDataFrameDomain(schema={'A': SparkStringColumnDescriptor(allow_null=False), 'B': SparkStringColumnDescriptor(allow_null=False)}) >>> truncate.input_metric IfGroupedBy(column='B', inner_metric=SumOf(inner_metric=IfGroupedBy(column='A', inner_metric=SymmetricDifference()))) >>> truncate.output_metric SymmetricDifference()
- Stability Guarantee:
LimitRowsPerKeyPerGroup‘sstability_function()returnsd_inifinput_metricisIfGroupedBy(grouping_column, SymmetricDifference())andthreshold * d_inotherwise.>>> truncate.stability_function(1) 2 >>> truncate.stability_function(2) 4
- Parameters:
input_domain (tmlt.core.domains.spark_domains.SparkDataFrameDomain)
input_metric (tmlt.core.metrics.IfGroupedBy)
grouping_column (str)
key_column (str)
threshold (int)
- property threshold: int#
Returns the maximum number of rows each unique (key, grouping column value) pair may appear in after truncation.
- Return type:
- property input_domain: tmlt.core.domains.base.Domain#
Return input domain for the measurement.
- Return type:
- property input_metric: tmlt.core.metrics.Metric#
Distance metric on input domain.
- Return type:
- property output_domain: tmlt.core.domains.base.Domain#
Return input domain for the measurement.
- Return type:
- property output_metric: tmlt.core.metrics.Metric#
Distance metric on input domain.
- Return type:
- __init__(input_domain, input_metric, grouping_column, key_column, threshold)#
Constructor.
- Parameters:
input_domain (
SparkDataFrameDomain) – Domain of input DataFrame.input_metric (
IfGroupedBy) – Distance metric for input DataFrames. This should beIfGroupedBy(key_column, SumOf(IfGroupedBy(grouping_column, SymmetricDifference())))orIfGroupedBy(key_column, RootSumOfSquared(IfGroupedBy(grouping_column, SymmetricDifference())))orIfGroupedBy(grouping_column, SymmetricDifference()).grouping_column (
str) – Name of column defining the groups to truncate.key_column (
str) – Name of column defining the keys.threshold (
int) – The maximum number of rows each unique (key, grouping column value) pair may appear in after truncation.
- stability_function(d_in)#
Returns the smallest d_out satisfied by the transformation.
See the architecture overview for more information.
- Parameters:
d_in (tmlt.core.utils.exact_number.ExactNumberInput) – Distance between inputs under input_metric.
- Return type:
- __call__(sdf)#
Returns a truncated dataframe.
- Parameters:
sdf (pyspark.sql.DataFrame)
- Return type:
- stability_relation(d_in, d_out)#
Returns True only if close inputs produce close outputs.
See the privacy and stability tutorial (add link?) for more information.
- Parameters:
d_in (Any) – Distance between inputs under input_metric.
d_out (Any) – Distance between outputs under output_metric.
- Return type:
- __or__(other: Transformation) Transformation#
- __or__(other: tmlt.core.measurements.base.Measurement) tmlt.core.measurements.base.Measurement
Return this transformation chained with another component.