persist#
Transformations for persisting and un-persisting Spark DataFrames.
See the architecture overview for more information.
Classes#
Persists a Spark DataFrame. |
|
Unpersists a Spark DataFrame. |
|
Triggers an action on a Spark DataFrame. |
- class Persist(domain, metric)#
Bases:
tmlt.core.transformations.base.Transformation
Persists a Spark DataFrame.
This is an identity transformation that marks the input Spark DataFrame to be stored when evaluated by Spark.
Note
This transformation does not eagerly evaluate and store the input DataFrame. Spark only stores it when an action (like collect) is performed. If you want to persist eagerly, chain this transformation with a
SparkAction
.- Parameters
domain (tmlt.core.domains.spark_domains.SparkDataFrameDomain) –
metric (tmlt.core.metrics.Metric) –
- __init__(domain, metric)#
Constructor.
- Parameters
domain (
SparkDataFrameDomain
SparkDataFrameDomain
) – Input/Output domain.
- stability_function(d_in)#
Returns the smallest d_out satisfied by the transformation.
The returned d_out is d_in.
- Parameters
d_in (Any) – Distance between inputs under input_metric.
- Return type
Any
- __call__(data)#
Returns input DataFrame.
- Parameters
data (pyspark.sql.DataFrame) –
- Return type
- property input_domain#
Return input domain for the measurement.
- Return type
- property input_metric#
Distance metric on input domain.
- Return type
- property output_domain#
Return input domain for the measurement.
- Return type
- property output_metric#
Distance metric on input domain.
- Return type
- stability_relation(d_in, d_out)#
Returns True only if close inputs produce close outputs.
See the privacy and stability tutorial (add link?) for more information.
- Parameters
d_in (Any) – Distance between inputs under input_metric.
d_out (Any) – Distance between outputs under output_metric.
- Return type
- __or__(other: Transformation) Transformation #
- __or__(other: tmlt.core.measurements.base.Measurement) tmlt.core.measurements.base.Measurement
Return this transformation chained with another component.
- class Unpersist(domain, metric)#
Bases:
tmlt.core.transformations.base.Transformation
Unpersists a Spark DataFrame.
This is an identity transformation that marks a persisted DataFrame to be evicted. If the input DataFrame is not persisted, this has no effect.
- Parameters
domain (tmlt.core.domains.spark_domains.SparkDataFrameDomain) –
metric (tmlt.core.metrics.Metric) –
- __init__(domain, metric)#
Constructor.
- Parameters
domain (
SparkDataFrameDomain
SparkDataFrameDomain
) – Input/Output domain.
- stability_function(d_in)#
Returns the smallest d_out satisfied by the transformation.
The returned d_out is d_in.
- Parameters
d_in (Any) – Distance between inputs under input_metric.
- Return type
Any
- __call__(data)#
Returns input DataFrame.
- Parameters
data (pyspark.sql.DataFrame) –
- Return type
- property input_domain#
Return input domain for the measurement.
- Return type
- property input_metric#
Distance metric on input domain.
- Return type
- property output_domain#
Return input domain for the measurement.
- Return type
- property output_metric#
Distance metric on input domain.
- Return type
- stability_relation(d_in, d_out)#
Returns True only if close inputs produce close outputs.
See the privacy and stability tutorial (add link?) for more information.
- Parameters
d_in (Any) – Distance between inputs under input_metric.
d_out (Any) – Distance between outputs under output_metric.
- Return type
- __or__(other: Transformation) Transformation #
- __or__(other: tmlt.core.measurements.base.Measurement) tmlt.core.measurements.base.Measurement
Return this transformation chained with another component.
- class SparkAction(domain, metric)#
Bases:
tmlt.core.transformations.base.Transformation
Triggers an action on a Spark DataFrame.
This is intended to be used after
Persist
to eagerly evaluate and store aTransformation
‘s output.- Parameters
domain (tmlt.core.domains.spark_domains.SparkDataFrameDomain) –
metric (tmlt.core.metrics.Metric) –
- __init__(domain, metric)#
Constructor.
- Parameters
domain (
SparkDataFrameDomain
SparkDataFrameDomain
) – Input/Output domain.
- stability_function(d_in)#
Returns the smallest d_out satisfied by the transformation.
The returned d_out is d_in.
- Parameters
d_in (Any) – Distance between inputs under input_metric.
- Return type
Any
- __call__(data)#
Returns input DataFrame.
- Parameters
data (pyspark.sql.DataFrame) –
- Return type
- property input_domain#
Return input domain for the measurement.
- Return type
- property input_metric#
Distance metric on input domain.
- Return type
- property output_domain#
Return input domain for the measurement.
- Return type
- property output_metric#
Distance metric on input domain.
- Return type
- stability_relation(d_in, d_out)#
Returns True only if close inputs produce close outputs.
See the privacy and stability tutorial (add link?) for more information.
- Parameters
d_in (Any) – Distance between inputs under input_metric.
d_out (Any) – Distance between outputs under output_metric.
- Return type
- __or__(other: Transformation) Transformation #
- __or__(other: tmlt.core.measurements.base.Measurement) tmlt.core.measurements.base.Measurement
Return this transformation chained with another component.