persist#

Transformations for persisting and un-persisting Spark DataFrames.

See the architecture overview for more information.

Classes#

Persist

Persists a Spark DataFrame.

Unpersist

Unpersists a Spark DataFrame.

SparkAction

Triggers an action on a Spark DataFrame.

class Persist(domain, metric)#

Bases: tmlt.core.transformations.base.Transformation

Persists a Spark DataFrame.

This is an identity transformation that marks the input Spark DataFrame to be stored when evaluated by Spark.

Note

This transformation does not eagerly evaluate and store the input DataFrame. Spark only stores it when an action (like collect) is performed. If you want to persist eagerly, chain this transformation with a SparkAction.

Parameters
__init__(domain, metric)#

Constructor.

Parameters
stability_function(d_in)#

Returns the smallest d_out satisfied by the transformation.

The returned d_out is d_in.

Parameters

d_in (Any) – Distance between inputs under input_metric.

Return type

Any

__call__(data)#

Returns input DataFrame.

Parameters

data (pyspark.sql.DataFrame) –

Return type

pyspark.sql.DataFrame

property input_domain#

Return input domain for the measurement.

Return type

tmlt.core.domains.base.Domain

property input_metric#

Distance metric on input domain.

Return type

tmlt.core.metrics.Metric

property output_domain#

Return input domain for the measurement.

Return type

tmlt.core.domains.base.Domain

property output_metric#

Distance metric on input domain.

Return type

tmlt.core.metrics.Metric

stability_relation(d_in, d_out)#

Returns True only if close inputs produce close outputs.

See the privacy and stability tutorial (add link?) for more information.

Parameters
  • d_in (Any) – Distance between inputs under input_metric.

  • d_out (Any) – Distance between outputs under output_metric.

Return type

bool

__or__(other: Transformation) Transformation#
__or__(other: tmlt.core.measurements.base.Measurement) tmlt.core.measurements.base.Measurement

Return this transformation chained with another component.

class Unpersist(domain, metric)#

Bases: tmlt.core.transformations.base.Transformation

Unpersists a Spark DataFrame.

This is an identity transformation that marks a persisted DataFrame to be evicted. If the input DataFrame is not persisted, this has no effect.

Parameters
__init__(domain, metric)#

Constructor.

Parameters
stability_function(d_in)#

Returns the smallest d_out satisfied by the transformation.

The returned d_out is d_in.

Parameters

d_in (Any) – Distance between inputs under input_metric.

Return type

Any

__call__(data)#

Returns input DataFrame.

Parameters

data (pyspark.sql.DataFrame) –

Return type

pyspark.sql.DataFrame

property input_domain#

Return input domain for the measurement.

Return type

tmlt.core.domains.base.Domain

property input_metric#

Distance metric on input domain.

Return type

tmlt.core.metrics.Metric

property output_domain#

Return input domain for the measurement.

Return type

tmlt.core.domains.base.Domain

property output_metric#

Distance metric on input domain.

Return type

tmlt.core.metrics.Metric

stability_relation(d_in, d_out)#

Returns True only if close inputs produce close outputs.

See the privacy and stability tutorial (add link?) for more information.

Parameters
  • d_in (Any) – Distance between inputs under input_metric.

  • d_out (Any) – Distance between outputs under output_metric.

Return type

bool

__or__(other: Transformation) Transformation#
__or__(other: tmlt.core.measurements.base.Measurement) tmlt.core.measurements.base.Measurement

Return this transformation chained with another component.

class SparkAction(domain, metric)#

Bases: tmlt.core.transformations.base.Transformation

Triggers an action on a Spark DataFrame.

This is intended to be used after Persist to eagerly evaluate and store a Transformation‘s output.

Parameters
__init__(domain, metric)#

Constructor.

Parameters
stability_function(d_in)#

Returns the smallest d_out satisfied by the transformation.

The returned d_out is d_in.

Parameters

d_in (Any) – Distance between inputs under input_metric.

Return type

Any

__call__(data)#

Returns input DataFrame.

Parameters

data (pyspark.sql.DataFrame) –

Return type

pyspark.sql.DataFrame

property input_domain#

Return input domain for the measurement.

Return type

tmlt.core.domains.base.Domain

property input_metric#

Distance metric on input domain.

Return type

tmlt.core.metrics.Metric

property output_domain#

Return input domain for the measurement.

Return type

tmlt.core.domains.base.Domain

property output_metric#

Distance metric on input domain.

Return type

tmlt.core.metrics.Metric

stability_relation(d_in, d_out)#

Returns True only if close inputs produce close outputs.

See the privacy and stability tutorial (add link?) for more information.

Parameters
  • d_in (Any) – Distance between inputs under input_metric.

  • d_out (Any) – Distance between outputs under output_metric.

Return type

bool

__or__(other: Transformation) Transformation#
__or__(other: tmlt.core.measurements.base.Measurement) tmlt.core.measurements.base.Measurement

Return this transformation chained with another component.