testing#

Utilities for testing.

Functions#

assert_dataframe_equal()

Assert that two DataFrames are equal, ignoring the order of rows.

pandas_to_spark_dataframe()

Convert a Pandas dataframe into a Spark dataframe in a more general way.

get_all_props()

Returns all properties and fields of a component.

assert_property_immutability()

Raises error if property is mutable.

create_mock_transformation()

Returns a mocked Transformation with the given properties.

create_mock_queryable()

Returns a mocked Queryable.

create_mock_measurement()

Returns a mocked Measurement with the given properties.

parametrize()

Parametrize a test using Case.

run_test_using_ks_test()

Runs given KSTestCase.

run_test_using_chi_squared_test()

Runs given ChiSquaredTestCase.

get_values_summing_to_loc()

Returns a list of n values that sum to loc.

get_sampler()

Returns a sampler function.

get_noise_scales()

Get noise scale per output column for an aggregation.

get_prob_functions()

Returns probability mass/density functions for different noise mechanisms.

assert_dataframe_equal(actual, expected)#

Assert that two DataFrames are equal, ignoring the order of rows.

If both inputs are Pandas DataFrames, this method uses pandas.testing.assert_frame_equal() with check_dtype=False, so two DataFrame that only differ in the type of a column are considered equal.

If both inputs are Spark DataFrames, then on PySpark 3.5 and above, this method uses pyspark.testing.assertDataFrameEqual() with its default options (which correctly compares NaNs and nulls). On older versions of Spark, this method falls back on comparing the dataframes in Pandas in all cases, and null and NaN values may be considered equal to one another.

If one input is a Spark DataFrame and the other is a Pandas DataFrame, the Spark DataFrame is converted to Pandas before comparison.

Parameters:
Return type:

None

pandas_to_spark_dataframe(spark, df, domain)#

Convert a Pandas dataframe into a Spark dataframe in a more general way.

This function avoids some edge cases that spark.createDataFrame(pandas_df) doesn’t handle correctly, mostly surrounding dataframes with no rows or no columns. Note that domain must be a SparkDataFrameDomain; the less-restrictive type annotation makes it easier to use based on the input/output domain taken from a transformation that is known to only allow SparkDataFrameDomains.

Parameters:
Return type:

pyspark.sql.DataFrame

get_all_props(Component)#

Returns all properties and fields of a component.

Parameters:

Component (type)

Return type:

List[Tuple[str]]

assert_property_immutability(component, prop_name)#

Raises error if property is mutable.

Parameters:
  • component (Any) – Privacy framework component whose attribute is to be checked.

  • prop_name (str) – Name of property to be checked.

Return type:

None

create_mock_transformation(input_domain=NumpyIntegerDomain(), input_metric=AbsoluteDifference(), output_domain=NumpyIntegerDomain(), output_metric=AbsoluteDifference(), return_value=0, stability_function_implemented=False, stability_function_return_value=ExactNumber(1), stability_relation_return_value=True)#

Returns a mocked Transformation with the given properties.

Parameters:
  • input_domain (tmlt.core.domains.base.Domain) – Input domain for the mock.

  • input_metric (tmlt.core.metrics.Metric) – Input metric for the mock.

  • output_domain (tmlt.core.domains.base.Domain) – Output domain for the mock.

  • output_metric (tmlt.core.metrics.Metric) – Output metric for the mock.

  • return_value (Any) – Return value for the Transformation’s __call__.

  • stability_function_implemented (bool) – If False, raises a NotImplementedError with the message “TEST” when the stability function is called.

  • stability_function_return_value (Any) – Return value for the Transformation’s stability function.

  • stability_relation_return_value (bool) – Return value for the Transformation’s stability relation.

Return type:

unittest.mock.Mock

create_mock_queryable(return_value=0)#

Returns a mocked Queryable.

Parameters:

return_value (Any) – Return value for the Queryable’s __call__.

Return type:

unittest.mock.Mock

create_mock_measurement(input_domain=NumpyIntegerDomain(), input_metric=AbsoluteDifference(), output_measure=PureDP(), is_interactive=False, return_value=np.int64(0), privacy_function_implemented=False, privacy_function_return_value=ExactNumber(1), privacy_relation_return_value=True)#

Returns a mocked Measurement with the given properties.

Parameters:
  • input_domain (tmlt.core.domains.base.Domain) – Input domain for the mock.

  • input_metric (tmlt.core.metrics.Metric) – Input metric for the mock.

  • output_measure (tmlt.core.measures.Measure) – Output measure for the mock.

  • is_interactive (bool) – Whether the mock should be interactive.

  • return_value (Any) – Return value for the Measurement’s __call__.

  • privacy_function_implemented (bool) – If False, raises a NotImplementedError with the message “TEST” when the privacy function is called.

  • privacy_function_return_value (Any) – Return value for the Measurement’s privacy function.

  • privacy_relation_return_value (bool) – Return value for the Measurement’s privacy relation.

Return type:

unittest.mock.Mock

parametrize(*cases, **kwargs)#

Parametrize a test using Case.

Provides a wrapper around pytest.mark.parametrize to allow passing a collection of instances of Case. The argument list provided to the test function is the union of all arguments provided to the Cases parametrized over; if a Case does not specify a particular argument, None is passed. As an example:

>>> @parametrize(
...     Case()(x=5, y=3, z=2),
...     Case("custom")(x=1),
...     Case("large", marks=pytest.mark.slow)(x=500, y=10, z=50),
... )
... def test_func(x, y, z):
...     print(x, y, z)

This parametrization would produce 3 concrete tests:

  • One with parameters x=5, y=3, z=2, with no marks and the automatically-generated name from pytest (5-3-2).

  • One with parameters x=1, y=None, z=None and the custom name custom.

  • One with parameters x=500, y=10, z=50, the custom name large, and the slow pytest mark.

In addition to the series of Cases, parametrize() accepts keyword arguments to be passed on to pytest.mark.parametrize, allowing the use of options like indirect. The argnames, argvalues, and ids options may not be passed this way, as their values are generated by parametrize().

Parameters:
  • cases (Union[Case, _NestedCases]) – A collection of test cases.

  • kwargs (Any) – Keyword arguments to be passed through to pytest.mark.parametrize.

Return type:

Callable

run_test_using_ks_test(case, p_threshold, noise_scale_fudge_factor)#

Runs given KSTestCase.

Parameters:
Return type:

None

run_test_using_chi_squared_test(case, p_threshold, noise_scale_fudge_factor)#

Runs given ChiSquaredTestCase.

Parameters:
Return type:

None

get_values_summing_to_loc(loc: int, n: int) List[int]#
get_values_summing_to_loc(loc: float, n: int) List[float]

Returns a list of n values that sum to loc.

Parameters:
  • loc – Value to which the return list adds up to. If this is a float, a list of floats will be returned, otherwise this must be an int, and a list of ints will be returned.

  • n – Desired list size.

get_sampler(measurement, dataset, post_processor, iterations=1)#

Returns a sampler function.

A sampler function takes 0 arguments and produces a numpy array containing samples obtaining by performing groupby-agg on the given dataset.

Parameters:
Return type:

Callable[[], Dict[str, numpy.ndarray]]

get_noise_scales(agg, budget, dataset, noise_mechanism)#

Get noise scale per output column for an aggregation.

Parameters:
Return type:

Dict[str, tmlt.core.utils.exact_number.ExactNumber]

get_prob_functions(noise_mechanism, locations)#

Returns probability mass/density functions for different noise mechanisms.

Parameters:
Return type:

Dict[str, Dict[str, Callable]]

Classes#

FakeAggregate

Dummy Pandas Series aggregation for testing purposes.

PySparkTest

Create a pyspark testing base class for all tests.

TestComponent

Helper class for component tests.

Case

A test case, for use with parametrize().

FixedGroupDataSet

Encapsulates a Spark DataFrame with specified number of identical groups.

KSTestCase

Test case for run_test_using_chi_squared_test().

ChiSquaredTestCase

Test case for run_test_using_ks_test().

class FakeAggregate#

Bases: tmlt.core.measurements.pandas_measurements.dataframe.Aggregate

Dummy Pandas Series aggregation for testing purposes.

property input_domain: tmlt.core.domains.pandas_domains.PandasDataFrameDomain#

Return input domain for the measurement.

Return type:

tmlt.core.domains.pandas_domains.PandasDataFrameDomain

property output_schema: pyspark.sql.types.StructType#

Return the output schema.

Return type:

pyspark.sql.types.StructType

property input_metric: tmlt.core.metrics.Metric#

Distance metric on input domain.

Return type:

tmlt.core.metrics.Metric

property output_measure: tmlt.core.measures.Measure#

Distance measure on output.

Return type:

tmlt.core.measures.Measure

property is_interactive: bool#

Returns true iff the measurement is interactive.

Return type:

bool

__init__()#

Constructor.

Return type:

None

privacy_relation(_, __)#

Returns False always for testing purposes.

Parameters:
  • _ (tmlt.core.utils.exact_number.ExactNumberInput)

  • __ (tmlt.core.utils.exact_number.ExactNumberInput)

Return type:

bool

__call__(data)#

Perform dummy measurement.

Parameters:

data (pandas.DataFrame)

Return type:

pandas.DataFrame

privacy_function(d_in)#

Returns the smallest d_out satisfied by the measurement.

See the privacy and stability tutorial (add link?) for more information.

Parameters:

d_in (Any) – Distance between inputs under input_metric.

Raises:

NotImplementedError – If not overridden.

Return type:

Any

class PySparkTest(methodName='runTest')#

Bases: unittest.TestCase

Create a pyspark testing base class for all tests.

All the unit test methods in the same test class can share or reuse the same spark context.

property spark: pyspark.sql.SparkSession#

Returns the spark session.

Return type:

pyspark.sql.SparkSession

__init__(methodName='runTest')#

Create an instance of the class that will use the named test method when executed. Raises a ValueError if the instance does not have a method with the specified name.

classmethod suppress_py4j_logging()#

Remove noise in the logs irrelevant to testing.

Return type:

None

classmethod setUpClass()#

Setup SparkSession.

Return type:

None

classmethod tearDownClass()#

Tears down SparkSession.

Return type:

None

classmethod assert_frame_equal_with_sort(first_df, second_df, sort_columns=None, **kwargs)#

Asserts that the two data frames are equal.

Wrapper around pandas test function. Both dataframes are sorted since the ordering in Spark is not guaranteed.

Parameters:
  • first_df (pandas.DataFrame) – First dataframe to compare.

  • second_df (pandas.DataFrame) – Second dataframe to compare.

  • sort_columns (Optional[Sequence[str]]) – Names of column to sort on. By default sorts by all columns.

  • **kwargs (Any) – Keyword arguments that will be passed to assert_frame_equal().

Return type:

None

class TestComponent(methodName='runTest')#

Bases: PySparkTest

Helper class for component tests.

property spark: pyspark.sql.SparkSession#

Returns the spark session.

Return type:

pyspark.sql.SparkSession

__init__(methodName='runTest')#

Create an instance of the class that will use the named test method when executed. Raises a ValueError if the instance does not have a method with the specified name.

setUp()#

Common setup for all component tests.

Return type:

None

classmethod suppress_py4j_logging()#

Remove noise in the logs irrelevant to testing.

Return type:

None

classmethod setUpClass()#

Setup SparkSession.

Return type:

None

classmethod tearDownClass()#

Tears down SparkSession.

Return type:

None

classmethod assert_frame_equal_with_sort(first_df, second_df, sort_columns=None, **kwargs)#

Asserts that the two data frames are equal.

Wrapper around pandas test function. Both dataframes are sorted since the ordering in Spark is not guaranteed.

Parameters:
  • first_df (pandas.DataFrame) – First dataframe to compare.

  • second_df (pandas.DataFrame) – Second dataframe to compare.

  • sort_columns (Optional[Sequence[str]]) – Names of column to sort on. By default sorts by all columns.

  • **kwargs (Any) – Keyword arguments that will be passed to assert_frame_equal().

Return type:

None

class Case(id=None, **kwargs)#

A test case, for use with parametrize().

Each instance of Case corresponds to a single pytest.param. The Case constructor arguments are passed to pytest.param as keyword arguments, while those passed to Case.__call__() are used as arguments to the test function. Some examples:

>>> # Simplest case -- a single parameter using the default name generated
>>> # by pytest
>>> _ = Case()(x=1)
>>> # Multiple parameters
>>> _ = Case()(x=1, y=2, z=3)
>>> # Passing a custom name for the test
>>> _ = Case("dict")(d={1:2, 3:4})
>>> # Using pytest marks
>>> _ = Case(marks=pytest.mark.xfail)(x=1)

For usage information, see parametrize().

Parameters:
  • id (Optional[str])

  • kwargs (Any)

property id: str | None#

The ID for this test case.

Return type:

Optional[str]

property args: Dict[str, Any]#

The arguments passed to the test function in this test case.

Return type:

Dict[str, Any]

property kwargs: Dict[str, Any]#

The keyword arguments passed to pytest.param for this test case.

Return type:

Dict[str, Any]

__init__(id=None, **kwargs)#

Constructor.

Parameters:
  • id (Optional[str]) – An ID for this test case, or None to use the default one generated by pytest.

  • kwargs (Any) – Additional keyword arguments, passed to pytest.param when running this test case. A common use would be to set pytest marks, e.g. marks=pytest.mark.xfail.

__call__(**kwargs)#

Set the parameters to be passed to the test function for this test case.

Parameters:

kwargs (Any)

Return type:

Case

class FixedGroupDataSet#

Encapsulates a Spark DataFrame with specified number of identical groups.

The DataFrame contains columns A and B – column ‘A’ corresponds to group index and column ‘B’ corresponds to the measure column (to be aggregated).

group_vals: List[float] | List[int]#

Values for each group.

num_groups: int#

Number of identical groups.

float_measure_column: bool = False#

If True, measure column has floating point values.

property domain: tmlt.core.domains.spark_domains.SparkDataFrameDomain#

Return dataframe domain.

Return type:

tmlt.core.domains.spark_domains.SparkDataFrameDomain

property lower: tmlt.core.utils.exact_number.ExactNumber#

Returns a lower bound on the values in B.

Return type:

tmlt.core.utils.exact_number.ExactNumber

property upper: tmlt.core.utils.exact_number.ExactNumber#

Returns an upper bound on the values in B.

Return type:

tmlt.core.utils.exact_number.ExactNumber

groupby(noise_mechanism)#

Returns appropriate GroupBy transformation.

Parameters:

noise_mechanism (tmlt.core.measurements.aggregations.NoiseMechanism)

Return type:

tmlt.core.transformations.spark_transformations.groupby.GroupBy

get_dataframe()#

Returns dataframe.

Return type:

pyspark.sql.DataFrame

class KSTestCase(sampler=None, locations=None, scales=None, cdfs=None)#

Test case for run_test_using_chi_squared_test().

Parameters:
  • sampler (Optional[Callable[[], Dict[str, numpy.ndarray]]])

  • locations (Optional[Dict[str, Union[str, float]]])

  • scales (Optional[Dict[str, tmlt.core.utils.exact_number.ExactNumberInput]])

  • cdfs (Optional[Dict[str, Callable]])

__init__(sampler=None, locations=None, scales=None, cdfs=None)#

Constructor.

Parameters:
Return type:

None

classmethod from_dict(d)#

Transform a dictionary into an KSTestCase.

Parameters:

d (Dict[str, Any])

Return type:

KSTestCase

class ChiSquaredTestCase(sampler=None, locations=None, scales=None, cmfs=None, pmfs=None)#

Test case for run_test_using_ks_test().

Parameters:
  • sampler (Optional[Callable[[], Dict[str, numpy.ndarray]]])

  • locations (Optional[Dict[str, int]])

  • scales (Optional[Dict[str, tmlt.core.utils.exact_number.ExactNumberInput]])

  • cmfs (Optional[Dict[str, Callable]])

  • pmfs (Optional[Dict[str, Callable]])

__init__(sampler=None, locations=None, scales=None, cmfs=None, pmfs=None)#

Constructor.

Parameters:
Return type:

None

classmethod from_dict(d)#

Turns a dictionary into a ChiSquaredTestCase.

Parameters:

d (Dict[str, Any])

Return type:

ChiSquaredTestCase