testing#
Utilities for testing.
Functions#
Assert that two DataFrames are equal, ignoring the order of rows. |
|
Convert a Pandas dataframe into a Spark dataframe in a more general way. |
|
Returns all properties and fields of a component. |
|
Raises error if property is mutable. |
|
Returns a mocked Transformation with the given properties. |
|
Returns a mocked Queryable. |
|
Returns a mocked Measurement with the given properties. |
|
Parametrize a test using |
|
Runs given |
|
Runs given |
|
Returns a list of n values that sum to loc. |
|
Returns a sampler function. |
|
Get noise scale per output column for an aggregation. |
|
Returns probability mass/density functions for different noise mechanisms. |
- assert_dataframe_equal(actual, expected)#
Assert that two DataFrames are equal, ignoring the order of rows.
If both inputs are Pandas DataFrames, this method uses
pandas.testing.assert_frame_equal()
withcheck_dtype=False
, so two DataFrame that only differ in the type of a column are considered equal.If both inputs are Spark DataFrames, then on PySpark 3.5 and above, this method uses
pyspark.testing.assertDataFrameEqual()
with its default options (which correctly compares NaNs and nulls). On older versions of Spark, this method falls back on comparing the dataframes in Pandas in all cases, and null and NaN values may be considered equal to one another.If one input is a Spark DataFrame and the other is a Pandas DataFrame, the Spark DataFrame is converted to Pandas before comparison.
- Parameters:
actual (Union[pyspark.sql.DataFrame, pandas.DataFrame])
expected (Union[pyspark.sql.DataFrame, pandas.DataFrame])
- Return type:
None
- pandas_to_spark_dataframe(spark, df, domain)#
Convert a Pandas dataframe into a Spark dataframe in a more general way.
This function avoids some edge cases that
spark.createDataFrame(pandas_df)
doesn’t handle correctly, mostly surrounding dataframes with no rows or no columns. Note thatdomain
must be aSparkDataFrameDomain
; the less-restrictive type annotation makes it easier to use based on the input/output domain taken from a transformation that is known to only allowSparkDataFrameDomain
s.- Parameters:
spark (pyspark.sql.SparkSession)
df (pandas.DataFrame)
domain (tmlt.core.domains.base.Domain)
- Return type:
- get_all_props(Component)#
Returns all properties and fields of a component.
- assert_property_immutability(component, prop_name)#
Raises error if property is mutable.
- Parameters:
component (Any) – Privacy framework component whose attribute is to be checked.
prop_name (str) – Name of property to be checked.
- Return type:
None
- create_mock_transformation(input_domain=NumpyIntegerDomain(), input_metric=AbsoluteDifference(), output_domain=NumpyIntegerDomain(), output_metric=AbsoluteDifference(), return_value=0, stability_function_implemented=False, stability_function_return_value=ExactNumber(1), stability_relation_return_value=True)#
Returns a mocked Transformation with the given properties.
- Parameters:
input_domain (tmlt.core.domains.base.Domain) – Input domain for the mock.
input_metric (tmlt.core.metrics.Metric) – Input metric for the mock.
output_domain (tmlt.core.domains.base.Domain) – Output domain for the mock.
output_metric (tmlt.core.metrics.Metric) – Output metric for the mock.
return_value (Any) – Return value for the Transformation’s __call__.
stability_function_implemented (bool) – If False, raises a
NotImplementedError
with the message “TEST” when the stability function is called.stability_function_return_value (Any) – Return value for the Transformation’s stability function.
stability_relation_return_value (bool) – Return value for the Transformation’s stability relation.
- Return type:
- create_mock_queryable(return_value=0)#
Returns a mocked Queryable.
- Parameters:
return_value (Any) – Return value for the Queryable’s __call__.
- Return type:
- create_mock_measurement(input_domain=NumpyIntegerDomain(), input_metric=AbsoluteDifference(), output_measure=PureDP(), is_interactive=False, return_value=np.int64(0), privacy_function_implemented=False, privacy_function_return_value=ExactNumber(1), privacy_relation_return_value=True)#
Returns a mocked Measurement with the given properties.
- Parameters:
input_domain (tmlt.core.domains.base.Domain) – Input domain for the mock.
input_metric (tmlt.core.metrics.Metric) – Input metric for the mock.
output_measure (tmlt.core.measures.Measure) – Output measure for the mock.
is_interactive (bool) – Whether the mock should be interactive.
return_value (Any) – Return value for the Measurement’s __call__.
privacy_function_implemented (bool) – If False, raises a
NotImplementedError
with the message “TEST” when the privacy function is called.privacy_function_return_value (Any) – Return value for the Measurement’s privacy function.
privacy_relation_return_value (bool) – Return value for the Measurement’s privacy relation.
- Return type:
- parametrize(*cases, **kwargs)#
Parametrize a test using
Case
.Provides a wrapper around
pytest.mark.parametrize
to allow passing a collection of instances ofCase
. The argument list provided to the test function is the union of all arguments provided to theCase
s parametrized over; if aCase
does not specify a particular argument, None is passed. As an example:>>> @parametrize( ... Case()(x=5, y=3, z=2), ... Case("custom")(x=1), ... Case("large", marks=pytest.mark.slow)(x=500, y=10, z=50), ... ) ... def test_func(x, y, z): ... print(x, y, z)
This parametrization would produce 3 concrete tests:
One with parameters
x=5, y=3, z=2
, with no marks and the automatically-generated name from pytest (5-3-2
).One with parameters
x=1, y=None, z=None
and the custom namecustom
.One with parameters
x=500, y=10, z=50
, the custom namelarge
, and theslow
pytest mark.
In addition to the series of
Case
s,parametrize()
accepts keyword arguments to be passed on topytest.mark.parametrize
, allowing the use of options likeindirect
. Theargnames
,argvalues
, andids
options may not be passed this way, as their values are generated byparametrize()
.- Parameters:
cases (Union[Case, _NestedCases]) – A collection of test cases.
kwargs (Any) – Keyword arguments to be passed through to
pytest.mark.parametrize
.
- Return type:
Callable
- run_test_using_ks_test(case, p_threshold, noise_scale_fudge_factor)#
Runs given
KSTestCase
.- Parameters:
case (KSTestCase)
p_threshold (float)
noise_scale_fudge_factor (float)
- Return type:
None
- run_test_using_chi_squared_test(case, p_threshold, noise_scale_fudge_factor)#
Runs given
ChiSquaredTestCase
.- Parameters:
case (ChiSquaredTestCase)
p_threshold (float)
noise_scale_fudge_factor (float)
- Return type:
None
- get_values_summing_to_loc(loc: int, n: int) List[int] #
- get_values_summing_to_loc(loc: float, n: int) List[float]
Returns a list of n values that sum to loc.
- Parameters:
loc – Value to which the return list adds up to. If this is a float, a list of floats will be returned, otherwise this must be an int, and a list of ints will be returned.
n – Desired list size.
- get_sampler(measurement, dataset, post_processor, iterations=1)#
Returns a sampler function.
A sampler function takes 0 arguments and produces a numpy array containing samples obtaining by performing groupby-agg on the given dataset.
- Parameters:
measurement (tmlt.core.measurements.base.Measurement) – Measurement to sample from.
dataset (FixedGroupDataSet) – FixedGroupDataSet object containing DataFrame to perform measurement on.
post_processor (Callable[[pyspark.sql.DataFrame], pyspark.sql.DataFrame]) – Function to process measurement’s output DataFrame and select relevant columns.
iterations (int) – Number of iterations of groupby-agg.
- Return type:
Callable[[], Dict[str, numpy.ndarray]]
- get_noise_scales(agg, budget, dataset, noise_mechanism)#
Get noise scale per output column for an aggregation.
- Parameters:
agg (str)
budget (tmlt.core.utils.exact_number.ExactNumberInput)
dataset (FixedGroupDataSet)
noise_mechanism (tmlt.core.measurements.aggregations.NoiseMechanism)
- Return type:
Classes#
Dummy Pandas Series aggregation for testing purposes. |
|
Create a pyspark testing base class for all tests. |
|
Helper class for component tests. |
|
A test case, for use with |
|
Encapsulates a Spark DataFrame with specified number of identical groups. |
|
Test case for |
|
Test case for |
- class FakeAggregate#
Bases:
tmlt.core.measurements.pandas_measurements.dataframe.Aggregate
Dummy Pandas Series aggregation for testing purposes.
- property input_domain: tmlt.core.domains.pandas_domains.PandasDataFrameDomain#
Return input domain for the measurement.
- property output_schema: pyspark.sql.types.StructType#
Return the output schema.
- Return type:
- property input_metric: tmlt.core.metrics.Metric#
Distance metric on input domain.
- Return type:
- property output_measure: tmlt.core.measures.Measure#
Distance measure on output.
- Return type:
- __init__()#
Constructor.
- Return type:
None
- privacy_relation(_, __)#
Returns False always for testing purposes.
- Parameters:
_ (tmlt.core.utils.exact_number.ExactNumberInput)
__ (tmlt.core.utils.exact_number.ExactNumberInput)
- Return type:
- __call__(data)#
Perform dummy measurement.
- Parameters:
data (pandas.DataFrame)
- Return type:
- privacy_function(d_in)#
Returns the smallest d_out satisfied by the measurement.
See the privacy and stability tutorial (add link?) for more information.
- Parameters:
d_in (Any) – Distance between inputs under input_metric.
- Raises:
NotImplementedError – If not overridden.
- Return type:
Any
- class PySparkTest(methodName='runTest')#
Bases:
unittest.TestCase
Create a pyspark testing base class for all tests.
All the unit test methods in the same test class can share or reuse the same spark context.
- property spark: pyspark.sql.SparkSession#
Returns the spark session.
- Return type:
- __init__(methodName='runTest')#
Create an instance of the class that will use the named test method when executed. Raises a ValueError if the instance does not have a method with the specified name.
- classmethod suppress_py4j_logging()#
Remove noise in the logs irrelevant to testing.
- Return type:
None
- classmethod setUpClass()#
Setup SparkSession.
- Return type:
None
- classmethod tearDownClass()#
Tears down SparkSession.
- Return type:
None
- classmethod assert_frame_equal_with_sort(first_df, second_df, sort_columns=None, **kwargs)#
Asserts that the two data frames are equal.
Wrapper around pandas test function. Both dataframes are sorted since the ordering in Spark is not guaranteed.
- Parameters:
first_df (pandas.DataFrame) – First dataframe to compare.
second_df (pandas.DataFrame) – Second dataframe to compare.
sort_columns (Optional[Sequence[str]]) – Names of column to sort on. By default sorts by all columns.
**kwargs (Any) – Keyword arguments that will be passed to assert_frame_equal().
- Return type:
None
- class TestComponent(methodName='runTest')#
Bases:
PySparkTest
Helper class for component tests.
- property spark: pyspark.sql.SparkSession#
Returns the spark session.
- Return type:
- __init__(methodName='runTest')#
Create an instance of the class that will use the named test method when executed. Raises a ValueError if the instance does not have a method with the specified name.
- setUp()#
Common setup for all component tests.
- Return type:
None
- classmethod suppress_py4j_logging()#
Remove noise in the logs irrelevant to testing.
- Return type:
None
- classmethod setUpClass()#
Setup SparkSession.
- Return type:
None
- classmethod tearDownClass()#
Tears down SparkSession.
- Return type:
None
- classmethod assert_frame_equal_with_sort(first_df, second_df, sort_columns=None, **kwargs)#
Asserts that the two data frames are equal.
Wrapper around pandas test function. Both dataframes are sorted since the ordering in Spark is not guaranteed.
- Parameters:
first_df (pandas.DataFrame) – First dataframe to compare.
second_df (pandas.DataFrame) – Second dataframe to compare.
sort_columns (Optional[Sequence[str]]) – Names of column to sort on. By default sorts by all columns.
**kwargs (Any) – Keyword arguments that will be passed to assert_frame_equal().
- Return type:
None
- class Case(id=None, **kwargs)#
A test case, for use with
parametrize()
.Each instance of
Case
corresponds to a singlepytest.param
. TheCase
constructor arguments are passed topytest.param
as keyword arguments, while those passed toCase.__call__()
are used as arguments to the test function. Some examples:>>> # Simplest case -- a single parameter using the default name generated >>> # by pytest >>> _ = Case()(x=1) >>> # Multiple parameters >>> _ = Case()(x=1, y=2, z=3) >>> # Passing a custom name for the test >>> _ = Case("dict")(d={1:2, 3:4}) >>> # Using pytest marks >>> _ = Case(marks=pytest.mark.xfail)(x=1)
For usage information, see
parametrize()
.- Parameters:
id (Optional[str])
kwargs (Any)
- property args: Dict[str, Any]#
The arguments passed to the test function in this test case.
- Return type:
Dict[str, Any]
- property kwargs: Dict[str, Any]#
The keyword arguments passed to
pytest.param
for this test case.- Return type:
Dict[str, Any]
- __init__(id=None, **kwargs)#
Constructor.
- class FixedGroupDataSet#
Encapsulates a Spark DataFrame with specified number of identical groups.
The DataFrame contains columns A and B – column ‘A’ corresponds to group index and column ‘B’ corresponds to the measure column (to be aggregated).
- property domain: tmlt.core.domains.spark_domains.SparkDataFrameDomain#
Return dataframe domain.
- Return type:
- property lower: tmlt.core.utils.exact_number.ExactNumber#
Returns a lower bound on the values in B.
- Return type:
- property upper: tmlt.core.utils.exact_number.ExactNumber#
Returns an upper bound on the values in B.
- Return type:
- groupby(noise_mechanism)#
Returns appropriate GroupBy transformation.
- Parameters:
noise_mechanism (tmlt.core.measurements.aggregations.NoiseMechanism)
- Return type:
tmlt.core.transformations.spark_transformations.groupby.GroupBy
- get_dataframe()#
Returns dataframe.
- Return type:
- class KSTestCase(sampler=None, locations=None, scales=None, cdfs=None)#
Test case for
run_test_using_chi_squared_test()
.- Parameters:
- __init__(sampler=None, locations=None, scales=None, cdfs=None)#
Constructor.
- class ChiSquaredTestCase(sampler=None, locations=None, scales=None, cmfs=None, pmfs=None)#
Test case for
run_test_using_ks_test()
.- Parameters:
- __init__(sampler=None, locations=None, scales=None, cmfs=None, pmfs=None)#
Constructor.