metrics#
Module containing metrics used for constructing measurements and transformations.
Classes#
Base class for input/output metrics. |
|
Metric for use when distance is undefined. |
|
A metric whose distances are exact numbers. |
|
The absolute value of the difference of two values. |
|
The number of elements that are in only one of two sets. |
|
The number of elements that are different between two sets of the same size. |
|
Distances resulting from aggregating distances of its components. |
|
Distances resulting from summing distances of its components. |
|
The square root of the sum of the squares of component distances. |
|
The value of a metric applied to a single column treated as a vector. |
|
A tuple containing the values of multiple OnColumn metrics. |
|
Distance between two DataFrames that shall be grouped by a given attribute. |
|
Distance between two dictionaries with identical sets of keys. |
|
The number of keys that dictionaries of dataframe differ by. |
- class Metric#
Bases:
abc.ABC
Base class for input/output metrics.
- abstract validate(value)#
Raises an error if
value
not a valid distance.- Parameters:
value (Any) – A distance between two datasets under this metric.
- Return type:
None
- abstract compare(value1, value2)#
Returns True if
value1
is less than or equal tovalue2
.- Parameters:
value1 (Any)
value2 (Any)
- Return type:
- abstract supports_domain(domain)#
Return True if the metric is implemented for the passed domain.
- Parameters:
domain (tmlt.core.domains.base.Domain)
- Return type:
- abstract distance(value1, value2, domain)#
Returns the metric distance between two elements of a supported domain.
- Parameters:
value1 (Any)
value2 (Any)
domain (tmlt.core.domains.base.Domain)
- Return type:
Any
- class NullMetric#
Bases:
Metric
Metric for use when distance is undefined.
- abstract validate(value)#
Raises an error if
value
not a valid distance.This method is not implemented.
- Parameters:
value (Any) – A distance between two datasets under this metric.
- Return type:
None
- abstract compare(value1, value2)#
Returns True if
value1
is less than or equal tovalue2
.This method is not implemented.
- Parameters:
value1 (Any) – A distance between two datasets under this metric.
value2 (Any) – A distance between two datasets under this metric.
- Return type:
- supports_domain(domain)#
Return True if the metric is implemented for the passed domain.
- Parameters:
domain (tmlt.core.domains.base.Domain) – The domain to check against.
- Return type:
- abstract distance(value1, value2, domain)#
Return the metric distance between two elements of a supported domain.
- Parameters:
value1 (Any) – An element of the domain.
value2 (Any) – An element of the domain.
domain (tmlt.core.domains.base.Domain) – A domain compatible with the metric.
- Return type:
Any
- class ExactNumberMetric#
Bases:
Metric
A metric whose distances are exact numbers.
- abstract distance(value1, value2, domain)#
Return the metric distance between two elements of a supported domain.
- Parameters:
value1 (Any) – An element of the domain.
value2 (Any) – An element of the domain.
domain (tmlt.core.domains.base.Domain) – A domain compatible with the metric.
- Return type:
- abstract validate(value)#
Raises an error if
value
not a valid distance.- Parameters:
value (Any) – A distance between two datasets under this metric.
- Return type:
None
- abstract compare(value1, value2)#
Returns True if
value1
is less than or equal tovalue2
.- Parameters:
value1 (Any)
value2 (Any)
- Return type:
- abstract supports_domain(domain)#
Return True if the metric is implemented for the passed domain.
- Parameters:
domain (tmlt.core.domains.base.Domain)
- Return type:
- class AbsoluteDifference#
Bases:
ExactNumberMetric
The absolute value of the difference of two values.
Example
>>> AbsoluteDifference().distance( ... np.int64(20), np.int64(82), NumpyIntegerDomain() ... ) 62 >>> # 1.2 is first converted to rational 5404319552844595/4503599627370496 >>> AbsoluteDifference().distance( ... np.float64(1.2), np.float64(1.0), NumpyFloatDomain() ... ) 900719925474099/4503599627370496
- validate(value)#
Raises an error if
value
not a valid distance.value
must be a nonnegative real or infinite
- Parameters:
value (tmlt.core.utils.exact_number.ExactNumberInput) – A distance between two datasets under this metric.
- Return type:
None
- compare(value1, value2)#
Returns True if
value1
is less than or equal tovalue2
.- Parameters:
value1 (tmlt.core.utils.exact_number.ExactNumberInput)
value2 (tmlt.core.utils.exact_number.ExactNumberInput)
- Return type:
- supports_domain(domain)#
Return True if the metric is implemented for the passed domain.
- Parameters:
domain (tmlt.core.domains.base.Domain) – The domain to check against.
- Return type:
- distance(value1, value2, domain)#
Return the metric distance between two elements of a supported domain.
- Parameters:
value1 (Any) – An element of the domain.
value2 (Any) – An element of the domain.
domain (tmlt.core.domains.base.Domain) – A domain compatible with the metric.
- Return type:
- class SymmetricDifference#
Bases:
ExactNumberMetric
The number of elements that are in only one of two sets.
This metric is compatible with spark dataframes, pandas dataframes, and pandas series. It ignores ordering and, in the case of pandas, indices. That is, it treats each collection as a multiset of items. For non-grouped data, it treats each record as an item.
For grouped data there are a few cases:
If the group keys are different, the distance is infinity
The distance between two groups with the same multi-set of records is 0
The distance between two groups where exactly one is empty is 1
The distance between two groups with different records (where neither is empty) is 2
Examples
>>> import pandas as pd >>> from pyspark.sql import SparkSession >>> from tmlt.core.domains.spark_domains import ( ... SparkColumnsDescriptor, ... SparkIntegerColumnDescriptor, ... ) >>> spark = SparkSession.builder.getOrCreate() >>> domain = SparkDataFrameDomain( ... { ... "A": SparkIntegerColumnDescriptor(), ... "B": SparkIntegerColumnDescriptor(), ... } ... ) >>> df1 = spark.createDataFrame( ... pd.DataFrame({"A": [1, 1, 1, 2, 3], "B": [2, 2, 2, 4, 3]}) ... ) >>> df2 = spark.createDataFrame(pd.DataFrame({"A": [1, 2, 1], "B": [2, 4, 1]})) >>> SymmetricDifference().distance(df1, df2, domain) 4 >>> group_keys = spark.createDataFrame(pd.DataFrame({"B": [1, 2, 4]})) >>> domain = SparkGroupedDataFrameDomain( ... { ... "A": SparkIntegerColumnDescriptor(), ... "B": SparkIntegerColumnDescriptor(), ... }, ... ["B"], ... ) >>> grouped_df1 = GroupedDataFrame(df1, group_keys) >>> grouped_df2 = GroupedDataFrame(df2, group_keys) >>> SymmetricDifference().distance(grouped_df1, grouped_df2, domain) 3
- validate(value)#
Raises an error if
value
not a valid distance.value
must be a nonnegative integer or infinity
- Parameters:
value (tmlt.core.utils.exact_number.ExactNumberInput) – A distance between two datasets under this metric.
- Return type:
None
- compare(value1, value2)#
Returns True if
value1
is less than or equal tovalue2
.- Parameters:
value1 (tmlt.core.utils.exact_number.ExactNumberInput)
value2 (tmlt.core.utils.exact_number.ExactNumberInput)
- Return type:
- supports_domain(domain)#
Return True if the metric is implemented for the passed domain.
- Parameters:
domain (tmlt.core.domains.base.Domain) – The domain to check against.
- Return type:
- distance(value1, value2, domain)#
Return the metric distance between two elements of a supported domain.
- Parameters:
value1 (Any) – An element of the domain.
value2 (Any) – An element of the domain.
domain (tmlt.core.domains.base.Domain) – A domain compatible with the metric.
- Return type:
- class HammingDistance#
Bases:
ExactNumberMetric
The number of elements that are different between two sets of the same size.
This metric is compatible with spark dataframes, pandas dataframes, and pandas series. It ignores ordering and, in the case of pandas, indices. That is, it treats each collection as a multiset of records.
If the sets are not the same size, the distance is infinity.
Example
>>> import pandas as pd >>> from pyspark.sql import SparkSession >>> from tmlt.core.domains.spark_domains import SparkColumnsDescriptor >>> from tmlt.core.domains.spark_domains import ( ... SparkIntegerColumnDescriptor, ... ) >>> spark = SparkSession.builder.getOrCreate() >>> domain = SparkDataFrameDomain( ... { ... "A": SparkIntegerColumnDescriptor(), ... "B": SparkIntegerColumnDescriptor(), ... } ... ) >>> df1 = spark.createDataFrame( ... pd.DataFrame({"A": [1, 1, 1, 3], "B": [2, 2, 2, 4]}) ... ) >>> df2 = spark.createDataFrame(pd.DataFrame({"A": [1, 2], "B": [2, 4]})) >>> HammingDistance().distance(df1, df2, domain) oo >>> df3 = spark.createDataFrame( ... pd.DataFrame({"A": [1, 2, 3, 1], "B": [2, 4, 4, 2]}) ... ) >>> HammingDistance().distance(df1, df3, domain) 1
- validate(value)#
Raises an error if
value
not a valid distance.value
must be a nonnegative and integer or infinity
- Parameters:
value (tmlt.core.utils.exact_number.ExactNumberInput) – A distance between two datasets under this metric.
- Return type:
None
- compare(value1, value2)#
Returns True if
value1
is less than or equal tovalue2
.- Parameters:
value1 (tmlt.core.utils.exact_number.ExactNumberInput)
value2 (tmlt.core.utils.exact_number.ExactNumberInput)
- Return type:
- supports_domain(domain)#
Return True if the metric is implemented for the passed domain.
- Parameters:
domain (tmlt.core.domains.base.Domain) – The domain to check against.
- Return type:
- distance(value1, value2, domain)#
Return the metric distance between two elements of a supported domain.
- Parameters:
value1 (Any) – An element of the domain.
value2 (Any) – An element of the domain.
domain (tmlt.core.domains.base.Domain) – A domain compatible with the metric.
- Return type:
- class AggregationMetric(inner_metric)#
Bases:
ExactNumberMetric
Distances resulting from aggregating distances of its components.
Components may be elements of a series, groups of a grouped dataframe, or elements of a list. This metric is parameterized by an
inner_metric
that is used to compute the distances of the components. SeeSumOf
or :class`RootSumOfSquared` for example usage.If the values are grouped dataframes, the groups must be the same for both values, or the distance is infinity.
If the values are pandas series or lists, they must be the same size, or the distance is infinity. The index of the series is ignored.
- Parameters:
inner_metric (Union[AbsoluteDifference, SymmetricDifference, HammingDistance, IfGroupedBy])
- property inner_metric: AbsoluteDifference | SymmetricDifference | HammingDistance | IfGroupedBy#
Returns metric to be used for summing.
- Return type:
Union[AbsoluteDifference, SymmetricDifference, HammingDistance, IfGroupedBy]
- __init__(inner_metric)#
Constructor.
- Parameters:
inner_metric (
Union
[AbsoluteDifference
,SymmetricDifference
,HammingDistance
,IfGroupedBy
]) – Metric to be applied to the components.
- compare(value1, value2)#
Returns True if
value1
is less than or equal tovalue2
.- Parameters:
value1 (tmlt.core.utils.exact_number.ExactNumberInput) – A distance between two datasets under this metric.
value2 (tmlt.core.utils.exact_number.ExactNumberInput) – A distance between two datasets under this metric.
- Return type:
- supports_domain(domain)#
Return True if the metric is implemented for the passed domain.
- Parameters:
domain (tmlt.core.domains.base.Domain) – The domain to check against.
- Return type:
- distance(value1, value2, domain)#
Return the metric distance between two elements of a supported domain.
- Parameters:
value1 (Any) – An element of the domain.
value2 (Any) – An element of the domain.
domain (tmlt.core.domains.base.Domain) – A domain compatible with the metric.
- Return type:
- abstract validate(value)#
Raises an error if
value
not a valid distance.- Parameters:
value (Any) – A distance between two datasets under this metric.
- Return type:
None
- class SumOf(inner_metric)#
Bases:
AggregationMetric
Distances resulting from summing distances of its components.
These components may be elements of a series, groups of a grouped dataframe, or elements of a list. This metric is parameterized by an
inner_metric
that is used to compute the distances of the components.Example
>>> import pandas as pd >>> from pyspark.sql import SparkSession >>> from tmlt.core.domains.spark_domains import SparkColumnsDescriptor >>> from tmlt.core.domains.spark_domains import ( ... SparkIntegerColumnDescriptor, ... ) >>> spark = SparkSession.builder.getOrCreate()
>>> # Symmetric difference on SparkGroupedDataFrame >>> group_keys = spark.createDataFrame(pd.DataFrame({"A": [1, 2]})) >>> domain = SparkGroupedDataFrameDomain( ... { ... "A": SparkIntegerColumnDescriptor(), ... "B": SparkIntegerColumnDescriptor(), ... }, ... ["A"], ... ) >>> df1 = GroupedDataFrame( ... spark.createDataFrame( ... pd.DataFrame({"A": [1, 1, 2, 3], "B": [1, 1, 2, 4]}) ... ), ... group_keys, ... ) >>> df2 = GroupedDataFrame( ... spark.createDataFrame( ... pd.DataFrame({"A": [1, 2, 2, 3], "B": [1, 3, 4, 5]}) ... ), ... group_keys, ... ) >>> SumOf(SymmetricDifference()).distance(df1, df2, domain) 4 >>> # Using HammingDistance gives a distance of infinity since the groups >>> # are different sizes, despite the fact that the two dataframes are the >>> # same size. >>> SumOf(HammingDistance()).distance(df1, df2, domain) oo
>>> # Absolute difference on pandas series first converts the floats to >>> # rationals, then exactly computes the distance. >>> domain = PandasSeriesDomain(NumpyFloatDomain()) >>> series1 = pd.Series([1.2, 0.8]) >>> series2 = pd.Series([0.3, 1.4]) >>> SumOf(AbsoluteDifference()).distance(series1, series2, domain) 27021597764222973/18014398509481984
- Parameters:
inner_metric (Union[AbsoluteDifference, SymmetricDifference, HammingDistance, IfGroupedBy])
- property inner_metric: AbsoluteDifference | SymmetricDifference | HammingDistance | IfGroupedBy#
Returns metric to be used for summing.
- Return type:
Union[AbsoluteDifference, SymmetricDifference, HammingDistance, IfGroupedBy]
- __init__(inner_metric)#
Constructor.
- Parameters:
inner_metric (
Union
[AbsoluteDifference
,SymmetricDifference
,HammingDistance
,IfGroupedBy
]) – Metric to be applied to the components.
- validate(value)#
Raises an error if
value
not a valid distance.value
must be a a valid distance forinner_metric
- Parameters:
value (tmlt.core.utils.exact_number.ExactNumberInput) – A distance between two datasets under this metric.
- Return type:
None
- compare(value1, value2)#
Returns True if
value1
is less than or equal tovalue2
.- Parameters:
value1 (tmlt.core.utils.exact_number.ExactNumberInput) – A distance between two datasets under this metric.
value2 (tmlt.core.utils.exact_number.ExactNumberInput) – A distance between two datasets under this metric.
- Return type:
- supports_domain(domain)#
Return True if the metric is implemented for the passed domain.
- Parameters:
domain (tmlt.core.domains.base.Domain) – The domain to check against.
- Return type:
- distance(value1, value2, domain)#
Return the metric distance between two elements of a supported domain.
- Parameters:
value1 (Any) – An element of the domain.
value2 (Any) – An element of the domain.
domain (tmlt.core.domains.base.Domain) – A domain compatible with the metric.
- Return type:
- class RootSumOfSquared(inner_metric)#
Bases:
AggregationMetric
The square root of the sum of the squares of component distances.
These components may be elements of a series, groups of a grouped dataframe, or elements of a list. This metric is parameterized by an
inner_metric
that is used to compute the distances of the components.Example
>>> import pandas as pd >>> from pyspark.sql import SparkSession >>> from tmlt.core.domains.spark_domains import SparkColumnsDescriptor >>> from tmlt.core.domains.spark_domains import ( ... SparkIntegerColumnDescriptor, ... ) >>> spark = SparkSession.builder.getOrCreate()
>>> # Symmetric difference on SparkGroupedDataFrame >>> group_keys = spark.createDataFrame(pd.DataFrame({"A": [1, 2]})) >>> domain = SparkGroupedDataFrameDomain( ... { ... "A": SparkIntegerColumnDescriptor(), ... "B": SparkIntegerColumnDescriptor(), ... }, ... ["A"], ... ) >>> df1 = GroupedDataFrame( ... spark.createDataFrame( ... pd.DataFrame({"A": [1, 1, 2, 3], "B": [1, 1, 2, 4]}) ... ), ... group_keys, ... ) >>> df2 = GroupedDataFrame( ... spark.createDataFrame( ... pd.DataFrame({"A": [1, 2, 2, 3], "B": [1, 3, 4, 5]}) ... ), ... group_keys, ... ) >>> RootSumOfSquared(SymmetricDifference()).distance(df1, df2, domain) sqrt(10) >>> # Using HammingDistance gives a distance of infinity since the groups >>> # are different sizes, despite the fact that the two dataframes are the >>> # same size. >>> RootSumOfSquared(HammingDistance()).distance(df1, df2, domain) oo
- Parameters:
inner_metric (Union[AbsoluteDifference, SymmetricDifference, HammingDistance, IfGroupedBy])
- property inner_metric: AbsoluteDifference | SymmetricDifference | HammingDistance | IfGroupedBy#
Returns metric to be used for summing.
- Return type:
Union[AbsoluteDifference, SymmetricDifference, HammingDistance, IfGroupedBy]
- __init__(inner_metric)#
Constructor.
- Parameters:
inner_metric (
Union
[AbsoluteDifference
,SymmetricDifference
,HammingDistance
,IfGroupedBy
]) – Metric to be applied to the components.
- validate(value)#
Raises an error if
value
not a valid distance.value
must be a nonnegative real or infinity
- Parameters:
value (tmlt.core.utils.exact_number.ExactNumberInput) – A distance between two datasets under this metric.
- Return type:
None
- compare(value1, value2)#
Returns True if
value1
is less than or equal tovalue2
.- Parameters:
value1 (tmlt.core.utils.exact_number.ExactNumberInput) – A distance between two datasets under this metric.
value2 (tmlt.core.utils.exact_number.ExactNumberInput) – A distance between two datasets under this metric.
- Return type:
- supports_domain(domain)#
Return True if the metric is implemented for the passed domain.
- Parameters:
domain (tmlt.core.domains.base.Domain) – The domain to check against.
- Return type:
- distance(value1, value2, domain)#
Return the metric distance between two elements of a supported domain.
- Parameters:
value1 (Any) – An element of the domain.
value2 (Any) – An element of the domain.
domain (tmlt.core.domains.base.Domain) – A domain compatible with the metric.
- Return type:
- class OnColumn(column, metric)#
Bases:
ExactNumberMetric
The value of a metric applied to a single column treated as a vector.
Example
>>> import pandas as pd >>> from pyspark.sql import SparkSession >>> from tmlt.core.domains.spark_domains import ( ... SparkIntegerColumnDescriptor, ... )
>>> spark = SparkSession.builder.getOrCreate() >>> domain = SparkDataFrameDomain( ... { ... "A": SparkIntegerColumnDescriptor(), ... "B": SparkIntegerColumnDescriptor(), ... } ... ) >>> value1 = spark.createDataFrame( ... pd.DataFrame({"A": [1, 23], "B": [3, 1]}) ... ) >>> value2 = spark.createDataFrame( ... pd.DataFrame({"A": [2, 20], "B": [1, 8]}) ... ) >>> OnColumn("A", SumOf(AbsoluteDifference())).distance(value1, value2, domain) 4 >>> OnColumn("B", RootSumOfSquared(AbsoluteDifference())).distance( ... value1, value2, domain ... ) sqrt(53)
- Parameters:
column (str)
metric (Union[SumOf, RootSumOfSquared])
- property metric: SumOf | RootSumOfSquared#
Return the metric to apply.
- Return type:
Union[SumOf, RootSumOfSquared]
- __init__(column, metric)#
Constructor.
- Parameters:
column (
str
) – The column to apply the metric to.metric (
Union
[SumOf
,RootSumOfSquared
]) – The metric to apply.
- validate(value)#
Raises an error if
value
not a valid distance.value
must be a valid distance formetric
- Parameters:
value (tmlt.core.utils.exact_number.ExactNumberInput) – A distance between two datasets under this metric.
- Return type:
None
- compare(value1, value2)#
Returns True if
value1
is less than or equal tovalue2
.- Parameters:
value1 (tmlt.core.utils.exact_number.ExactNumberInput)
value2 (tmlt.core.utils.exact_number.ExactNumberInput)
- Return type:
- supports_domain(domain)#
Return True if the metric is implemented for the passed domain.
- Parameters:
domain (tmlt.core.domains.base.Domain) – The domain to check against.
- Return type:
- distance(value1, value2, domain)#
Return the metric distance between two elements of a supported domain.
- Parameters:
value1 (Any) – An element of the domain.
value2 (Any) – An element of the domain.
domain (tmlt.core.domains.base.Domain) – A domain compatible with the metric.
- Return type:
- class OnColumns(on_columns)#
Bases:
Metric
A tuple containing the values of multiple OnColumn metrics.
Example
>>> import pandas as pd >>> from pyspark.sql import SparkSession >>> from tmlt.core.domains.spark_domains import ( ... SparkIntegerColumnDescriptor, ... )
>>> spark = SparkSession.builder.getOrCreate() >>> domain = SparkDataFrameDomain( ... { ... "A": SparkIntegerColumnDescriptor(), ... "B": SparkIntegerColumnDescriptor(), ... } ... ) >>> metric = OnColumns( ... [ ... OnColumn("A", SumOf(AbsoluteDifference())), ... OnColumn("B", RootSumOfSquared(AbsoluteDifference())), ... ] ... ) >>> value1 = spark.createDataFrame( ... pd.DataFrame({"A": [1, 23], "B": [3, 1]}) ... ) >>> value2 = spark.createDataFrame( ... pd.DataFrame({"A": [2, 20], "B": [1, 8]}) ... ) >>> metric.distance(value1, value2, domain) (4, sqrt(53))
- Parameters:
on_columns (Sequence[OnColumn])
- property on_columns: List[OnColumn]#
Return the OnColumn metrics to apply.
- Return type:
List[OnColumn]
- __init__(on_columns)#
Constructor.
- validate(value)#
Raises an error if
value
not a valid distance.value
must be a tuple with one value for each metric inon_columns
each value must be a valid distance for the corresponding metric
- Parameters:
value (Tuple[tmlt.core.utils.exact_number.ExactNumberInput, Ellipsis]) – A distance between two datasets under this metric.
- Return type:
None
- compare(value1, value2)#
Returns True if
value1
is less than or equal tovalue2
.- Parameters:
value1 (Tuple[tmlt.core.utils.exact_number.ExactNumberInput, Ellipsis]) – A distance between two datasets under this metric.
value2 (Tuple[tmlt.core.utils.exact_number.ExactNumberInput, Ellipsis]) – A distance between two datasets under this metric.
- Return type:
- supports_domain(domain)#
Return True if the metric is implemented for the passed domain.
- Parameters:
domain (tmlt.core.domains.base.Domain) – The domain to check against.
- Return type:
- distance(value1, value2, domain)#
Return the metric distance between two elements of a supported domain.
- Parameters:
value1 (Any) – An element of the domain.
value2 (Any) – An element of the domain.
domain (tmlt.core.domains.base.Domain) – A domain compatible with the metric.
- Return type:
Tuple[tmlt.core.utils.exact_number.ExactNumber, Ellipsis]
- class IfGroupedBy(column, inner_metric)#
Bases:
ExactNumberMetric
Distance between two DataFrames that shall be grouped by a given attribute.
This metric is an upper bound on the distance for any fixed set of grouping keys. This assumes that the distance between two empty groups is zero, and the inner metric must satisfy this property.
The grouping column cannot contain floating point values.
Examples
>>> import pandas as pd >>> from pyspark.sql import SparkSession >>> from tmlt.core.domains.spark_domains import ( ... SparkIntegerColumnDescriptor, ... )
>>> spark = SparkSession.builder.getOrCreate() >>> domain = SparkDataFrameDomain( ... { ... "A": SparkIntegerColumnDescriptor(), ... "B": SparkIntegerColumnDescriptor(), ... "C": SparkIntegerColumnDescriptor(), ... }, ... ) >>> metric = IfGroupedBy("C", RootSumOfSquared(SymmetricDifference())) >>> value1 = spark.createDataFrame( ... pd.DataFrame({"A": [1, 1, 3], "B": [2, 1, 4], "C": [1, 1, 2]}), ... ) >>> value2 = spark.createDataFrame( ... pd.DataFrame({"A": [2, 1], "B": [1, 1], "C": [1, 1]}) ... ) >>> metric.distance(value1, value2, domain) sqrt(5) >>> metric = IfGroupedBy("C", SymmetricDifference()) >>> value1 = spark.createDataFrame( ... pd.DataFrame({"A": [1, 1, 3], "B": [2, 1, 4], "C": [1, 1, 2]}), ... ) >>> value2 = spark.createDataFrame( ... pd.DataFrame({"A": [1, 1], "B": [2, 1], "C": [1, 1]}) ... ) >>> metric.distance(value1, value2, domain) 1
- Parameters:
column (str)
inner_metric (Union[SumOf, RootSumOfSquared, SymmetricDifference])
- property inner_metric: SumOf | RootSumOfSquared | SymmetricDifference#
Metric to be applied for corresponding groups.
- Return type:
Union[SumOf, RootSumOfSquared, SymmetricDifference]
- __init__(column, inner_metric)#
Constructor.
- Parameters:
column (
str
) – Column that the DataFrame shall be grouped by.inner_metric (
Union
[SumOf
,RootSumOfSquared
,SymmetricDifference
]) – Metric to be applied to corresponding groups in the DataFrame.
- validate(value)#
Raises an error if
value
not a valid distance.value
must be a valid distance forinner_metric
- Parameters:
value (tmlt.core.utils.exact_number.ExactNumberInput) – A distance between two datasets under this metric.
- Return type:
None
- compare(value1, value2)#
Returns True if
value1
is less than or equal tovalue2
.- Parameters:
value1 (tmlt.core.utils.exact_number.ExactNumberInput) – A distance between two datasets under this metric.
value2 (tmlt.core.utils.exact_number.ExactNumberInput) – A distance between two datasets under this metric.
- Return type:
- supports_domain(domain)#
Return True if the metric is implemented for the passed domain.
- Parameters:
domain (tmlt.core.domains.base.Domain) – The domain to check against.
- Return type:
- distance(value1, value2, domain)#
Return the metric distance between two elements of a supported domain.
- Parameters:
value1 (Any) – An element of the domain.
value2 (Any) – An element of the domain.
domain (tmlt.core.domains.base.Domain) – A domain compatible with the metric.
- Return type:
- class DictMetric(key_to_metric)#
Bases:
Metric
Distance between two dictionaries with identical sets of keys.
Example
>>> import pandas as pd >>> from pyspark.sql import SparkSession >>> from tmlt.core.domains.spark_domains import ( ... SparkIntegerColumnDescriptor, ... )
>>> spark = SparkSession.builder.getOrCreate() >>> metric = DictMetric( ... {"x": AbsoluteDifference(), "y": SymmetricDifference()} ... ) >>> domain = DictDomain( ... { ... "x": NumpyIntegerDomain(), ... "y": SparkDataFrameDomain( ... { ... "A": SparkIntegerColumnDescriptor(), ... "B": SparkIntegerColumnDescriptor(), ... } ... ), ... } ... ) >>> df1 = spark.createDataFrame( ... pd.DataFrame({"A": [1, 1, 3], "B": [2, 1, 4]}) ... ) >>> df2 = spark.createDataFrame(pd.DataFrame({"A": [2, 1], "B": [1, 1]})) >>> value1 = {"x": np.int64(1), "y": df1} >>> value2 = {"x": np.int64(10), "y": df2} >>> metric.distance(value1, value2, domain) {'x': 9, 'y': 3}
- Parameters:
key_to_metric (Mapping[Any, Metric])
- property key_to_metric: Dict[Any, Metric]#
Returns mapping from keys to metrics.
- Return type:
Dict[Any, Metric]
- __init__(key_to_metric)#
Constructor.
- validate(value)#
Raises an error if
value
not a valid distance.value
must be a dictionary with the same keys askey_to_metric
each value in the dictionary must be a valid distance under the corresponding metric
- Parameters:
value (Dict[Any, Any]) – A distance between two datasets under this metric.
- Return type:
None
- compare(value1, value2)#
Returns True if
value1
is less than or equal tovalue2
.- Parameters:
value1 (Dict[Any, Any]) – A distance between two datasets under this metric.
value2 (Dict[Any, Any]) – A distance between two datasets under this metric.
- Return type:
- supports_domain(domain)#
Return True if the metric is implemented for the passed domain.
- Parameters:
domain (tmlt.core.domains.base.Domain) – The domain to check against.
- Return type:
- distance(value1, value2, domain)#
Return the metric distance between two elements of a supported domain.
- Parameters:
value1 (Any) – An element of the domain.
value2 (Any) – An element of the domain.
domain (tmlt.core.domains.base.Domain) – A domain compatible with the metric.
- Return type:
Dict[Any, Any]
- __getitem__(key)#
Returns metric associated with given key.
- Parameters:
key (Any)
- Return type:
- class AddRemoveKeys(df_to_key_column)#
Bases:
Metric
The number of keys that dictionaries of dataframe differ by.
This metric can be thought of as a extension of
IfGroupedBy
with inner metricSymmetricDifference
, except it is applied to a dictionary of dataframes, instead of a single dataframe.AddRemoveKeys(X)
can be described in the following way:Sum over each key that appears in the key column in either neighbor, where the key column for dataframe df is given by
X[df]
.0 if both neighbors “match” for
X[df] = key
1 if only one neighbor has records for
X[df] = key
2 if both neighbor have records for
X[df] = key
, but they don’t “match”
The key column cannot contain floating point values, and all dataframes must have the same type for the key column. The key columns for the different dataframes may have different names.
Examples
>>> import pandas as pd >>> from pyspark.sql import SparkSession >>> from tmlt.core.domains.spark_domains import ( ... SparkIntegerColumnDescriptor, ... SparkStringColumnDescriptor, ... ) >>> spark = SparkSession.builder.getOrCreate() >>> domain = DictDomain( ... { ... 1: SparkDataFrameDomain( ... { ... "A": SparkIntegerColumnDescriptor(), ... "B": SparkIntegerColumnDescriptor(), ... }, ... ), ... 2: SparkDataFrameDomain( ... { ... "C": SparkIntegerColumnDescriptor(), ... "D": SparkStringColumnDescriptor(), ... }, ... ), ... } ... ) >>> metric = AddRemoveKeys({1: "A", 2: "C"}) >>> # key=1 matches, key=2 is only in value1, key=3 is only in value2, key=4 >>> # differs >>> value1 = { ... 1: spark.createDataFrame( ... pd.DataFrame( ... { ... "A": [1, 1, 2], ... "B": [1, 1, 1], ... } ... ) ... ), ... 2: spark.createDataFrame( ... pd.DataFrame( ... { ... "C": [1, 4], ... "D": ["1", "1"], ... } ... ) ... ) ... } >>> value2 = { ... 1: spark.createDataFrame( ... pd.DataFrame( ... { ... "A": [1, 1, 3], ... "B": [1, 1, 1], ... } ... ) ... ), ... 2: spark.createDataFrame( ... pd.DataFrame( ... { ... "C": [1, 4], ... "D": ["1", "2"], ... } ... ) ... ) ... } >>> metric.distance(value1, value2, domain) 4
- Parameters:
df_to_key_column (Mapping[Any, str])
- __init__(df_to_key_column)#
Constructor.
- validate(value)#
Raises an error if
value
not a valid distance.value
must be a nonnegative real or infinite
- Parameters:
value (tmlt.core.utils.exact_number.ExactNumberInput) – A distance between two datasets under this metric.
- Return type:
None
- compare(value1, value2)#
Returns True if
value1
is less than or equal tovalue2
.- Parameters:
value1 (tmlt.core.utils.exact_number.ExactNumberInput)
value2 (tmlt.core.utils.exact_number.ExactNumberInput)
- Return type:
- supports_domain(domain)#
Return True if the metric is implemented for the passed domain.
- Parameters:
domain (tmlt.core.domains.base.Domain) – The domain to check against.
- Return type:
- distance(value1, value2, domain)#
Return the metric distance between two elements of a supported domain.
- Parameters:
value1 (Any) – An element of the domain.
value2 (Any) – An element of the domain.
domain (tmlt.core.domains.base.Domain) – A domain compatible with the metric.
- Return type: