spark_domains#
Domains for Spark datatypes.
Data#
- SparkColumnsDescriptor#
Mapping from column name to SparkColumnDescriptor.
Functions#
Returns mapping from column name to SparkColumnDescriptor. |
|
Returns a mapping from column name to SparkColumnDescriptor. |
|
Returns a SparkColumnDescriptor for a NumpyDomain. |
- convert_spark_schema(spark_schema)#
Returns mapping from column name to SparkColumnDescriptor.
- Parameters
spark_schema (pyspark.sql.types.StructType) –
- Return type
SparkColumnsDescriptor
- convert_pandas_domain(pandas_domain)#
Returns a mapping from column name to SparkColumnDescriptor.
- Parameters
pandas_domain (tmlt.core.domains.pandas_domains.PandasDataFrameDomain) –
- Return type
SparkColumnsDescriptor
- convert_numpy_domain(numpy_domain)#
Returns a SparkColumnDescriptor for a NumpyDomain.
- Parameters
numpy_domain (tmlt.core.domains.numpy_domains.NumpyDomain) –
- Return type
Classes#
Base class for describing Spark column types. |
|
Describes an integer attribute in Spark. |
|
Describes a float attribute in Spark. |
|
Describes a string attribute in Spark. |
|
Describes a date attribute in Spark. |
|
Describes a timestamp attribute in Spark. |
|
Domain of Spark DataFrame rows. |
|
Domain of Spark DataFrames. |
|
Domain of grouped DataFrames. |
- class SparkColumnDescriptor#
Bases:
abc.ABC
Base class for describing Spark column types.
- allow_null#
If True, null values are permitted in the domain.
- abstract to_numpy_domain(self)#
Returns corresponding NumPy domain.
- Return type
- validate_column(self, sdf, col_name)#
Raises error if not all values in given DataFrame column match descriptor.
- Parameters
sdf (pyspark.sql.DataFrame) – Spark DataFrame to check.
col_name (str) – Name of column in sdf to be checked.
- abstract valid_py_value(self, val)#
Returns True if val is valid for described Spark column.
- Parameters
val (Any) –
- Return type
- property data_type(self)#
Returns data type associated with Spark column.
- Return type
- class SparkIntegerColumnDescriptor#
Bases:
SparkColumnDescriptor
Describes an integer attribute in Spark.
- SIZE_TO_TYPE#
Mapping from size to Spark type.
- SIZE_TO_MIN_MAX#
Mapping from size to tuple of minimum and maximum value allowed.
- allow_null :bool = False#
If True, null values are permitted in the domain.
- size :int = 64#
Number of bits a member of the domain occupies. Must be 32 or 64.
- to_numpy_domain(self)#
Returns corresponding NumPy domain.
- Return type
- valid_py_value(self, val)#
Returns True if value is a valid python value for the descriptor.
- Parameters
val (Any) –
- property data_type(self)#
Returns data type associated with Spark column.
- Return type
- validate_column(self, sdf, col_name)#
Raises error if not all values in given DataFrame column match descriptor.
- Parameters
sdf (pyspark.sql.DataFrame) – Spark DataFrame to check.
col_name (str) – Name of column in sdf to be checked.
- class SparkFloatColumnDescriptor#
Bases:
SparkColumnDescriptor
Describes a float attribute in Spark.
- SIZE_TO_TYPE#
Mapping from size to Spark type.
- allow_nan :bool = False#
If True, NaNs are permitted in the domain.
- allow_inf :bool = False#
If True, infs are permitted in the domain.
- allow_null :bool = False#
If True, null values are permitted in the domain.
Note
Nulls aren’t supported in pd.
- size :int = 64#
Number of bits a member of the domain occupies. Must be 32 or 64.
- to_numpy_domain(self)#
Returns corresponding NumPy domain.
- Return type
- validate_column(self, sdf, col_name)#
Raises error if not all values in given DataFrame column match descriptor.
- Parameters
sdf (pyspark.sql.DataFrame) – Spark DataFrame to check.
col_name (str) – Name of column in sdf to be checked.
- valid_py_value(self, val)#
Returns True if value is a valid python value for the descriptor.
In particular, this returns True only if one of the following is true:
val is float(“nan”) and NaN is allowed.
val is float(“inf”) or float(“-inf”), and inf values are allowed.
val is a float that can be represented in size bits.
val is None and nulls are allowed in the domain.
- Parameters
val (Any) –
- property data_type(self)#
Returns data type associated with Spark column.
- Return type
- class SparkStringColumnDescriptor#
Bases:
SparkColumnDescriptor
Describes a string attribute in Spark.
- allow_null :bool = False#
If True, null values are permitted in the domain.
- to_numpy_domain(self)#
Returns corresponding NumPy domain.
- valid_py_value(self, val)#
Returns True if value is a valid python value for the descriptor.
- Parameters
val (Any) –
- property data_type(self)#
Returns data type associated with Spark column.
- Return type
- validate_column(self, sdf, col_name)#
Raises error if not all values in given DataFrame column match descriptor.
- Parameters
sdf (pyspark.sql.DataFrame) – Spark DataFrame to check.
col_name (str) – Name of column in sdf to be checked.
- class SparkDateColumnDescriptor#
Bases:
SparkColumnDescriptor
Describes a date attribute in Spark.
- allow_null :bool = False#
If True, null values are permitted in the domain.
- to_numpy_domain(self)#
Returns corresponding NumPy domain.
Note
Date types are not supported in NumPy; this method always raises an exception.
- valid_py_value(self, val)#
Returns True if the value is a valid Python value for the descriptor.
- Parameters
val (Any) –
- property data_type(self)#
Returns data type associated with Spark column.
- Return type
- validate_column(self, sdf, col_name)#
Raises error if not all values in given DataFrame column match descriptor.
- Parameters
sdf (pyspark.sql.DataFrame) – Spark DataFrame to check.
col_name (str) – Name of column in sdf to be checked.
- class SparkTimestampColumnDescriptor#
Bases:
SparkColumnDescriptor
Describes a timestamp attribute in Spark.
- allow_null :bool = False#
If True, null values are permitted in the domain.
- to_numpy_domain(self)#
Returns corresponding NumPy domain.
Note
Timestamp types are not supported in NumPy; this method always raises an exception.
- valid_py_value(self, val)#
Returns True if the value is a valid Python value for the descriptor.
- Parameters
val (Any) –
- property data_type(self)#
Returns data type associated with Spark column.
- Return type
- validate_column(self, sdf, col_name)#
Raises error if not all values in given DataFrame column match descriptor.
- Parameters
sdf (pyspark.sql.DataFrame) – Spark DataFrame to check.
col_name (str) – Name of column in sdf to be checked.
- class SparkRowDomain(schema)#
Bases:
tmlt.core.domains.base.Domain
Domain of Spark DataFrame rows.
- Parameters
schema (SparkColumnsDescriptor) –
- __init__(schema)#
Constructor.
- Parameters
schema (
Mapping
Mapping
[str
,SparkColumnDescriptor
]) – Mapping from column names to column descriptors.
- property schema(self)#
Returns mapping from column names to column descriptors.
- Return type
SparkColumnsDescriptor
- abstract validate(self, value)#
Raises error if value is not a row with matching schema.
- Parameters
value (Any) –
- abstract __contains__(self, value)#
Returns True if value is a row with matching schema.
- Parameters
value (Any) –
- Return type
- __eq__(self, other)#
Return True if the classes are equivalent.
- Parameters
other (Any) –
- Return type
- class SparkDataFrameDomain(schema)#
Bases:
tmlt.core.domains.base.Domain
Domain of Spark DataFrames.
- Parameters
schema (SparkColumnsDescriptor) –
- __init__(schema)#
Constructor.
- Parameters
schema (
Mapping
Mapping
[str
,SparkColumnDescriptor
]) – Mapping from column names to column descriptors.
- property schema(self)#
Returns mapping from column names to column descriptors.
- Return type
SparkColumnsDescriptor
- validate(self, value)#
Raises error if value is not a DataFrame with matching schema.
- Parameters
value (Any) –
- __eq__(self, other)#
Return True if the classes are equivalent.
- Parameters
other (Any) –
- Return type
- __getitem__(self, col_name)#
Returns column descriptor for given column.
- Parameters
col_name (str) –
- Return type
- classmethod from_spark_schema(cls, schema)#
Returns a SparkDataFrameDomain constructed from a Spark schema.
Note
If schema contains float types, nans and infs are allowed since the schema places no restrictions on these.
- Parameters
schema (pyspark.sql.types.StructType) – Spark schema for constructing domain.
- Return type
- property spark_schema(self)#
Returns Spark schema object according to domain.
Note
There isn’t a one-to-one correspondence between Spark schema objects and SparkDataFrameDomain objects since the domains encode additional information about allowing nans or infs in float columns. Other information may get added in the future and these cannot be represented with the Spark schema (StructType) object.
- Return type
- class SparkGroupedDataFrameDomain(schema, group_keys)#
Bases:
tmlt.core.domains.base.Domain
Domain of grouped DataFrames.
- Parameters
schema (SparkColumnsDescriptor) –
group_keys (pyspark.sql.DataFrame) –
- __init__(schema, group_keys)#
Constructor.
- property group_keys(self)#
Returns DataFrame containing group keys as rows.
- Return type
- property schema(self)#
Returns mapping from column names to column descriptors.
- Return type
SparkColumnsDescriptor
- property spark_schema(self)#
Returns Spark schema object according to domain.
Note
There isn’t a one-to-one correspondence between Spark schema objects and SparkDataFrameDomain objects since the domains encode additional information about allowing nans or infs in float columns. Other information may get added in the future and these cannot be represented with the Spark schema (StructType) object.
- Return type
- validate(self, value)#
Raises error if value is not a GroupedDataFrame with matching group_keys.
- Parameters
value (Any) –
- get_group_domain(self)#
Return the domain for one of the groups.
- Return type
- __eq__(self, other)#
Return True if the schemas and group keys are identical.
- Parameters
other (Any) –
- Return type