spark_domains#
Domains for Spark datatypes.
Data#
- SparkColumnsDescriptor#
Mapping from column name to SparkColumnDescriptor.
Functions#
Returns mapping from column name to SparkColumnDescriptor. |
|
Returns a mapping from column name to SparkColumnDescriptor. |
|
Returns a SparkColumnDescriptor for a NumpyDomain. |
- convert_spark_schema(spark_schema)#
Returns mapping from column name to SparkColumnDescriptor.
- Parameters:
spark_schema (pyspark.sql.types.StructType)
- Return type:
SparkColumnsDescriptor
- convert_pandas_domain(pandas_domain)#
Returns a mapping from column name to SparkColumnDescriptor.
- Parameters:
pandas_domain (tmlt.core.domains.pandas_domains.PandasDataFrameDomain)
- Return type:
SparkColumnsDescriptor
- convert_numpy_domain(numpy_domain)#
Returns a SparkColumnDescriptor for a NumpyDomain.
- Parameters:
numpy_domain (tmlt.core.domains.numpy_domains.NumpyDomain)
- Return type:
Classes#
Base class for describing Spark column types. |
|
Describes an integer attribute in Spark. |
|
Describes a float attribute in Spark. |
|
Describes a string attribute in Spark. |
|
Describes a date attribute in Spark. |
|
Describes a timestamp attribute in Spark. |
|
Domain of Spark DataFrame rows. |
|
Domain of Spark DataFrames. |
|
Domain of grouped DataFrames. |
- class SparkColumnDescriptor#
Bases:
abc.ABC
Base class for describing Spark column types.
- allow_null#
If True, null values are permitted in the domain.
- property data_type: pyspark.sql.types.DataType#
- Abstractmethod:
- Return type:
Returns data type associated with Spark column.
- abstract to_numpy_domain()#
Returns corresponding NumPy domain.
- Return type:
- validate_column(sdf, col_name)#
Raises error if not all values in given DataFrame column match descriptor.
- Parameters:
sdf (pyspark.sql.DataFrame) – Spark DataFrame to check.
col_name (str) – Name of column in sdf to be checked.
- Return type:
None
- class SparkIntegerColumnDescriptor#
Bases:
SparkColumnDescriptor
Describes an integer attribute in Spark.
- SIZE_TO_TYPE#
Mapping from size to Spark type.
- SIZE_TO_MIN_MAX#
Mapping from size to tuple of minimum and maximum value allowed.
- property data_type: pyspark.sql.types.DataType#
Returns data type associated with Spark column.
- Return type:
- to_numpy_domain()#
Returns corresponding NumPy domain.
- Return type:
- valid_py_value(val)#
Returns True if value is a valid python value for the descriptor.
- Parameters:
val (Any)
- Return type:
- validate_column(sdf, col_name)#
Raises error if not all values in given DataFrame column match descriptor.
- Parameters:
sdf (pyspark.sql.DataFrame) – Spark DataFrame to check.
col_name (str) – Name of column in sdf to be checked.
- Return type:
None
- class SparkFloatColumnDescriptor#
Bases:
SparkColumnDescriptor
Describes a float attribute in Spark.
- SIZE_TO_TYPE#
Mapping from size to Spark type.
- allow_null: bool = False#
If True, null values are permitted in the domain.
Note
Nulls aren’t supported in pd.
- property data_type: pyspark.sql.types.DataType#
Returns data type associated with Spark column.
- Return type:
- to_numpy_domain()#
Returns corresponding NumPy domain.
- Return type:
- validate_column(sdf, col_name)#
Raises error if not all values in given DataFrame column match descriptor.
- Parameters:
sdf (pyspark.sql.DataFrame) – Spark DataFrame to check.
col_name (str) – Name of column in sdf to be checked.
- Return type:
None
- valid_py_value(val)#
Returns True if value is a valid python value for the descriptor.
In particular, this returns True only if one of the following is true:
val is
float("nan")
and NaN is allowed.val is
float("inf")
orfloat("-inf")
, and inf values are allowed.val is a float that can be represented in
size
bits.val is None and nulls are allowed in the domain.
- Parameters:
val (Any)
- Return type:
- class SparkStringColumnDescriptor#
Bases:
SparkColumnDescriptor
Describes a string attribute in Spark.
- property data_type: pyspark.sql.types.DataType#
Returns data type associated with Spark column.
- Return type:
- to_numpy_domain()#
Returns corresponding NumPy domain.
- Return type:
- valid_py_value(val)#
Returns True if value is a valid python value for the descriptor.
- Parameters:
val (Any)
- Return type:
- validate_column(sdf, col_name)#
Raises error if not all values in given DataFrame column match descriptor.
- Parameters:
sdf (pyspark.sql.DataFrame) – Spark DataFrame to check.
col_name (str) – Name of column in sdf to be checked.
- Return type:
None
- class SparkDateColumnDescriptor#
Bases:
SparkColumnDescriptor
Describes a date attribute in Spark.
- property data_type: pyspark.sql.types.DataType#
Returns data type associated with Spark column.
- Return type:
- to_numpy_domain()#
Returns corresponding NumPy domain.
Note
Date types are not supported in NumPy; this method always raises an exception.
- Return type:
- valid_py_value(val)#
Returns True if the value is a valid Python value for the descriptor.
- Parameters:
val (Any)
- Return type:
- validate_column(sdf, col_name)#
Raises error if not all values in given DataFrame column match descriptor.
- Parameters:
sdf (pyspark.sql.DataFrame) – Spark DataFrame to check.
col_name (str) – Name of column in sdf to be checked.
- Return type:
None
- class SparkTimestampColumnDescriptor#
Bases:
SparkColumnDescriptor
Describes a timestamp attribute in Spark.
- property data_type: pyspark.sql.types.DataType#
Returns data type associated with Spark column.
- Return type:
- to_numpy_domain()#
Returns corresponding NumPy domain.
Note
Timestamp types are not supported in NumPy; this method always raises an exception.
- Return type:
- valid_py_value(val)#
Returns True if the value is a valid Python value for the descriptor.
- Parameters:
val (Any)
- Return type:
- validate_column(sdf, col_name)#
Raises error if not all values in given DataFrame column match descriptor.
- Parameters:
sdf (pyspark.sql.DataFrame) – Spark DataFrame to check.
col_name (str) – Name of column in sdf to be checked.
- Return type:
None
- class SparkRowDomain(schema)#
Bases:
tmlt.core.domains.base.Domain
Domain of Spark DataFrame rows.
- Parameters:
schema (SparkColumnsDescriptor)
- property schema: SparkColumnsDescriptor#
Returns mapping from column names to column descriptors.
- Return type:
SparkColumnsDescriptor
- __init__(schema)#
Constructor.
- Parameters:
schema (
Mapping
[str
,SparkColumnDescriptor
]) – Mapping from column names to column descriptors.
- abstract validate(value)#
Raises error if value is not a row with matching schema.
- Parameters:
value (Any)
- Return type:
None
- abstract __contains__(value)#
Returns True if value is a row with matching schema.
- Parameters:
value (Any)
- Return type:
- class SparkDataFrameDomain(schema)#
Bases:
tmlt.core.domains.base.Domain
Domain of Spark DataFrames.
- Parameters:
schema (SparkColumnsDescriptor)
- property schema: SparkColumnsDescriptor#
Returns mapping from column names to column descriptors.
- Return type:
SparkColumnsDescriptor
- property spark_schema: pyspark.sql.types.StructType#
Returns Spark schema object according to domain.
Note
There isn’t a one-to-one correspondence between Spark schema objects and SparkDataFrameDomain objects since the domains encode additional information about allowing nans or infs in float columns. Other information may get added in the future and these cannot be represented with the Spark schema (StructType) object.
- Return type:
- __init__(schema)#
Constructor.
- Parameters:
schema (
Mapping
[str
,SparkColumnDescriptor
]) – Mapping from column names to column descriptors.
- validate(value)#
Raises error if value is not a DataFrame with matching schema.
- Parameters:
value (Any)
- Return type:
None
- __getitem__(col_name)#
Returns column descriptor for given column.
- Parameters:
col_name (str)
- Return type:
- classmethod from_spark_schema(schema)#
Returns a SparkDataFrameDomain constructed from a Spark schema.
Note
If schema contains float types, nans and infs are allowed since the schema places no restrictions on these.
- Parameters:
schema (pyspark.sql.types.StructType) – Spark schema for constructing domain.
- Return type:
- class SparkGroupedDataFrameDomain(schema, groupby_columns)#
Bases:
tmlt.core.domains.base.Domain
Domain of grouped DataFrames.
- Parameters:
schema (SparkColumnsDescriptor)
groupby_columns (Sequence[str])
- property schema: SparkColumnsDescriptor#
Returns mapping from column names to column descriptors.
- Return type:
SparkColumnsDescriptor
- property groupby_columns: List[str]#
Returns list of columns used for grouping.
- Return type:
List[str]
- property spark_schema: pyspark.sql.types.StructType#
Returns Spark schema object according to domain.
Note
There isn’t a one-to-one correspondence between Spark schema objects and SparkDataFrameDomain objects since the domains encode additional information about allowing nans or infs in float columns. Other information may get added in the future and these cannot be represented with the Spark schema (StructType) object.
- Return type:
- __init__(schema, groupby_columns)#
Constructor.
- Parameters:
schema (
Mapping
[str
,SparkColumnDescriptor
]) – Mapping from column name to column descriptors for all columns.groupby_columns (
Sequence
[str
]) – List of columns used for grouping.
- validate(value)#
Raises error if value is not a GroupedDataFrame with matching group_keys.
- Parameters:
value (Any)
- Return type:
None
- get_group_domain()#
Return the domain for one of the groups.
- Return type:
- __eq__(other)#
Return True if the schemas and group keys are identical.
- Parameters:
other (Any)
- Return type: