binning_spec#

A BinningSpec defines a binning operation on a column.

Classes#

BinningSpec

A spec object defining an operation where values are assigned to bins.

class BinningSpec(bin_edges, names=None, right=True, include_both_endpoints=True, nan_bin=None)#

Bases: Generic[BinT, BinNameT]

A spec object defining an operation where values are assigned to bins.

A BinningSpec divides values into bins based on a list of bin edges, for use with the bin_column() method. All supported data types can be binned using a BinningSpec.

Values outside the range of the provided bins and None types are all mapped to None (null in Spark), as are NaN values by default. Bin names are generated based on the bin edges, but custom names can be provided.

By default, the right edge of each bin is included in that bin: using edges [0, 5, 10] will lead to bins [0, 5] and (5, 10]. To include the left edge instead, set the right parameter to False.

Examples

>>> spec = BinningSpec([0,5,10])
>>> spec.bins()
['[0, 5]', '(5, 10]']
>>> spec(0)
'[0, 5]'
>>> spec(5)
'[0, 5]'
>>> spec(6)
'(5, 10]'
>>> spec(10)
'(5, 10]'
>>> spec(11) is None
True
Parameters
  • bin_edges (Sequence[BinT]) –

  • names (Optional[Sequence[Optional[BinNameT]]]) –

  • right (bool) –

  • include_both_endpoints (bool) –

  • nan_bin (Optional[BinNameT]) –

__init__(bin_edges, names=None, right=True, include_both_endpoints=True, nan_bin=None)#

Initialize a BinningSpec.

Parameters
  • bin_edges (Sequence[~BinT]Sequence[~BinT]) – A list of the bin edges, sorted in ascending order.

  • names (Sequence[Optional[~BinNameT]] | NoneOptional[Sequence[Optional[~BinNameT]]] (default: None)) – If given, used as the names of bins. Must be one element shorter than bin_edges. Duplicate values are allowed, which will place non-contiguous ranges of values into the same bin. Note that while using floats and timestamps as bin names is allowed here, grouping on the resulting column is not allowed.

  • right (boolbool (default: True)) – When True, the right edge of each bin is included in that bin; otherwise, the left edge is. Defaults to True.

  • include_both_endpoints (boolbool (default: True)) – When True, the outer edges of both the first and last bins will be included in their respective bins; when False, these edges are treated the same as the other bins, i.e. only one will be included based on how right is set. Defaults to True.

  • nan_bin (~BinNameT | NoneOptional[~BinNameT] (default: None)) – If binning over a float-valued column, all NaNs will be placed in a bin with this name. The default value, None, causes these values to be placed in the same bin with out-of-range and null values.

property input_type#

Return the ColumnType of the column this binning can be applied to.

Return type

tmlt.analytics._schema.ColumnType

property column_descriptor#

Return the ColumnDescriptor that results from applying this binning.

Return type

tmlt.analytics._schema.ColumnDescriptor

bins(include_null=False)#

Return a list of all the bin names that could result from the binning.

The returned list is guaranteed to contain unique elements, even if multiple bins were mapped to the same name. The NaN bin, if one was specified, is included. If include_null is true, the null bin is included as well; by default, it is not included.

Parameters

include_null (bool) –

Return type

List[Optional[BinNameT]]

__call__(val)#

Given a value to bin, return its bin name.

In most cases this method only needs to be used internally, but it can be called on its own to test the binning that will be performed.

Parameters

val (Optional[BinT]) – The value to be assigned to a bin.

Return type

Optional[BinNameT]