QueryBuilder.histogram#

from tmlt.analytics import QueryBuilder
QueryBuilder.histogram(column, bin_edges, name=None)#

Returns a count query containing the frequency of values in specified column.

Example

>>> from tmlt.analytics import BinningSpec
>>> private_data = spark.createDataFrame(
...     pd.DataFrame(
...         {
...          "income_thousands": [83, 85, 86, 73, 82, 95,
...                               74, 92, 71, 86, 97]
...         }
...     )
... )
>>> session = Session.from_dataframe(
...     privacy_budget=PureDPBudget(epsilon=float('inf')),
...     source_id="private_data",
...     dataframe=private_data,
...     protected_change=AddOneRow(),
... )
>>> income_binspec = BinningSpec(
...     bin_edges=[i for i in range(70,110,10)],
...     include_both_endpoints=False
... )
>>> binned_income_count_query = (
...     QueryBuilder("private_data")
...     .histogram("income_thousands", income_binspec, "income_binned")
... )
>>> binned_income_counts = session.evaluate(
...     binned_income_count_query,
...     privacy_budget=PureDPBudget(epsilon=10),
... )
>>> print(binned_income_counts.sort("income_binned").toPandas())
  income_binned  count
0      (70, 80]      3
1      (80, 90]      5
2     (90, 100]      3
Parameters:
  • column (str) – Name of the column used to assign bins.

  • bin_edges (Union[Sequence[TypeVar(BinT, str, Union[int, float], date, datetime)], BinningSpec]) – The bin edges for the histogram; provided as either a BinningSpec or as a list of supported data types. Values outside the range of the provided bins, None types, and NaN values are all mapped to None (null in Spark).

  • name (Optional[str]) – The name of the column that will be created. If None (the default), the input column name with _binned appended to it.

Return type:

Query