query_builder<a class="headerlink" href="#module-tmlt.analytics.query_builder" title="Permalink to this heading">#

join_private(right_operand, truncation_strategy_left=None, truncation_strategy_right=None, join_columns=None)#

Updates the current query to join with another QueryBuilder.

The current query can also join with a named private table (represented as a string).

This operation is an inner join.

This operation is a natural join, with the same behavior and requirements as join_public().

For operations on tables with a ProtectedChange that protects adding or removing rows (e.g. AddMaxRows), there is a key difference: before the join is performed, each table is truncated based on the corresponding TruncationStrategy.

In contrast, operations on tables with a AddRowsWithID ProtectedChange do not require a TruncationStrategy, as no truncation is necessary while performing the join.

Note

Tables with a ProtectedChange of AddRowsWithID must include the privacy ID column in the join columns.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> from tmlt.analytics.query_builder import TruncationStrategy
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> sess.create_view(
...     QueryBuilder("my_private_data")
...     .select(["A", "X"])
...     .rename({"X": "C"})
...     .query_expr,
...     source_id="my_private_view",
...     cache=False
... )
>>> # A query where only one row with each join key is kept on the left
>>> # table, but two are kept on the right table.
>>> query_drop_excess = (
...     QueryBuilder("my_private_data")
...     .join_private(
...         QueryBuilder("my_private_view"),
...         truncation_strategy_left=TruncationStrategy.DropExcess(1),
...         truncation_strategy_right=TruncationStrategy.DropExcess(2),
...     )
...     .count()
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query_drop_excess,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas()
   count
0      3
>>> # A query where all rows that share a join key with another row in
>>> # their table are dropped, in both the left and right tables.
>>> query_drop_non_unique = (
...     QueryBuilder("my_private_data")
...     .join_private(
...         QueryBuilder("my_private_view"),
...         truncation_strategy_left=TruncationStrategy.DropNonUnique(),
...         truncation_strategy_right=TruncationStrategy.DropNonUnique(),
...     )
...     .count()
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query_drop_non_unique,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas()
   count
0      1

Parameters:

right_operand (Union[QueryBuilder, str]) – QueryBuilder object representing the table to be joined with. When calling query_a.join_private(query_b, ...), we refer to query_a as the left table and query_b as the right table. query_a.join_private("table") is shorthand for query_a.join_private(QueryBuilder("table")).
truncation_strategy_left (Optional[tmlt.analytics.truncation_strategy.TruncationStrategy.Type]) – Strategy for truncation of the left table.
truncation_strategy_right (Optional[tmlt.analytics.truncation_strategy.TruncationStrategy.Type]) – Strategy for truncation of the right table.
join_columns (Optional[Sequence[str]]) – The columns to join on. If join_columns is not specified, the tables will be joined on all common columns.

Return type:

replace_null_and_nan(replace_with=None)#

Updates the current query to replace null and NaN values in some columns.

Note

Null values cannot be replaced in the ID column of a table initialized with a AddRowsWithID ProtectedChange, nor on a column generated by a flat_map() with the grouping parameter set to True.

Warning

If null values are replaced in a column, then Analytics will raise an error if you use a KeySet with a null value for that column.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> my_private_data.toPandas()
      A    B    X
0  None  0.0  0.0
1     1  NaN  1.1
2     2  2.0  NaN
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'DECIMAL', 'X': 'DECIMAL'}
>>> # Building a query with a replace_null_and_nan transformation
>>> query = (
...     QueryBuilder("my_private_data")
...     .replace_null_and_nan(
...         replace_with={
...             "A": "new_value",
...             "B": 1234,
...             "X": 56.78,
...         },
...     )
...     .groupby(KeySet.from_dict({"A": ["new_value", "1", "2"]}))
...     .count()
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas()
           A  count
0          1      1
1          2      1
2  new_value      1

Parameters:: replace_with (Optional[Mapping[str, Union[int, float, str, datetime.date, datetime.datetime]]]) – A dictionary mapping column names to values used to replace null and NaN values. If None (or empty), all columns will have null and NaN values replaced with Analytics defaults; see AnalyticsDefault.
Return type:: QueryBuilder

replace_infinity(replace_with=None)#

Updates the current query to replace +inf and -inf values in some columns.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> my_private_data.toPandas()
    A    B    X
0  a1  0.0  0.0
1  a1  NaN -inf
2  a2  2.0  inf
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'DECIMAL', 'X': 'DECIMAL'}
>>> # Building a query with a replace_infinity transformation
>>> query = (
...     QueryBuilder("my_private_data")
...     .replace_infinity(
...         replace_with={
...             "X": (-100, 100),
...         },
...     )
...     .groupby(KeySet.from_dict({"A": ["a1", "a2"]}))
...     .count()
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas()
    A  count
0  a1      2
1  a2      1

Parameters:: replace_with (Optional[Dict[str, Tuple[float, float]]]) – A dictionary mapping column names to values used to replace -inf and +inf. If None (or empty), all columns will have infinite values replaced with Analytics defaults; see AnalyticsDefault.
Return type:: QueryBuilder

drop_null_and_nan(columns)#

Updates the current query to drop rows containing null or NaN values.

Note

Null values cannot be dropped in the ID column of a table initialized with a AddRowsWithID ProtectedChange, nor on a column generated by a flat_map() with the grouping parameter set to True.

Warning

If null and NaN values are dropped from a column, then Analytics will raise an error if you use a KeySet that contains a null value for that column.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> my_private_data.toPandas()
    A    B    X
0  a1  2.0  0.0
1  a1  NaN  1.1
2  a2  2.0  NaN
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'DECIMAL'}
>>> # Count query on the original data
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(KeySet.from_dict({"A": ["a1", "a2"], "B": [None, 2]}))
...     .count()
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A", "B").toPandas()
    A    B  count
0  a1  NaN      1
1  a1  2.0      1
2  a2  NaN      0
3  a2  2.0      1
>>> # Building a query with a transformation
>>> query = (
...     QueryBuilder("my_private_data")
...     .drop_null_and_nan(columns=["B"])
...     .groupby(KeySet.from_dict({"A": ["a1", "a2"]}))
...     .count()
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas()
    A  count
0  a1      1
1  a2      1

Parameters:: columns (Optional[List[str]]) – A list of columns in which to look for null and NaN values. If None or an empty list, then all columns will be considered, meaning that if any column has a null/NaN value then the row it is in will be dropped.
Return type:: QueryBuilder

drop_infinity(columns)#

Updates the current query to drop rows containing infinite values.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> my_private_data.sort("A", "B", "X").toPandas()
    A  B    X
0  a1  1  1.1
1  a1  2  0.0
2  a2  2  inf
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'DECIMAL'}
>>> # Count query on the original data
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(KeySet.from_dict({"A": ["a1", "a2"]}))
...     .count()
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas()
    A  count
0  a1      2
1  a2      1

>>> # Building a query with a drop_infinity transformation
>>> query = (
...     QueryBuilder("my_private_data")
...     .drop_infinity(columns=["X"])
...     .groupby(KeySet.from_dict({"A": ["a1", "a2"]}))
...     .count()
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas()
    A  count
0  a1      2
1  a2      0

Parameters:: columns (Optional[List[str]]) – A list of columns in which to look for positive and negative infinities. If None or an empty list, then all columns will be considered, meaning that if any column has an infinite value then the row it is in will be dropped.
Return type:: QueryBuilder

rename(column_mapper)#

Updates the current query to rename the columns.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Building a query with a rename transformation
>>> query = (
...     QueryBuilder("my_private_data")
...     .rename({"X": "C"})
...     .groupby(KeySet.from_dict({"C": [0, 1]}))
...     .count()
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("C").toPandas()
   C  count
0  0      1
1  1      2

Parameters:: column_mapper (Dict[str, str]) – A mapping of columns to new column names. Columns not specified in the mapper will remain the same.
Return type:: QueryBuilder

filter(condition)#

Updates the current query to filter for rows matching a condition.

The condition parameter accepts the same syntax as in PySpark’s filter() method: valid expressions are those that can be used in a WHERE clause in Spark SQL. Examples of valid conditions include:

age < 42
age BETWEEN 17 AND 42
age < 42 OR (age < 60 AND gender IS NULL)
LENGTH(name) > 17
favorite_color IN ('blue', 'red')
date = '2022-03-14'
time < '2022-01-01T12:45:00'

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Building a query with a filter transformation
>>> query = (
...     QueryBuilder("my_private_data")
...     .filter("A == '0'")
...     .count()
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas()
   count
0      1

Parameters:: condition (str) – A string of SQL expressions specifying the filter to apply to the data. For example, the string “A > B” matches rows where column A is greater than column B.
Return type:: QueryBuilder

select(columns)#

Updates the current query to select certain columns.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Create a new view using a select query
>>> query = (
...     QueryBuilder("my_private_data")
...     .select(["A", "B"])
... )
>>> sess.create_view(query, "selected_data", cache=True)
>>> # Inspect the schema of the resulting view
>>> sess.get_schema("selected_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER'}

Parameters:: columns (Sequence[str]) – The columns to select.
Return type:: QueryBuilder

map(f, new_column_types, augment=False)#

Updates the current query to apply a mapping function to each row.

If you provide only a ColumnType for the new column types, Analytics assumes that all new columns created may contain null values (and that DECIMAL columns may contain NaN or infinite values).

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Building a query with a map transformation
>>> query = (
...     QueryBuilder("my_private_data")
...     .map(
...         lambda row: {"new": row["B"]*2},
...         new_column_types={"new": 'INTEGER'},
...         augment=True
...     )
...     .groupby(KeySet.from_dict({"new": [0, 1, 2, 3, 4]}))
...     .count()
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("new").toPandas()
   new  count
0    0      1
1    1      0
2    2      1
3    3      0
4    4      1

Parameters:

f (Callable[[Row], Row]) – The function to be applied to each row. The function’s input is a dictionary matching each column name to its value for that row. This function should return a dictionary, which should always have the same keys regardless of input, and the values in that dictionary should match the column type specified in new_column_types. The function should not have any side effects (in particular, f cannot raise exceptions).
new_column_types (Mapping[str, Union[tmlt.analytics._schema.ColumnDescriptor, tmlt.analytics._schema.ColumnType]]) – Mapping from column names to types, for new columns produced by f. Using ColumnDescriptor is preferred.
augment (bool) – If True, add new columns to the existing dataframe (so new schema = old schema + schema_new_columns). If False, make the new dataframe with schema = schema_new_columns

Return type:

flat_map(f, new_column_types, augment=False, grouping=False, max_rows=None, max_num_rows=None)#

Updates the current query to apply a flat map.

If you provide only a ColumnType for the new column types, Analytics assumes that all new columns created may contain null values (and that DECIMAL columns may contain NaN or infinite values).

Operations on tables with a AddRowsWithID ProtectedChange do not require a max_rows argument, since it is not necessary to impose a limit on the number of new rows.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
3  1  3  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Building a query with a flat map transformation
>>> query = (
...     QueryBuilder("my_private_data")
...     .flat_map(
...         lambda row: [{"i_B": i} for i in range(int(row["B"])+1)],
...         new_column_types={"i_B": ColumnDescriptor(
...             ColumnType.INTEGER,
...             allow_null=False,
...         )},
...         augment=True,
...         grouping=False,
...         max_rows=3,
...     )
...     .groupby(KeySet.from_dict({"B": [0, 1, 2, 3]}))
...     .count()
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("B").toPandas()
   B  count
0  0      1
1  1      2
2  2      3
3  3      3

Parameters:

f (Callable[[Row], List[Row]]) – The function to be applied to each row. The function’s input is a dictionary matching a column name to its value for that row. This function should return a list of dictionaries. Those dictionaries should always have the same keys regardless of input, and the values in those dictionaries should match the column type specified in new_column_types. The function should not have any side effects (in particular, f must not raise exceptions), and must be deterministic (running it multiple times on a fixed input should always return the same output).
new_column_types (Mapping[str, Union[str, tmlt.analytics._schema.ColumnDescriptor, tmlt.analytics._schema.ColumnType]]) – Mapping from column names to types, for new columns produced by f. Using ColumnDescriptor is preferred.
augment (bool) – If True, add new columns to the existing dataframe (so new schema = old schema + schema_new_columns). If False, make the new dataframe with schema = schema_new_columns
grouping (bool) – Whether this produces a new column that we want to groupby. If True, this requires that any groupby aggregations following this query include the new column as a groupby column. Only one new column is supported, and the new column must have distinct values for each input row.
max_rows (Optional[int]) – The enforced limit on the number of rows from each f(row). If f produces more rows than this, only the first max_rows rows will be in the output.
max_num_rows (Optional[int]) – Deprecated synonym for max_rows.

Return type:

bin_column(column, spec, name=None)#

Create a new column by assigning the values in a given column to bins.

Example

>>> from tmlt.analytics.binning_spec import BinningSpec
>>> sess = Session.from_dataframe(
...     PureDPBudget(float("inf")),
...     source_id="private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> my_private_data.toPandas()
   age  income
0   11       0
1   17       6
2   30      54
3   18      14
4   59     126
5   48     163
6   76     151
7   91      18
8   48      97
9   53      85
>>> age_binspec = BinningSpec(
...     [0, 18, 65, 100], include_both_endpoints=False
... )
>>> income_tax_rate_binspec = BinningSpec(
...     [0, 10, 40, 86, 165], names=[10, 12, 22, 24]
... )
>>> keys = KeySet.from_dict(
...     {
...         "age_binned": age_binspec.bins(),
...         "marginal_tax_rate": income_tax_rate_binspec.bins()
...     }
... )
>>> query = (
...     QueryBuilder("private_data")
...     .bin_column("age", age_binspec)
...     .bin_column(
...         "income", income_tax_rate_binspec, name="marginal_tax_rate"
...     )
...     .groupby(keys).count()
... )
>>> answer = sess.evaluate(query, PureDPBudget(float("inf")))
>>> answer.sort("age_binned", "marginal_tax_rate").toPandas()
   age_binned  marginal_tax_rate  count
0     (0, 18]                 10      2
1     (0, 18]                 12      1
2     (0, 18]                 22      0
3     (0, 18]                 24      0
4    (18, 65]                 10      0
5    (18, 65]                 12      0
6    (18, 65]                 22      2
7    (18, 65]                 24      3
8   (65, 100]                 10      0
9   (65, 100]                 12      1
10  (65, 100]                 22      0
11  (65, 100]                 24      1

Parameters:

column (str) – Name of the column used to assign bins.
spec (tmlt.analytics.binning_spec.BinningSpec) – A BinningSpec that defines the binning operation to be performed.
name (Optional[str]) – The name of the column that will be created. If None (the default), the input column name with _binned appended to it.

Return type:

histogram(column, bin_edges, name=None)#

Returns a count query containing the frequency of values in specified column.

Example

>>> from tmlt.analytics.binning_spec import BinningSpec
>>> private_data = spark.createDataFrame(
...     pd.DataFrame(
...         {
...          "income_thousands": [83, 85, 86, 73, 82, 95,
...                               74, 92, 71, 86, 97]
...         }
...     )
... )
>>> session = Session.from_dataframe(
...     privacy_budget=PureDPBudget(epsilon=float('inf')),
...     source_id="private_data",
...     dataframe=private_data,
...     protected_change=AddOneRow(),
... )
>>> income_binspec = BinningSpec(
...     bin_edges=[i for i in range(70,110,10)],
...     include_both_endpoints=False
... )
>>> binned_income_count_query = (
...     QueryBuilder("private_data")
...     .histogram("income_thousands", income_binspec, "income_binned")
... )
>>> binned_income_counts = session.evaluate(
...     binned_income_count_query,
...     privacy_budget=PureDPBudget(epsilon=10),
... )
>>> print(binned_income_counts.sort("income_binned").toPandas())
  income_binned  count
0      (70, 80]      3
1      (80, 90]      5
2     (90, 100]      3

Parameters:

column (str) – Name of the column used to assign bins.
bin_edges (Union[Sequence[tmlt.analytics.binning_spec.BinT], tmlt.analytics.binning_spec.BinningSpec]) – The bin edges for the histogram; provided as either a BinningSpec or as a list of supported data types. Values outside the range of the provided bins, None types, and NaN values are all mapped to None (null in Spark).
name (Optional[str]) – The name of the column that will be created. If None (the default), the input column name with _binned appended to it.

Return type:

enforce(constraint)#

Enforce a Constraint on the current table.

This method can be used to enforce constraints on the current table. See the constraints module for information about the available constraints and what they are used for.

Example

>>> my_private_data.toPandas()
  id  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess = (
...     Session.Builder()
...     .with_privacy_budget(PureDPBudget(float("inf")))
...     .with_id_space("a")
...     .with_private_dataframe(
...         "my_private_data",
...         my_private_data,
...         protected_change=AddRowsWithID("id", "a"),
...     )
...     .build()
... )
>>> # No ID contributes more than 2 rows, so no rows are dropped when
>>> # enforcing the constraint
>>> query = QueryBuilder("my_private_data").enforce(MaxRowsPerID(2)).count()
>>> sess.evaluate(query, sess.remaining_privacy_budget).toPandas()
   count
0      3
>>> # ID 1 contributes more than one row, so one of the rows with ID 1 will
>>> # be dropped when enforcing the constraint
>>> query = QueryBuilder("my_private_data").enforce(MaxRowsPerID(1)).count()
>>> sess.evaluate(query, sess.remaining_privacy_budget).toPandas()
   count
0      2

Parameters:: constraint (tmlt.analytics.constraints.Constraint) – The constraint to enforce.
Return type:: QueryBuilder

get_groups(columns=None)#

Returns a query that gets combinations of values in the listed columns.

Note

Because this uses differential privacy, it won’t include all of the values in the input dataset columns, and may even return no results at all on datasets that have few values for each set of group keys.

Example

>>> my_private_data = spark.createDataFrame(
...     pd.DataFrame(
...         [["0", 1, 0] for _ in range(10000)]
...         + [["1", 2, 1] for _ in range(10000)],
...         columns=["A", "B", "X"],
...     )
... )
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=ApproxDPBudget(1, 1e-5),
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> # Building a get_groups query
>>> query = (
...     QueryBuilder("my_private_data")
...     .get_groups()
... )
>>> # Answering the query
>>> answer = sess.evaluate(
...     query,
...     sess.remaining_privacy_budget
... )
>>> answer.toPandas()
   A  B  X
0  0  1  0
1  1  2  1

Parameters:: columns (Optional[List[str]]) – Name of the column used to assign bins. If empty or none are provided, all of the columns in the table will be used, excluding any column marked as a privacy ID in a table with a ProtectedChange of AddRowsWithID.
Return type:: tmlt.analytics._query_expr.QueryExpr

get_bounds(column)#

Returns a query that gets approximate upper and lower bounds for a column.

The bounds are chosen so that most of the values fall between them. They can be used as the upper and lower bounds for any of the aggregations that require bounds, like sum or quantile.

Note

The algorithm is approximate, and differentially private, so the bounds may not be tight, and not all input values may fall between them.

Example

>>> my_private_data = spark.createDataFrame(
...     pd.DataFrame(
...         [[i] for i in range(100)],
...         columns=["X"],
...     )
... )
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=PureDPBudget(float('inf')),
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> # Building a get_groups query
>>> query = (
...     QueryBuilder("my_private_data")
...     .get_bounds("X")
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     sess.remaining_privacy_budget
... )
>>> answer
(-128, 128)

Parameters:: column (str) – Name of the column whose bounds we want to get.
Return type:: tmlt.analytics._query_expr.QueryExpr

groupby(by)#

Groups the query by the given set of keys, returning a GroupedQueryBuilder.

Examples

>>> from tmlt.analytics.keyset import KeySet
>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}

Answering a query with the exact groupby domain:

>>> groupby_keys = KeySet.from_dict({"A": ["0", "1"]})
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(groupby_keys)
...     .count()
... )
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas()
   A  count
0  0      1
1  1      2

Answering a query with an omitted domain value:

>>> groupby_keys = KeySet.from_dict({"A": ["0"]})
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(groupby_keys)
...     .count()
... )
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas()
   A  count
0  0      1

Answering a query with an added domain value:

>>> groupby_keys = KeySet.from_dict({"A": ["0", "1", "2"]})
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(groupby_keys)
...     .count()
... )
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas()
   A  count
0  0      1
1  1      2
2  2      0

Answering a query with a multi-column domain:

>>> groupby_keys = KeySet.from_dict(
...    {"A": ["0", "1"], "B": [0, 1, 2]}
... )
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(groupby_keys)
...     .count()
... )
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A", "B").toPandas()
   A  B  count
0  0  0      0
1  0  1      1
2  0  2      0
3  1  0      1
4  1  1      0
5  1  2      1

Answering a query with a multi-column domain and structural zeros:

>>> # Suppose it is known that A and B cannot be equal. This set of
>>> # groupby keys prevents those impossible values from being computed.
>>> keys_df = pd.DataFrame({
...     "A": ["0", "0", "1", "1"],
...     "B": [1, 2, 0, 2],
... })
>>> groupby_keys = KeySet.from_dataframe(spark.createDataFrame(keys_df))
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(groupby_keys)
...     .count()
... )
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A", "B").toPandas()
   A  B  count
0  0  1      1
1  0  2      0
2  1  0      1
3  1  2      1

Parameters:: by (Union[tmlt.analytics.keyset.KeySet, List[str], str]) – A KeySet which defines the columns to group on and the possible values for each column.
Return type:: GroupedQueryBuilder

count(name=None, mechanism=CountMechanism.DEFAULT)#

Returns a count query ready to be evaluated.

Note

Differentially private counts may return values that are not possible for a non-DP query - including negative values. You can enforce non-negativity once the query returns its results; see the example below.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Building a count query
>>> query = (
...     QueryBuilder("my_private_data")
...     .count()
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas()
   count
0      3
>>> # Ensuring all results are non-negative
>>> import pyspark.sql.functions as sf
>>> answer = answer.withColumn(
...     "count", sf.when(sf.col("count") < 0, 0).otherwise(
...             sf.col("count")
...     )
... )
>>> answer.toPandas()
   count
0      3

Parameters:

name (Optional[str]) – Name for the resulting aggregation column. Defaults to “count”.
mechanism (tmlt.analytics._query_expr.CountMechanism) – Choice of noise mechanism. By default, the framework automatically selects an appropriate mechanism.

Return type:

count_distinct(columns=None, name=None, mechanism=CountDistinctMechanism.DEFAULT, cols=None)#

Returns a count_distinct query ready to be evaluated.

Note

Differentially private counts may returns values that are not possible for a non-DP query - including negative values. You can enforce non-negativity once the query returns its results; see the example below.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  0  1  0
2  1  0  1
3  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Building a count_distinct query
>>> query = (
...     QueryBuilder("my_private_data")
...     .count_distinct()
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas()
   count_distinct
0               3
>>> # Ensuring all results are non-negative
>>> import pyspark.sql.functions as sf
>>> answer = answer.withColumn(
...     "count_distinct", sf.when(
...         sf.col("count_distinct") < 0, 0
...     ).otherwise(
...         sf.col("count_distinct")
...     )
... )
>>> answer.toPandas()
   count_distinct
0               3

Parameters:

columns (Optional[List[str]]) – Columns in which to count distinct values. If none are provided, the query will count every distinct row.
name (Optional[str]) – Name for the resulting aggregation column. Defaults to “count_distinct” if no columns are provided, or “count_distinct(A, B, C)” if the provided columns are A, B, and C.
mechanism (tmlt.analytics._query_expr.CountDistinctMechanism) – Choice of noise mechanism. By default, the framework automatically selects an appropriate mechanism.
cols (Optional[List[str]]) – Deprecated; use columns instead.

Return type:

quantile(column, quantile, low, high, name=None)#

Returns a quantile query ready to be evaluated.

Note

If the column being measured contains NaN or null values, a drop_null_and_nan() query will be performed first. If the column being measured contains infinite values, a drop_infinity() query will be performed first.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Building a quantile query
>>> query = (
...     QueryBuilder("my_private_data")
...     .quantile(column="B", quantile=0.6, low=0, high=2)
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas() 
   B_quantile(0.6)
0         1.331107

Parameters:

column (str) – The column to compute the quantile over.
quantile (float) – A number between 0 and 1 specifying the quantile to compute. For example, 0.5 would compute the median.
low (float) – The lower bound for clamping.
high (float) – The upper bound for clamping. Must be such that low is less than high.
name (Optional[str]) – The name to give the resulting aggregation column. Defaults to f"{column}_quantile({quantile})".

Return type:

min(column, low, high, name=None)#

Returns a quantile query requesting a minimum value, ready to be evaluated.

Note

If the column being measured contains NaN or null values, a drop_null_and_nan() query will be performed first. If the column being measured contains infinite values, a drop_infinity() query will be performed first.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Building a quantile query
>>> query = (
...     QueryBuilder("my_private_data")
...     .min(column="B", low=0, high=5, name="min_B")
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas() 
      min_B
0  0.213415

Parameters:

column (str) – The column to compute the quantile over.
low (float) – The lower bound for clamping.
high (float) – The upper bound for clamping. Must be such that low is less than high.
name (Optional[str]) – The name to give the resulting aggregation column. Defaults to f"{column}_min".

Return type:

max(column, low, high, name=None)#

Returns a quantile query requesting a maximum value, ready to be evaluated.

Note

If the column being measured contains NaN or null values, a drop_null_and_nan() query will be performed first. If the column being measured contains infinite values, a drop_infinity() query will be performed first.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Building a quantile query
>>> query = (
...     QueryBuilder("my_private_data")
...     .max(column="B", low=0, high=5, name="max_B")
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas() 
      max_B
0  2.331871

Parameters:

column (str) – The column to compute the quantile over.
low (float) – The lower bound for clamping.
high (float) – The upper bound for clamping. Must be such that low is less than high.
name (Optional[str]) – The name to give the resulting aggregation column. Defaults to f"{column}_max".

Return type:

median(column, low, high, name=None)#

Returns a quantile query requesting a median value, ready to be evaluated.

Note

If the column being measured contains NaN or null values, a drop_null_and_nan() query will be performed first. If the column being measured contains infinite values, a drop_infinity() query will be performed first.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Building a quantile query
>>> query = (
...     QueryBuilder("my_private_data")
...     .median(column="B", low=0, high=5, name="median_B")
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas() 
   median_B
0  1.221197

Parameters:

column (str) – The column to compute the quantile over.
low (float) – The lower bound for clamping.
high (float) – The upper bound for clamping. Must be such that low is less than high.
name (Optional[str]) – The name to give the resulting aggregation column. Defaults to f"{column}_median".

Return type:

sum(column, low, high, name=None, mechanism=SumMechanism.DEFAULT)#

Returns a sum query ready to be evaluated.

Note

If the column being measured contains NaN or null values, a drop_null_and_nan() query will be performed first. If the column being measured contains infinite values, a drop_infinity() query will be performed first.

Note

Regarding the clamping bounds:

The values for low and high are a choice the caller must make.
All data will be clamped to lie within this range.
The narrower the range, the less noise. Larger bounds mean more data is kept, but more noise needs to be added to the result.
The clamping bounds are assumed to be public information. Avoid using the private data to set these values.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Building a sum query
>>> query = (
...     QueryBuilder("my_private_data")
...     .sum(column="B",low=0, high=2)
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas()
   B_sum
0      3

Parameters:

column (str) – The column to compute the sum over.
low (float) – The lower bound for clamping.
high (float) – The upper bound for clamping. Must be such that low is less than high.
name (Optional[str]) – The name to give the resulting aggregation column. Defaults to f"{column}_sum".
mechanism (tmlt.analytics._query_expr.SumMechanism) – Choice of noise mechanism. By default, the framework automatically selects an appropriate mechanism.

Return type:

average(column, low, high, name=None, mechanism=AverageMechanism.DEFAULT)#

Returns an average query ready to be evaluated.

Note

If the column being measured contains NaN or null values, a drop_null_and_nan() query will be performed first. If the column being measured contains infinite values, a drop_infinity() query will be performed first.

Note

Regarding the clamping bounds:

The values for low and high are a choice the caller must make.
All data will be clamped to lie within this range.
The narrower the range, the less noise. Larger bounds mean more data is kept, but more noise needs to be added to the result.
The clamping bounds are assumed to be public information. Avoid using the private data to set these values.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Building an average query
>>> query = (
...     QueryBuilder("my_private_data")
...     .average(column="B",low=0, high=2)
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas()
   B_average
0        1.0

Parameters:

column (str) – The column to compute the average over.
low (float) – The lower bound for clamping.
high (float) – The upper bound for clamping. Must be such that low is less than high.
name (Optional[str]) – The name to give the resulting aggregation column. Defaults to f"{column}_average".
mechanism (tmlt.analytics._query_expr.AverageMechanism) – Choice of noise mechanism. By default, the framework automatically selects an appropriate mechanism.

Return type:

variance(column, low, high, name=None, mechanism=VarianceMechanism.DEFAULT)#

Returns a variance query ready to be evaluated.

Note

If the column being measured contains NaN or null values, a drop_null_and_nan() query will be performed first. If the column being measured contains infinite values, a drop_infinity() query will be performed first.

Note

Regarding the clamping bounds:

The values for low and high are a choice the caller must make.
All data will be clamped to lie within this range.
The narrower the range, the less noise. Larger bounds mean more data is kept, but more noise needs to be added to the result.
The clamping bounds are assumed to be public information. Avoid using the private data to set these values.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Building a variance query
>>> query = (
...     QueryBuilder("my_private_data")
...     .variance(column="B",low=0, high=2)
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas()
   B_variance
0         1.0

Parameters:

column (str) – The column to compute the variance over.
low (float) – The lower bound for clamping.
high (float) – The upper bound for clamping. Must be such that low is less than high.
name (Optional[str]) – The name to give the resulting aggregation column. Defaults to f"{column}_variance".
mechanism (tmlt.analytics._query_expr.VarianceMechanism) – Choice of noise mechanism. By default, the framework automatically selects an appropriate mechanism.

Return type:

stdev(column, low, high, name=None, mechanism=StdevMechanism.DEFAULT)#

Returns a standard deviation query ready to be evaluated.

Note

If the column being measured contains NaN or null values, a drop_null_and_nan() query will be performed first. If the column being measured contains infinite values, a drop_infinity() query will be performed first.

Note

Regarding the clamping bounds:

The values for low and high are a choice the caller must make.
All data will be clamped to lie within this range.
The narrower the range, the less noise. Larger bounds mean more data is kept, but more noise needs to be added to the result.
The clamping bounds are assumed to be public information. Avoid using the private data to set these values.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Building a standard deviation query
>>> query = (
...     QueryBuilder("my_private_data")
...     .stdev(column="B",low=0, high=2)
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.toPandas()
   B_stdev
0      1.0

Parameters:

column (str) – The column to compute the stdev over.
low (float) – The lower bound for clamping.
high (float) – The upper bound for clamping. Must be such that low is less than high.
name (Optional[str]) – The name to give the resulting aggregation column. Defaults to f"{column}_stdev".
mechanism (tmlt.analytics._query_expr.StdevMechanism) – Choice of noise mechanism. By default, the framework automatically selects an appropriate mechanism.

Return type:

class GroupedQueryBuilder(source_id, query_expr, groupby_keys)#

A QueryBuilder that is grouped by a set of columns and can be aggregated.

property query_expr: tmlt.analytics._query_expr.QueryExpr#

Get the query expression being built.

Return type:: tmlt.analytics._query_expr.QueryExpr

__init__(source_id, query_expr, groupby_keys)#

Constructor.

Do not construct directly; use groupby().

Return type:: None

count(name=None, mechanism=CountMechanism.DEFAULT)#

Returns an AggregatedQueryBuilder with a count query.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Building a groupby count query
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(KeySet.from_dict({"A": ["0", "1"]}))
...     .count()
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas()
   A  count
0  0      1
1  1      2

Parameters:

name (Optional[str]) – Name for the resulting aggregation column. Defaults to “count”.
mechanism (tmlt.analytics._query_expr.CountMechanism) – Choice of noise mechanism. By default, the framework automatically selects an appropriate mechanism.

Return type:

count_distinct(columns=None, name=None, mechanism=CountDistinctMechanism.DEFAULT, cols=None)#

Returns an AggregatedQueryBuilder with a count_distinct query.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  0  1  0
2  1  0  1
3  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Building a groupby count_distinct query
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(KeySet.from_dict({"A": ["0", "1"]}))
...     .count_distinct(["B", "X"])
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas()
   A  count_distinct(B, X)
0  0                     1
1  1                     2

Parameters:

columns (Optional[List[str]]) – Columns in which to count distinct values. If none are provided, the query will count every distinct row.
name (Optional[str]) – Name for the resulting aggregation column. Defaults to “count_distinct” if no columns are provided, or “count_distinct(A, B, C)” if the provided columns are A, B, and C.
mechanism (tmlt.analytics._query_expr.CountDistinctMechanism) – Choice of noise mechanism. By default, the framework automatically selects an appropriate mechanism.
cols (Optional[List[str]]) – Deprecated; use columns instead.

Return type:

quantile(column, quantile, low, high, name=None)#

Returns an AggregatedQueryBuilder with a quantile query.

Note

If the column being measured contains NaN or null values, a drop_null_and_nan() query will be performed first. If the column being measured contains infinite values, a drop_infinity() query will be performed first.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Building a groupby quantile query
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(KeySet.from_dict({"A": ["0", "1"]}))
...     .quantile(column="B", quantile=0.6, low=0, high=2)
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas() 
   A  B_quantile(0.6)
0  0         1.331107
1  1         1.331107

Parameters:

column (str) – The column to compute the quantile over.
quantile (float) – A number between 0 and 1 specifying the quantile to compute. For example, 0.5 would compute the median.
low (float) – The lower bound for clamping.
high (float) – The upper bound for clamping. Must be such that low is less than high.
name (Optional[str]) – The name to give the resulting aggregation column. Defaults to f"{column}_quantile({quantile})".

Return type:

min(column, low, high, name=None)#

Returns an AggregatedQueryBuilder with a quantile query requesting a minimum value.

Note

If the column being measured contains NaN or null values, a drop_null_and_nan() query will be performed first. If the column being measured contains infinite values, a drop_infinity() query will be performed first.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Building a quantile query
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(KeySet.from_dict({"A": ["0", "1"]}))
...     .min(column="B", low=0, high=5, name="min_B")
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas() 
   A     min_B
0  0  0.213415
1  1  0.213415

Parameters:

column (str) – The column to compute the quantile over.
low (float) – The lower bound for clamping.
high (float) – The upper bound for clamping. Must be such that low is less than high.
name (Optional[str]) – The name to give the resulting aggregation column. Defaults to f"{column}_min".

Return type:

max(column, low, high, name=None)#

Returns an AggregatedQueryBuilder with a quantile query requesting a maximum value.

Note

If the column being measured contains NaN or null values, a drop_null_and_nan() query will be performed first. If the column being measured contains infinite values, a drop_infinity() query will be performed first.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Building a quantile query
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(KeySet.from_dict({"A": ["0", "1"]}))
...     .max(column="B", low=0, high=5, name="max_B")
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas() 
   A     max_B
0  0  2.331871
1  1  2.331871

Parameters:

column (str) – The column to compute the quantile over.
low (float) – The lower bound for clamping.
high (float) – The upper bound for clamping. Must be such that low is less than high.
name (Optional[str]) – The name to give the resulting aggregation column. Defaults to f"{column}_max".

Return type:

median(column, low, high, name=None)#

Returns an AggregatedQueryBuilder with a quantile query requesting a median value.

Note

If the column being measured contains NaN or null values, a drop_null_and_nan() query will be performed first. If the column being measured contains infinite values, a drop_infinity() query will be performed first.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Building a quantile query
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(KeySet.from_dict({"A": ["0", "1"]}))
...     .median(column="B", low=0, high=5, name="median_B")
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas() 
   A  median_B
0  0  1.221197
1  1  1.221197

Parameters:

column (str) – The column to compute the quantile over.
low (float) – The lower bound for clamping.
high (float) – The upper bound for clamping. Must be such that low is less than high.
name (Optional[str]) – The name to give the resulting aggregation column. Defaults to f"{column}_median".

Return type:

sum(column, low, high, name=None, mechanism=SumMechanism.DEFAULT)#

Returns an AggregatedQueryBuilder with a sum query.

Note

If the column being measured contains NaN or null values, a drop_null_and_nan() query will be performed first. If the column being measured contains infinite values, a drop_infinity() query will be performed first.

Note

Regarding the clamping bounds:

The values for low and high are a choice the caller must make.
All data will be clamped to lie within this range.
The narrower the range, the less noise. Larger bounds mean more data is kept, but more noise needs to be added to the result.
The clamping bounds are assumed to be public information. Avoid using the private data to set these values.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Building a groupby sum query
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(KeySet.from_dict({"A": ["0", "1"]}))
...     .sum(column="B",low=0, high=2)
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas()
   A  B_sum
0  0      1
1  1      2

Parameters:

column (str) – The column to compute the sum over.
low (float) – The lower bound for clamping.
high (float) – The upper bound for clamping. Must be such that low is less than high.
name (Optional[str]) – The name to give the resulting aggregation column. Defaults to f"{column}_sum".
mechanism (tmlt.analytics._query_expr.SumMechanism) – Choice of noise mechanism. By default, the framework automatically selects an appropriate mechanism.

Return type:

average(column, low, high, name=None, mechanism=AverageMechanism.DEFAULT)#

Returns an AggregatedQueryBuilder with an average query.

Note

If the column being measured contains NaN or null values, a drop_null_and_nan() query will be performed first. If the column being measured contains infinite values, a drop_infinity() query will be performed first.

Note

Regarding the clamping bounds:

The values for low and high are a choice the caller must make.
All data will be clamped to lie within this range.
The narrower the range, the less noise. Larger bounds mean more data is kept, but more noise needs to be added to the result.
The clamping bounds are assumed to be public information. Avoid using the private data to set these values.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Building a groupby average query
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(KeySet.from_dict({"A": ["0", "1"]}))
...     .average(column="B",low=0, high=2)
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas()
   A  B_average
0  0        1.0
1  1        1.0

Parameters:

column (str) – The column to compute the average over.
low (float) – The lower bound for clamping.
high (float) – The upper bound for clamping. Must be such that low is less than high.
name (Optional[str]) – The name to give the resulting aggregation column. Defaults to f"{column}_average".
mechanism (tmlt.analytics._query_expr.AverageMechanism) – Choice of noise mechanism. By default, the framework automatically selects an appropriate mechanism.

Return type:

variance(column, low, high, name=None, mechanism=VarianceMechanism.DEFAULT)#

Returns an AggregatedQueryBuilder with a variance query.

Note

If the column being measured contains NaN or null values, a drop_null_and_nan() query will be performed first. If the column being measured contains infinite values, a drop_infinity() query will be performed first.

Note

Regarding the clamping bounds:

The values for low and high are a choice the caller must make.
All data will be clamped to lie within this range.
The narrower the range, the less noise. Larger bounds mean more data is kept, but more noise needs to be added to the result.
The clamping bounds are assumed to be public information. Avoid using the private data to set these values.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Building a groupby variance query
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(KeySet.from_dict({"A": ["0", "1"]}))
...     .variance(column="B",low=0, high=2)
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas()
   A  B_variance
0  0         NaN
1  1         1.0

Parameters:

column (str) – The column to compute the variance over.
low (float) – The lower bound for clamping.
high (float) – The upper bound for clamping. Must be such that low is less than high.
name (Optional[str]) – The name to give the resulting aggregation column. Defaults to f"{column}_variance".
mechanism (tmlt.analytics._query_expr.VarianceMechanism) – Choice of noise mechanism. By default, the framework automatically selects an appropriate mechanism.

Return type:

stdev(column, low, high, name=None, mechanism=StdevMechanism.DEFAULT)#

Returns an AggregatedQueryBuilder with a standard deviation query.

Note

If the column being measured contains NaN or null values, a drop_null_and_nan() query will be performed first. If the column being measured contains infinite values, a drop_infinity() query will be performed first.

Note

Regarding the clamping bounds:

The values for low and high are a choice the caller must make.
All data will be clamped to lie within this range.
The narrower the range, the less noise. Larger bounds mean more data is kept, but more noise needs to be added to the result.
The clamping bounds are assumed to be public information. Avoid using the private data to set these values.

Example

>>> budget = PureDPBudget(float("inf"))
>>> sess = tmlt.analytics.session.Session.from_dataframe(
...     privacy_budget=budget,
...     source_id="my_private_data",
...     dataframe=my_private_data,
...     protected_change=AddOneRow(),
... )
>>> my_private_data.toPandas()
   A  B  X
0  0  1  0
1  1  0  1
2  1  2  1
>>> sess.private_sources
['my_private_data']
>>> sess.get_schema("my_private_data").column_types
{'A': 'VARCHAR', 'B': 'INTEGER', 'X': 'INTEGER'}
>>> # Building a groupby standard deviation query
>>> query = (
...     QueryBuilder("my_private_data")
...     .groupby(KeySet.from_dict({"A": ["0", "1"]}))
...     .stdev(column="B",low=0, high=2)
... )
>>> # Answering the query with infinite privacy budget
>>> answer = sess.evaluate(
...     query,
...     PureDPBudget(float("inf"))
... )
>>> answer.sort("A").toPandas()
   A  B_stdev
0  0      NaN
1  1      1.0

Parameters:

column (str) – The column to compute the stdev over.
low (float) – The lower bound for clamping.
high (float) – The upper bound for clamping. Must be such that low is less than high.
name (Optional[str]) – The name to give the resulting aggregation column. Defaults to f"{column}_stdev".
mechanism (tmlt.analytics._query_expr.StdevMechanism) – Choice of noise mechanism. By default, the framework automatically selects an appropriate mechanism.

Return type: