Changelog#
Unreleased#
0.19.0 - 2024-11-21#
This release includes no user-facing API changes.
0.18.0 - 2024-11-19#
This release introduces no major API changes. However, it increases the minimum supported Python version to 3.9, and the minimum supported PySpark version to 3.3.1.
Fixed#
The
columns
argument toKeySet.from_tuples()
is no longer required to be a tuple, any sequence type (e.g. a list) is now acceptable.
0.17.0 - 2024-11-04#
This release provides a number of quality of life improvements, including a new KeySet.from_tuples()
method and support for basic arithmetic on privacy budgets.
Note
Tumult Analytics 0.17 will be the last minor version to support Python 3.8 and PySpark versions below 3.3.1. If you are using Python 3.8 or one of these versions of PySpark, you will need to upgrade them in order to use Tumult Analytics 0.18.0.
Changed#
The
map()
,flat_map()
, andflat_map_by_id()
transformations now more strictly check their outputs against the provided new column types. This may cause some existing programs to produce errors if they relied on the previous, less-strict behavior.Log messages are now emitted via Python’s built-in
logging
module.The supported version of typeguard has been updated to 4.*.
Added#
Privacy budgets now support division, multiplication, addition and subtraction.
KeySets can now be initialized directly from a collection of Python tuples using
KeySet.from_tuples()
.
0.16.1 - 2024-09-04#
This is a maintenance release, with no externally-visible changes.
0.16.0 - 2024-08-21#
This release adds a new QueryBuilder.flat_map_by_id
transformation, improved constraint support when using partition_and_create()
, and performance improvements.
Added#
Added a new transformation,
QueryBuilder.flat_map_by_id
, which allows user-defined transformations to be applied to groups of rows sharing an ID on tables with theAddRowsWithID
protected change.
Fixed#
Significantly improved the performance of coercing Session input dataframe columns to supported types.
Changed#
partition_and_create()
can now be used on a table with anAddRowsWithID
protected change if aMaxRowsPerID
constraint is present, converting the table being partitioned into one with anAddMaxRows
protected change. The behavior when usingpartition_and_create()
on such a table with aMaxGroupsPerID
constraint has not changed. If bothMaxRowsPerID
andMaxGroupsPerID
constraints are present, theMaxRowsPerID
constraint is ignored and only theMaxGroupsPerID
constraint gets applied.
0.15.0 - 2024-08-12#
This release extends the get_bounds()
method so it can get upper and lower bounds for each group in a dataframe.
In addition, it changes the object used to represent queries to the new Query
class, and updates the format in which table schemas are returned.
Added#
Added a dependency on the library
tabulate
to improve table displays fromdescribe()
.Added the ability to
get_bounds()
after callinggroupby()
, for determining upper and lower bounds for a column per group in a differentially private way.
Changed#
Backwards-incompatible: The
get_bounds()
query now returns a dataframe when evaluated instead of a tuple.Backwards-incompatible: The
Session.get_schema()
andKeySet.schema()
methods now return a normal dictionary of column names toColumnDescriptor
s, rather than a specializedSchema
type. This brings them more in line with the rest of the Tumult Analytics API, but could impact code that used some functionality available through theSchema
type. Uses of these methods where the result is treated as a dictionary should not be impacted.QueryBuilder
now returns aQuery
object instead of aQueryExpr
orAggregatedQueryBuilder
when a query is created. This should not affect code usingQueryBuilder
unless it directly inspects these objects.GroupbyCount queries now return
GroupbyCountQuery
, a subclass ofQuery
that has thesuppress()
post-process method.evaluate()
now acceptsQuery
objects instead ofQueryExpr
objects.Replaced asserts with custom exceptions in cases where internal errors are detected. Internal errors are now raised as
AnalyticsInternalError
.Updated to Tumult Core 0.16.1.
Removed#
QueryExprs (previously in
tmlt.analytics.query_expr
) have been removed from the Tumult Analytics public API. Queries should be created usingQueryBuilder
, which returns a newQuery
when a query is created.Removed the
query_expr
attribute from theQueryBuilder
class.Removed support for Pandas 1.2 and 1.3 due to a known bug in Pandas versions below 1.4.
0.14.0 - 2024-07-18#
Tumult Analytics 0.14.0 introduces experimental support for Python 3.12. Full support for Python 3.12 and Pandas 2 will not be available until the release of PySpark 4.0. In addition, Python 3.7 is no longer supported.
In addition, this release deprecates the tmlt.analytics.query_expr
module.
Use of QueryExpr
and its subtypes to create queries has been discouraged for a long time, and these types will be removed from the Tumult Analytics API in an upcoming release.
Other types from this module have been moved into the tmlt.analytics.query_builder
module, though they may be imported from either until the query_expr
module is removed.
Added#
Tumult Analytics now has experimental support for Python 3.12 using Pandas 2.
Changed#
Mechanism enums (e.g.
CountMechanism
) should now be imported fromtmlt.analytics.query_builder
. The current query expression module (tmlt.analytics.query_expr
) will be removed from the public API in an upcoming release.
Removed#
Removed support for Python 3.7.
Deprecated#
QueryExprs (previously in
tmlt.analytics.query_expr
) will be removed from the Tumult Analytics public API in an upcoming release. Queries should be created usingQueryBuilder
instead.
0.13.0 - 2024-07-03#
This release makes some supporting classes immutable.
Changed#
Made
BinningSpec
immutable.
0.12.0 - 2024-06-18#
This release adds support for left public joins.
Added#
Added support for left public joins to
join_public()
, previously only inner joins were supported.
0.11.0 - 2024-06-05#
This release introduces support in the query language for suppressing aggregates below a certain threshold, providing an easier and clearer way to express queries where small values must be dropped due to potentially-high noise.
For macOS users, it also introduces native support for Apple silicon, allowing Tumult Analytics to be used on ARM-based Macs without the need for Rosetta. Take a look at the updated installation guide for more information about this. If you have an existing installation that uses Rosetta, ensure that you are using a supported native Python installation when switching over. Users with Intel-based Macs should not be affected.
Added#
Added a
tmlt.analytics.query_expr.SuppressAggregates
query type, for suppressing aggregates less than a certain threshold. This is currently only supported for post-processingtmlt.analytics.query_expr.GroupByCount
queries. These can be built using theQueryBuilder
by callingAggregatedQueryBuilder.suppress
after building a GroupByCount query. As part of this change, query builders now return antmlt.analytics.query_builder.AggregatedQueryBuilder
instead of atmlt.analytics.query_expr.QueryExpr
when aggregating; thetmlt.analytics.query_builder.AggregatedQueryBuilder
can be passed toSession.evaluate
so most existing code should not need to be migrated.Added
cache()
anduncache()
methods toKeySet
for caching and uncaching the underlying Spark dataframe. These methods can be used to improve performance because KeySets follow Spark’s lazy evaluation model.
Changed#
PureDPBudget
,ApproxDPBudget
, andRhoZCDPBudget
are now immutable classes.PureDPBudget
andApproxDPBudget
are no longer considered equal if they have the same epsilon and theApproxDPBudget
has a delta of zero.
0.10.2 - 2024-05-31#
Changed#
Column order is now preserved when selecting columns from a
KeySet
.
0.10.1 - 2024-05-28#
This release contains no externally-visible changes from the previous version.
0.10.0 - 2024-05-17#
This release adds a new get_bounds()
aggregation.
It also includes performance improvements for KeySet
s, and other quality-of-life improvements.
Added#
Added the
QueryBuilder.get_bounds
function, for determining upper and lower bounds for a column in a differentially private way.
Changed#
If a
Builder
has only one private dataframe and that dataframe uses theAddRowsWithID
protected change, the relevant ID space will automatically be added to the Builder whenbuild()
is called.KeySet
is now an abstract class, in order to make some KeySet operations (column selection after cross-products) more efficient. Behavior is unchanged for users of thefrom_dict()
andfrom_dataframe()
constructors.
Fixed#
Stopped trying to set extra options for Java 11 and removed error when options are not set. Removed
get_java_11_config()
.Updated minimum supported Spark version to 3.1.1 to prevent Java 11 error.
0.9.0 - 2024-04-16#
This is a maintenance release, fixing a number of bugs and improving our API documentation.
Note that the 0.9.x release series will be the last to support Python 3.7, which has not been receiving security updates for several months. If this is a problem, please reach out to us.
Changed#
KeySet
equality is now performed without converting the underlying dataframe to Pandas.partition_and_create()
: thecolumn
andsplits
arguments are now annotated as required.The minimum supported version of Tumult Core is now 0.13.0.
The
QueryBuilder.variance
,QueryBuilder.stdev
,GroupedQueryBuilder.variance
, andGroupedQueryBuilder.stdev
methods now calculate the sample variance or standard deviation, rather than the population variance or standard deviation.
Removed#
Backwards-incompatible: The
stability
andgrouping_column
parameters toSession.from_dataframe
andSession.Builder.with_private_dataframe
have been removed (deprecated since 0.7.0). As a result, theprotected_change
parameter to those methods is now required.
Fixed#
The error message when attempting to overspend an
ApproxDPBudget
now more clearly indicates which component of the budget was insufficient to evaluate the query.QueryBuilder.get_groups
now automatically excludes ID columns if no columns are specified.Flat maps now correctly ignore
max_rows
when it does not apply. Previously they would raise a warning saying thatmax_rows
was ignored, but would still use it to limit the number of rows in the output.
0.8.3 - 2024-02-27#
This is a maintenance release that adds support for newer versions of Tumult Core. It contains no API changes.
0.8.2 - 2023-11-29#
This release addresses a serious security vulnerability in PyArrow: CVE-2023-47248. It is strongly recommended that all users update to this version of Analytics or apply one of the mitigations described in the GitHub Advisory.
Changed#
Increased minimum supported version of Tumult Core to 0.11.5. As a result:
Increased the minimum supported version of PyArrow to 14.0.1 for Python 3.8 and above.
Added dependency on
pyarrow-hotfix
on Python 3.7. Note that if you are using Python 3.7, the hotfix must be imported before using PySpark in order to be effective. Analytics imports the hotfix, so importing Analytics before using Spark will also work.
0.8.1 - 2023-10-30#
This release adds support for Python 3.11, as well as compatibility with newer versions of various dependencies, including PySpark. It also includes documentation improvements, but no API changes.
0.8.0 - 2023-08-15#
This is a maintenance release that addresses a performance regression for complex queries and improves naming consistency in some areas of the Tumult Analytics API.
Added#
Added the
QueryBuilder.get_groups
function, for determining groupby keys for a table in a differentially private way.
Changed#
Backwards-incompatible: Renamed
DropExcess.max_records
tomax_rows
.Backwards-incompatible: Renamed
FlatMap.max_num_rows
toFlatMap.max_rows
.Changed the name of an argument for
QueryBuilder.flat_map()
frommax_num_rows
tomax_rows
. The oldmax_num_rows
argument is deprecated and will be removed in a future release.
Fixed#
Upgrades to version 0.11 of Tumult Core. This addresses a performance issue introduced in Tumult Analytics 0.7.0 where some complex queries compiled much more slowly than they had previously.
0.7.3 - 2023-07-13#
Fixed#
Fixed a crash in public and private joins.
0.7.2 - 2023-06-15#
This release adds support for running Tumult Analytics on Python 3.10. It also enables adding continuous Gaussian noise to query results, and addresses a number of bugs and API inconsistencies.
Added#
Tumult Analytics now supports Python 3.10 in addition to the previously-supported versions.
Queries evaluated with zCDP budgets can now use continuous Gaussian noise, allowing the use of Gaussian noise for queries with non-integer results.
Changed#
The
QueryBuilder.replace_null_and_nan()
andQueryBuilder.drop_null_and_nan()
methods now accept empty column specifications on tables with anAddRowsWithID
protected change. Replacing/dropping nulls on ID columns is still not allowed, but the ID column will now automatically be excluded in this case rather than raising an exception.BinningSpec.bins()
used to only include the NaN bin if the provided bin edges were floats. However, float-valued columns can be binned with integer bin edges, which resulted in a confusing situation where aBinningSpec
could indicate that it would not use a NaN bin but still place values in the NaN bin. To avoid this,BinningSpec.bins()
now always includes the NaN bin if one was specified, regardless of whether the bin edge type can represent NaN values.The automatically-generated bin names in
BinningSpec
now quote strings when they are used as bin edges. For example, the bin generated byBinningSpec(["0", "1"])
is now['0', '1']
where it was previously[0, 1]
. Bins with edges of other types are not affected.
Fixed#
Creating a
Session
with multiple tables in an ID space used to fail if some of those tables’ ID columns allowed nulls and others did not. This no longer occurs, and in such cases all of the tables’ ID columns are made nullable.
0.7.1 - 2023-05-23#
This is a maintenance release that mainly contains documentation updates. It also fixes a bug where installing Tumult Analytics using pip 23 and above could fail due to a dependency mismatch.
0.7.0 - 2023-04-27#
This release adds support for privacy identifiers: Tumult Analytics can now protect input tables in which the differential privacy guarantee needs to hide the presence of arbitrarily many rows sharing the same value in a particular column. For example, this may be used to protect each user of a service when every row in a table is associated with a user ID.
Privacy identifiers are set up using the new AddRowsWithID
protected change.
A number of features have been added to the API to support this, including alternative behaviors for various query transformations when working with IDs and the new concept of constraints
.
To get started with these features, take a look at the new Working with privacy IDs and Doing more with privacy IDs tutorials.
Added#
A new
AddRowsWithID
protected change has been added, which protects the addition or removal of all rows with the same value in a specified column. See the documentation forAddRowsWithID
and the Doing more with privacy IDs tutorial for more information.When creating a Session with
AddRowsWithID
using aSession.Builder
, you must use the newwith_id_space()
method to specify the identifier space(s) of tables using this protected change.When creating a Session with
Session.from_dataframe()
, specifying an ID space is not necessary.
QueryBuilder
has a new method,enforce()
, for enforcing constraints on a table. Types for representing these constraints are located in the newtmlt.analytics.constraints
module.A new method,
Session.describe()
, has been added to provide a summary of the tables in aSession
, or of a single table or the output of a query.
Changed#
QueryBuilder.join_private()
now accepts the name of a private table asright_operand
. For example,QueryBuilder("table").join_private("foo")
is equivalent toQueryBuilder("table").join_private(QueryBuilder("foo"))
.The
max_num_rows
parameter toQueryBuilder.flat_map()
is now optional when applied to tables with anAddRowsWithID
protected change.Backwards-incompatible: The parameters to
QueryBuilder.flat_map()
have been reordered, movingmax_num_rows
to be the last parameter.Backwards-incompatible: The lower and upper bounds for quantile, sum, average, variance, and standard deviation queries can no longer be equal to one another. The lower bound must now be strictly less than the upper bound.
Backwards-incompatible: Renamed
QueryBuilder.filter()
predicate
argument tocondition
.Backwards-incompatible: Renamed
tmlt.analytics.query_expr.Filter
query expressionpredicate
property tocondition
.Backwards-incompatible: Renamed
KeySet.filter()
expr
argument tocondition
.
Deprecated#
The
stability
andgrouping_column
parameters toSession.from_dataframe()
andSession.Builder.with_private_dataframe()
are deprecated, and will be removed in a future release. Theprotected_change
parameter should be used instead, and will become required.
Removed#
The
attr_name
parameter toSession.partition_and_create()
, which was deprecated in version 0.5.0, has been removed.
Fixed#
Session.add_public_datafame()
used to allow creation of a public table with the same name as an existing public table, which was neither intended nor fully supported by someSession
methods. It now raises aValueError
in this case.Some query patterns on tables containing nulls could cause grouped aggregations to produce the wrong set of group keys in their output. This no longer happens.
In certain unusual cases, join transformations could erroneously drop rows containing nulls in columns that were not being joined on. These rows are no longer dropped.
0.6.1 - 2022-12-07#
This is a maintenance release which introduces a number of documentation improvements, but has no publicly-visible API changes.
0.6.0 - 2022-12-06#
This release introduces a new way to specify what unit of data is protected by the privacy guarantee of a Session
.
A new protected_change
parameter is available when creating a Session
, taking an instance of the new ProtectedChange
class which describes the largest unit of data in the resulting table on which the differential privacy guarantee will hold.
See the documentation for the protected_change
module for more information about the available protected changes and how to use them.
The stability
and grouping_column
parameters which were used to specify this information are still accepted, and work as before, but they will be deprecated and eventually removed in future releases.
The default behavior of assuming stability=1
if no other information is given will also be deprecated and removed, on a similar timeline to stability
and grouping_column
; instead, explicitly specify protected_change=AddOneRow()
.
These changes should make the privacy guarantees provided by the Session
interface easier to understand and harder to misuse, and allow for future support for other units of protection that were not representable with the existing API.
Added#
As described above,
Session.Builder.with_private_dataframe
andSession.from_dataframe
now have a new parameter,protected_change
. This parameter takes an instance of one of the classes defined in the newprotected_change
module, specifying the unit of data in the corresponding table to be protected.
0.5.1 - 2022-11-16#
Changed#
Updated to Tumult Core 0.6.0.
0.5.0 - 2022-10-17#
Added#
Added a diagram to the API reference page.
Analytics now does an additional Spark configuration check for users running Java 11+ at the time of Analytics Session initialization. If the user is running Java 11 or higher with an incorrect Spark configuration, Analytics raises an informative exception.
Added a method to check that basic Analytics functionality works (
tmlt.analytics.utils.check_installation
).
Changed#
Backwards-incompatible: Changed argument names for
QueryBuilder.count_distinct
andKeySet.__getitem__
fromcols
tocolumns
, for consistency. The old argument has been deprecated, but is still available.Backwards-incompatible: Changed the argument name for
Session.partition_and_create
fromattr_name
tocolumn
. The old argument has been deprecated, but is still available.Improved the error message shown when a filter expression is invalid.
Updated to Tumult Core 0.5.0. As a result,
python-flint
is no longer a transitive dependency, simplifying the Analytics installation process.
Deprecated#
The contents of the
cleanup
module have been moved to theutils
module. Thecleanup
module will be removed in a future version.
0.4.2 - 2022-09-06#
Fixed#
Switched to Core version 0.4.3 to avoid warnings when evaluating some queries.
0.4.1 - 2022-08-25#
Added#
Added
QueryBuilder.histogram
function, which provides a shorthand for generating binned data counts.Analytics now checks to see if the user is running Java 11 or higher. If they are, Analytics either sets the appropriate Spark options (if Spark is not yet running) or raises an informative exception (if Spark is running and configured incorrectly).
Changed#
Improved documentation for
QueryBuilder.map
andQueryBuilder.flat_map
.
Fixed#
Switched to Core version 0.4.2, which contains a fix for an issue that sometimes caused queries to fail to be compiled.
0.4.0 - 2022-07-22#
Added#
Session.from_dataframe
andSession.Builder.with_private_dataframe
now have agrouping_column
option and support non-integer stabilities. This allows setting up grouping columns like those that result from grouping flatmaps when loading data. This is an advanced feature, and should be used carefully.
0.3.0 - 2022-06-23#
Added#
Added
QueryBuilder.bin_column
and an associatedBinningSpec
type.Dates may now be used in
KeySet
s.Added support for DataFrames containing NaN and null values. Columns created by Map and FlatMap are now marked as potentially containing NaN and null values.
Added
QueryBuilder.replace_null_and_nan
function, which replaces null and NaN values with specified defaults.Added
QueryBuilder.replace_infinite
function, which replaces positive and negative infinity values with specified defaults.Added
QueryBuilder.drop_null_and_nan
function, which drops null and NaN values for specified columns.Added
QueryBuilder.drop_infinite
function, which drops infinite values for specified columns.Aggregations (sum, quantile, average, variance, and standard deviation) now silently drop null and NaN values before being performed.
Aggregations (sum, quantile, average, variance, and standard deviation) now silently clamp infinite values (+infinity and -infinity) to the query’s lower and upper bounds.
Added a
cleanup
module with two functions: acleanup
function to remove the current temporary table (which should be called beforespark.stop()
), and aremove_all_temp_tables
function that removes all temporary tables ever created by Analytics.Added a topic guide in the documentation for Tumult Analytics’ treatment of null, NaN, and infinite values.
Changed#
Backwards-incompatible: Sessions no longer allow DataFrames to contain a column named
""
(the empty string).Backwards-incompatible: You can no longer call
Session.Builder.with_privacy_budget
multiple times on the same builder.Backwards-incompatible: You can no longer call
Session.add_private_data
multiple times with the same source id.Backwards-incompatible: Sessions now use the DataFrame’s schema to determine which columns are nullable.
Removed#
Backwards-incompatible: Removed
groupby_public_source
andgroupby_domains
fromQueryBuilder
.Backwards-incompatible:
Session.from_csv
and CSV-related methods onSession.Builder
have been removed. Instead, usespark.read.csv
along withSession.from_dataframe
and other dataframe-based methods.Backwards-incompatible: Removed
validate
option fromSession.from_dataframe
,Session.add_public_dataframe
,Session.Builder.with_private_dataframe
,Session.Builder.with_public_dataframe
.Backwards-incompatible: Removed
KeySet.contains_nan_or_null
.
Fixed#
Backwards-incompatible:
KeySet
s now explicitly check for and disallow the use of floats and timestamps as keys. This has always been the intended behavior, but it was previously not checked for and could work or cause non-obvious errors depending on the situation.KeySet.dataframe()
now always returns a dataframe where all rows are distinct.Under certain circumstances, evaluating a
GroupByCountDistinct
query expression used to modify the inputQueryExpr
. This no longer occurs.It is now possible to partition on a column created by a grouping flat map, which used to raise exception from Core.
0.2.1 - 2022-04-14 (internal release)#
Added#
Added support for basic operations (filter, map, etc.) on Spark date and timestamp columns.
ColumnType
has two new variants,DATE
andTIMESTAMP
, to support these.Future documentation will now include any exceptions defined in Analytics.
Changed#
Switch session to use Persist/Unpersist instead of Cache.
0.2.0 - 2022-03-28 (internal release)#
Removed#
Multi-query evaluate support is entirely removed.
Columns that are neither floats nor doubles will no longer be checked for NaN values.
The
BIT
variant of theColumnType
enum was removed, as it was not supported elsewhere in Analytics.
Changed#
Backwards-incompatible: Renamed
query_exprs
parameter inSession.evaluate
toquery_expr
.Backwards-incompatible:
QueryBuilder.join_public
and theJoinPublic
query expression can now accept public tables specified as Spark dataframes. The existing behavior using public source IDs is still supported, but thepublic_id
parameter/property is now calledpublic_table
.Installation on Python 3.7.1 through 3.7.3 is now allowed.
KeySets now do type coercion on creation, matching the type coercion that Sessions do for private sources.
Sessions created by
partition_and_create
must be used in the order they were created, and using the parent session will forcibly close all child sessions. Sessions can be manually closed withsession.stop()
.
Fixed#
Joining with a public table that contains no NaNs, but has a column where NaNs are allowed, previously caused an error when compiling queries. This is now handled correctly.
0.1.1 - 2022-02-28 (internal release)#
Added#
Added a
KeySet
class, which will eventually be used for all GroupBy queries.Added
QueryBuilder.groupby()
, a new group-by based onKeySet
s.
Changed#
The Analytics library now uses
KeySet
andQueryBuilder.groupby()
for all GroupBy queries.The various
Session
methods for loading in data from CSV no longer support loading the data’s schema from a file.Made Session return a more user-friendly error message when the user provides a privacy budget of 0.
Removed all instances of the old name of this library, and replaced them with “Analytics”
Deprecated#
QueryBuilder.groupby_domains()
andQueryBuilder.groupby_public_source()
are now deprecated in favor of usingQueryBuilder.groupby()
withKeySet
s. They will be removed in a future version.
0.1.0 - 2022-02-15 (internal release)#
Added#
Initial release.