0.8.2 - 2023-11-29#
This release addresses a serious security vulnerability in PyArrow: CVE-2023-47248. It is strongly recommended that all users update to this version of Analytics or apply one of the mitigations described in the GitHub Advisory.
Increased minimum supported version of Tumult Core to 0.11.5. As a result:
Increased the minimum supported version of PyArrow to 14.0.1 for Python 3.8 and above.
Added dependency on
pyarrow-hotfixon Python 3.7. Note that if you are using Python 3.7, the hotfix must be imported before using PySpark in order to be effective. Analytics imports the hotfix, so importing Analytics before using Spark will also work.
0.8.1 - 2023-10-30#
This release adds support for Python 3.11, as well as compatibility with newer versions of various dependencies, including PySpark. It also includes documentation improvements, but no API changes.
0.8.0 - 2023-08-15#
This is a maintenance release that addresses a performance regression for complex queries and improves naming consistency in some areas of the Tumult Analytics API.
QueryBuilder.get_groupsfunction, for determining groupby keys for a table in a differentially private way.
Changed the name of an argument for
max_rows. The old
max_num_rowsargument is deprecated and will be removed in a future release.
Upgrades to version 0.11 of Tumult Core. This addresses a performance issue introduced in Tumult Analytics 0.7.0 where some complex queries compiled much more slowly than they had previously.
0.7.3 - 2023-07-13#
Fixed a crash in public and private joins.
0.7.2 - 2023-06-15#
This release adds support for running Tumult Analytics on Python 3.10. It also enables adding continuous Gaussian noise to query results, and addresses a number of bugs and API inconsistencies.
Tumult Analytics now supports Python 3.10 in addition to the previously-supported versions.
Queries evaluated with zCDP budgets can now use continuous Gaussian noise, allowing the use of Gaussian noise for queries with non-integer results.
QueryBuilder.drop_null_and_nan()methods now accept empty column specifications on tables with an
AddRowsWithIDprotected change. Replacing/dropping nulls on ID columns is still not allowed, but the ID column will now automatically be excluded in this case rather than raising an exception.
BinningSpec.bins()used to only include the NaN bin if the provided bin edges were floats. However, float-valued columns can be binned with integer bin edges, which resulted in a confusing situation where a
BinningSpeccould indicate that it would not use a NaN bin but still place values in the NaN bin. To avoid this,
BinningSpec.bins()now always includes the NaN bin if one was specified, regardless of whether the bin edge type can represent NaN values.
The automatically-generated bin names in
BinningSpecnow quote strings when they are used as bin edges. For example, the bin generated by
BinningSpec(["0", "1"])is now
['0', '1']where it was previously
[0, 1]. Bins with edges of other types are not affected.
Sessionwith multiple tables in an ID space used to fail if some of those tables’ ID columns allowed nulls and others did not. This no longer occurs, and in such cases all of the tables’ ID columns are made nullable.
0.7.1 - 2023-05-23#
This is a maintenance release that mainly contains documentation updates. It also fixes a bug where installing Tumult Analytics using pip 23 and above could fail due to a dependency mismatch.
0.7.0 - 2023-04-27#
This release adds support for privacy identifiers: Tumult Analytics can now protect input tables in which the differential privacy guarantee needs to hide the presence of arbitrarily many rows sharing the same value in a particular column. For example, this may be used to protect each user of a service when every row in a table is associated with a user ID.
Privacy identifiers are set up using the new
AddRowsWithID protected change.
A number of features have been added to the API to support this, including alternative behaviors for various query transformations when working with IDs and the new concept of
To get started with these features, take a look at the new Working with privacy IDs and Doing more with privacy IDs tutorials.
AddRowsWithIDprotected change has been added, which protects the addition or removal of all rows with the same value in a specified column. See the documentation for
AddRowsWithIDand the Doing more with privacy IDs tutorial for more information.
QueryBuilder.join_private()now accepts the name of a private table as
right_operand. For example,
QueryBuilder("table").join_private("foo")is equivalent to
Backwards-incompatible: The parameters to
QueryBuilder.flat_map()have been reordered, moving
max_num_rowsto be the last parameter.
Backwards-incompatible: The lower and upper bounds for quantile, sum, average, variance, and standard deviation queries can no longer be equal to one another. The lower bound must now be strictly less than the upper bound.
Session.partition_and_create(), which was deprecated in version 0.5.0, has been removed.
Session.add_public_datafame()used to allow creation of a public table with the same name as an existing public table, which was neither intended nor fully supported by some
Sessionmethods. It now raises a
ValueErrorin this case.
Some query patterns on tables containing nulls could cause grouped aggregations to produce the wrong set of group keys in their output. This no longer happens.
In certain unusual cases, join transformations could erroneously drop rows containing nulls in columns that were not being joined on. These rows are no longer dropped.
0.6.1 - 2022-12-07#
This is a maintenance release which introduces a number of documentation improvements, but has no publicly-visible API changes.
0.6.0 - 2022-12-06#
This release introduces a new way to specify what unit of data is protected by the privacy guarantee of a
protected_change parameter is available when creating a
Session, taking an instance of the new
ProtectedChange class which describes the largest unit of data in the resulting table on which the differential privacy guarantee will hold.
See the documentation for the
protected_change module for more information about the available protected changes and how to use them.
grouping_column parameters which were used to specify this information are still accepted, and work as before, but they will be deprecated and eventually removed in future releases.
The default behavior of assuming
stability=1 if no other information is given will also be deprecated and removed, on a similar timeline to
grouping_column; instead, explicitly specify
These changes should make the privacy guarantees provided by the
Session interface easier to understand and harder to misuse, and allow for future support for other units of protection that were not representable with the existing API.
As described above,
Session.from_dataframenow have a new parameter,
protected_change. This parameter takes an instance of one of the classes defined in the new
protected_changemodule, specifying the unit of data in the corresponding table to be protected.
0.5.1 - 2022-11-16#
Updated to Tumult Core 0.6.0.
0.5.0 - 2022-10-17#
Added a diagram to the API reference page.
Analytics now does an additional Spark configuration check for users running Java 11+ at the time of Analytics Session initialization. If the user is running Java 11 or higher with an incorrect Spark configuration, Analytics raises an informative exception.
Added a method to check that basic Analytics functionality works (
Backwards-incompatible: Changed argument names for
columns, for consistency. The old argument has been deprecated, but is still available.
Backwards-incompatible: Changed the argument name for
column. The old argument has been deprecated, but is still available.
Improved the error message shown when a filter expression is invalid.
Updated to Tumult Core 0.5.0. As a result,
python-flintis no longer a transitive dependency, simplifying the Analytics installation process.
The contents of the
cleanupmodule have been moved to the
cleanupmodule will be removed in a future version.
0.4.2 - 2022-09-06#
Switched to Core version 0.4.3 to avoid warnings when evaluating some queries.
0.4.1 - 2022-08-25#
QueryBuilder.histogramfunction, which provides a shorthand for generating binned data counts.
Analytics now checks to see if the user is running Java 11 or higher. If they are, Analytics either sets the appropriate Spark options (if Spark is not yet running) or raises an informative exception (if Spark is running and configured incorrectly).
Improved documentation for
Switched to Core version 0.4.2, which contains a fix for an issue that sometimes caused queries to fail to be compiled.
0.4.0 - 2022-07-22#
Session.Builder.with_private_dataframenow have a
grouping_columnoption and support non-integer stabilities. This allows setting up grouping columns like those that result from grouping flatmaps when loading data. This is an advanced feature, and should be used carefully.
0.3.0 - 2022-06-23#
QueryBuilder.bin_columnand an associated
Dates may now be used in
Added support for DataFrames containing NaN and null values. Columns created by Map and FlatMap are now marked as potentially containing NaN and null values.
QueryBuilder.replace_null_and_nanfunction, which replaces null and NaN values with specified defaults.
QueryBuilder.replace_infinitefunction, which replaces positive and negative infinity values with specified defaults.
QueryBuilder.drop_null_and_nanfunction, which drops null and NaN values for specified columns.
QueryBuilder.drop_infinitefunction, which drops infinite values for specified columns.
Aggregations (sum, quantile, average, variance, and standard deviation) now silently drop null and NaN values before being performed.
Aggregations (sum, quantile, average, variance, and standard deviation) now silently clamp infinite values (+infinity and -infinity) to the query’s lower and upper bounds.
cleanupmodule with two functions: a
cleanupfunction to remove the current temporary table (which should be called before
spark.stop()), and a
remove_all_temp_tablesfunction that removes all temporary tables ever created by Analytics.
Added a topic guide in the documentation for Tumult Analytics’ treatment of null, NaN, and infinite values.
Backwards-incompatible: Sessions no longer allow DataFrames to contain a column named
""(the empty string).
Backwards-incompatible: You can no longer call
Session.Builder.with_privacy_budgetmultiple times on the same builder.
Backwards-incompatible: You can no longer call
Session.add_private_datamultiple times with the same source id.
Backwards-incompatible: Sessions now use the DataFrame’s schema to determine which columns are nullable.
Session.from_csvand CSV-related methods on
Session.Builderhave been removed. Instead, use
Session.from_dataframeand other dataframe-based methods.
KeySets now explicitly check for and disallow the use of floats and timestamps as keys. This has always been the intended behavior, but it was previously not checked for and could work or cause non-obvious errors depending on the situation.
KeySet.dataframe()now always returns a dataframe where all rows are distinct.
Under certain circumstances, evaluating a
GroupByCountDistinctquery expression used to modify the input
QueryExpr. This no longer occurs.
It is now possible to partition on a column created by a grouping flat map, which used to raise exception from Core.
0.2.1 - 2022-04-14 (internal release)#
Added support for basic operations (filter, map, etc.) on Spark date and timestamp columns.
ColumnTypehas two new variants,
TIMESTAMP, to support these.
Future documentation will now include any exceptions defined in Analytics.
Switch session to use Persist/Unpersist instead of Cache.
0.2.0 - 2022-03-28 (internal release)#
Multi-query evaluate support is entirely removed.
Columns that are neither floats nor doubles will no longer be checked for NaN values.
BITvariant of the
ColumnTypeenum was removed, as it was not supported elsewhere in Analytics.
JoinPublicquery expression can now accept public tables specified as Spark dataframes. The existing behavior using public source IDs is still supported, but the
public_idparameter/property is now called
Installation on Python 3.7.1 through 3.7.3 is now allowed.
KeySets now do type coercion on creation, matching the type coercion that Sessions do for private sources.
Sessions created by
partition_and_createmust be used in the order they were created, and using the parent session will forcibly close all child sessions. Sessions can be manually closed with
Joining with a public table that contains no NaNs, but has a column where NaNs are allowed, previously caused an error when compiling queries. This is now handled correctly.
0.1.1 - 2022-02-28 (internal release)#
KeySetclass, which will eventually be used for all GroupBy queries.
QueryBuilder.groupby(), a new group-by based on
The Analytics library now uses
QueryBuilder.groupby()for all GroupBy queries.
Sessionmethods for loading in data from CSV no longer support loading the data’s schema from a file.
Made Session return a more user-friendly error message when the user provides a privacy budget of 0.
Removed all instances of the old name of this library, and replaced them with “Analytics”
QueryBuilder.groupby_public_source()are now deprecated in favor of using
KeySets. They will be removed in a future version.
0.1.0 - 2022-02-15 (internal release)#