Troubleshooting#
This page lists common issues that can arise when using Tumult Analytics, and explains how to address them.
Handling large amounts of data#
When running Analytics locally on large amounts of data (10 million rows or more),
you might encounter Spark errors like
java.lang.OutOfMemoryError: GC overhead limit exceeded
or java.lang.OutOfMemoryError: Java heap space
.
It’s often possible to successfully run Analytics
locally anyway, by configuring Spark with enough RAM. See our
Spark guide for more information.
Receiving empty dataframes as outputs#
If you’re running Analytics queries and getting empty dataframes as outputs, this likely indicates that your Spark configuration is incorrect. If you run the installation checker, it should identify this problem.
Receiving empty dataframes is a sign that Spark is writing to an incorrect warehouse directory location. This is most likely to occur in a setting where multiple machines need to use a shared location as a datastore, but no external warehouse directory is specified.
The issue can be resolved by providing your desired warehouse directory location when building your Spark session.
For example, to configure the session to use to the S3 location s3://my-bucket/spark-warehouse
, you would use the following code:
from pyspark.sql import SparkSession
warehouse_location = "s3://my-bucket/spark-warehouse"
spark = SparkSession.builder.config(
"spark.sql.warehouse.dir", warehouse_location
).getOrCreate()
This assumes that you have configured Spark with the permissions to interact with the given bucket.
If you are using Hive tables to read and write data, you may instead want to consult the Hive section of the Spark topic guide. For more tips related to Spark, see the entirety of that guide.
PicklingError
on map queries#
Functions used in
Map
or FlatMap
queries cannot reference Spark objects, directly or indirectly. If they do,
you might get errors like this:
_pickle.PicklingError: Could not serialize object: RuntimeError: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers
or like this:
PicklingError: Could not serialize object: TypeError: can't pickle _thread.RLock objects
For example, this code will raise an error:
from typing import Dict, List
from pyspark.sql import DataFrame, SparkSession
from tmlt.analytics.query_builder import ColumnType, QueryBuilder
class DataReader:
def __init__(self, filenames: List[str]):
spark = SparkSession.builder.getOrCreate()
self.data: Dict[str, DataFrame] = {}
for f in filenames:
self.data[f] = spark.read.csv(f)
reader = DataReader(["a.csv", "b.csv"])
qb = QueryBuilder("private").map(
f=lambda row: {"data_files": ",".join(reader.data.keys())},
new_column_types={"data_files": ColumnType.VARCHAR},
)
session.create_view(qb, source_id="my_view", cache=True)
If you re-write the map function so that no objects referenced inside the function have any references to Spark objects, the map function will succeed:
data_files = ",".join(reader.data.keys())
qb = QueryBuilder("private").map(
f=lambda row: {"data_files": data_files},
new_column_types={"data_files": ColumnType.VARCHAR},
)
session.create_view(qb, source_id="my_view", cache=True)
Having problems with something else?#
Ask for help on our Slack server in the #library-questions channel!