Troubleshooting#
This page lists common issues that can arise when using Tumult Analytics, and explains how to address them.
Handling large amounts of data#
When running Analytics locally on large amounts of data (10 million rows or more),
you might encounter Spark errors like
java.lang.OutOfMemoryError: GC overhead limit exceeded
or java.lang.OutOfMemoryError: Java heap space
.
It’s often possible to successfully run Analytics
locally anyway, by configuring Spark with enough RAM. See our
Spark guide for more information.
PicklingError
on map queries#
Functions used in
Map
or FlatMap
queries cannot reference Spark objects, directly or indirectly. If they do,
you might get errors like this:
_pickle.PicklingError: Could not serialize object: RuntimeError: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers
or like this:
PicklingError: Could not serialize object: TypeError: can't pickle _thread.RLock objects
For example, this code will raise an error:
from typing import Dict, List
from pyspark.sql import DataFrame, SparkSession
from tmlt.analytics.query_builder import ColumnType, QueryBuilder
class DataReader:
def __init__(self, filenames: List[str]):
spark = SparkSession.builder.getOrCreate()
self.data: Dict[str, DataFrame] = {}
for f in filenames:
self.data[f] = spark.read.csv(f)
reader = DataReader(["a.csv", "b.csv"])
qb = QueryBuilder("private").map(
f=lambda row: {"data_files": ",".join(reader.data.keys())},
new_column_types={"data_files": ColumnType.VARCHAR},
)
session.create_view(qb, source_id="my_view", cache=True)
If you re-write the map function so that no objects referenced inside the function have any references to Spark objects, the map function will succeed:
data_files = ",".join(reader.data.keys())
qb = QueryBuilder("private").map(
f=lambda row: {"data_files": data_files},
new_column_types={"data_files": ColumnType.VARCHAR},
)
session.create_view(qb, source_id="my_view", cache=True)
Having problems with something else?#
Ask for help on our Slack server in the #library-questions channel!