Job fails with NoSuchElementException error

NoSuchElementException errors can occur when using Apache Arrow.

Written by ashish

Last published at: March 3rd, 2023

Problem

You are getting intermittent job failures with a NoSuchElementException error. 

Example stack trace

Py4JJavaError: An error occurred while calling o2843.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 17 in stage 868.0 failed 4 times, most recent failure: Lost task 17.3 in stage 868.0 (TID 3065) (10.249.38.86 executor 6): java.util.NoSuchElementException
at org.apache.spark.sql.vectorized.ColumnarBatch$1.next(ColumnarBatch.java:69)
at org.apache.spark.sql.vectorized.ColumnarBatch$1.next(ColumnarBatch.java:58)
at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:44)
at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$4.next(ArrowConverters.scala:401)
at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$4.next(ArrowConverters.scala:382)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage9.processNext(Unknown Source)
...

Cause

The NoSuchElementException error is the result of an issue in Apache Arrow optimization. Apache Arrow is an in-memory columnar data format that is used in spark to efficiently transfer data between the JVM and Python.

When Arrow optimization is enabled in the Py4J interface, there is a possibility it can call iterator.next() without checking iterator.hasNext(). This can result in a NoSuchElementException error. 

Solution

Set spark.databricks.pyspark.emptyArrowBatchCheck to true in the cluster's Spark config (AWS | Azure | GCP).

spark.databricks.pyspark.emptyArrowBatchCheck=true

Enabling spark.databricks.pyspark.emptyArrowBatchCheck prevents a NoSuchElementException error from occurring when the Arrow batch size is 0.

Alternatively, you can disable Arrow optimization by setting the following properties in your cluster's Spark config.

spark.sql.execution.arrow.pyspark.enabled=false
spark.sql.execution.arrow.enabled=false

Disabling Arrow optimization may have performance implications.