Problem
You are getting intermittent job failures with a NoSuchElementException error.
Example stack trace
Py4JJavaError: An error occurred while calling o2843.count. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 17 in stage 868.0 failed 4 times, most recent failure: Lost task 17.3 in stage 868.0 (TID 3065) (10.249.38.86 executor 6): java.util.NoSuchElementException at org.apache.spark.sql.vectorized.ColumnarBatch$1.next(ColumnarBatch.java:69) at org.apache.spark.sql.vectorized.ColumnarBatch$1.next(ColumnarBatch.java:58) at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:44) at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$4.next(ArrowConverters.scala:401) at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$4.next(ArrowConverters.scala:382) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage9.processNext(Unknown Source) ...
Cause
The NoSuchElementException error is the result of an issue in Apache Arrow optimization. Apache Arrow is an in-memory columnar data format that is used in spark to efficiently transfer data between the JVM and Python.
When Arrow optimization is enabled in the Py4J interface, there is a possibility it can call iterator.next() without checking iterator.hasNext(). This can result in a NoSuchElementException error.
Solution
Set spark.databricks.pyspark.emptyArrowBatchCheck to true in the cluster's Spark config (AWS | Azure | GCP).
spark.databricks.pyspark.emptyArrowBatchCheck=true
Enabling spark.databricks.pyspark.emptyArrowBatchCheck prevents a NoSuchElementException error from occurring when the Arrow batch size is 0.
Alternatively, you can disable Arrow optimization by setting the following properties in your cluster's Spark config.
spark.sql.execution.arrow.pyspark.enabled=false spark.sql.execution.arrow.enabled=false
Disabling Arrow optimization may have performance implications.