Job fails with IndexOutOfBoundsException and ArrowBuf errors

When Groupby is used with applyinPandas it can result in Apache Arrow buffer size estimation errors.

Written by ashish

Last published at: March 3rd, 2023

Problem

You are getting intermittent job failures with java.lang.IndexOutOfBoundsException and ArrowBuf errors. 

Example stack trace

Py4JJavaError: An error occurred while calling o617.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 7.0 failed 4 times, most recent failure: Lost task 0.3 in stage 7.0 (TID 2195, 10.207.235.228, executor 0): java.lang.IndexOutOfBoundsException: index: 0, length: 1073741824 (expected: range(0, 0))
    at io.netty.buffer.ArrowBuf.checkIndex(ArrowBuf.java:716)
    at io.netty.buffer.ArrowBuf.setBytes(ArrowBuf.java:954)
    at org.apache.arrow.vector.BaseVariableWidthVector.reallocDataBuffer(BaseVariableWidthVector.java:508)
    at org.apache.arrow.vector.BaseVariableWidthVector.handleSafe(BaseVariableWidthVector.java:1239)
    at org.apache.arrow.vector.BaseVariableWidthVector.setSafe(BaseVariableWidthVector.java:1066)
    at org.apache.spark.sql.execution.arrow.StringWriter.setValue(ArrowWriter.scala:287)
    at org.apache.spark.sql.execution.arrow.ArrowFieldWriter.write(ArrowWriter.scala:151)
    at org.apache.spark.sql.execution.arrow.ArrowWriter.write(ArrowWriter.scala:105)
    at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.$anonfun$writeIteratorToStream$1(ArrowPythonRunner.scala:100)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581)
    at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.writeIteratorToStream(ArrowPythonRunner.scala:122)
    at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:478)
    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2146)
    at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:270)


Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2519)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2466)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2460)
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2460)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1152)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1152)
    at scala.Option.foreach(Option.scala:407)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1152)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2721)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2668)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2656)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
Caused by: java.lang.IndexOutOfBoundsException: index: 0, length: 1073741824 (expected: range(0, 0))
    at io.netty.buffer.ArrowBuf.checkIndex(ArrowBuf.java:716)
    at io.netty.buffer.ArrowBuf.setBytes(ArrowBuf.java:954)
    at org.apache.arrow.vector.BaseVariableWidthVector.reallocDataBuffer(BaseVariableWidthVector.java:508)
    at org.apache.arrow.vector.BaseVariableWidthVector.handleSafe(BaseVariableWidthVector.java:1239)
    at org.apache.arrow.vector.BaseVariableWidthVector.setSafe(BaseVariableWidthVector.java:1066)
    at org.apache.spark.sql.execution.arrow.StringWriter.setValue(ArrowWriter.scala:287)
    at org.apache.spark.sql.execution.arrow.ArrowFieldWriter.write(ArrowWriter.scala:151)
    at org.apache.spark.sql.execution.arrow.ArrowWriter.write(ArrowWriter.scala:105)
    at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.$anonfun$writeIteratorToStream$1(ArrowPythonRunner.scala:100)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581)
    at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.writeIteratorToStream(ArrowPythonRunner.scala:122)
    at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:478)
    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2146)
    at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:270)

Cause

This is due to an issue in the Apache Arrow buffer size estimation. Apache Arrow is an in-memory columnar data format that is used in spark to efficiently transfer data between JVM and python. 

When Groupby is used with applyinPandas it can result in this error.

For more information, please review the ARROW-15983 issue on the Apache site.

Solution

This is a sporadic failure and a retry usually succeeds.

If a retry doesn't work, you can workaround the issue by adding the following to your cluster's Spark config (AWS | Azure | GCP):

spark.databricks.execution.pandasZeroConfConversion.groupbyApply.enabled=true
Delete

Info

Enabling pandasZeroConfConversion.groupbyApply may result in lower performance, so it should only be used if needed. This should not be a default setting on your cluster.