Error when trying to create a distributed Ray dataset using from_spark() function

Set spark.databricks.pyspark.dataFrameChunk.enabled to true.

Written by Raghavan Vaidhyaraman

Last published at: January 30th, 2025

Problem

When you try to create a distributed Ray dataset from an Apache Spark DataFrame using the ray.data.from_spark() function, you encounter the following error. 

 

RuntimeError: In databricks runtime, if you want to use 'ray.data.from_spark' API, you need to set spark cluster config 'spark.databricks.pyspark.dataFrameChunk.enabled' to 'true'.
File <command-602145481410085>, line 3
      1 import ray.data
----> 3 ray_dataset = ray.data.from_spark(dataframe)

 

Cause

The spark.databricks.pyspark.dataFrameChunk.enabled configuration is set to false by default. 

 

Solution

Set spark.databricks.pyspark.dataFrameChunk.enabled to true to ensure the from_spark() function works as expected.

  1. Navigate to your cluster’s configuration page.
  2. Click the Advanced Options accordion.
  3. Click the Spark tab.
  4. In the Spark Config textbox, enter spark.databricks.pyspark.dataFrameChunk.enabled true
  5. Click Confirm.