DataFrame to Ray Dataset conversions taking a long time to execute

Set spark.task.resource.gpu.amount to 0, modify num_cpus_worker_node, or enable Spark cluster auto-scaling.

Written by Raghavan Vaidhyaraman

Last published at: April 1st, 2025

Problem

When converting a DataFrame to a Ray Dataset using Ray versions below 2.10, you notice long execution times, especially for small datasets. 

 

Cause

Every Apache Spark job requests one GPU per Spark task, but the GPU is already occupied by the Ray worker node. This makes the Spark job hang, since the Spark task resources can't be allocated. 

 

The Ray cluster alternatively may be using all available vCPU resources on the workers, so there are no free vCPU cores for the Spark tasks.

 

Solution

Set the Spark configuration spark.task.resource.gpu.amount to 0 in your cluster’s configuration. This means no GPU resources will be allocated to the Spark tasks while the Ray cluster uses the GPU.

  1. In your workspace, navigate to Compute
  2. Click your cluster to open the settings. 
  3. Scroll down to Advanced options and click to expand. 
  4. In the Spark tab, enter spark.task.resource.gpu.amount 0 in the Spark config box. 

 

Alternatively, update your Ray cluster configuration to modify num_cpus_worker_node to be less than the total number of CPUs available on each worker type. For example, if you have both num_worker_nodes and num_cpus_worker_node set to four, reduce num_cpus_worker_node to three. 

 

Last, you can enable Spark cluster auto-scaling by setting autoscale=True in the setup_ray_cluster function. If you are configuring both the Databricks and Ray-on-Spark clusters to be auto-scalable for better resource management please refer to the Scale Ray clusters on Databricks (AWSAzureGCP) documentation.