Problem
When converting a DataFrame to a Ray Dataset using Ray versions below 2.10, you notice long execution times, especially for small datasets.
Cause
Every Apache Spark job requests one GPU per Spark task, but the GPU is already occupied by the Ray worker node. This makes the Spark job hang, since the Spark task resources can't be allocated.
The Ray cluster alternatively may be using all available vCPU resources on the workers, so there are no free vCPU cores for the Spark tasks.
Solution
Set the Spark configuration spark.task.resource.gpu.amount to 0 in your cluster’s configuration. This means no GPU resources will be allocated to the Spark tasks while the Ray cluster uses the GPU.
- In your workspace, navigate to Compute.
- Click your cluster to open the settings.
- Scroll down to Advanced options and click to expand.
- In the Spark tab, enter
spark.task.resource.gpu.amount 0in the Spark config box.
Alternatively, update your Ray cluster configuration to modify num_cpus_worker_node to be less than the total number of CPUs available on each worker type. For example, if you have both num_worker_nodes and num_cpus_worker_node set to four, reduce num_cpus_worker_node to three.
Last, you can enable Spark cluster auto-scaling by setting autoscale=True in the setup_ray_cluster function. If you are configuring both the Databricks and Ray-on-Spark clusters to be auto-scalable for better resource management please refer to the Scale Ray clusters on Databricks (AWS | Azure | GCP) documentation.