Problem
When converting a DataFrame to a Ray Dataset using Ray versions below 2.10, you notice long execution times, especially for small datasets.
Cause
Every Apache Spark job requests one GPU per Spark task, but the GPU is already occupied by the Ray worker node. This makes the Spark job hang, since the Spark task resources can't be allocated.
The Ray cluster alternatively may be using all available vCPU resources on the workers, so there are no free vCPU cores for the Spark tasks.
Solution
Set the Spark configuration spark.task.resource.gpu.amount
to 0
in your cluster’s configuration. This means no GPU resources will be allocated to the Spark tasks while the Ray cluster uses the GPU.
- In your workspace, navigate to Compute.
- Click your cluster to open the settings.
- Scroll down to Advanced options and click to expand.
- In the Spark tab, enter
spark.task.resource.gpu.amount 0
in the Spark config box.
Alternatively, update your Ray cluster configuration to modify num_cpus_worker_node
to be less than the total number of CPUs available on each worker type. For example, if you have both num_worker_nodes
and num_cpus_worker_node
set to four, reduce num_cpus_worker_node
to three.
Last, you can enable Spark cluster auto-scaling by setting autoscale=True
in the setup_ray_cluster
function. If you are configuring both the Databricks and Ray-on-Spark clusters to be auto-scalable for better resource management please refer to the Scale Ray clusters on Databricks (AWS | Azure | GCP) documentation.