Workflows are failing with a 'Could not reach driver of the cluster' error

Use a larger driver instance or increase the REPL timeout.

Written by kingshuk.das

Last published at: March 27th, 2025

Problem

You have jobs and/or workflows that are failing with the error message Could not reach driver of the cluster

 

Cause

The most common cause is high system.gc() pauses on the driver node as well as high CPU and memory utilization, which leads to throttling, and prevents the driver from responding within the allocated time. This can be caused by running multiple jobs concurrently on a single cluster.

To verify high system.gc() pauses in the driver, you can review the logs.

  1. Click Compute.
  2. Click the name of your cluster.
  3. Click Driver logs.
  4. Click stdout to display the logs.

Review the garbage collection log lines to see the time taken. This example log shows garbage collection time taken is very high and confirms high system.gc() pauses as the cause. 

[25229.348s][info][gc     ] GC(33) Pause Young (System.gc()) 20386M->896M(213022M) 11.509ms
[25229.535s][info][gc     ] GC(34) Pause Full (System.gc()) 896M->705M(213022M) 187.301ms
[27029.347s][info][gc     ] GC(35) Pause Young (System.gc()) 20334M->919M(213004M) 10.893ms
[27029.525s][info][gc     ] GC(36) Pause Full (System.gc()) 919M->707M(213004M) 177.894ms

You can verify if high CPU and memory utilization is the cause, by reviewing the utilization metrics in the compute metrics (AWSAzureGCP).

If you have high utilization the graph shows CPU or memory usage above 85%.

Compute metrics CPU and memory utilization graph showing high usage.

Another common cause is when the default REPL timeout is too short for the specific workload, causing the kernel to fail to start within the allocated time. 

 

Solution

If the root cause is high system.gc() pauses, high CPU utilization, or high memory utilization you should use a larger driver instance to accommodate the increased resource requirements. 

If the root cause is a too-short REPL timeout, you can increase it with a cluster-scoped init script (AWSAzureGCP). 
 

Example init script

This sample code creates an init script that sets the REPL timeout to 150 seconds, providing more time for the kernel to start. It stores the init script as a workspace file.

Before running the sample code, replace <path-to-script> with the full path to the location in your workspace where you want to store the init script.

Info

Databricks Runtime 11.3 LTS and above is required to use init scripts stored as workspace files.

 
%python

initScriptContent = """
#!/bin/bash
cat > /databricks/common/conf/set_repl_timeout.conf << EOL
{
  databricks.daemon.driver.launchTimeout = 150
}
EOL
"""
dbutils.fs.put("/Workspace/<path-to-script>/set_repl_timeout.sh",initScriptContent, True)

 

Best practices

  • Avoid running multiple jobs concurrently on a single cluster.
  • Regularly monitor CPU, memory, and disk usage metrics to ensure that your clusters have sufficient resources to handle the workload. Adjust cluster configurations or scale-up as needed.
  • Choose driver and worker instance types that match your workload requirements. Consider using larger instances for resource-intensive workloads.