Problem
You have jobs and/or workflows that are failing with the error message Could not reach driver of the cluster
.
Cause
The most common cause is high system.gc()
pauses on the driver node as well as high CPU and memory utilization, which leads to throttling, and prevents the driver from responding within the allocated time. This can be caused by running multiple jobs concurrently on a single cluster.
To verify high system.gc()
pauses in the driver, you can review the logs.
- Click Compute.
- Click the name of your cluster.
- Click Driver logs.
- Click stdout to display the logs.
Review the garbage collection log lines to see the time taken. This example log shows garbage collection time taken is very high and confirms high system.gc()
pauses as the cause.
[25229.348s][info][gc ] GC(33) Pause Young (System.gc()) 20386M->896M(213022M) 11.509ms
[25229.535s][info][gc ] GC(34) Pause Full (System.gc()) 896M->705M(213022M) 187.301ms
[27029.347s][info][gc ] GC(35) Pause Young (System.gc()) 20334M->919M(213004M) 10.893ms
[27029.525s][info][gc ] GC(36) Pause Full (System.gc()) 919M->707M(213004M) 177.894ms
You can verify if high CPU and memory utilization is the cause, by reviewing the utilization metrics in the compute metrics (AWS | Azure | GCP).
If you have high utilization the graph shows CPU or memory usage above 85%.

Another common cause is when the default REPL timeout is too short for the specific workload, causing the kernel to fail to start within the allocated time.
Solution
If the root cause is high system.gc()
pauses, high CPU utilization, or high memory utilization you should use a larger driver instance to accommodate the increased resource requirements.
If the root cause is a too-short REPL timeout, you can increase it with a cluster-scoped init script (AWS | Azure | GCP).
Example init script
This sample code creates an init script that sets the REPL timeout to 150 seconds, providing more time for the kernel to start. It stores the init script as a workspace file.
Before running the sample code, replace <path-to-script>
with the full path to the location in your workspace where you want to store the init script.
Info
Databricks Runtime 11.3 LTS and above is required to use init scripts stored as workspace files.
%python
initScriptContent = """
#!/bin/bash
cat > /databricks/common/conf/set_repl_timeout.conf << EOL
{
databricks.daemon.driver.launchTimeout = 150
}
EOL
"""
dbutils.fs.put("/Workspace/<path-to-script>/set_repl_timeout.sh",initScriptContent, True)
Best practices
- Avoid running multiple jobs concurrently on a single cluster.
- Regularly monitor CPU, memory, and disk usage metrics to ensure that your clusters have sufficient resources to handle the workload. Adjust cluster configurations or scale-up as needed.
- Choose driver and worker instance types that match your workload requirements. Consider using larger instances for resource-intensive workloads.