Problem
When you enable dynamic allocation by setting spark.dynamicAllocation.enabled
to true
, you experience unexpected NODES_LOST
scenarios.
In your event logs you see the following message.
EVENT TYPE : NODES LOST
MESSAGE : Compute lost at least one node. Reason: Communication lost
And in your backend cluster logs you see the following error message.
TerminateInstances worker_env_id: "workerenv-XXXXXXXXXXX"
instance_ids: "i-XXXXXX"
instance_termination_reason_code: LOST_EXECUTOR_DETECTED
Cause
Dynamic allocation is an Apache Spark feature exclusive to YARN and is not supported in Databricks environments. Databricks clusters are instead managed by Databricks Autoscaling.
Solution
Enable Autoscaling when you create a Databricks cluster. For more information, review the “Enable autoscaling” section of the Compute configuration reference (AWS | Azure | GCP) documentation.