Problem
You have jobs in Databricks that run for longer than expected. When you check the event log, you see a Metastore_Down
event. This happens when you use Hive or an external metastore like AWS Glue.
When you analyze the thread dump, you find threads stuck at delta-catalog-update
.
Sample thread
delta-catalog-update-8" #518 daemon prio=5 os_prio=0 tid=xxx nid=xxx waiting on condition [xxx]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <xxx> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2044)
at org.spark_project.apache.commons.pool2.impl.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:590)
at org.spark_project.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:432)
at org.spark_project.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:349)
at org.apache.spark.sql.hive.client.LocalHiveClientsPool.super$borrowObject(LocalHiveClientImpl.scala:124)
at org.apache.spark.sql.hive.client.LocalHiveClientsPool.$anonfun$borrowObject$1(LocalHiveClientImpl.scala:124)
at org.apache.spark.sql.hive.client.LocalHiveClientsPool$$Lambda$5460/xxx.apply(Unknown Source)
at com.databricks.backend.daemon.driver.ProgressReporter$.withStatusCode(ProgressReporter.scala:394)
at com.databricks.spark.util.SparkDatabricksProgressReporter$.withStatusCode(ProgressReporter.scala:34)
at org.apache.spark.sql.hive.client.LocalHiveClientsPool.borrowObject(LocalHiveClientImpl.scala:122)
at org.apache.spark.sql.hive.client.PoolingHiveClient.retain(PoolingHiveClient.scala:181)
at org.apache.spark.sql.hive.HiveExternalCatalog.maybeSynchronized(HiveExternalCatalog.scala:110)
at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$withClient$1(HiveExternalCatalog.scala:150)
at org.apache.spark.sql.hive.HiveExternalCatalog$$Lambda$5186/xxx.apply(Unknown Source)
at com.databricks.backend.daemon.driver.ProgressReporter$.withStatusCode(ProgressReporter.scala:394)
at com.databricks.spark.util.SparkDatabricksProgressReporter$.withStatusCode(ProgressReporter.scala:34)
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:149)
at org.apache.spark.sql.hive.HiveExternalCatalog.tableExists(HiveExternalCatalog.scala:1027)
at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.tableExists(ExternalCatalogWithListener.scala:154)
at org.apache.spark.sql.catalyst.catalog.SessionCatalogImpl.tableExists(SessionCatalog.scala:936)
at com.databricks.sql.managedcatalog.ManagedCatalogSessionCatalog.tableExists(ManagedCatalogSessionCatalog.scala:763)
at com.databricks.sql.transaction.tahoe.hooks.UpdateCatalog.tableStillExists$1(UpdateCatalog.scala:112)
Cause
This happens when catalog update operations saturate the Hive client thread pool. The delta update threads can exhaust all Hive client connections, which prevents other query operations, and results in hanging jobs. This usually occurs if there is an update to the table metadata in the catalog through the ALTER TABLE
command.
Solution
There are three options to try depending on your case.
Run the VACUUM
command
- Check if there are a large number of files for the table.
- Periodically run a vacuum on Delta tables to remove stale and unreferenced files, which can help in reducing the load on the metastore.
Adjust catalog update thread pool size
In Databricks Runtime 14.3 LTS and above, you can control the size of the thread pool used to update the catalog. To set this configuration, adjust spark.databricks.delta.catalog.update.threadPoolSize
to a value less than the default of 20
.
spark.databricks.delta.catalog.update.threadPoolSize <value-less-than-20>
Disable Delta catalog update
If you’re using a read-only metastore database, Databricks recommends setting the following configuration on your clusters. This configuration controls the syncing of the most recent schema and table properties of a Delta table with the Hive metastore (or any external catalog) to ensure both of them stay the same.
spark.databricks.delta.catalog.update.enabled false
Important
If other systems access your external metastore for this table schema or table properties, do not use this option. Keep enabled
set to true
to ensure they sync.