Problem
You have a job triggered through Apache Airflow using the DatabricksRunNowOperator
that cancels after running for X hours (where X is the timeout value set through Airflow) even though the job is not complete.
Cause
The Airflow DatabricksRunNowOperator
operator uses the X hours configuration to determine the job’s length and when to stop, regardless of whether the job is complete.
If you have access to audit logs, you can see the cancellation request is sent by the Airflow operator, confirming the issue lies in the Airflow configuration rather than the Databricks job settings.
Audit logs snippet
{"version":"2.0","auditLevel":"WORKSPACE_LEVEL","timestamp":1746505827573,"orgId":"<org-id>","shardName":"<shard-name>","accountId":"xxxxxxxxxxxxxxxx","sourceIPAddress":"<source-ip-address>","userAgent":"databricks-airflow/6.7.0 _/0.0.0 python/3.11.11 os/linux airflow/2.9.3+astro.11 operator/DatabricksRunNowOperator","sessionId":null,"userIdentity":{"email":"<email>","subjectName":null},"principal":{"resourceName":"accounts/xxxxxxxxxxxxxxxx"/users/<user>","uniqueName":"<email>","contextId":"<context-id>","displayName":"Data Engineering"},"authorizeAs":{"resourceName":"accounts/xxxxxxxxxxxxxxxx"/users/<user>","uniqueName":"<email>","displayName":"Data Engineering","activatingResourceName":null},"serviceName":"jobs","actionName":"cancel","requestId":"<request-id>","requestParams":{"run_id":"<run-id>"},"response":{"statusCode":200,"errorMessage":null,"result":"{}"}}
Note
f you do not have audit logs configured for your workspace and you are on a premium plan or above, you can follow the instructions in the Audit log reference (AWS | Azure | GCP) documentation to configure them.
Solution
Increase the job timeout threshold on the Airflow side.
- Review the Airflow Directed Acyclic Graph (DAG) that triggers the Databricks job. Look for the
DatabricksRunNowOperator
task and check its configuration. - Adjust the parameter in
DatabricksRunNowOperator
that controls the timeout to a value beyond four hours. - Update your Airflow DAG with the adjusted timeout parameter and deploy the changes.
- After updating the DAG, trigger a new run and monitor the job to ensure it runs beyond the previous four-hour limit without being terminated.
For more information, review the Airflow Tasks documentation.