Problem
When running an Apache Spark PySpark job workflow on a serverless compute in serverless environment version 1, you try to conduct a MERGE
operation with withSchemaEvolution
using Delta Lake and receive the following error.
Error message: AttributeError: 'DeltaMergeBuilder' object has no attribute 'withSchemaEvolution'
This issue occurs despite using a Databricks Runtime version that supports schema evolution (15.4 LTS and above).
Cause
Certain Apache Spark configurations, including those required for schema evolution (spark.databricks.delta.schema.autoMerge.enabled
), are not supported in serverless compute version 1 environments. As a result, the withSchemaEvolution
method, which relies on these configurations, is also not supported.
For more information, refer to the Serverless compute limitations (AWS | Azure | GCP) documentation.
To review the Spark configs supported in serverless, refer to the “Configure Spark properties for serverless notebooks and jobs” section of the Set Spark configuration properties on Databricks (AWS | Azure | GCP) documentation.
For more information about schema evolution, review the “Schema evolution syntax for merge” section of the Update Delta Lake table schema (AWS | Azure | GCP) documentation.
Solution
Either use a job compute or all-purpose compute instead of serverless, use SQL to perform the MERGE
operation with schema evolution, or use serverless environment version 2 or above.
Use a job compute or all-purpose compute
Instead of using serverless compute, switch to a job cluster or an all-purpose cluster with Databricks Runtime 15.4 LTS and above where the withSchemaEvolution
method is supported.
This involves changing the compute configuration for your Databricks job workflow. Please refer to the Compute configuration reference (AWS | Azure | GCP) documentation for more information.
Use SQL to perform the MERGE operation with schema evolution
Alternatively, use SQL to perform the MERGE
operation with schema evolution. You can execute the operation directly in a SQL cell or in PySpark using spark.sql(<query>)
.
Direct SQL example
%sql
MERGE WITH SCHEMA EVOLUTION INTO <target-table-name> t
USING source s
ON s.id = t.id
WHEN MATCHED THEN
UPDATE SET *
WHEN NOT MATCHED THEN
INSERT *
SQL in PySpark example
spark.sql("""
MERGE WITH SCHEMA EVOLUTION INTO <target-table-name> t
USING source s
ON s.id = t.id
WHEN MATCHED THEN
UPDATE SET *
WHEN NOT MATCHED THEN
INSERT *
""")
Use serverless environment version 2 or above
For more information, refer to the Serverless environment versions (AWS | Azure | GCP) documentation.