Duplicate columns in the metadata error

Spark job fails while processing a Delta table with org.apache.spark.sql.AnalysisException Found duplicate column(s) in the metadata error.

Written by vikas.yadav

Last published at: May 23rd, 2022

Problem

Your Apache Spark job is processing a Delta table when the job fails with an error message.

org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the metadata update: col1, col2...

Cause

There are duplicate column names in the Delta table. Column names that differ only by case are considered duplicate.

Delta Lake is case preserving, but case insensitive, when storing a schema.

Parquet is case sensitive when storing and returning column information.

Spark can be case sensitive, but it is case insensitive by default.

In order to avoid potential data corruption or data loss, duplicate column names are not allowed.

Solution

Delta tables must not contain duplicate column names.

Ensure that all column names are unique.