Apache Spark job fails with Parquet column cannot be converted error

Parquet column cannot be converted error appears when you are reading decimal data in Parquet format and writing to a Delta table.

Written by shanmugavel.chandrakasu

Last published at: May 20th, 2022

Problem

You are reading data in Parquet format and writing to a Delta table when you get a Parquet column cannot be converted error message.

The cluster is running Databricks Runtime 7.3 LTS or above.

org.apache.spark.SparkException: Task failed while writing rows.
Caused by: com.databricks.sql.io.FileReadException: Error while reading file s3://bucket-name/landing/edw/xxx/part-xxxx-tid-c00.snappy.parquet. Parquet column cannot be converted. Column: [Col1], Expected: DecimalType(10,0), Found: FIXED_LEN_BYTE_ARRAY

Caused by: org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException.

Cause

The vectorized Parquet reader is decoding the decimal type column to a binary format.

The vectorized Parquet reader is enabled by default in Databricks Runtime 7.3 and above for reading datasets in Parquet files. The read schema uses atomic data types: binary, boolean, date, string, and timestamp.

Delete

Info

This error only occurs if you have decimal type columns in the source data.

Solution

If you have decimal type columns in your source data, you should disable the vectorized Parquet reader.

Set spark.sql.parquet.enableVectorizedReader to false in the cluster’s Spark configuration to disable the vectorized Parquet reader at the cluster level.

You can also disable the vectorized Parquet reader at the notebook level by running:

%scala

spark.conf.set("spark.sql.parquet.enableVectorizedReader","false")
Delete

Info

The vectorized Parquet reader enables native record-level filtering using push-down filters, improving memory locality, and cache utilization. If you disable the vectorized Parquet reader, there may be a minor performance impact. You should only disable it, if you have decimal type columns in your source data.