Problem
When trying to read a Parquet file that contains a timestamp column of datatype INT (nanoseconds) using Databricks Runtime 11.3 LTS or later, you encounter an "illegal parquet type exception"
.
The stack trace shows the following output.
Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 (TIMESTAMP(NANOS,false))
at org.apache.spark.sql.errors.QueryCompilationErrors$.illegalParquetTypeError(QueryCompilationErrors.scala:1328)
at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.illegalType$1(ParquetSchemaConverter.scala:178)
at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.$anonfun$convertPrimitiveField$2(ParquetSchemaConverter.scala:247)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertPrimitiveField(ParquetSchemaConverter.scala:196)
at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.
at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convert(ParquetSchemaConverter.scala:87)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readSchemaFromFooter$2(ParquetFileFormat.scala:1040)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.readSchemaFromFooter(ParquetFileFormat.scala:1040)
Cause
Databricks Runtime versions 11.3 LTS and above do not support the TIMESTAMP_NANOS
type in open source Apache Spark and Databricks Runtime. If a Parquet file contains fields with the TIMESTAMP_NANOS
type, attempts to read it will fail with an Illegal Parquet Type
exception. As a result, schema inference will also fail, since Spark cannot interpret the unsupported timestamp type.
Solution
Explicitly provide the schema where TIMESTAMP_NANOS
should be referred to as LONGTYPE
to the Spark reader.
1. Import the necessary Spark SQL types.
from pyspark.sql.types import StructType, StructField, LongType, StringType
2. Define the schema for the Parquet file.
schema = StructType([
StructField("timestamp_nanos", LongType(), True),
StructField("value", StringType(), True)
])
3. Read the Parquet file using the specified schema.
parquet_path = "</path/to/your/parquet/file.parquet>"
try:
df = spark.read.schema(schema).parquet(parquet_path)
df.show()
except Exception as e:
print(f"Error reading Parquet file: {e}")