Problem
You need to change the compression codec for writing data to improve storage efficiency. Rewriting the entire table is impractical, but you are concerned that switching may corrupt existing tables if you mix snappy
-compressed files and zstd
-compressed files in a Delta table.
Cause
Databricks allows different codecs, but you must ensure data integrity and readability aren't affected by the transition.
Solution
Follow the list of steps to test your compression type, generate and insert sample records using zstd
, then write the zstd
files to your Delta table. If you already have a Delta table, you can skip to step 4.
- Read 10 numbers into a DataFrame.
df = spark.range(0, 10).toDF("numbers")
display(df)
- Persist the DataFrame records as a Delta table.
df.write.format('delta').mode("append").saveAsTable("<catalog-name>.<schema-name>.<table-name>")
- Describe detail using
%sql describe detail catalog_name.schema_name.table_name
to get the location of your Delta table, and run the%fs ls
command on the table location to see the list of files. All files in the list are written by default in.snappy.parquet
compression codec file format.
%fs ls /<path-to-your-location-folder>/
- To switch the compression type from the default
snappy
tozstd
, generate sample records and insert them into the table using either incremental updates or append.
df_zstd = spark.range(11, 20).toDF("numbers")
display(df_zstd)
- Write this new data to the existing Delta table by changing the compression type to
zstd
using incremental updates or append.
df_zstd.write.format('delta').option("compression","zstd").mode("append").saveAsTable("<catalog-name>.<schema-name>.<table-name>")
- Run the
%fs ls
command on the table location again to see the list of.snappy.parquet
and newly changed.zstd.parquet
files, which co-exist in the table location.
%fs ls /<path-to-your-location-folder>/
- Review the table to verify the records by running a select statement. Ensure you can read all the data without corrupting the table.
%sql select * <catalog-name>.<schema-name>.<table-name>;