Sort failed after writing partitioned data to parquet using PySpark on Databricks Runtime 13.3 LTS

Set the Apache Spark configuration to set the sorted data after writing partitioned data to parquet.

Written by mounika.tarigopula

Last published at: September 9th, 2024

Problem 

In Databricks Runtime 13.3 LTS to 15.3, when using sortWithinPartitions to make sure the rows in each partition are ordered based on the columns, the sorted data frame looks correct when displayed, but after saving and reading it back, the sorting is lost.

Cause 

There is a bug in which the planned write local sort comes after the sortWithinPartitions local sort, and then EliminateSorts drops the first sort as unnecessary. The bug exists with or without Photon.

 

This issue is fixed in Databricks Runtime 15.4 LTS.  

Solution

Set the below Apache Spark configuration as a workaround.

 

spark.conf.set("spark.sql.optimizer.plannedWrite.enabled", "false")

 

Was this article helpful?