ETL process fails to process a column and throws error Row group size has overflowed

Reduce the default row group size and increase the frequency of size checks.

Written by Raphael Freixo

Last published at: January 25th, 2025

Problem

After upgrading to Databricks Runtime 14.3 LTS, you try to process a column containing a large amount of unparsed data. The extract, transform, load (ETL) process that handles large data batches fails with the error Row group size has overflowed.

 

Cause

The row group size exceeds the maximum allowable limit for Parquet files. Databricks Runtime 14.3 LTS and above have stricter or more frequent checks for row group size, causing this error to surface.

 

Solution

  1. Navigate to your cluster. 
  2. Click Advanced options.
  3. In the Spark config box under the Spark tab, add the following configuration settings to reduce the default row group size and increase the size check frequency. 
  • spark.hadoop.parquet.page.size.row.check.max  5
  • spark.hadoop.parquet.block.size.row.check.max  5
  • spark.hadoop.parquet.page.size.row.check.min  5
  • spark.hadoop.parquet.block.size.row.check.min  5

 

These settings lower the thresholds for size checks from 10, the default, to 5, ensuring that large rows are detected and handled before causing an overflow. If you still face the issue, please lower the settings further.