Using percentile() to work with large datasets causing memory issues or OOM errors

Use approx_percentile() instead.

Written by Raphael Freixo

Last published at: January 25th, 2025

Problem

When you use the percentile() aggregate expression in PySpark and Photon to work with large datasets or datasets with many distinct values, you notice severe memory issues, including out-of-memory (OOM) errors. 

 

Example stack trace

Photon failed to reserve 43.3 MiB for percentile-merge-expr memory pool, in task.
Memory usage:
Total task memory (including non-Photon): 6.6 GiB
task: allocated 6.6 GiB, tracked 6.6 GiB, untracked allocated 3.2 GiB, peak 6.6 GiB
BufferPool: allocated 128.0 MiB, tracked 128.0 MiB, untracked allocated 3.2 GiB, peak 128.0 MiB
DataWriter: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
ShuffleExchangeSourceNode(id=XXX, output_schema=[string, string, binary, binary, binary, binary, binary, binary, binary, binary, binary, binary, binary, binary, binary, binary, binary, binary, binary, binary, binary, binary, binary, binary, binary, ... 105 more]): allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
GroupingAggNode(id=XXX, output_schema=[string, string, double, double, double, double, double, double, double, double, double, double, double, double, double, double, double, double, double, double, double, double, double, double, double, ... 105 more]): allocated 6.0 MiB, tracked 6.0 MiB, untracked allocated 0.0 B, peak 5.7 GiB
GroupingAggregation(recursion_depth=0): allocated 6.0 MiB, tracked 6.0 MiB, untracked allocated 0.0 B, peak 5.7 GiB
GC'd var-len aggregates: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 3.3 GiB

 

Cause

The percentile() function causes memory issues in three ways.

  1. Both Apache Spark and Photon buffer the entire dataset into memory as a single, large array of values. This can quickly exhaust memory resources, especially with high cardinality datasets.
  2. Spilling to disk is challenging with the current approach because splitting and spilling a single large array of values is complex and inefficient.
  3. Sorting the buffered values adds further memory and computational overhead.

 

Solution

Use approx_percentile() instead. The approx_percentile() function is an approximate alternative to percentile() with a fixed and predictable memory footprint. It avoids buffering the entire dataset in memory and is suitable for large-scale datasets where exact percentile calculation may be unnecessary.

 

Note

Simply disabling Photon does not resolve the issue because Spark uses a similar algorithm for percentile(), which inherits the same limitations.