Problem
When you use the percentile()
aggregate expression in PySpark and Photon to work with large datasets or datasets with many distinct values, you notice severe memory issues, including out-of-memory (OOM) errors.
Example stack trace
Photon failed to reserve 43.3 MiB for percentile-merge-expr memory pool, in task.
Memory usage:
Total task memory (including non-Photon): 6.6 GiB
task: allocated 6.6 GiB, tracked 6.6 GiB, untracked allocated 3.2 GiB, peak 6.6 GiB
BufferPool: allocated 128.0 MiB, tracked 128.0 MiB, untracked allocated 3.2 GiB, peak 128.0 MiB
DataWriter: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
ShuffleExchangeSourceNode(id=XXX, output_schema=[string, string, binary, binary, binary, binary, binary, binary, binary, binary, binary, binary, binary, binary, binary, binary, binary, binary, binary, binary, binary, binary, binary, binary, binary, ... 105 more]): allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
GroupingAggNode(id=XXX, output_schema=[string, string, double, double, double, double, double, double, double, double, double, double, double, double, double, double, double, double, double, double, double, double, double, double, double, ... 105 more]): allocated 6.0 MiB, tracked 6.0 MiB, untracked allocated 0.0 B, peak 5.7 GiB
GroupingAggregation(recursion_depth=0): allocated 6.0 MiB, tracked 6.0 MiB, untracked allocated 0.0 B, peak 5.7 GiB
GC'd var-len aggregates: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 3.3 GiB
Cause
The percentile()
function causes memory issues in three ways.
- Both Apache Spark and Photon buffer the entire dataset into memory as a single, large array of values. This can quickly exhaust memory resources, especially with high cardinality datasets.
- Spilling to disk is challenging with the current approach because splitting and spilling a single large array of values is complex and inefficient.
- Sorting the buffered values adds further memory and computational overhead.
Solution
Use approx_percentile()
instead. The approx_percentile() function is an approximate alternative to percentile()
with a fixed and predictable memory footprint. It avoids buffering the entire dataset in memory and is suitable for large-scale datasets where exact percentile calculation may be unnecessary.
Note
Simply disabling Photon does not resolve the issue because Spark uses a similar algorithm for percentile()
, which inherits the same limitations.