Problem
When performing join transformations in Apache Spark, you notice the expected broadcast hash join is not being used, although you provide broadcast join hints.
The issue occurs even when one of the datasets qualifies for broadcast exchange. Additionally, you have tried Adaptive Query Execution (AQE) join optimizations, but they do not resolve the issue.
Cause
The table statistics are outdated. Spark estimates a larger data size than the actual size.
Alternatively, you’re using a full outer join. Spark does not support broadcast joins with this join type.
Note
For right outer joins only the left-side table can be broadcast, and for left outer joins only the right-side table can be broadcast.
If the transformation involves a left outer join, and the right DataFrame is not small enough to fit into memory, Spark resorts to a sort merge join. AQE skew optimization will also resort to a sort merge join in this situation.
Solution
Use one or more of the following options in combination to resolve the issue.
- Refresh your table’s statistics using the following command. Refreshing also helps Spark make accurate decisions during query optimization.
%sql analyze table <table-name> compute statistics
- If you are using a full outer join, use an inner join, left outer join, or right outer join instead depending on your use case. When possible, structure the query such that the smaller table is on the non-outer side to allow for broadcasting.
- Increase the size of the broadcast threshold if you have enough compute memory to support broadcasting the data (the default is 10 MB). Use the following property.
%sql
set spark.sql.autoBroadcastJoinThreshold = <size-in-bytes>;
set spark.databricks.adaptive.autoBroadcastJoinThreshold = <size-in-bytes>;