Problem
When attempting to sample a small percentage of data from a large production table using TABLESAMPLE, the query takes an unexpectedly long time to execute.
Sample query
select * from <table>
tablesample (10 percent) repeatable(1)
where date = "<YYYY-MM-DD>";
When you review the query execution plan (such as the following example, corresponding to the sample query), you notice a full table scan is being performed. This is evident from the PhotonScan section, which shows an empty PartitionFilters list. This means the query is not taking advantage of partition pruning.
*(1) ColumnarToRow
+- PhotonResultStage
+- PhotonFilter (isnotnull(date#12465) AND (date#12465 = <yyyy-mm-dd>))
+- PhotonSample 0.0, 0.1, false
+- PhotonProject [col1, ... 18 more fields]
+- PhotonScan parquet table1[col1,... 18 more fieldsDataFilters] : [], DictionaryFilters: [], Format: parquet, Location: PreparedDeltaFileIndex(1 paths)[<path>..., OptionalDataFilters: [], PartitionFilters: [], ReadSchema:.., RequiredDataFilters: []
Cause
TABLESAMPLE is applied before any filters (such as WHERE clauses) are evaluated. As a result, filter pushdown does not occur, and the engine must scan the entire dataset before applying the sample. This order prevents the query engine from pruning partitions early, leading to inefficient full scans.
Solution
Restructure your query so that filters are applied before sampling. Use a subquery or a Common Table Expression (CTE). By filtering the data first, the query engine can take advantage of partition pruning, significantly reducing the amount of data scanned.
Modified sample query
with filtered_data as(
SELECT *
FROM <table> where date = "<YYYY-MM-DD>"
)
SELECT *
FROM filtered_data tablesample (10 percent) repeatable(1)
You can review the query plan again and confirm PartitionFilters are being applied.
== Physical Plan ==
*(1) ColumnarToRow
+- PhotonResultStage
+- PhotonSample 0.0, 0.1, false
+- PhotonProject [col1, ... 18 more fields]
+- PhotonScan parquet table1 [col1,... 18 more fields] DataFilters: [], DictionaryFilters: [], Format: parquet, Location: PreparedDeltaFileIndex(1 paths)[<path>, OptionalDataFilters: [], PartitionFilters: [isnotnull(Date#12211), (Date#12211 = <yyyy-mm–dd)], ReadSchema:..., RequiredDataFilters: []