Problem
When working with Auto Loader in file notification mode to ingest files, you notice the cluster receives a significantly higher number of messages during certain intervals, leading to duplicate data in the resulting dataset.
This behavior is observed despite a set backfill interval and smooth incoming traffic.
Cause
Using BackfillInterval
, notification mode, and allowOverwrites
in combination is known to cause duplicates.
It’s also possible the same file is being repeatedly processed and overwriting existing entries in cloud_files_state
, which also leads to duplicates.
Solution
Remove the allowOverwrites
configuration or implement a deduplicate logic downstream if you want to reprocess overwritten files at the source.
Use cloud_files_state
to identify any files that have been processed more than once.
For more information on cloud_files_state
, review the cloud_files_state
table-valued function (AWS | Azure | GCP) documentation.