Duplicates appearing in Auto Loader with file notification feature despite set backfill interval

Remove the allowOverwrites configuration or implement a deduplicate logic.

Written by raul.goncalves

Last published at: December 12th, 2024

Problem

When working with Auto Loader in file notification mode to ingest files, you notice the cluster receives a significantly higher number of messages during certain intervals, leading to duplicate data in the resulting dataset. 

This behavior is observed despite a set backfill interval and smooth incoming traffic. 

 

Cause

Using BackfillInterval, notification mode, and allowOverwrites in combination is known to cause duplicates. 

It’s also possible the same file is being repeatedly processed and overwriting existing entries in cloud_files_state, which also leads to duplicates. 

 

Solution

Remove the allowOverwrites configuration or implement a deduplicate logic downstream if you want to reprocess overwritten files at the source.

Use cloud_files_state to identify any files that have been processed more than once.

 

For more information on cloud_files_state, review the cloud_files_state table-valued function (AWSAzureGCP) documentation.