Problem
When using Lakeflow Declarative Pipelines (LDP) to ingest data, the ingestion process fails with the following error.
Error:
SparkException: Job aborted due to stage failure: com.databricks.sql.io.FileReadException: Error while reading file dbfs:<path-to-file>.
..
Caused by: org.apache.spark.SparkRuntimeException: [CANNOT_READ_ARCHIVED_FILE] Cannot read file at path dbfs:<path-to-file> because it has been archived. Please adjust your query filters to exclude archived files. SQLSTATE: KD003
..
Caused by: java.io.IOException: java.lang.RuntimeException: java.io.IOException: Operation failed: "This operation is not permitted on an archived blob.", 409, GET, <url-with-path-to-file>?"
Cause
The LDP data ingestion process uses Auto Loader to read files that have been archived. When files are archived, they are moved to a storage class that is not directly accessible for processing. The error arises when Auto Loader tries to read these archived files.
Solution
If your S3 object storage class is Glacier, move the files to a non-Glacier policy bucket. For more information, refer to the Archival support in Databricks documentation.
Alternatively, add a timestamp filter to the Auto Loader readStream option in your code, such as modifiedAfter, and set a timestamp from where you want to start reading.
For more information, refer to the Auto Loader options documentation.