Auto Loader streaming job failure with schema inference error
Problem You have an Apache Spark streaming job using Auto Loader encounter an error stating: Schema inference for the 'parquet' format from the existing files in the input path <Root Folder> has failed Cause One possible cause for this issue is having multiple types of files in the child directories. The input directory structure includes a ro...
0 min reading timeAutoloader job fails with a URISyntaxException error due to invalid characters in filenames
Problem You have an Autoloader job configured in Directory listing mode and are encountering a failure with a URISyntaxException error. java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: [masked_uri] Cause The error message indicates an issue with the URI (Uniform Resource Identifier) used in the Autoload...
0 min reading timeAuto Loader streaming query failure with unknownFieldException error
Problem Your Auto Loader streaming job fails with an UnknownFieldException error when a new column is added to the source file of the stream. Exception: org.apache.spark.sql.catalyst.util.UnknownFieldException: Encountered unknown field(s) during parsing: <column name> Cause An UnknownFieldException error occurs when Auto Loader detects the ad...
0 min reading timeCustom garbage collection prevents cluster launch
Problem You are trying to use a custom Apache Spark garbage collection algorithm (other than the default one (parallel garbage collection) on clusters running Databricks Runtime 10.0 and above. When you try to start a cluster, it fails to start. If the configuration is set on an executor, the executor is immediately terminated. For example, if you s...
0 min reading timeJDBC write fails with a PrimaryKeyViolation error
Problem You are using JDBC to write to a SQL table that has primary key constraints, and the job fails with a PrimaryKeyViolation error. Alternatively, you are using JDBC to write to a SQL table that does not have primary key constraints, and you see duplicate entries in recently written tables. Cause When Apache Spark performs a JDBC write, one par...
0 min reading timeStream to stream join failure
Problem You are encountering an error when attempting to display a streaming DataFrame that is derived by performing a stream-stream join. Cause When calling the display method on a structured streaming DataFrame, the default settings utilize complete output mode and a memory sink. However, it's important to note that for stream-stream joins, the c...
0 min reading timedisplay() does not show microseconds correctly
Problem You want to display a timestamp value with microsecond precision, but when you use display() it does not show the value past milliseconds. For example, this Apache Spark SQL display() command: %sql display(spark.sql("select cast('2021-08-10T09:08:56.740436' as timestamp) as test")) Returns a truncated value: 2021-08-10T09:08:56.740+0000 Caus...
0 min reading timeJob fails, but Apache Spark tasks finish
Problem Your Databricks job reports a failed status, but all Spark jobs and tasks have successfully completed. Cause You have explicitly called spark.stop() or System.exit(0) in your code. If either of these are called, the Spark context is stopped, but the graceful shutdown and handshake with the Databricks job service does not happen. Solution Do ...
0 min reading timeOffset reprocessing issues in streaming queries with a Kafka source
Problem You are using Apache Spark Structured Streaming to source data from a Kafka topic and write it to a Delta table sink, but challenges arise when attempting to reprocess data from the earliest offset in the topic. The stream is appropriately updated with the option "startingOffsets": "earliest" and restarted. However, the streaming query fails...
0 min reading time