Generate unique increasing numeric values
This article shows you how to use Apache Spark functions to generate unique increasing numeric values in a column. We review three different methods to use. You should select the method that works best with your use case. Use zipWithIndex() in a Resilient Distributed Dataset (RDD) The zipWithIndex() function is only available within RDDs. You cannot...
1 min reading timeAnalysisException error due to a schema mismatch
Problem You are writing to a Delta table when you get an AnalysisException error indicating a schema mismatch. 'AnalysisException: A schema mismatch detected when writing to the Delta table (Table ID: bc10as3e-e12va-4f325-av10e-4s38f17vr3dd3)'. input_df.write.format("delta").mode("overwrite").save(target_delta_table_path) Cause The scheme mismatch ...
0 min reading timeLEFT JOIN resulting in null values when joining timestamp column and date column
Problem When joining two dataframes, joining a timestamp column with a date column results in null values. Example In this example, start_timestamp is of timestamp data type, and start_date is of date data type. select * from table1 left join table2 on table1.start_timestamp = table2.start_date Cause A join between a timestamp and a date colum...
0 min reading timeCreate tables on JSON datasets
In this article we cover how to create a table on JSON datasets using SerDe. Download the JSON SerDe JAR Open the hive-json-serde 1.3.8 download page. Click on json-serde-1.3.8-jar-with-dependencies.jar to download the file json-serde-1.3.8-jar-with-dependencies.jar. Info You can review the Hive-JSON-Serde GitHub repo for more information on the JAR...
0 min reading timeCreate a DataFrame from a JSON string or Python dictionary
In this article we are going to review how you can create an Apache Spark DataFrame from a variable containing a JSON string or a Python dictionary. Create a Spark DataFrame from a JSON string Add the JSON content from the variable to a list.%scala import scala.collection.mutable.ListBuffer val json_content1 = "{'json_col1': 'hello', 'json_col2': 32...
2 min reading timeHow to specify the DBFS path
When working with Databricks you will sometimes have to access the Databricks File System (DBFS). Accessing files on DBFS is done with standard filesystem commands, however the syntax varies depending on the language or tool used. For example, take the following DBFS path: dbfs:/mnt/test_folder/test_folder1/ Apache Spark Under Spark, you should spec...
0 min reading timeParquet table counts not being reflected based on concurrent updates
Problem You may notice that a Parquet table count within a notebook remains the same even after additional rows are added to the table from an external process. For instance, if a count is taken from a table (Table 1) in a notebook (Notebook A) and the count is 100, an outside process or another notebook updates Table 1 and adds 100 additional rows....
0 min reading time