Updated May 23rd, 2022 by ram.sankarasubramanian

Generate unique increasing numeric values

This article shows you how to use Apache Spark functions to generate unique increasing numeric values in a column. We review three different methods to use. You should select the method that works best with your use case. Use zipWithIndex() in a Resilient Distributed Dataset (RDD) The zipWithIndex() function is only available within RDDs. You cannot...

1 min reading time
Updated September 23rd, 2024 by ram.sankarasubramanian

AnalysisException error due to a schema mismatch

Problem You are writing to a Delta table when you get an  AnalysisException error indicating a schema mismatch. 'AnalysisException: A schema mismatch detected when writing to the Delta table (Table ID: bc10as3e-e12va-4f325-av10e-4s38f17vr3dd3)'. input_df.write.format("delta").mode("overwrite").save(target_delta_table_path) Cause The scheme mismatch ...

0 min reading time
Updated September 12th, 2024 by ram.sankarasubramanian

LEFT JOIN resulting in null values when joining timestamp column and date column

Problem  When joining two dataframes, joining a timestamp column with a date column results in null values. Example  In this example,  start_timestamp is of timestamp data type, and  start_date   is of date data type. select * from table1 left join table2 on table1.start_timestamp = table2.start_date Cause A join between a timestamp and a date colum...

0 min reading time
Updated May 31st, 2022 by ram.sankarasubramanian

Create tables on JSON datasets

In this article we cover how to create a table on JSON datasets using SerDe. Download the JSON SerDe JAR Open the hive-json-serde 1.3.8 download page. Click on json-serde-1.3.8-jar-with-dependencies.jar to download the file json-serde-1.3.8-jar-with-dependencies.jar. Info You can review the Hive-JSON-Serde GitHub repo for more information on the JAR...

0 min reading time
Updated July 1st, 2022 by ram.sankarasubramanian

Create a DataFrame from a JSON string or Python dictionary

In this article we are going to review how you can create an Apache Spark DataFrame from a variable containing a JSON string or a Python dictionary. Create a Spark DataFrame from a JSON string Add the JSON content from the variable to a list.%scala import scala.collection.mutable.ListBuffer val json_content1 = "{'json_col1': 'hello', 'json_col2': 32...

2 min reading time
Updated December 9th, 2022 by ram.sankarasubramanian

How to specify the DBFS path

When working with Databricks you will sometimes have to access the Databricks File System (DBFS). Accessing files on DBFS is done with standard filesystem commands, however the syntax varies depending on the language or tool used. For example, take the following DBFS path: dbfs:/mnt/test_folder/test_folder1/ Apache Spark Under Spark, you should spec...

0 min reading time
Updated September 12th, 2024 by ram.sankarasubramanian

Parquet table counts not being reflected based on concurrent updates

Problem You may notice that a Parquet table count within a notebook remains the same even after additional rows are added to the table from an external process. For instance, if a count is taken from a table (Table 1) in a notebook (Notebook A) and the count is 100, an outside process or another notebook updates Table 1 and adds 100 additional rows....

0 min reading time
Load More