Japanese yen symbol not rendering when reading CSV files in Databricks

Replace backslashes (\) with the correct Unicode yen symbol (\u00a5) in the CSV file before loading it into Databricks.

Written by Amruth Ashoka

Last published at: October 21st, 2025

Problem

When reading CSV files in Databricks, the Japanese yen symbol (¥) does not render. The character appears missing or corrupted, leading to incorrect values or parsing errors when loaded into DataFrames or tables.

 

Cause

Some text editors, such as Sakura, save the ¥ symbol as a backslash (\). When Spark reads such CSV files, it interprets the backslash as an escape character, causing the ¥ symbol to be lost or misread.

 

Solution

  1. Load the file as plain text so each line is a string. 
  2. Do a literal search-and-replace of \ to ¥ (\u00A5) with the built-in regexp_replace

 

Then, when you hand those lines to from_csv, the parser doesn’t see a backslash to treat as an escape, so it doesn't drop or mangle the yen symbol. You can adapt and use the following example code. 

from pyspark.sql import functions as F, types as T

input_path = "path/<your-file>.csv"
schema = "Name STRING, Number INT, Price STRING, Position STRING"

# Read into Spark as TEXT
raw = spark.read.text(input_path)

# Replace literal backslashes with ¥
replacedDf = raw.select(
    F.regexp_replace(F.col("value"), r"\\", "\u00A5").alias("line")
)
# Parse the lines as CSV
parsed = (
    replacedDf
      .select(
          F.from_csv(
              F.col("line"),
              schema,
              {"escape": "\u0000", "quote": "\"", "mode": "PERMISSIVE"}
          ).alias("s")
      )
      .select("s.*")
)

# Clean up double-¥
parsed = parsed.withColumn("Price", F.regexp_replace("Price", "¥{2,}", "¥"))
display(parsed)