Increase in processing time for CSV file with multiline option set to true

Split the CSV file at the source.

Written by shubham.bhusate

Last published at: October 13th, 2025

Problem

When you have the Apache Spark option multiLine set to true, you notice a significant increase in processing time while reading a large CSV file not divided into smaller parts. 

 

Cause

Setting multiLine to true causes the entire file to be processed as a single partition. Spark does this to ensure records spanning multiple lines are not split incorrectly, which could lead to data corruption. 

 

As a result, file processing cannot be parallelized, so only a single task ends up handling the entire dataset. This lack of parallelization can significantly increase execution time when working with large files.

 

Solution

Split multiline CSV files at the source (where they are generated). By creating multiple, smaller files, each can be processed independently as a separate task in Spark, enabling parallel processing and improving overall job performance.