Problem
When you have the Apache Spark option multiLine
set to true
, you notice a significant increase in processing time while reading a large CSV file not divided into smaller parts.
Cause
Setting multiLine
to true
causes the entire file to be processed as a single partition. Spark does this to ensure records spanning multiple lines are not split incorrectly, which could lead to data corruption.
As a result, file processing cannot be parallelized, so only a single task ends up handling the entire dataset. This lack of parallelization can significantly increase execution time when working with large files.
Solution
Split multiline CSV files at the source (where they are generated). By creating multiple, smaller files, each can be processed independently as a separate task in Spark, enabling parallel processing and improving overall job performance.