Problem
When reading a CSV file in DROPMALFORMED mode with the .schema option specified, functions such as df.count() or df.agg(count('*')).display() still include malformed rows in the returned result.
Cause
The functions df.count() and df.agg(count('*')).display() each count the number of line breaks in a file without fully parsing each row according to the schema. For example, if you pass a string in and the schema is expecting an integer, df.count still counts it.
Solution
Databricks recommends caching the DataFrame after reading it, and then calling the df.count() function. This forces Apache Spark to fully parse the data and apply the DROPMALFORMED mode correctly.