Runtimes increase when using .loc() and assignment(=) operations

Use vectorized operations instead.

Last published at: March 11th, 2025

Problem

When using the .loc() and assignment(=) operations, you notice performance degradation such as increased runtimes.

The following code is an example of using .loc() and assignment(=) in the def func(data1) section.

import numpy as np;
import pandas as pd;
from pyspark.sql.functions import lit;

sql_query = f"""select cast(cast(rand(1) * 20 as int) as string) as a, cast(1 as long) as b, cast(id as timestamp) as c, 0 as d FROM range(1000000) order by c"""
df1=spark.sql(sql_query);
df1 = df1.withColumn("e", lit(None));

def func(data1):
    i = 0
    while i < (data1.shape[0]):
      data1.loc[i, "e"] = data1.loc[i, "d"]
      i = i + 1
    return data1;

df2=df1.groupBy("a", "b").applyInPandas(func, schema="a string, b long, c timestamp, d integer, e float");
display(df2);

Cause

The use of .loc() and direct assignment(=) operations in pandas is generally discouraged because they can disable vectorized operations that NumPy performs under the hood.

Solution

Use vectorized operations instead. Vectorized operations are typically faster and more efficient, especially for large datasets.

The following code is the same example from the problem statement, with vectorized operations in replacing .loc() and assignment(=) in def func(data1).

import numpy as np;
import pandas as pd;
from pyspark.sql.functions import lit;

sql_query = f"""select cast(cast(rand(1) * 20 as int) as string) as a, cast(1 as long) as b, cast(id as timestamp) as c, 0 as d FROM range(1000000) order by c"""
df1=spark.sql(sql_query);
df1 = df1.withColumn("e", lit(None));

def func(data1):
  data1["e"] = data1["d"]
  return data1;

df2=df1.groupBy("a", "b").applyInPandas(func, schema="a string, b long, c timestamp, d integer, e float");
display(df2);

Databricks Help Center

Problem

Cause

Solution

Contact Us