Problem
When using the .loc()
and assignment(=)
operations, you notice performance degradation such as increased runtimes.
The following code is an example of using .loc()
and assignment(=)
in the def func(data1)
section.
import numpy as np;
import pandas as pd;
from pyspark.sql.functions import lit;
sql_query = f"""select cast(cast(rand(1) * 20 as int) as string) as a, cast(1 as long) as b, cast(id as timestamp) as c, 0 as d FROM range(1000000) order by c"""
df1=spark.sql(sql_query);
df1 = df1.withColumn("e", lit(None));
def func(data1):
i = 0
while i < (data1.shape[0]):
data1.loc[i, "e"] = data1.loc[i, "d"]
i = i + 1
return data1;
df2=df1.groupBy("a", "b").applyInPandas(func, schema="a string, b long, c timestamp, d integer, e float");
display(df2);
Cause
The use of .loc()
and direct assignment(=)
operations in pandas is generally discouraged because they can disable vectorized operations that NumPy performs under the hood.
Solution
Use vectorized operations instead. Vectorized operations are typically faster and more efficient, especially for large datasets.
The following code is the same example from the problem statement, with vectorized operations in replacing .loc()
and assignment(=)
in def func(data1)
.
import numpy as np;
import pandas as pd;
from pyspark.sql.functions import lit;
sql_query = f"""select cast(cast(rand(1) * 20 as int) as string) as a, cast(1 as long) as b, cast(id as timestamp) as c, 0 as d FROM range(1000000) order by c"""
df1=spark.sql(sql_query);
df1 = df1.withColumn("e", lit(None));
def func(data1):
data1["e"] = data1["d"]
return data1;
df2=df1.groupBy("a", "b").applyInPandas(func, schema="a string, b long, c timestamp, d integer, e float");
display(df2);