Problem
When you attempt to read data directly using pandas on a standard compute in Databricks, you receive the following error message.
PermissionError: Forbidden File
Cause
When you use a standard compute, pandas does not have permission to access storage.
In standard compute, pandas does not use the cluster’s instance profile or credentials. Instead, it reads files from the local filesystem where the code runs. Session isolation is set up to prevent users from accessing each other’s data, which can block pandas reads from external storage.
Solution
If you need to read external files directly with pandas, run your workload on a dedicated compute instead. Dedicated compute allows direct filesystem or cloud access with the user’s permissions, so pandas can read external files.
Otherwise, use an Apache Spark DataFrame on a standard compute. Read the file into Spark, then convert it to pandas. Your .csv
file path should include s3://
on AWS, abfs://
on Azure, or gs://
on GCP.
spark_df = spark.read.csv("<your-csv-filepath>", header=True, inferSchema=True)
pandas_df = spark_df.toPandas()
Spark can read files using a cluster’s configuration and credentials. Further, work can be distributed across cluster nodes.