Problem
When you deploy a job with Databricks Asset Bundles (DAB) that installs Python packages from a private PyPI-compliant repository, the job fails with the following error message.
Run failed with error message
Library installation failed for library due to user error for pypi {
package: "<package-name>"
repo: "<private-repository-url>"
}
Error messages:
Library installation attempted on the driver node of cluster <cluster-id> and failed. Pip could not find a version that satisfies the requirement for the library. Please check your library version and dependencies. Error code: ERROR_NO_MATCHING_DISTRIBUTION, error message: org apache spark-SparkException: Process List(/bin/su, Libraries,
-c, bash /local_disk0/.ephemeral_nfs/cluster_libraries/python/python_start_clusterwide.sh /local_disk0/.ephemeral_nf s/cluster_libraries/python/bin/pip install '<package-name>' --index-url <private-repository-url>…
***WARNING: message truncated. Skipped *** bytes of output**
Cause
Installing Python packages from a private PyPI-compliant repository forces pip to resolve all dependencies exclusively through that repository.
This causes failures when some dependencies are only available on the public PyPI repository or other indexes.
Solution
Enable pip to install packages from multiple indexes. Set the PIP_EXTRA_INDEX_URL environment variable as part of the cluster specification in the databricks.yml file.
This environment variable mirrors pip’s --extra-index-url option, which allows an additional package index, such as the public PyPI repository, to be searched alongside the private PyPI-compliant repository.
Example configuration
targets:
dev:
mode: development
default: true
resources:
jobs:
my_job:
job_clusters:
- job_cluster_key:${bundle.target}-${bundle.name}-job-cluster
new_cluster:
num_workers: 2
spark_version: "14.3.x-cpu-ml-scala2.12"
node_type_id: Standard_F4
Spark_env_vars:
PIP_EXTRA_INDEX_URL:"{{secrets/<your-scope>/<your-extra-index-url>}}"
For serverless notebooks and jobs, you can configure PIP_EXTRA_INDEX_URL through the UI and apply it across the entire workspace.
For more details, refer to the “Configure default Python package repositories” section of the Configure the serverless environment (AWS | Azure | GCP) documentation.