Problem
When using the Unstructured library in your workspace to extract content from PDF files, you encounter the following error.
PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?
Cause
You are missing the system-level dependency, Poppler.
The Unstructured library internally uses the pdf2image
Python package to process PDF files. Pdf2image
relies on a command-line tool called pdfinfo
, which is part of the Poppler utility suite, to extract metadata like page count and layout.
However, Poppler is not a Python package and cannot be installed using %pip install
. The required tool, pdfinfo
, comes from the poppler-utils
Linux package, which must be installed using a system package manager like apt-get
.
If poppler-utils
is not installed or not available in the system PATH, pdf2image
will raise a PDFInfoNotInstalledError
.
Solution
Install pdf2image using %pip or the Libraries UI
First ensure the Python library pdf2image is correctly installed in your notebook or cluster environment.
Using a notebook
%pip install pdf2image
Using the Libraries UI
- Go to Compute > Your Cluster > Libraries > Install New
- Select PyPI, and in the package field, enter “pdf2image”
Create an init script to install poppler-utils
Install system-level dependencies like Poppler using init scripts, which are executed automatically on cluster startup.
1. Use the workspace file browser to create a new file (AWS | Azure | GCP) in your home directory. Call it install_poppler.sh
.
2. Copy the following sample script and paste it into the install_poppler.sh
file you just created:
#!/bin/bash
sudo apt-get update && sudo apt-get install -y poppler-utils
# (Optional) Install OCR engine used in some PDF workflows
sudo apt-get install -y tesseract-ocr
3. Your init script is located at /Workspace/Users/<user-name>/install_poppler.sh
. Remember the path to the init script. You will need it when configuring your cluster.
Configure the init script on your cluster
- Go to Compute > Your Cluster > Advanced Options > Init Scripts
- Enter the file path
/Workspace/Users/<user-name>/install_poppler.sh
- Click Add.
- Click Confirm and then restart the cluster to apply the script.
For more information, refer to the Cluster-scoped init scripts (AWS | Azure | GCP) documentation.