Error while using the Unstructured library

Use apt-get to install poppler-utils.

Written by priyanshi.david

Last published at: June 11th, 2025

Problem

When using the Unstructured library in your workspace to extract content from PDF files, you encounter the following error.

PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?

 

Cause

You are missing the system-level dependency, Poppler.

The Unstructured library internally uses the pdf2image Python package to process PDF files. Pdf2image relies on a command-line tool called pdfinfo, which is part of the Poppler utility suite, to extract metadata like page count and layout.

However, Poppler is not a Python package and cannot be installed using %pip install. The required tool, pdfinfo, comes from the poppler-utils Linux package, which must be installed using a system package manager like apt-get.

If poppler-utils is not installed or not available in the system PATH, pdf2image will raise a PDFInfoNotInstalledError.

 

Solution

 

Install pdf2image using %pip or the Libraries UI

First ensure the Python library pdf2image is correctly installed in your notebook or cluster environment.

 

Using a notebook

%pip install pdf2image

 

Using the Libraries UI

  1. Go to Compute > Your Cluster > Libraries > Install New
  2. Select PyPI, and in the package field, enter “pdf2image”

 

Create an init script to install poppler-utils

Install system-level dependencies like Poppler using init scripts, which are executed automatically on cluster startup. 

1. Use the workspace file browser to create a new file (AWSAzureGCP) in your home directory. Call it install_poppler.sh.

2. Copy the following sample script and paste it into the install_poppler.sh file you just created:

#!/bin/bash

sudo apt-get update && sudo apt-get install -y poppler-utils

# (Optional) Install OCR engine used in some PDF workflows
sudo apt-get install -y tesseract-ocr 

 

3. Your init script is located at /Workspace/Users/<user-name>/install_poppler.sh.  Remember the path to the init script. You will need it when configuring your cluster.

 

Configure the init script on your cluster 

  1. Go to Compute > Your Cluster > Advanced Options > Init Scripts
  2. Enter the file path /Workspace/Users/<user-name>/install_poppler.sh
  3. Click Add
  4. Click Confirm and then restart the cluster to apply the script.

For more information, refer to the Cluster-scoped init scripts (AWSAzureGCP) documentation.