pdf2image

A python module that wraps the pdftoppm utility to convert PDF to PIL Image object

https://github.com/belval/pdf2image

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.0%) to scientific vocabulary

Keywords

convert pdf pil pil-image poppler
Last synced: 6 months ago · JSON representation

Repository

A python module that wraps the pdftoppm utility to convert PDF to PIL Image object

Basic Info
  • Host: GitHub
  • Owner: Belval
  • License: mit
  • Language: Python
  • Default Branch: master
  • Homepage:
  • Size: 4.59 MB
Statistics
  • Stars: 1,855
  • Watchers: 18
  • Forks: 211
  • Open Issues: 82
  • Releases: 18
Topics
convert pdf pil pil-image poppler
Created over 8 years ago · Last pushed over 1 year ago
Metadata Files
Readme Funding License

README.md

pdf2image

CircleCI PyPI version codecov Downloads GitHub CI

A python (3.7+) module that wraps pdftoppm and pdftocairo to convert PDF to a PIL Image object

How to install

pip install pdf2image

Windows

Windows users will have to build or download poppler for Windows. I recommend @oschwartz10612 version which is the most up-to-date. You will then have to add the bin/ folder to PATH or use poppler_path = r"C:\path\to\poppler-xx\bin" as an argument in convert_from_path.

Mac

Mac users will have to install poppler.

Installing using Brew:

brew install poppler

Linux

Most distros ship with pdftoppm and pdftocairo. If they are not installed, refer to your package manager to install poppler-utils

Platform-independant (Using conda)

  1. Install poppler: conda install -c conda-forge poppler
  2. Install pdf2image: pip install pdf2image

How does it work?

py from pdf2image import convert_from_path, convert_from_bytes from pdf2image.exceptions import ( PDFInfoNotInstalledError, PDFPageCountError, PDFSyntaxError )

Then simply do:

py images = convert_from_path('/home/belval/example.pdf')

OR

py images = convert_from_bytes(open('/home/belval/example.pdf', 'rb').read())

OR better yet

```py import tempfile

with tempfile.TemporaryDirectory() as path: imagesfrompath = convertfrompath('/home/belval/example.pdf', output_folder=path) # Do something here ```

images will be a list of PIL Image representing each page of the PDF document.

Here are the definitions:

convert_from_path(pdf_path, dpi=200, output_folder=None, first_page=None, last_page=None, fmt='ppm', jpegopt=None, thread_count=1, userpw=None, use_cropbox=False, strict=False, transparent=False, single_file=False, output_file=str(uuid.uuid4()), poppler_path=None, grayscale=False, size=None, paths_only=False, use_pdftocairo=False, timeout=600, hide_attributes=False)

convert_from_bytes(pdf_file, dpi=200, output_folder=None, first_page=None, last_page=None, fmt='ppm', jpegopt=None, thread_count=1, userpw=None, use_cropbox=False, strict=False, transparent=False, single_file=False, output_file=str(uuid.uuid4()), poppler_path=None, grayscale=False, size=None, paths_only=False, use_pdftocairo=False, timeout=600, hide_attributes=False)

What's new?

  • Allow users to hide attributes when using pdftoppm with hide_attributes (Thank you @StaticRocket)
  • Fix console opening on Windows (Thank you @OhMyAgnes!)
  • Add timeout parameter which raises PDFPopplerTimeoutError after the given number of seconds.
  • Add use_pdftocairo parameter which forces pdf2image to use pdftocairo. Should improve performance.
  • Fixed a bug where using pdf2image with multiple threads (but not multiple processes) would cause and exception
  • jpegopt parameter allows for tuning of the output JPEG when using fmt="jpeg" (-jpegopt in pdftoppm CLI) (Thank you @abieler)
  • pdfinfo_from_path and pdfinfo_from_bytes which expose the output of the pdfinfo CLI
  • paths_only parameter will return image paths instead of Image objects, to prevent OOM when converting a big PDF
  • size parameter allows you to define the shape of the resulting images (-scale-to in pdftoppm CLI)
    • size=400 will fit the image to a 400x400 box, preserving aspect ratio
    • size=(400, None) will make the image 400 pixels wide, preserving aspect ratio
    • size=(500, 500) will resize the image to 500x500 pixels, not preserving aspect ratio
  • grayscale parameter allows you to convert images to grayscale (-gray in pdftoppm CLI)
  • single_file parameter allows you to convert the first PDF page only, without adding digits at the end of the output_file
  • Allow the user to specify poppler's installation path with poppler_path

Performance tips

  • Using an output folder is significantly faster if you are using an SSD. Otherwise i/o usually becomes the bottleneck.
  • Using multiple threads can give you some gains but avoid more than 4 as this will cause i/o bottleneck (even on my NVMe SSD!).
  • If i/o is your bottleneck, using the JPEG format can lead to significant gains.
  • PNG format is pretty slow, this is because of the compression.
  • If you want to know the best settings (most settings will be fine anyway) you can clone the project and run python tests.py to get timings.

Limitations / known issues

  • A relatively big PDF will use up all your memory and cause the process to be killed (unless you use an output folder)
  • Sometimes fail read pdf signed using DocuSign, Solution for DocuSign issue.

Owner

  • Name: Edouard Belval
  • Login: Belval
  • Kind: user
  • Location: Canada
  • Company: Amazon Web Services

Sr Research Engineer@Amazon

GitHub Events

Total
  • Issues event: 7
  • Watch event: 225
  • Issue comment event: 11
  • Pull request event: 3
  • Fork event: 12
Last Year
  • Issues event: 7
  • Watch event: 225
  • Issue comment event: 11
  • Pull request event: 3
  • Fork event: 12

Committers

Last synced: 6 months ago

All Time
  • Total Commits: 192
  • Total Committers: 34
  • Avg Commits per committer: 5.647
  • Development Distribution Score (DDS): 0.391
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Edouard Belval e****d@b****g 117
Edouard Belval g****b@b****g 34
Martin Münch m****h@g****m 4
Plat p****1@g****m 3
Shipra Srivastava c****a@g****m 3
Edouard Belval e****l@e****m 2
josephernest n****n@g****m 2
Akash a****h@a****m 1
Andre Bieler a****r@u****m 1
Ankit Patel a****3@g****m 1
Ben Beasley c****e@m****t 1
Bob Du i@b****c 1
Bojan Mihelac b****c@m****g 1
Bruno Cabral b****m@g****m 1
Camila Pozas 8****s@u****m 1
Code0987 d****j@g****m 1
Daniel Angelov d****v@g****m 1
Fillipe Galiza 5****z@u****m 1
Florian Demmer f****r@g****m 1
Hugo van Kemenade h****k@u****m 1
John-Schreiber 5****r@u****m 1
Magnus m****s@g****m 1
Pedro Perpétua 4****a@u****m 1
Stanislav Pankevich s****h@g****m 1
StaticRocket 3****t@u****m 1
Surya U s****7@g****m 1
Tobias Happ t****p@g****e 1
Wes Lord w****d@g****m 1
Yu Zhenkun y****2@h****m 1
Zeth Weissman 4****n@u****m 1
and 4 more...
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 99
  • Total pull requests: 37
  • Average time to close issues: 4 months
  • Average time to close pull requests: 4 months
  • Total issue authors: 97
  • Total pull request authors: 31
  • Average comments per issue: 3.64
  • Average comments per pull request: 0.81
  • Merged pull requests: 11
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 8
  • Pull requests: 9
  • Average time to close issues: about 1 month
  • Average time to close pull requests: N/A
  • Issue authors: 8
  • Pull request authors: 7
  • Average comments per issue: 0.38
  • Average comments per pull request: 0.0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • camipozas (2)
  • 2V3EvG4LMJFdRe (2)
  • vinhtq115 (1)
  • bvilmann (1)
  • darixsamani (1)
  • qazwsx74269 (1)
  • DarioBernardo (1)
  • scaffarelli (1)
  • deepanshuagarwal150 (1)
  • Ailibert (1)
  • arjun251 (1)
  • rookiexiao123 (1)
  • jonathan-s (1)
  • Crispisu (1)
  • rushabh-wadkar (1)
Pull Request Authors
  • Belval (4)
  • John-Schreiber (2)
  • musicinmybrain (2)
  • NYF-BRICK (2)
  • magnurud (2)
  • pepijnolivier (2)
  • ankitpt (2)
  • cedk (2)
  • qazwsx74269 (2)
  • danielerizzoarchivi (2)
  • zweissman (2)
  • bmihelac (2)
  • bartfeenstra (2)
  • StaticRocket (1)
  • mnewls (1)
Top Labels
Issue Labels
enhancement (3) help wanted (2) need repro (2) documentation (1) windows (1) bug (1) need more info (1) feature request (1)
Pull Request Labels

Packages

  • Total packages: 18
  • Total downloads:
    • pypi 6,428,375 last-month
  • Total docker downloads: 11,267,823
  • Total dependent packages: 200
    (may contain duplicates)
  • Total dependent repositories: 2,769
    (may contain duplicates)
  • Total versions: 117
  • Total maintainers: 3
pypi.org: pdf2image

A wrapper around the pdftoppm and pdftocairo command line tools to convert PDF to a PIL Image list.

  • Versions: 46
  • Dependent Packages: 195
  • Dependent Repositories: 2,760
  • Downloads: 6,425,940 Last month
  • Docker Downloads: 11,267,591
Rankings
Dependent packages count: 0.2%
Dependent repos count: 0.2%
Downloads: 0.3%
Docker downloads count: 0.6%
Average: 1.2%
Stargazers count: 1.8%
Forks count: 3.8%
Maintainers (1)
Last synced: 6 months ago
alpine-v3.18: py3-pdf2image-pyc

Precompiled Python bytecode for py3-pdf2image

  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent repos count: 0.0%
Dependent packages count: 0.0%
Average: 3.8%
Stargazers count: 7.1%
Forks count: 8.1%
Maintainers (1)
Last synced: 6 months ago
alpine-v3.18: py3-pdf2image

A python module that wraps the pdftoppm utility to convert PDF to PIL Image object

  • Versions: 1
  • Dependent Packages: 1
  • Dependent Repositories: 0
Rankings
Dependent repos count: 0.0%
Dependent packages count: 0.0%
Average: 3.8%
Stargazers count: 7.1%
Forks count: 8.1%
Maintainers (1)
Last synced: 6 months ago
pypi.org: pdf2img

A wrapper around the pdftoppm and pdftocairo command line tools to convert PDF to a PIL Image list.

  • Versions: 3
  • Dependent Packages: 0
  • Dependent Repositories: 3
  • Downloads: 2,435 Last month
  • Docker Downloads: 232
Rankings
Stargazers count: 1.8%
Docker downloads count: 2.4%
Forks count: 3.8%
Average: 5.8%
Downloads: 7.5%
Dependent repos count: 9.0%
Dependent packages count: 10.1%
Maintainers (1)
Last synced: 6 months ago
proxy.golang.org: github.com/Belval/pdf2image
  • Versions: 18
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Stargazers count: 1.8%
Forks count: 2.1%
Average: 5.9%
Dependent packages count: 8.9%
Dependent repos count: 10.7%
Last synced: 6 months ago
proxy.golang.org: github.com/belval/pdf2image
  • Versions: 18
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Stargazers count: 1.8%
Forks count: 2.1%
Average: 5.9%
Dependent packages count: 8.9%
Dependent repos count: 10.7%
Last synced: 6 months ago
alpine-edge: py3-pdf2image

A python module that wraps the pdftoppm utility to convert PDF to PIL Image object

  • Versions: 6
  • Dependent Packages: 1
  • Dependent Repositories: 0
Rankings
Dependent repos count: 0.0%
Average: 6.0%
Dependent packages count: 6.0%
Stargazers count: 8.6%
Forks count: 9.2%
Maintainers (1)
Last synced: 6 months ago
alpine-edge: py3-pdf2image-pyc

Precompiled Python bytecode for py3-pdf2image

  • Versions: 4
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent repos count: 0.0%
Average: 7.9%
Stargazers count: 8.8%
Forks count: 9.4%
Dependent packages count: 13.4%
Maintainers (1)
Last synced: 6 months ago
conda-forge.org: pdf2image
  • Versions: 11
  • Dependent Packages: 3
  • Dependent Repositories: 6
Rankings
Stargazers count: 11.6%
Average: 13.8%
Dependent repos count: 13.9%
Forks count: 14.0%
Dependent packages count: 15.6%
Last synced: 6 months ago
anaconda.org: pdf2image

A python module that wraps pdftoppm and pdftocairo to convert PDF to a PIL Image object

  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent packages count: 46.9%
Average: 49.2%
Dependent repos count: 51.5%
Last synced: 6 months ago
alpine-v3.19: py3-pdf2image-pyc

Precompiled Python bytecode for py3-pdf2image

  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent repos count: 0.0%
Dependent packages count: 0.0%
Average: 100%
Maintainers (1)
Last synced: 6 months ago
alpine-v3.22: py3-pdf2image

A python module that wraps the pdftoppm utility to convert PDF to PIL Image object

  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent repos count: 0.0%
Dependent packages count: 0.0%
Average: 100%
Maintainers (1)
Last synced: 6 months ago
alpine-v3.20: py3-pdf2image-pyc

Precompiled Python bytecode for py3-pdf2image

  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent repos count: 0.0%
Dependent packages count: 0.0%
Average: 100%
Maintainers (1)
Last synced: 6 months ago
alpine-v3.21: py3-pdf2image-pyc

Precompiled Python bytecode for py3-pdf2image

  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent repos count: 0.0%
Dependent packages count: 0.0%
Average: 100%
Maintainers (1)
Last synced: 6 months ago
alpine-v3.22: py3-pdf2image-pyc

Precompiled Python bytecode for py3-pdf2image

  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent repos count: 0.0%
Dependent packages count: 0.0%
Average: 100%
Maintainers (1)
Last synced: 6 months ago
alpine-v3.21: py3-pdf2image

A python module that wraps the pdftoppm utility to convert PDF to PIL Image object

  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent repos count: 0.0%
Dependent packages count: 0.0%
Average: 100%
Maintainers (1)
Last synced: 6 months ago
alpine-v3.20: py3-pdf2image

A python module that wraps the pdftoppm utility to convert PDF to PIL Image object

  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent repos count: 0.0%
Dependent packages count: 0.0%
Average: 100%
Maintainers (1)
Last synced: 6 months ago
alpine-v3.19: py3-pdf2image

A python module that wraps the pdftoppm utility to convert PDF to PIL Image object

  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent repos count: 0.0%
Dependent packages count: 0.0%
Average: 100%
Maintainers (1)
Last synced: 6 months ago

Dependencies

setup.py pypi
  • pillow *
.github/workflows/documentation.yml actions
  • actions/checkout v3 composite
  • peaceiris/actions-gh-pages v3 composite
.github/workflows/release.yml actions
  • actions/checkout v3 composite
  • pypa/gh-action-pypi-publish release/v1 composite
docs/requirements.txt pypi
  • Sphinx ==5.1.
  • nbsphinx ==0.8.
  • recommonmark *
  • sphinx-argparse *
  • sphinx-rtd-theme ==1.0.0
  • sphinx-rtd-theme ==1.0.
  • sphinxcontrib-applehelp ==1.0.
  • sphinxcontrib-devhelp ==1.0.
  • sphinxcontrib-htmlhelp ==2.0.
  • sphinxcontrib-jsmath ==1.0.
  • sphinxcontrib-qthelp ==1.0.
  • sphinxcontrib-serializinghtml ==1.1.