pdf2image
A python module that wraps the pdftoppm utility to convert PDF to PIL Image object
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.0%) to scientific vocabulary
Keywords
Repository
A python module that wraps the pdftoppm utility to convert PDF to PIL Image object
Basic Info
Statistics
- Stars: 1,855
- Watchers: 18
- Forks: 211
- Open Issues: 82
- Releases: 18
Topics
Metadata Files
README.md
pdf2image
A python (3.7+) module that wraps pdftoppm and pdftocairo to convert PDF to a PIL Image object
How to install
pip install pdf2image
Windows
Windows users will have to build or download poppler for Windows. I recommend @oschwartz10612 version which is the most up-to-date. You will then have to add the bin/ folder to PATH or use poppler_path = r"C:\path\to\poppler-xx\bin" as an argument in convert_from_path.
Mac
Mac users will have to install poppler.
Installing using Brew:
brew install poppler
Linux
Most distros ship with pdftoppm and pdftocairo. If they are not installed, refer to your package manager to install poppler-utils
Platform-independant (Using conda)
- Install poppler:
conda install -c conda-forge poppler - Install pdf2image:
pip install pdf2image
How does it work?
py
from pdf2image import convert_from_path, convert_from_bytes
from pdf2image.exceptions import (
PDFInfoNotInstalledError,
PDFPageCountError,
PDFSyntaxError
)
Then simply do:
py
images = convert_from_path('/home/belval/example.pdf')
OR
py
images = convert_from_bytes(open('/home/belval/example.pdf', 'rb').read())
OR better yet
```py import tempfile
with tempfile.TemporaryDirectory() as path: imagesfrompath = convertfrompath('/home/belval/example.pdf', output_folder=path) # Do something here ```
images will be a list of PIL Image representing each page of the PDF document.
Here are the definitions:
convert_from_path(pdf_path, dpi=200, output_folder=None, first_page=None, last_page=None, fmt='ppm', jpegopt=None, thread_count=1, userpw=None, use_cropbox=False, strict=False, transparent=False, single_file=False, output_file=str(uuid.uuid4()), poppler_path=None, grayscale=False, size=None, paths_only=False, use_pdftocairo=False, timeout=600, hide_attributes=False)
convert_from_bytes(pdf_file, dpi=200, output_folder=None, first_page=None, last_page=None, fmt='ppm', jpegopt=None, thread_count=1, userpw=None, use_cropbox=False, strict=False, transparent=False, single_file=False, output_file=str(uuid.uuid4()), poppler_path=None, grayscale=False, size=None, paths_only=False, use_pdftocairo=False, timeout=600, hide_attributes=False)
What's new?
- Allow users to hide attributes when using pdftoppm with
hide_attributes(Thank you @StaticRocket) - Fix console opening on Windows (Thank you @OhMyAgnes!)
- Add
timeoutparameter which raisesPDFPopplerTimeoutErrorafter the given number of seconds. - Add
use_pdftocairoparameter which forcespdf2imageto usepdftocairo. Should improve performance. - Fixed a bug where using
pdf2imagewith multiple threads (but not multiple processes) would cause and exception jpegoptparameter allows for tuning of the output JPEG when usingfmt="jpeg"(-jpegoptin pdftoppm CLI) (Thank you @abieler)pdfinfo_from_pathandpdfinfo_from_byteswhich expose the output of the pdfinfo CLIpaths_onlyparameter will return image paths instead of Image objects, to prevent OOM when converting a big PDFsizeparameter allows you to define the shape of the resulting images (-scale-toin pdftoppm CLI)size=400will fit the image to a 400x400 box, preserving aspect ratiosize=(400, None)will make the image 400 pixels wide, preserving aspect ratiosize=(500, 500)will resize the image to 500x500 pixels, not preserving aspect ratio
grayscaleparameter allows you to convert images to grayscale (-grayin pdftoppm CLI)single_fileparameter allows you to convert the first PDF page only, without adding digits at the end of theoutput_file- Allow the user to specify poppler's installation path with
poppler_path
Performance tips
- Using an output folder is significantly faster if you are using an SSD. Otherwise i/o usually becomes the bottleneck.
- Using multiple threads can give you some gains but avoid more than 4 as this will cause i/o bottleneck (even on my NVMe SSD!).
- If i/o is your bottleneck, using the JPEG format can lead to significant gains.
- PNG format is pretty slow, this is because of the compression.
- If you want to know the best settings (most settings will be fine anyway) you can clone the project and run
python tests.pyto get timings.
Limitations / known issues
- A relatively big PDF will use up all your memory and cause the process to be killed (unless you use an output folder)
- Sometimes fail read pdf signed using DocuSign, Solution for DocuSign issue.
Owner
- Name: Edouard Belval
- Login: Belval
- Kind: user
- Location: Canada
- Company: Amazon Web Services
- Repositories: 47
- Profile: https://github.com/Belval
Sr Research Engineer@Amazon
GitHub Events
Total
- Issues event: 7
- Watch event: 225
- Issue comment event: 11
- Pull request event: 3
- Fork event: 12
Last Year
- Issues event: 7
- Watch event: 225
- Issue comment event: 11
- Pull request event: 3
- Fork event: 12
Committers
Last synced: 6 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Edouard Belval | e****d@b****g | 117 |
| Edouard Belval | g****b@b****g | 34 |
| Martin Münch | m****h@g****m | 4 |
| Plat | p****1@g****m | 3 |
| Shipra Srivastava | c****a@g****m | 3 |
| Edouard Belval | e****l@e****m | 2 |
| josephernest | n****n@g****m | 2 |
| Akash | a****h@a****m | 1 |
| Andre Bieler | a****r@u****m | 1 |
| Ankit Patel | a****3@g****m | 1 |
| Ben Beasley | c****e@m****t | 1 |
| Bob Du | i@b****c | 1 |
| Bojan Mihelac | b****c@m****g | 1 |
| Bruno Cabral | b****m@g****m | 1 |
| Camila Pozas | 8****s@u****m | 1 |
| Code0987 | d****j@g****m | 1 |
| Daniel Angelov | d****v@g****m | 1 |
| Fillipe Galiza | 5****z@u****m | 1 |
| Florian Demmer | f****r@g****m | 1 |
| Hugo van Kemenade | h****k@u****m | 1 |
| John-Schreiber | 5****r@u****m | 1 |
| Magnus | m****s@g****m | 1 |
| Pedro Perpétua | 4****a@u****m | 1 |
| Stanislav Pankevich | s****h@g****m | 1 |
| StaticRocket | 3****t@u****m | 1 |
| Surya U | s****7@g****m | 1 |
| Tobias Happ | t****p@g****e | 1 |
| Wes Lord | w****d@g****m | 1 |
| Yu Zhenkun | y****2@h****m | 1 |
| Zeth Weissman | 4****n@u****m | 1 |
| and 4 more... | ||
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 99
- Total pull requests: 37
- Average time to close issues: 4 months
- Average time to close pull requests: 4 months
- Total issue authors: 97
- Total pull request authors: 31
- Average comments per issue: 3.64
- Average comments per pull request: 0.81
- Merged pull requests: 11
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 8
- Pull requests: 9
- Average time to close issues: about 1 month
- Average time to close pull requests: N/A
- Issue authors: 8
- Pull request authors: 7
- Average comments per issue: 0.38
- Average comments per pull request: 0.0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- camipozas (2)
- 2V3EvG4LMJFdRe (2)
- vinhtq115 (1)
- bvilmann (1)
- darixsamani (1)
- qazwsx74269 (1)
- DarioBernardo (1)
- scaffarelli (1)
- deepanshuagarwal150 (1)
- Ailibert (1)
- arjun251 (1)
- rookiexiao123 (1)
- jonathan-s (1)
- Crispisu (1)
- rushabh-wadkar (1)
Pull Request Authors
- Belval (4)
- John-Schreiber (2)
- musicinmybrain (2)
- NYF-BRICK (2)
- magnurud (2)
- pepijnolivier (2)
- ankitpt (2)
- cedk (2)
- qazwsx74269 (2)
- danielerizzoarchivi (2)
- zweissman (2)
- bmihelac (2)
- bartfeenstra (2)
- StaticRocket (1)
- mnewls (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 18
-
Total downloads:
- pypi 6,428,375 last-month
- Total docker downloads: 11,267,823
-
Total dependent packages: 200
(may contain duplicates) -
Total dependent repositories: 2,769
(may contain duplicates) - Total versions: 117
- Total maintainers: 3
pypi.org: pdf2image
A wrapper around the pdftoppm and pdftocairo command line tools to convert PDF to a PIL Image list.
- Homepage: https://github.com/Belval/pdf2image
- Documentation: https://pdf2image.readthedocs.io/
- License: MIT
-
Latest release: 1.17.0
published about 2 years ago
Rankings
Maintainers (1)
alpine-v3.18: py3-pdf2image-pyc
Precompiled Python bytecode for py3-pdf2image
- Homepage: https://github.com/Belval/pdf2image
- License: MIT
-
Latest release: 1.16.3-r1
published almost 3 years ago
Rankings
Maintainers (1)
alpine-v3.18: py3-pdf2image
A python module that wraps the pdftoppm utility to convert PDF to PIL Image object
- Homepage: https://github.com/Belval/pdf2image
- License: MIT
-
Latest release: 1.16.3-r1
published almost 3 years ago
Rankings
Maintainers (1)
pypi.org: pdf2img
A wrapper around the pdftoppm and pdftocairo command line tools to convert PDF to a PIL Image list.
- Homepage: https://github.com/Belval/pdf2image
- Documentation: https://pdf2img.readthedocs.io/
- License: MIT License
-
Latest release: 0.1.2
published almost 5 years ago
Rankings
Maintainers (1)
proxy.golang.org: github.com/Belval/pdf2image
- Documentation: https://pkg.go.dev/github.com/Belval/pdf2image#section-documentation
- License: mit
-
Latest release: v1.17.0
published about 2 years ago
Rankings
proxy.golang.org: github.com/belval/pdf2image
- Documentation: https://pkg.go.dev/github.com/belval/pdf2image#section-documentation
- License: mit
-
Latest release: v1.17.0
published about 2 years ago
Rankings
alpine-edge: py3-pdf2image
A python module that wraps the pdftoppm utility to convert PDF to PIL Image object
- Homepage: https://github.com/Belval/pdf2image
- License: MIT
-
Latest release: 1.17.0-r1
published almost 2 years ago
Rankings
Maintainers (1)
alpine-edge: py3-pdf2image-pyc
Precompiled Python bytecode for py3-pdf2image
- Homepage: https://github.com/Belval/pdf2image
- License: MIT
-
Latest release: 1.17.0-r1
published almost 2 years ago
Rankings
Maintainers (1)
conda-forge.org: pdf2image
- Homepage: https://github.com/Belval/pdf2image
- License: MIT
-
Latest release: 1.16.0
published over 4 years ago
Rankings
anaconda.org: pdf2image
A python module that wraps pdftoppm and pdftocairo to convert PDF to a PIL Image object
- Homepage: https://github.com/Belval/pdf2image
- License: MIT
-
Latest release: 1.17.0
published about 1 year ago
Rankings
alpine-v3.19: py3-pdf2image-pyc
Precompiled Python bytecode for py3-pdf2image
- Homepage: https://github.com/Belval/pdf2image
- License: MIT
-
Latest release: 1.16.3-r2
published over 2 years ago
Rankings
Maintainers (1)
alpine-v3.22: py3-pdf2image
A python module that wraps the pdftoppm utility to convert PDF to PIL Image object
- Homepage: https://github.com/Belval/pdf2image
- License: MIT
-
Latest release: 1.17.0-r1
published almost 2 years ago
Rankings
Maintainers (1)
alpine-v3.20: py3-pdf2image-pyc
Precompiled Python bytecode for py3-pdf2image
- Homepage: https://github.com/Belval/pdf2image
- License: MIT
-
Latest release: 1.17.0-r1
published almost 2 years ago
Rankings
Maintainers (1)
alpine-v3.21: py3-pdf2image-pyc
Precompiled Python bytecode for py3-pdf2image
- Homepage: https://github.com/Belval/pdf2image
- License: MIT
-
Latest release: 1.17.0-r1
published almost 2 years ago
Rankings
Maintainers (1)
alpine-v3.22: py3-pdf2image-pyc
Precompiled Python bytecode for py3-pdf2image
- Homepage: https://github.com/Belval/pdf2image
- License: MIT
-
Latest release: 1.17.0-r1
published almost 2 years ago
Rankings
Maintainers (1)
alpine-v3.21: py3-pdf2image
A python module that wraps the pdftoppm utility to convert PDF to PIL Image object
- Homepage: https://github.com/Belval/pdf2image
- License: MIT
-
Latest release: 1.17.0-r1
published almost 2 years ago
Rankings
Maintainers (1)
alpine-v3.20: py3-pdf2image
A python module that wraps the pdftoppm utility to convert PDF to PIL Image object
- Homepage: https://github.com/Belval/pdf2image
- License: MIT
-
Latest release: 1.17.0-r1
published almost 2 years ago
Rankings
Maintainers (1)
alpine-v3.19: py3-pdf2image
A python module that wraps the pdftoppm utility to convert PDF to PIL Image object
- Homepage: https://github.com/Belval/pdf2image
- License: MIT
-
Latest release: 1.16.3-r2
published over 2 years ago
Rankings
Maintainers (1)
Dependencies
- pillow *
- actions/checkout v3 composite
- peaceiris/actions-gh-pages v3 composite
- actions/checkout v3 composite
- pypa/gh-action-pypi-publish release/v1 composite
- Sphinx ==5.1.
- nbsphinx ==0.8.
- recommonmark *
- sphinx-argparse *
- sphinx-rtd-theme ==1.0.0
- sphinx-rtd-theme ==1.0.
- sphinxcontrib-applehelp ==1.0.
- sphinxcontrib-devhelp ==1.0.
- sphinxcontrib-htmlhelp ==2.0.
- sphinxcontrib-jsmath ==1.0.
- sphinxcontrib-qthelp ==1.0.
- sphinxcontrib-serializinghtml ==1.1.