amazon-textract-textractor

Analyze documents with Amazon Textract and generate output in multiple formats.

https://github.com/aws-samples/amazon-textract-textractor

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.5%) to scientific vocabulary

Keywords

amazon-textract

Last synced: 10 months ago · JSON representation

Repository

Analyze documents with Amazon Textract and generate output in multiple formats.

Basic Info

Host: GitHub
Owner: aws-samples
License: apache-2.0
Language: Jupyter Notebook
Default Branch: master
Homepage:
Size: 192 MB

Statistics

Stars: 458
Watchers: 18
Forks: 159
Open Issues: 109
Releases: 50

Topics

amazon-textract

Created about 7 years ago · Last pushed about 1 year ago

Metadata Files

Readme Contributing License Code of conduct Citation

Textractor is a python package created to seamlessly work with Amazon Textract a document intelligence service offering text recognition, table extraction, form processing, and much more. Whether you are making a one-off script or a complex distributed document processing pipeline, Textractor makes it easy to use Textract.

If you are looking for the other amazon-textract-* packages, you can find them using the links below:

amazon-textract-caller (to simplify calling Amazon Textract without additional dependencies)
amazon-textract-response-parser (to parse the JSON response returned by Textract APIs)
amazon-textract-overlayer (to draw bounding boxes around the document entities on the document image)
amazon-textract-prettyprinter (convert Amazon Textract response to CSV, text, markdown, ...)
amazon-textract-geofinder (extract specific information from document with methods that help navigate the document using geometry and relations, e. g. hierarchical key/value pairs)

Installation

Textractor is available on PyPI and can be installed with pip install amazon-textract-textractor. By default this will install the minimal version of Textractor which is suitable for lambda execution. The following extras can be used to add features:

pandas (pip install "amazon-textract-textractor[pandas]") installs pandas which is used to enable DataFrame and CSV exports.
pdfium (pip install amazon-textract-textractor[pdfium]) includes pypdfium2 and is the recommended way to enable PDF rasterization in Textractor. Note that this is not necessary to call Textract with a PDF file.
pdf (pip install amazon-textract-textractor[pdf]) includes pdf2image and is an additional way to enable PDF rasterization in Textractor. Note that this is not necessary to call Textract with a PDF file.
torch (pip install "amazon-textract-textractor[torch]") includes sentence_transformers for better word search and matching. This will work on CPU but be noticeably slower than non-machine learning based approaches.
dev (pip install "amazon-textract-textractor[dev]") includes all the dependencies above and everything else needed to test the code.

You can pick several extras by separating the labels with commas like this pip install "amazon-textract-textractor[pdf,torch]".

Documentation

Generated documentation for the latest released version can be accessed here: aws-samples.github.io/amazon-textract-textractor/

Examples

While a collection of simplistic examples is presented here, the documentation has a much larger collection of examples with specific case studies that will help you get started.

Setup

These two lines are all you need to use Textract. The Textractor instance can be reused across multiple requests for both synchronous and asynchronous requests.

```py from textractor import Textractor

extractor = Textractor(profile_name="default") ```

Text recognition

```py

file_source can be an image, list of images, bytes or S3 path

document = extractor.detectdocumenttext(file_source="tests/fixtures/single-page-1.png") print(document.lines)

[Textractor Test, Document, Page (1), Key - Values, Name of package: Textractor, Date : 08/14/2022, Table 1, Cell 1, Cell 2, Cell 4, Cell 5, Cell 6, Cell 7, Cell 8, Cell 9, Cell 10, Cell 11, Cell 12, Cell 13, Cell 14, Cell 15, Selection Element, Selected Checkbox, Un-Selected Checkbox]

```

Table extraction

```py from textractor.data.constants import TextractFeatures

document = extractor.analyzedocument( filesource="tests/fixtures/form.png", features=[TextractFeatures.TABLES] )

Saves the table in an excel document for further processing

document.tables[0].to_excel("output.xlsx") ```

Form extraction

```py from textractor.data.constants import TextractFeatures

document = extractor.analyzedocument( filesource="tests/fixtures/form.png", features=[TextractFeatures.FORMS] )

Use document.get() to search for a key with fuzzy matching

document.get("email")

[E-mail Address : johndoe@gmail.com]

```

Analyze ID

```py document = extractor.analyzeid(filesource="tests/fixtures/fakeid.png") print(document.identitydocuments[0].get("FIRST_NAME"))

'MARIA'

```

Receipt processing (Analyze Expense)

```py document = extractor.analyzeexpense(filesource="tests/fixtures/receipt.jpg") print(document.expensedocuments[0].summaryfields.get("TOTAL")[0].text)

'$1810.46'

```

If your use case was not covered here or if you are looking for asynchronous usage examples, see our collection of examples.

CLI

Textractor also comes with the textractor script, which supports calling, printing and overlaying directly in the terminal.

textractor analyze-document tests/fixtures/amzn_q2.png output.json --features TABLES --overlay TABLES

overlay_example

See the documentation for more examples.

Tests

The package comes with tests that call the production Textract APIs. Running the tests will incur charges to your AWS account.

Acknowledgements

This library was made possible by the work of Srividhya Radhakrishna (@srividh-r).

Contributing

See CONTRIBUTING.md

Citing

Textractor can be cited using:

@software{amazontextractor, author = {Belval, Edouard and Delteil, Thomas and Schade, Martin and Radhakrishna, Srividhya}, title = {{Amazon Textractor}}, url = {https://github.com/aws-samples/amazon-textract-textractor}, version = {1.9.2}, year = {2025} }

Or using the CITATION.cff file.

License

This library is licensed under the Apache 2.0 License.

_{^{Excavator image by macrovector on Freepik}}

Owner

Name: AWS Samples
Login: aws-samples
Kind: organization

Website: https://amazon.com/aws
Repositories: 6,789
Profile: https://github.com/aws-samples

GitHub Events

Total

Create event: 11
Release event: 4
Issues event: 24
Watch event: 62
Delete event: 5
Issue comment event: 63
Push event: 27
Pull request review event: 5
Pull request event: 28
Fork event: 18

Last Year

Create event: 11
Release event: 4
Issues event: 24
Watch event: 62
Delete event: 5
Issue comment event: 63
Push event: 27
Pull request review event: 5
Pull request event: 28
Fork event: 18

Committers

Last synced: over 3 years ago

All Time

Total Commits: 292
Total Committers: 22
Avg Commits per committer: 13.273
Development Distribution Score (DDS): 0.531

Top Committers

Name	Email	Commits
schadem	4**m@u**m	137
Edouard Belval	b**e@a**m	82
Tobias Bruckert	6**2@u**m	20
dependabot[bot]	4**]@u**m	9
James Siri	j**i@a**m	7
Thomas	t**l@a**m	6
darwaishx	k**n@o**m	6
RichardScottOZ	7**Z@u**m	5
Simran Singh	s**j@a**m	4
robot	r**t@e**m	3
Thomas Delteil	t**1@g**m	2
Konstantinos Kourmousis	3**s@u**m	1
Dhawalkumar Patel	d**p@a**m	1
Edouard Belval	e**d@b**g	1
Mike Biddlecombe	m**e@k**m	1
Rudolfs Berzins	r**e@g**m	1
Michael Hsieh	m**2@g**m	1
Roy wu	y**w@l**m	1
darwaishx	k**i@a**m	1
janahang	1**g@u**m	1
Lana Zhang	l**z@a**m	1
irbian	3**n@u**m	1

Committer Domains (Top 20 + Academic)

amazon.com: 7 genmab.com: 1 koananalytics.com: 1 belval.org: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 185
Total pull requests: 154
Average time to close issues: 5 months
Average time to close pull requests: 11 days
Total issue authors: 94
Total pull request authors: 39
Average comments per issue: 1.55
Average comments per pull request: 0.49
Merged pull requests: 127
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 20
Pull requests: 27
Average time to close issues: about 1 month
Average time to close pull requests: 18 days
Issue authors: 18
Pull request authors: 14
Average comments per issue: 0.95
Average comments per pull request: 0.33
Merged pull requests: 14
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

schadem (17)
Belval (16)
ThomasDelteil (16)
bvbg1 (8)
tb102122 (6)
athewsey (5)
arsher-b (5)
ttruong-gilead (4)
dannellyz (3)
oonisim (3)
red-sky17 (3)
rasrivid (3)
aka-rabbi-inv (2)
ccrosland (2)
rnschmidt (2)

Pull Request Authors

Belval (95)
schadem (18)
tb102122 (10)
anjanvb (8)
ThomasDelteil (6)
Chuukwudi (4)
grantrosse (4)
simonschmidt (2)
mdscruggs (2)
neil-sola (2)
dzmitry-kankalovich (2)
k-agau (2)
BPDanek (2)
athewsey (2)
akhilnarayanan1 (2)

Top Labels

Issue Labels

bug (18) enhancement (16) need repro (4) documentation (3) chore (1) question (1)

Pull Request Labels

pretty-printer (4)

Packages

Total packages: 8
Total downloads:
- pypi 2,963,903 last-month

Total dependent packages: 29
(may contain duplicates)
Total dependent repositories: 71
(may contain duplicates)
Total versions: 171
Total maintainers: 4

pypi.org: amazon-textract-caller

Amazon Textract Caller tools

Homepage: https://github.com/aws-samples/amazon-textract-textractor/tree/master/caller
Documentation: https://amazon-textract-caller.readthedocs.io/
License: Apache License Version 2.0
Latest release: 0.2.4
published about 2 years ago

Versions: 29
Dependent Packages: 22
Dependent Repositories: 60
Downloads: 1,431,450 Last month

Rankings

Dependent packages count: 0.5%

Downloads: 1.5%

Dependent repos count: 1.9%

Average: 2.4%

Stargazers count: 3.6%

Forks count: 4.3%

Maintainers (4)

Belval rekognition-textract-demos schadem kmascar

Last synced: 10 months ago

pypi.org: amazon-textract-prettyprinter

Amazon Textract Helper tools for pretty printing

Homepage: https://github.com/aws-samples/amazon-textract-textractor/tree/master/prettyprinter
Documentation: https://amazon-textract-prettyprinter.readthedocs.io/
License: Apache License Version 2.0
Latest release: 0.1.10
published about 2 years ago

Versions: 23
Dependent Packages: 2
Dependent Repositories: 5
Downloads: 48,403 Last month

Rankings

Downloads: 2.3%

Stargazers count: 3.6%

Average: 4.3%

Forks count: 4.3%

Dependent packages count: 4.8%

Dependent repos count: 6.6%

Maintainers (4)

Belval rekognition-textract-demos schadem kmascar

Last synced: 10 months ago

pypi.org: amazon-textract-pipeline-pagedimensions

Amazon Textract Pipeline Component to add page dimensions to page block types

Homepage: https://github.com/aws-samples/amazon-textract-textractor/tree/master/tpipelinepagedimensions
Documentation: https://amazon-textract-pipeline-pagedimensions.readthedocs.io/
License: Apache License Version 2.0
Latest release: 0.0.9
published over 2 years ago

Versions: 9
Dependent Packages: 1
Dependent Repositories: 2
Downloads: 2,186 Last month

Rankings

Downloads: 3.0%

Stargazers count: 3.6%

Forks count: 4.3%

Dependent packages count: 4.8%

Average: 5.5%

Dependent repos count: 11.5%

Maintainers (4)

Belval rekognition-textract-demos schadem kmascar

Last synced: 10 months ago

pypi.org: amazon-textract-textractor

A package to use AWS Textract services.

Homepage: https://github.com/aws-samples/amazon-textract-textractor
Documentation: https://amazon-textract-textractor.readthedocs.io/
License: Apache 2.0
Latest release: 1.9.2
published about 1 year ago

Versions: 69
Dependent Packages: 3
Dependent Repositories: 1
Downloads: 1,372,174 Last month

Rankings

Downloads: 2.0%

Dependent packages count: 2.4%

Stargazers count: 3.8%

Forks count: 4.4%

Average: 6.8%

Dependent repos count: 21.6%

Maintainers (4)

Belval rekognition-textract-demos schadem kmascar

Last synced: 10 months ago

pypi.org: amazon-textract-overlayer

Amazon Textract Overlay tools

Homepage: https://github.com/aws-samples/amazon-textract-textractor/tree/master/overlayer
Documentation: https://amazon-textract-overlayer.readthedocs.io/
License: Apache License Version 2.0
Latest release: 0.0.12
published over 2 years ago

Versions: 9
Dependent Packages: 1
Dependent Repositories: 1
Downloads: 6,227 Last month

Rankings

Downloads: 3.5%

Stargazers count: 3.6%

Forks count: 4.3%

Dependent packages count: 4.8%

Average: 7.6%

Dependent repos count: 21.5%

Maintainers (4)

Belval rekognition-textract-demos schadem kmascar

Last synced: 10 months ago

pypi.org: amazon-textract-helper

Amazon Textract Helper tools

Homepage: https://github.com/aws-samples/amazon-textract-textractor/tree/master/helper
Documentation: https://amazon-textract-helper.readthedocs.io/
License: Apache License Version 2.0
Latest release: 0.0.35
published over 2 years ago

Versions: 23
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 6,429 Last month

Rankings

Downloads: 3.6%

Stargazers count: 3.6%

Forks count: 4.3%

Average: 8.7%

Dependent packages count: 10.1%

Dependent repos count: 21.5%

Maintainers (4)

Belval rekognition-textract-demos schadem kmascar

Last synced: 10 months ago

pypi.org: amazon-textract-geofinder

Amazon Textract package to easier access data through geometric information

Homepage: https://github.com/aws-samples/amazon-textract-textractor/tpipelinegeofinder
Documentation: https://amazon-textract-geofinder.readthedocs.io/
License: Apache License Version 2.0
Latest release: 0.0.8
published over 2 years ago

Versions: 8
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 43,223 Last month

Rankings

Stargazers count: 3.6%

Forks count: 4.3%

Downloads: 6.9%

Average: 9.3%

Dependent packages count: 10.1%

Dependent repos count: 21.5%

Maintainers (4)

Belval rekognition-textract-demos schadem kmascar

Last synced: 10 months ago

pypi.org: amazon-textract-idp-cdk-manifest

Amazon Textract IDP CDK Manifest

Homepage: https://github.com/aws-samples/amazon-textract-textractor/tree/master/idp_cdk_manifest
Documentation: https://amazon-textract-idp-cdk-manifest.readthedocs.io/
License: Apache License Version 2.0
Latest release: 0.0.1
published over 3 years ago

Versions: 1
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 53,811 Last month

Rankings

Downloads: 1.9%

Stargazers count: 4.0%

Forks count: 4.6%

Dependent packages count: 6.6%

Average: 9.6%

Dependent repos count: 30.6%

Maintainers (3)

Belval rekognition-textract-demos kmascar

Last synced: 10 months ago

Dependencies

.github/workflows/documentation.yml actions

actions/cache v2 composite
actions/checkout v3 composite
peaceiris/actions-gh-pages v3 composite

.github/workflows/lambda_layers.yml actions

actions/checkout v3 composite
actions/upload-artifact v3 composite

.github/workflows/release.yml actions

actions/cache v2 composite
actions/checkout v3 composite
pypa/gh-action-pypi-publish release/v1 composite

.github/workflows/test-pr-caller.yml actions

actions/checkout v3 composite
actions/setup-python v4 composite
aws-actions/configure-aws-credentials v1-node16 composite

.github/workflows/test-pr-geofinder.yml actions

actions/checkout v3 composite
actions/setup-python v4 composite

.github/workflows/test-pr-prettyprinter.yml actions

actions/checkout v3 composite
actions/setup-python v4 composite

.github/workflows/tests.yml actions

actions/cache v2 composite
actions/checkout v3 composite

requirements.txt pypi

Pillow *
XlsxWriter ==3.0.
amazon-textract-caller ==0.0.27
amazon-textract-response-parser ==0.1.37
editdistance ==0.6.2
jsonschema *
tabulate ==0.8.

caller/setup.py pypi

helper/setup.py pypi

idp_cdk_manifest/setup.py pypi

overlayer/setup.py pypi

prettyprinter/setup.py pypi

setup.py pypi

tpipelinegeofinder/setup.py pypi

tpipelinepagedimensions/setup.py pypi

.github/workflows/release-caller.yml actions

actions/cache v3 composite
actions/checkout v3 composite
pypa/gh-action-pypi-publish release/v1 composite

amazon-textract-textractor

Science Score: 26.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Installation

Documentation

Examples

Setup

Text recognition

file_source can be an image, list of images, bytes or S3 path

[Textractor Test, Document, Page (1), Key - Values, Name of package: Textractor, Date : 08/14/2022, Table 1, Cell 1, Cell 2, Cell 4, Cell 5, Cell 6, Cell 7, Cell 8, Cell 9, Cell 10, Cell 11, Cell 12, Cell 13, Cell 14, Cell 15, Selection Element, Selected Checkbox, Un-Selected Checkbox]

Table extraction

Saves the table in an excel document for further processing

Form extraction

Use document.get() to search for a key with fuzzy matching

[E-mail Address : johndoe@gmail.com]

Analyze ID

'MARIA'

Receipt processing (Analyze Expense)

'$1810.46'

CLI

Tests

Acknowledgements

Contributing

Citing

License

Owner

GitHub Events

Total

Last Year

Committers

All Time

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: amazon-textract-caller

Rankings

Maintainers (4)

pypi.org: amazon-textract-prettyprinter

Rankings

Maintainers (4)

pypi.org: amazon-textract-pipeline-pagedimensions

Rankings

Maintainers (4)

pypi.org: amazon-textract-textractor

Rankings

Maintainers (4)

pypi.org: amazon-textract-overlayer

Rankings

Maintainers (4)

pypi.org: amazon-textract-helper

Rankings

Maintainers (4)

pypi.org: amazon-textract-geofinder

Rankings

Maintainers (4)

pypi.org: amazon-textract-idp-cdk-manifest

Rankings

Maintainers (3)

Dependencies