amazon-textract-textractor

Analyze documents with Amazon Textract and generate output in multiple formats.

https://github.com/aws-samples/amazon-textract-textractor

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.5%) to scientific vocabulary

Keywords

amazon-textract
Last synced: 6 months ago · JSON representation

Repository

Analyze documents with Amazon Textract and generate output in multiple formats.

Basic Info
  • Host: GitHub
  • Owner: aws-samples
  • License: apache-2.0
  • Language: Jupyter Notebook
  • Default Branch: master
  • Homepage:
  • Size: 192 MB
Statistics
  • Stars: 458
  • Watchers: 18
  • Forks: 159
  • Open Issues: 109
  • Releases: 50
Topics
amazon-textract
Created almost 7 years ago · Last pushed 10 months ago
Metadata Files
Readme Contributing License Code of conduct Citation

README.md

Textractor

Tests Documentation PyPI version Downloads Code style: black

Textractor is a python package created to seamlessly work with Amazon Textract a document intelligence service offering text recognition, table extraction, form processing, and much more. Whether you are making a one-off script or a complex distributed document processing pipeline, Textractor makes it easy to use Textract.

If you are looking for the other amazon-textract-* packages, you can find them using the links below:

Installation

Textractor is available on PyPI and can be installed with pip install amazon-textract-textractor. By default this will install the minimal version of Textractor which is suitable for lambda execution. The following extras can be used to add features:

  • pandas (pip install "amazon-textract-textractor[pandas]") installs pandas which is used to enable DataFrame and CSV exports.
  • pdfium (pip install amazon-textract-textractor[pdfium]) includes pypdfium2 and is the recommended way to enable PDF rasterization in Textractor. Note that this is not necessary to call Textract with a PDF file.
  • pdf (pip install amazon-textract-textractor[pdf]) includes pdf2image and is an additional way to enable PDF rasterization in Textractor. Note that this is not necessary to call Textract with a PDF file.
  • torch (pip install "amazon-textract-textractor[torch]") includes sentence_transformers for better word search and matching. This will work on CPU but be noticeably slower than non-machine learning based approaches.
  • dev (pip install "amazon-textract-textractor[dev]") includes all the dependencies above and everything else needed to test the code.

You can pick several extras by separating the labels with commas like this pip install "amazon-textract-textractor[pdf,torch]".

Documentation

Generated documentation for the latest released version can be accessed here: aws-samples.github.io/amazon-textract-textractor/

Examples

While a collection of simplistic examples is presented here, the documentation has a much larger collection of examples with specific case studies that will help you get started.

Setup

These two lines are all you need to use Textract. The Textractor instance can be reused across multiple requests for both synchronous and asynchronous requests.

```py from textractor import Textractor

extractor = Textractor(profile_name="default") ```

Text recognition

```py

file_source can be an image, list of images, bytes or S3 path

document = extractor.detectdocumenttext(file_source="tests/fixtures/single-page-1.png") print(document.lines)

[Textractor Test, Document, Page (1), Key - Values, Name of package: Textractor, Date : 08/14/2022, Table 1, Cell 1, Cell 2, Cell 4, Cell 5, Cell 6, Cell 7, Cell 8, Cell 9, Cell 10, Cell 11, Cell 12, Cell 13, Cell 14, Cell 15, Selection Element, Selected Checkbox, Un-Selected Checkbox]

```

Table extraction

```py from textractor.data.constants import TextractFeatures

document = extractor.analyzedocument( filesource="tests/fixtures/form.png", features=[TextractFeatures.TABLES] )

Saves the table in an excel document for further processing

document.tables[0].to_excel("output.xlsx") ```

Form extraction

```py from textractor.data.constants import TextractFeatures

document = extractor.analyzedocument( filesource="tests/fixtures/form.png", features=[TextractFeatures.FORMS] )

Use document.get() to search for a key with fuzzy matching

document.get("email")

[E-mail Address : johndoe@gmail.com]

```

Analyze ID

```py document = extractor.analyzeid(filesource="tests/fixtures/fakeid.png") print(document.identitydocuments[0].get("FIRST_NAME"))

'MARIA'

```

Receipt processing (Analyze Expense)

```py document = extractor.analyzeexpense(filesource="tests/fixtures/receipt.jpg") print(document.expensedocuments[0].summaryfields.get("TOTAL")[0].text)

'$1810.46'

```

If your use case was not covered here or if you are looking for asynchronous usage examples, see our collection of examples.

CLI

Textractor also comes with the textractor script, which supports calling, printing and overlaying directly in the terminal.

textractor analyze-document tests/fixtures/amzn_q2.png output.json --features TABLES --overlay TABLES

overlay_example

See the documentation for more examples.

Tests

The package comes with tests that call the production Textract APIs. Running the tests will incur charges to your AWS account.

Acknowledgements

This library was made possible by the work of Srividhya Radhakrishna (@srividh-r).

Contributing

See CONTRIBUTING.md

Citing

Textractor can be cited using:

@software{amazontextractor, author = {Belval, Edouard and Delteil, Thomas and Schade, Martin and Radhakrishna, Srividhya}, title = {{Amazon Textractor}}, url = {https://github.com/aws-samples/amazon-textract-textractor}, version = {1.9.2}, year = {2025} }

Or using the CITATION.cff file.

License

This library is licensed under the Apache 2.0 License.

Excavator image by macrovector on Freepik

Owner

  • Name: AWS Samples
  • Login: aws-samples
  • Kind: organization

GitHub Events

Total
  • Create event: 11
  • Release event: 4
  • Issues event: 24
  • Watch event: 62
  • Delete event: 5
  • Issue comment event: 63
  • Push event: 27
  • Pull request review event: 5
  • Pull request event: 28
  • Fork event: 18
Last Year
  • Create event: 11
  • Release event: 4
  • Issues event: 24
  • Watch event: 62
  • Delete event: 5
  • Issue comment event: 63
  • Push event: 27
  • Pull request review event: 5
  • Pull request event: 28
  • Fork event: 18

Committers

Last synced: almost 3 years ago

All Time
  • Total Commits: 292
  • Total Committers: 22
  • Avg Commits per committer: 13.273
  • Development Distribution Score (DDS): 0.531
Top Committers
Name Email Commits
schadem 4****m@u****m 137
Edouard Belval b****e@a****m 82
Tobias Bruckert 6****2@u****m 20
dependabot[bot] 4****]@u****m 9
James Siri j****i@a****m 7
Thomas t****l@a****m 6
darwaishx k****n@o****m 6
RichardScottOZ 7****Z@u****m 5
Simran Singh s****j@a****m 4
robot r****t@e****m 3
Thomas Delteil t****1@g****m 2
Konstantinos Kourmousis 3****s@u****m 1
Dhawalkumar Patel d****p@a****m 1
Edouard Belval e****d@b****g 1
Mike Biddlecombe m****e@k****m 1
Rudolfs Berzins r****e@g****m 1
Michael Hsieh m****2@g****m 1
Roy wu y****w@l****m 1
darwaishx k****i@a****m 1
janahang 1****g@u****m 1
Lana Zhang l****z@a****m 1
irbian 3****n@u****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 185
  • Total pull requests: 154
  • Average time to close issues: 5 months
  • Average time to close pull requests: 11 days
  • Total issue authors: 94
  • Total pull request authors: 39
  • Average comments per issue: 1.55
  • Average comments per pull request: 0.49
  • Merged pull requests: 127
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 20
  • Pull requests: 27
  • Average time to close issues: about 1 month
  • Average time to close pull requests: 18 days
  • Issue authors: 18
  • Pull request authors: 14
  • Average comments per issue: 0.95
  • Average comments per pull request: 0.33
  • Merged pull requests: 14
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • schadem (17)
  • Belval (16)
  • ThomasDelteil (16)
  • bvbg1 (8)
  • tb102122 (6)
  • athewsey (5)
  • arsher-b (5)
  • ttruong-gilead (4)
  • dannellyz (3)
  • oonisim (3)
  • red-sky17 (3)
  • rasrivid (3)
  • aka-rabbi-inv (2)
  • ccrosland (2)
  • rnschmidt (2)
Pull Request Authors
  • Belval (95)
  • schadem (18)
  • tb102122 (10)
  • anjanvb (8)
  • ThomasDelteil (6)
  • Chuukwudi (4)
  • grantrosse (4)
  • simonschmidt (2)
  • mdscruggs (2)
  • neil-sola (2)
  • dzmitry-kankalovich (2)
  • k-agau (2)
  • BPDanek (2)
  • athewsey (2)
  • akhilnarayanan1 (2)
Top Labels
Issue Labels
bug (18) enhancement (16) need repro (4) documentation (3) chore (1) question (1)
Pull Request Labels
pretty-printer (4)

Packages

  • Total packages: 8
  • Total downloads:
    • pypi 2,963,903 last-month
  • Total dependent packages: 29
    (may contain duplicates)
  • Total dependent repositories: 71
    (may contain duplicates)
  • Total versions: 171
  • Total maintainers: 4
pypi.org: amazon-textract-caller

Amazon Textract Caller tools

  • Versions: 29
  • Dependent Packages: 22
  • Dependent Repositories: 60
  • Downloads: 1,431,450 Last month
Rankings
Dependent packages count: 0.5%
Downloads: 1.5%
Dependent repos count: 1.9%
Average: 2.4%
Stargazers count: 3.6%
Forks count: 4.3%
Last synced: 6 months ago
pypi.org: amazon-textract-prettyprinter

Amazon Textract Helper tools for pretty printing

  • Versions: 23
  • Dependent Packages: 2
  • Dependent Repositories: 5
  • Downloads: 48,403 Last month
Rankings
Downloads: 2.3%
Stargazers count: 3.6%
Average: 4.3%
Forks count: 4.3%
Dependent packages count: 4.8%
Dependent repos count: 6.6%
Last synced: 6 months ago
pypi.org: amazon-textract-pipeline-pagedimensions

Amazon Textract Pipeline Component to add page dimensions to page block types

  • Versions: 9
  • Dependent Packages: 1
  • Dependent Repositories: 2
  • Downloads: 2,186 Last month
Rankings
Downloads: 3.0%
Stargazers count: 3.6%
Forks count: 4.3%
Dependent packages count: 4.8%
Average: 5.5%
Dependent repos count: 11.5%
Last synced: 6 months ago
pypi.org: amazon-textract-textractor

A package to use AWS Textract services.

  • Versions: 69
  • Dependent Packages: 3
  • Dependent Repositories: 1
  • Downloads: 1,372,174 Last month
Rankings
Downloads: 2.0%
Dependent packages count: 2.4%
Stargazers count: 3.8%
Forks count: 4.4%
Average: 6.8%
Dependent repos count: 21.6%
Last synced: 6 months ago
pypi.org: amazon-textract-overlayer

Amazon Textract Overlay tools

  • Versions: 9
  • Dependent Packages: 1
  • Dependent Repositories: 1
  • Downloads: 6,227 Last month
Rankings
Downloads: 3.5%
Stargazers count: 3.6%
Forks count: 4.3%
Dependent packages count: 4.8%
Average: 7.6%
Dependent repos count: 21.5%
Last synced: 6 months ago
pypi.org: amazon-textract-helper

Amazon Textract Helper tools

  • Versions: 23
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 6,429 Last month
Rankings
Downloads: 3.6%
Stargazers count: 3.6%
Forks count: 4.3%
Average: 8.7%
Dependent packages count: 10.1%
Dependent repos count: 21.5%
Last synced: 6 months ago
pypi.org: amazon-textract-geofinder

Amazon Textract package to easier access data through geometric information

  • Versions: 8
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 43,223 Last month
Rankings
Stargazers count: 3.6%
Forks count: 4.3%
Downloads: 6.9%
Average: 9.3%
Dependent packages count: 10.1%
Dependent repos count: 21.5%
Last synced: 6 months ago
pypi.org: amazon-textract-idp-cdk-manifest

Amazon Textract IDP CDK Manifest

  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 53,811 Last month
Rankings
Downloads: 1.9%
Stargazers count: 4.0%
Forks count: 4.6%
Dependent packages count: 6.6%
Average: 9.6%
Dependent repos count: 30.6%
Last synced: 6 months ago

Dependencies

.github/workflows/documentation.yml actions
  • actions/cache v2 composite
  • actions/checkout v3 composite
  • peaceiris/actions-gh-pages v3 composite
.github/workflows/lambda_layers.yml actions
  • actions/checkout v3 composite
  • actions/upload-artifact v3 composite
.github/workflows/release.yml actions
  • actions/cache v2 composite
  • actions/checkout v3 composite
  • pypa/gh-action-pypi-publish release/v1 composite
.github/workflows/test-pr-caller.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
  • aws-actions/configure-aws-credentials v1-node16 composite
.github/workflows/test-pr-geofinder.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
.github/workflows/test-pr-prettyprinter.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
.github/workflows/tests.yml actions
  • actions/cache v2 composite
  • actions/checkout v3 composite
requirements.txt pypi
  • Pillow *
  • XlsxWriter ==3.0.
  • amazon-textract-caller ==0.0.27
  • amazon-textract-response-parser ==0.1.37
  • editdistance ==0.6.2
  • jsonschema *
  • tabulate ==0.8.
caller/setup.py pypi
helper/setup.py pypi
idp_cdk_manifest/setup.py pypi
overlayer/setup.py pypi
prettyprinter/setup.py pypi
setup.py pypi
tpipelinegeofinder/setup.py pypi
tpipelinepagedimensions/setup.py pypi
.github/workflows/release-caller.yml actions
  • actions/cache v3 composite
  • actions/checkout v3 composite
  • pypa/gh-action-pypi-publish release/v1 composite