quipucamayoc

dev repo for article

https://github.com/sergiocorreia/quipucamayoc

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org, zenodo.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (17.6%) to scientific vocabulary

Keywords

ocr ocr-post-processing ocr-python poppler table-extraction table-ocr textract
Last synced: 6 months ago · JSON representation ·

Repository

dev repo for article

Basic Info
  • Host: GitHub
  • Owner: sergiocorreia
  • License: agpl-3.0
  • Language: Python
  • Default Branch: master
  • Homepage:
  • Size: 30.3 MB
Statistics
  • Stars: 30
  • Watchers: 6
  • Forks: 5
  • Open Issues: 3
  • Releases: 0
Topics
ocr ocr-post-processing ocr-python poppler table-extraction table-ocr textract
Created about 4 years ago · Last pushed almost 3 years ago
Metadata Files
Readme License Citation

README.md

Quipucamayoc: tools for digitizing historical data

Development Status Build Status License DOI

GitHub Releases Python version Supported implementations Poppler version

quipucamayoc is a Python package that simplifies the extraction of historical data from scanned images and PDFs. It's designed to be modular and so it can be used together with other existing tools, and can be extended easily by users.

For an overview of how to use quipucamayoc to digitize historical data, see this research article, which amongst other things details the different steps involved, the methods used, and provides practical examples. For an user guide, documentation, and installation instructions, see http://scorreia.com/software/quipucamayoc/ (TODO).

If you want to contribute by improving the code or extending its functionality (much welcome!), head here.

Installation

Pip

To manage quipucamayoc using pip, open the command line and run:

  • pip install quipucamayoc to install
    • pip install quipucamayoc[dev] to include extra dependencies used when developing the code
  • pip install -U quipucamayoc to upgrade
  • pip uninstall quipucamayoc to remove

Note that quipucamayoc has been tested against Python 3.10 and newer versions, but should also work with Python 3.9.

Git Install

After cloning the repo to your computer and navigating to the quipucamayoc folder, run:

  • pip install . to install the package locally
  • pip install --no-cache-dir --editable .[dev] to install locally with a symlink so changes are automatically updated (recommended for developers)

After installation

AWS

quipucamayoc can use AWS's Textract to OCR text and tables. Its configuration is quite cumbersome, so it has been automated for you. To do so, first install quipucamayoc and then follow these steps to specify your credentials:

  1. Ensure you have an Amazon AWS account.
    • AWS Textract prices are listed here. If you just created an account you should be able to use the free tier. Otherwise, you might need to set up a payment method.
  2. Now you need to create credentials so you can access AWS programmatically.
    • The simplest method is to go to the security credentials page (you can also go to it from the AWS console: click on your name on the top-right > click on security credentials). Then, ignore the security warning (see below), go the next page, scroll to Access Keys and click create an access key. Copy the Access key and Secret access key strings (akin to username and passwords).
    • (TODO) Alternatively, AWS now recommends an alternative: instead of creating access keys for your user, create a new user with a more restricted access and then create an access key for this user. In this way, if you somehow lose your credentials (e.g. your computer is hacked) then hackers are limited in what they can do. To create an user, go to the Identity and Access Management (IAM) console, and create a new user: Access Management > Users > Add Users. Select a name, press Next, then you need to attach certain policies (TODO: find out which policies are needed).
  3. Download and install the aws command line interface (CLI). Update: quipucamayoc installs the awscli package so this step might not be necessary anymore.
  4. Go to the command line, and type aws configure to enter your credentials. You need to enter your Access key and Secret Access Key strings. The Default region name can be the AWS region of your preference (most likely, aws-east-1) or left empty. Default output format can be left empty.
  5. From the command line, you can now run the quipucamayoc command quipu aws install. This will setup your AWS account so you can use Textract programmatically (i.e., create an S3 bucket, an SNS topic, a SQS queue, a user with the required credentials, and then configure the user, bucket, topic, and queue so they can talk to each other.)

Notes:

  • You can avoid steps 3-4 by directly [writing your credentials[(https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html)] to the credentials file.
  • If you want to remove all quipucamayoc artifacts from your AWS account, you can run quipu aws uninstall from the command line.

Usage

  • From the command line, you can extract tables using AWS via quipu extract-tables --filename <myfile.pdf>

TODO

  • [x] Automatically set up Textract pipeline
  • [ ] Expose key functions as command line tools
  • [ ] Allow parallel (async?) tasks. Useful for OpenCV (CPU-intensive) and Textract calls (IO-intensive). Consider also uvloop
  • [ ] Include Poppler by default on Windows
  • [ ] Add mypy/(flake8|black)

Interesting tools to explore:

  • https://github.com/qurator-spk/eynollah
  • https://github.com/qurator-spk/sbb_binarization
  • https://github.com/leonlulu/DeepLayout

Contributing

Feel free to submit push requests. For consistency, code should comply with pep8 (as long as its reasonable), and with the style guides by @kennethreitz and google. Read more here.

Citation

(Download BibTex file here)

As text

  • Sergio Correia, Stephan Luck: “Digitizing Historical Balance Sheet Data: A Practitioner's Guide”, 2022; arXiv:2204.00052.

As BibTex

bibtex @misc{quipucamayoc, Author = {Correia, Sergio and Luck, Stephan}, Title = {Digitizing Historical Balance Sheet Data: A Practitioner's Guide}, Year = {2022}, eprint = {arXiv:2204.00052}, journal={arXiv preprint arXiv:2204.00052} }

Acknowledgments

Quipucamayoc is built upon the work and improvements of many users and developers, from which it was heavily inspired, such as:

It is also relies for most of its work on the following open source projects:

License

Quipucamayoc is developed under the GNU Affero GPL v3 license.

Why "quipucamayoc"?

The quipucamayocs were the Inca empire officials in charge of desciphering (amonst other things) accounting information stored in quipus. Our goal for this package is to act as a sort of quipucamayoc, helping researchers in desciphering and extracting historical information, particularly balance sheets and numerical records.

Owner

  • Name: Sergio Correia
  • Login: sergiocorreia
  • Kind: user
  • Location: Washington, DC
  • Company: Federal Reserve Board

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Correia"
  given-names: "Sergio"
  orcid: "https://orcid.org/0000-0002-0914-8648"
- family-names: "Luck"
  given-names: "Stephan"
title: "quipucamayoc"
version: 1.0.0
date-released: 2022-09-09
url: "https://github.com/sergiocorreia/quipucamayoc"

GitHub Events

Total
  • Watch event: 5
Last Year
  • Watch event: 5

Committers

Last synced: almost 3 years ago

All Time
  • Total Commits: 47
  • Total Committers: 3
  • Avg Commits per committer: 15.667
  • Development Distribution Score (DDS): 0.34
Top Committers
Name Email Commits
Sergio Correia s****a@g****m 31
Aaron Mahr a****r@g****m 15
a-mahr 7****r@u****m 1

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 7
  • Total pull requests: 1
  • Average time to close issues: 3 months
  • Average time to close pull requests: 6 days
  • Total issue authors: 6
  • Total pull request authors: 1
  • Average comments per issue: 2.71
  • Average comments per pull request: 1.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • aureliusnoble (2)
  • a-mahr (1)
  • maestro1315 (1)
  • puppy8472 (1)
  • psungho (1)
  • JamesMaxwellHarrison (1)
Pull Request Authors
  • a-mahr (1)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 18 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 1
  • Total versions: 2
  • Total maintainers: 1
pypi.org: quipucamayoc

Tools to extract information from digitized historical documents

  • Versions: 2
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 18 Last month
Rankings
Dependent packages count: 10.1%
Stargazers count: 14.5%
Forks count: 16.8%
Dependent repos count: 21.6%
Average: 25.8%
Downloads: 66.2%
Maintainers (1)
Last synced: 6 months ago

Dependencies

pyproject.toml pypi