daiR

daiR: an R package for OCR with Google Document AI - Published in JOSS (2021)

https://github.com/hegghammer/dair

Science Score: 93.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 10 DOI reference(s) in README and JOSS metadata
  • Academic publication links
    Links to: joss.theoj.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
    Published in Journal of Open Source Software

Keywords

google-cloud ocr r
Last synced: 6 months ago · JSON representation

Repository

R package for Google Document AI

Basic Info
  • Host: GitHub
  • Owner: Hegghammer
  • License: other
  • Language: R
  • Default Branch: master
  • Homepage: https://dair.info/
  • Size: 8.68 MB
Statistics
  • Stars: 44
  • Watchers: 4
  • Forks: 4
  • Open Issues: 0
  • Releases: 2
Topics
google-cloud ocr r
Created almost 5 years ago · Last pushed over 1 year ago
Metadata Files
Readme Changelog Contributing License Code of conduct

README.md

daiR: OCR with Google Document AI in R

daiR is an R package for Google Document AI, a powerful server-based OCR service with support for over 60 languages. The package provides an interface for the Document AI API and comes with additional tools for output file parsing and text reconstruction. See the daiR website and this journal article for more details.

Use

Quick OCR short documents:

```R

NOT RUN

library(daiR) gettext(daisync("file.pdf")) ```

Turn images of tables into R dataframes:

```R

NOT RUN:

Assumes a default processor of type "FORMPARSERPROCESSOR"

gettables(daisync("file.pdf")) ```

Draw bounding boxes on the source image:

```R

NOT RUN:

drawblocks(daisync("file.pdf")) ```

Requirements

Google Document AI is a paid service that requires a Google Cloud account and a Google Storage bucket. I recommend using Mark Edmondson's googleCloudStorageR package in combination with daiR.

Installation

Install daiR from CRAN:

R install.packages("daiR")

Or install the latest development version from Github:

R devtools::install_github("hegghammer/daiR")

Citation

To cite daiR in publications, please use

Hegghammer, T., (2021). daiR: an R package for OCR with Google Document AI. Journal of Open Source Software, 6(68), 3538, https://doi.org/10.21105/joss.03538

Bibtex: @article{Hegghammer2021, doi = {10.21105/joss.03538}, url = {https://doi.org/10.21105/joss.03538}, year = {2021}, publisher = {The Open Journal}, volume = {6}, number = {68}, pages = {3538}, author = {Thomas Hegghammer}, title = {daiR: an R package for OCR with Google Document AI}, journal = {Journal of Open Source Software} }

Acknowledgments

Thanks to Mark Edmondson, Hallvar Gisnås, Will Hanley, Neil Ketchley, Trond Arne Sørby, Chris Barrie, and Geraint Palmer for contributions to the project.

Code of conduct

Please note that the daiR project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

DOI CRAN status R-CMD-check Codecov test coverage <!-- badges: end -->

Owner

  • Name: Thomas Hegghammer
  • Login: Hegghammer
  • Kind: user
  • Location: Oxford, UK

Historian and political scientist studying militant Islamism. Senior Research Fellow at All Souls College, Oxford University.

JOSS Publication

daiR: an R package for OCR with Google Document AI
Published
December 21, 2021
Volume 6, Issue 68, Page 3538
Authors
Thomas Hegghammer ORCID
Senior Research Fellow, Norwegian Defence Research Establishment (FFI)
Editor
Nikoleta Glynatsi ORCID
Tags
optical character recognition cloud computing text mining natural language processing

GitHub Events

Total
  • Issues event: 1
  • Watch event: 3
  • Issue comment event: 3
  • Push event: 9
Last Year
  • Issues event: 1
  • Watch event: 3
  • Issue comment event: 3
  • Push event: 9

Committers

Last synced: 7 months ago

All Time
  • Total Commits: 204
  • Total Committers: 2
  • Avg Commits per committer: 102.0
  • Development Distribution Score (DDS): 0.015
Past Year
  • Commits: 9
  • Committers: 1
  • Avg Commits per committer: 9.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
hegghammer h****r@g****m 201
Hegghammer T****r@f****o 3
Committer Domains (Top 20 + Academic)
ffi.no: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 10
  • Total pull requests: 0
  • Average time to close issues: 3 months
  • Average time to close pull requests: N/A
  • Total issue authors: 7
  • Total pull request authors: 0
  • Average comments per issue: 2.7
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • whanley (3)
  • giladwenig (2)
  • brancengregory (1)
  • tweedmann (1)
  • dcaud (1)
  • pauvallprat (1)
  • arunrajes (1)
  • bhanberry (1)
Pull Request Authors
Top Labels
Issue Labels
help wanted (4) enhancement (1) bug (1)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • cran 596 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 4
  • Total maintainers: 1
cran.r-project.org: daiR

Interface with Google Cloud Document AI API

  • Versions: 4
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 596 Last month
Rankings
Stargazers count: 9.2%
Forks count: 14.9%
Average: 22.3%
Dependent packages count: 29.8%
Dependent repos count: 35.5%
Maintainers (1)
Last synced: 6 months ago

Dependencies

DESCRIPTION cran
  • R >= 3.1.0 depends
  • base64enc * imports
  • beepr * imports
  • curl * imports
  • fs * imports
  • gargle * imports
  • glue * imports
  • googleCloudStorageR * imports
  • grDevices * imports
  • httr * imports
  • jsonlite * imports
  • magick * imports
  • pdftools * imports
  • purrr * imports
  • readtext * imports
  • stringr * imports
  • covr * suggests
  • dplyr * suggests
  • knitr * suggests
  • ngram * suggests
  • qpdf * suggests
  • rmarkdown * suggests
  • sodium * suggests
  • testthat >= 3.0.0 suggests
  • usethis * suggests
  • utils * suggests
.github/workflows/package-check.yml actions
  • actions/cache v2 composite
  • actions/checkout v2 composite
  • actions/upload-artifact main composite
  • r-lib/actions/setup-pandoc v1 composite
  • r-lib/actions/setup-r v1 composite
.github/workflows/test-coverage.yaml actions
  • actions/cache v2 composite
  • actions/checkout v2 composite
  • r-lib/actions/setup-pandoc v1 composite
  • r-lib/actions/setup-r v1 composite