daiR
daiR: an R package for OCR with Google Document AI - Published in JOSS (2021)
Science Score: 93.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 10 DOI reference(s) in README and JOSS metadata -
✓Academic publication links
Links to: joss.theoj.org -
○Committers with academic emails
-
○Institutional organization owner
-
✓JOSS paper metadata
Published in Journal of Open Source Software
Keywords
Repository
R package for Google Document AI
Basic Info
- Host: GitHub
- Owner: Hegghammer
- License: other
- Language: R
- Default Branch: master
- Homepage: https://dair.info/
- Size: 8.68 MB
Statistics
- Stars: 44
- Watchers: 4
- Forks: 4
- Open Issues: 0
- Releases: 2
Topics
Metadata Files
README.md
daiR: OCR with Google Document AI in R

daiR is an R package for Google Document AI, a powerful server-based OCR service with support for over 60 languages. The package provides an interface for the Document AI API and comes with additional tools for output file parsing and text reconstruction. See the daiR website and this journal article for more details.
Use
Quick OCR short documents:
```R
NOT RUN
library(daiR) gettext(daisync("file.pdf")) ```
Turn images of tables into R dataframes:
```R
NOT RUN:
Assumes a default processor of type "FORMPARSERPROCESSOR"
gettables(daisync("file.pdf")) ```
Draw bounding boxes on the source image:
```R
NOT RUN:
drawblocks(daisync("file.pdf")) ```
Requirements
Google Document AI is a paid service that requires a Google Cloud account and a Google Storage bucket. I recommend using Mark Edmondson's googleCloudStorageR package in combination with daiR.
Installation
Install daiR from CRAN:
R
install.packages("daiR")
Or install the latest development version from Github:
R
devtools::install_github("hegghammer/daiR")
Citation
To cite daiR in publications, please use
Hegghammer, T., (2021). daiR: an R package for OCR with Google Document AI. Journal of Open Source Software, 6(68), 3538, https://doi.org/10.21105/joss.03538
Bibtex:
@article{Hegghammer2021,
doi = {10.21105/joss.03538},
url = {https://doi.org/10.21105/joss.03538},
year = {2021},
publisher = {The Open Journal},
volume = {6},
number = {68},
pages = {3538},
author = {Thomas Hegghammer},
title = {daiR: an R package for OCR with Google Document AI},
journal = {Journal of Open Source Software}
}
Acknowledgments
Thanks to Mark Edmondson, Hallvar Gisnås, Will Hanley, Neil Ketchley, Trond Arne Sørby, Chris Barrie, and Geraint Palmer for contributions to the project.
Code of conduct
Please note that the daiR project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.
Owner
- Name: Thomas Hegghammer
- Login: Hegghammer
- Kind: user
- Location: Oxford, UK
- Website: hegghammer.net
- Repositories: 2
- Profile: https://github.com/Hegghammer
Historian and political scientist studying militant Islamism. Senior Research Fellow at All Souls College, Oxford University.
JOSS Publication
daiR: an R package for OCR with Google Document AI
Tags
optical character recognition cloud computing text mining natural language processingGitHub Events
Total
- Issues event: 1
- Watch event: 3
- Issue comment event: 3
- Push event: 9
Last Year
- Issues event: 1
- Watch event: 3
- Issue comment event: 3
- Push event: 9
Committers
Last synced: 7 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| hegghammer | h****r@g****m | 201 |
| Hegghammer | T****r@f****o | 3 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 10
- Total pull requests: 0
- Average time to close issues: 3 months
- Average time to close pull requests: N/A
- Total issue authors: 7
- Total pull request authors: 0
- Average comments per issue: 2.7
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- whanley (3)
- giladwenig (2)
- brancengregory (1)
- tweedmann (1)
- dcaud (1)
- pauvallprat (1)
- arunrajes (1)
- bhanberry (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- cran 596 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 4
- Total maintainers: 1
cran.r-project.org: daiR
Interface with Google Cloud Document AI API
- Homepage: https://github.com/Hegghammer/daiR
- Documentation: http://cran.r-project.org/web/packages/daiR/daiR.pdf
- License: MIT + file LICENSE
-
Latest release: 1.0.1
published over 1 year ago
Rankings
Maintainers (1)
Dependencies
- R >= 3.1.0 depends
- base64enc * imports
- beepr * imports
- curl * imports
- fs * imports
- gargle * imports
- glue * imports
- googleCloudStorageR * imports
- grDevices * imports
- httr * imports
- jsonlite * imports
- magick * imports
- pdftools * imports
- purrr * imports
- readtext * imports
- stringr * imports
- covr * suggests
- dplyr * suggests
- knitr * suggests
- ngram * suggests
- qpdf * suggests
- rmarkdown * suggests
- sodium * suggests
- testthat >= 3.0.0 suggests
- usethis * suggests
- utils * suggests
- actions/cache v2 composite
- actions/checkout v2 composite
- actions/upload-artifact main composite
- r-lib/actions/setup-pandoc v1 composite
- r-lib/actions/setup-r v1 composite
- actions/cache v2 composite
- actions/checkout v2 composite
- r-lib/actions/setup-pandoc v1 composite
- r-lib/actions/setup-r v1 composite
