pdfsearch
pdfsearch: Search Tools for PDF Files - Published in JOSS (2018)
Science Score: 93.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 4 DOI reference(s) in README and JOSS metadata -
✓Academic publication links
Links to: arxiv.org, joss.theoj.org -
○Committers with academic emails
-
○Institutional organization owner
-
✓JOSS paper metadata
Published in Journal of Open Source Software
Keywords
keyword
pdf
r
Scientific Fields
Engineering
Computer Science -
40% confidence
Last synced: 6 months ago
·
JSON representation
Repository
Search pdf files for keywords.
Basic Info
Statistics
- Stars: 41
- Watchers: 5
- Forks: 4
- Open Issues: 7
- Releases: 4
Topics
keyword
pdf
r
Created over 9 years ago
· Last pushed 9 months ago
Metadata Files
Readme
Changelog
Contributing
License
Zenodo
README.Rmd
# pdfsearch
[](https://github.com/lebebr01/pdfsearch/actions?workflow=R-CMD-check)
[](https://ci.appveyor.com/project/lebebr01/pdfsearch)
[](https://app.codecov.io/github/lebebr01/pdfsearch)
[](https://cran.r-project.org/package=pdfsearch)
[](https://doi.org/10.21105/joss.00668)
This package defines a few useful functions for keyword searching using the [pdftools](https://github.com/ropensci/pdftools) package developed by [rOpenSci](https://ropensci.org/).
The package can be installed from CRAN directly:
```{r install_cran, eval = FALSE}
install.packages("pdfsearch")
```
To install the development version you use devtools:
```{r install, eval = FALSE}
install.packages("devtools")
devtools::install_github('lebebr01/pdfsearch')
```
## Basic Usage
There are currently two functions in this package of use to users. The first `keyword_search` takes a single pdf and searches for keywords from the pdf. The second `keyword_directory` does the same search over a directory of pdfs.
## Example with `keyword_search`
The package comes with two pdf files from [arXiv](https://arxiv.org/) to use as test cases. Below is an example of using the `keyword_search` function.
```{r search1, message = FALSE}
library(pdfsearch)
file <- system.file('pdf', '1610.00147.pdf', package = 'pdfsearch')
result <- keyword_search(file,
keyword = c('measurement', 'error'),
path = TRUE)
head(result$line_text, n = 2)
```
The location of the keyword match, including page number and line number, the actual line of text, and a tokenized version of the text (raw text split by individual words) are returned by default.
In addition, by default the hyphenated words at the end of the text are combined with the continued word at the start of the next line. If this behavior is not of interest, set the `remove_hyphen` argument to `FALSE`.
### Surrounding lines of text
It may be useful to extract not just the line of text that the keyword is found in, but also surrounding text to have additional context when looking at the keyword results. This can be added by using the argument `surround_lines` as follows:
```{r surround, eval = FALSE}
file <- system.file('pdf', '1610.00147.pdf', package = 'pdfsearch')
result <- keyword_search(file,
keyword = c('measurement', 'error'),
path = TRUE, surround_lines = 1)
head(result)
head(result$line_text, n = 2)
```
## Example with `keyword_directory`
The `keyword_directory` function allows users to search for keywords in multiple PDF files in one function call. The same functionality from the `keyword_search` function can be invoked, specifically `remove_hyphen` and `surround_lines`. Below is an example of searching a single directory.
```{r directory, eval = FALSE}
directory <- system.file('pdf', package = 'pdfsearch')
# do search over two files
directory_result <- keyword_directory(directory,
keyword = c('repeated measures', 'measurement error'),
surround_lines = 1)
head(directory_result, n = 2)
```
A few other useful arguments are possible when searching for keywords within multiple PDF files in a directory. One is the `recursive` (default is `FALSE`), where if set to `TRUE` will search within subdirectories as well, the default function behavior will not venture into subdirectories. Finally, if the directory has many PDF files, testing the function first on a handful of PDF files may be desired. The number of PDF files can be limited with the argument `max_search` where a positive integer can be specified indicating the number of PDF files to search. For example, is `max_search = 2`, only the first two PDF files will be searched within the directory.
### Shiny App
The package also has a simple Shiny app that can be called using the following command
```{r shiny, eval = FALSE}
run_shiny()
```
## Usage in Research
The pdfsearch package may be most useful to those conducting research syntheses or meta-analyses. The package can allow users to search for keywords related to a research question; therefore, instead of searching the entire text of a document, specific portions of the text can be identified to be searched. This could increase the reproducibility and reduce the time needed to collect the data for the research synthesis or meta-analysis.
As an example, the package is currently being used to explore the evolution of statistical software and quantitative methods used in published social science research (https://ww2.amstat.org/meetings/jsm/2018/onlineprogram/AbstractDetails.cfm?abstractid=330777). This process involves getting PDF files from published research articles and using pdfsearch to search for specific software and quantitative methods keywords within the research articles. The results of the keyword matches will be explored using research synthesis methods. A pre-print of the paper and slides from the presentation will be posted to the GitHub repo as part of the package later this summer.
Owner
- Name: Brandon LeBeau
- Login: lebebr01
- Kind: user
- Location: Iowa City
- Website: https://brandonlebeau.org
- Twitter: blebeau11
- Repositories: 14
- Profile: https://github.com/lebebr01
JOSS Publication
pdfsearch: Search Tools for PDF Files
Published
July 09, 2018
Volume 3, Issue 27, Page 668
Tags
Keyword Search PDF reproducible researchGitHub Events
Total
- Issues event: 2
- Watch event: 2
- Push event: 14
Last Year
- Issues event: 2
- Watch event: 2
- Push event: 14
Committers
Last synced: 7 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Brandon LeBeau | l****1@g****m | 184 |
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 28
- Total pull requests: 2
- Average time to close issues: 4 months
- Average time to close pull requests: 4 months
- Total issue authors: 12
- Total pull request authors: 1
- Average comments per issue: 1.07
- Average comments per pull request: 3.0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- lebebr01 (12)
- lmullen (5)
- behrica (2)
- bes827 (1)
- mdietz3 (1)
- dcaud (1)
- eni1985 (1)
- ghost (1)
- machado-t (1)
- wrasitejmiago (1)
- alastairrushworth (1)
- jeroen (1)
Pull Request Authors
- behrica (2)
Top Labels
Issue Labels
enhancement (10)
bug (4)
help wanted (1)
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- cran 6,448 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 5
- Total maintainers: 1
cran.r-project.org: pdfsearch
Search Tools for PDF Files
- Homepage: https://github.com/lebebr01/pdfsearch
- Documentation: http://cran.r-project.org/web/packages/pdfsearch/pdfsearch.pdf
- License: MIT + file LICENSE
-
Latest release: 0.4.3
published 9 months ago
Rankings
Stargazers count: 8.5%
Forks count: 11.3%
Average: 23.5%
Dependent packages count: 29.8%
Downloads: 32.5%
Dependent repos count: 35.5%
Maintainers (1)
Last synced:
6 months ago
Dependencies
DESCRIPTION
cran
- R >= 3.3.0 depends
- pdftools * imports
- stringi * imports
- tibble * imports
- tokenizers * imports
- covr * suggests
- knitr * suggests
- rmarkdown * suggests
- shiny * suggests
- testthat * suggests
.github/workflows/covr.yml
actions
- actions/checkout v4 composite
- actions/upload-artifact v4 composite
- codecov/codecov-action v5 composite
- r-lib/actions/setup-r v2 composite
- r-lib/actions/setup-r-dependencies v2 composite
.github/workflows/main.yml
actions
- actions/checkout v4 composite
- r-lib/actions/check-r-package v2 composite
- r-lib/actions/setup-pandoc v2 composite
- r-lib/actions/setup-r v2 composite
- r-lib/actions/setup-r-dependencies v2 composite
.github/workflows/pkgdown.yml
actions
- JamesIves/github-pages-deploy-action v4.5.0 composite
- actions/checkout v4 composite
- r-lib/actions/setup-pandoc v2 composite
- r-lib/actions/setup-r v2 composite
- r-lib/actions/setup-r-dependencies v2 composite
