Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (18.4%) to scientific vocabulary
Keywords
epub
epub-files
epub-format
peer-reviewed
r
r-package
rstats
Last synced: 9 months ago
·
JSON representation
Repository
Read EPUB files in R
Basic Info
- Host: GitHub
- Owner: ropensci
- License: other
- Language: R
- Default Branch: master
- Homepage: https://docs.ropensci.org/epubr
- Size: 963 KB
Statistics
- Stars: 24
- Watchers: 5
- Forks: 1
- Open Issues: 0
- Releases: 8
Topics
epub
epub-files
epub-format
peer-reviewed
r
r-package
rstats
Created about 8 years ago
· Last pushed over 1 year ago
Metadata Files
Readme
Changelog
License
Code of conduct
Codemeta
README.Rmd
---
output: github_document
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE, comment = "#>", fig.path = "man/figures/README-",
message = FALSE, warning = FALSE, error = FALSE
)
library(epubr)
```
# epubr
[](https://www.repostatus.org/)
[](https://github.com/ropensci/software-review/issues/222)
[](https://cran.r-project.org/package=epubr)
[](https://cran.r-project.org/package=epubr)
[](https://github.com/ropensci/epubr)
## Read EPUB files in R
Read EPUB text and metadata.
The `epubr` package provides functions supporting the reading and parsing of internal e-book content from EPUB files. E-book metadata and text content are parsed separately and joined together in a tidy, nested tibble data frame.
E-book formatting is not completely standardized across all literature. It can be challenging to curate parsed e-book content across an arbitrary collection of e-books perfectly and in completely general form, to yield a singular, consistently formatted output. Many EPUB files do not even contain all the same pieces of information in their respective metadata.
EPUB file parsing functionality in this package is intended for relatively general application to arbitrary EPUB e-books. However, poorly formatted e-books or e-books with highly uncommon formatting may not work with this package.
There may even be cases where an EPUB file has DRM or some other property that makes it impossible to read with `epubr`.
Text is read 'as is' for the most part. The only nominal changes are minor substitutions, for example curly quotes changed to straight quotes. Substantive changes are expected to be performed subsequently by the user as part of their text analysis. Additional text cleaning can be performed at the user's discretion, such as with functions from packages like `tm` or `qdap`.
## Installation
Install `epubr` from CRAN with:
``` r
install.packages("epubr")
```
Install the development version from GitHub with:
``` r
# install.packages("remotes")
remotes::install_github("ropensci/epubr")
```
## Example
Bram Stoker's Dracula novel sourced from Project Gutenberg is a good example of an EPUB file with unfortunate formatting.
The first thing that stands out is the naming convention using `item` followed by some ordered digits does not differentiate sections like the book preamble from the chapters.
The numbering also starts in a weird place. But it is actually worse than this. Notice that sections are not broken into chapters; they can begin and end in the middle of chapters!
These annoyances aside, the metadata and contents can still be read into a convenient table. Text mining analyses can still be performed on the overall book, if not so easily on individual chapters. See the [package vignette](https://docs.ropensci.org/epubr/articles/epubr.html) for examples on how to further improve the structure of an e-book with formatting like this.
```{r ex}
file <- system.file("dracula.epub", package = "epubr")
(x <- epub(file))
x$data[[1]]
```
## Related packages
[tesseract](https://github.com/ropensci/tesseract) by @jeroen for more direct control of the OCR process.
[pdftools](https://github.com/ropensci/pdftools) for extracting metadata and text from PDF files (therefore more specific to PDF, and without a Java dependency)
[tabulizer](https://github.com/ropensci/tabulapdf) by @leeper and @tpaskhalis, Bindings for Tabula PDF Table Extractor Library, to extract tables, therefore not text, from PDF files.
[rtika](https://github.com/ropensci/rtika) by @goodmansasha for more general text parsing.
[gutenbergr](https://github.com/ropensci/gutenbergr) by @dgrtwo for searching and downloading public domain texts from Project Gutenberg.
---
Please note that the `epubr` project is released with a [Contributor Code of Conduct](https://github.com/ropensci/epubr/blob/master/CODE_OF_CONDUCT.md). By contributing to this project, you agree to abide by its terms.
[](https://ropensci.org)
Owner
- Name: rOpenSci
- Login: ropensci
- Kind: organization
- Email: info@ropensci.org
- Location: Berkeley, CA
- Website: https://ropensci.org/
- Twitter: rOpenSci
- Repositories: 307
- Profile: https://github.com/ropensci
CodeMeta (codemeta.json)
{
"@context": "https://doi.org/10.5063/schema/codemeta-2.0",
"@type": "SoftwareSourceCode",
"identifier": "epubr",
"description": "Provides functions supporting the reading and parsing of internal e-book content from EPUB files. The 'epubr' package provides functions supporting the reading and parsing of internal e-book content from EPUB files. E-book metadata and text content are parsed separately and joined together in a tidy, nested tibble data frame. E-book formatting is not completely standardized across all literature. It can be challenging to curate parsed e-book content across an arbitrary collection of e-books perfectly and in completely general form, to yield a singular, consistently formatted output. Many EPUB files do not even contain all the same pieces of information in their respective metadata. EPUB file parsing functionality in this package is intended for relatively general application to arbitrary EPUB e-books. However, poorly formatted e-books or e-books with highly uncommon formatting may not work with this package. There may even be cases where an EPUB file has DRM or some other property that makes it impossible to read with 'epubr'. Text is read 'as is' for the most part. The only nominal changes are minor substitutions, for example curly quotes changed to straight quotes. Substantive changes are expected to be performed subsequently by the user as part of their text analysis. Additional text cleaning can be performed at the user's discretion, such as with functions from packages like 'tm' or 'qdap'.",
"name": "epubr: Read EPUB File Metadata and Text",
"relatedLink": "https://docs.ropensci.org/epubr/",
"codeRepository": "https://github.com/ropensci/epubr",
"issueTracker": "https://github.com/ropensci/epubr/issues",
"license": "https://spdx.org/licenses/MIT",
"version": "0.6.5",
"programmingLanguage": {
"@type": "ComputerLanguage",
"name": "R",
"url": "https://r-project.org"
},
"runtimePlatform": "R version 4.4.1 (2024-06-14 ucrt)",
"provider": {
"@id": "https://cran.r-project.org",
"@type": "Organization",
"name": "Comprehensive R Archive Network (CRAN)",
"url": "https://cran.r-project.org"
},
"author": [
{
"@type": "Person",
"givenName": "Matthew",
"familyName": "Leonawicz",
"email": "rpkgs@pm.me",
"@id": "https://orcid.org/0000-0001-9452-2771"
}
],
"maintainer": [
{
"@type": "Person",
"givenName": "Matthew",
"familyName": "Leonawicz",
"email": "rpkgs@pm.me",
"@id": "https://orcid.org/0000-0001-9452-2771"
}
],
"softwareSuggestions": [
{
"@type": "SoftwareApplication",
"identifier": "testthat",
"name": "testthat",
"provider": {
"@id": "https://cran.r-project.org",
"@type": "Organization",
"name": "Comprehensive R Archive Network (CRAN)",
"url": "https://cran.r-project.org"
},
"sameAs": "https://CRAN.R-project.org/package=testthat"
},
{
"@type": "SoftwareApplication",
"identifier": "knitr",
"name": "knitr",
"provider": {
"@id": "https://cran.r-project.org",
"@type": "Organization",
"name": "Comprehensive R Archive Network (CRAN)",
"url": "https://cran.r-project.org"
},
"sameAs": "https://CRAN.R-project.org/package=knitr"
},
{
"@type": "SoftwareApplication",
"identifier": "rmarkdown",
"name": "rmarkdown",
"provider": {
"@id": "https://cran.r-project.org",
"@type": "Organization",
"name": "Comprehensive R Archive Network (CRAN)",
"url": "https://cran.r-project.org"
},
"sameAs": "https://CRAN.R-project.org/package=rmarkdown"
},
{
"@type": "SoftwareApplication",
"identifier": "readr",
"name": "readr",
"provider": {
"@id": "https://cran.r-project.org",
"@type": "Organization",
"name": "Comprehensive R Archive Network (CRAN)",
"url": "https://cran.r-project.org"
},
"sameAs": "https://CRAN.R-project.org/package=readr"
}
],
"softwareRequirements": {
"1": {
"@type": "SoftwareApplication",
"identifier": "xml2",
"name": "xml2",
"provider": {
"@id": "https://cran.r-project.org",
"@type": "Organization",
"name": "Comprehensive R Archive Network (CRAN)",
"url": "https://cran.r-project.org"
},
"sameAs": "https://CRAN.R-project.org/package=xml2"
},
"2": {
"@type": "SoftwareApplication",
"identifier": "xslt",
"name": "xslt",
"provider": {
"@id": "https://cran.r-project.org",
"@type": "Organization",
"name": "Comprehensive R Archive Network (CRAN)",
"url": "https://cran.r-project.org"
},
"sameAs": "https://CRAN.R-project.org/package=xslt"
},
"3": {
"@type": "SoftwareApplication",
"identifier": "magrittr",
"name": "magrittr",
"provider": {
"@id": "https://cran.r-project.org",
"@type": "Organization",
"name": "Comprehensive R Archive Network (CRAN)",
"url": "https://cran.r-project.org"
},
"sameAs": "https://CRAN.R-project.org/package=magrittr"
},
"4": {
"@type": "SoftwareApplication",
"identifier": "tibble",
"name": "tibble",
"provider": {
"@id": "https://cran.r-project.org",
"@type": "Organization",
"name": "Comprehensive R Archive Network (CRAN)",
"url": "https://cran.r-project.org"
},
"sameAs": "https://CRAN.R-project.org/package=tibble"
},
"5": {
"@type": "SoftwareApplication",
"identifier": "dplyr",
"name": "dplyr",
"provider": {
"@id": "https://cran.r-project.org",
"@type": "Organization",
"name": "Comprehensive R Archive Network (CRAN)",
"url": "https://cran.r-project.org"
},
"sameAs": "https://CRAN.R-project.org/package=dplyr"
},
"6": {
"@type": "SoftwareApplication",
"identifier": "tidyr",
"name": "tidyr",
"provider": {
"@id": "https://cran.r-project.org",
"@type": "Organization",
"name": "Comprehensive R Archive Network (CRAN)",
"url": "https://cran.r-project.org"
},
"sameAs": "https://CRAN.R-project.org/package=tidyr"
},
"SystemRequirements": null
},
"fileSize": "555.291KB"
}
GitHub Events
Total
Last Year
Committers
Last synced: over 2 years ago
Top Committers
| Name | Commits | |
|---|---|---|
| leonawicz | m****t@g****m | 103 |
| leonawicz | m****z@e****m | 36 |
| leonawicz | m****z@g****m | 6 |
| Hugo Gruson | B****o | 1 |
Committer Domains (Top 20 + Academic)
esource.com: 1
Issues and Pull Requests
Last synced: 10 months ago
All Time
- Total issues: 2
- Total pull requests: 2
- Average time to close issues: 21 days
- Average time to close pull requests: about 4 hours
- Total issue authors: 2
- Total pull request authors: 2
- Average comments per issue: 2.0
- Average comments per pull request: 0.5
- Merged pull requests: 2
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 1
- Average time to close issues: N/A
- Average time to close pull requests: about 8 hours
- Issue authors: 0
- Pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- mjockers (1)
- sckott (1)
Pull Request Authors
- maelle (2)
- Bisaloo (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- cran 368 last-month
- Total dependent packages: 0
- Total dependent repositories: 1
- Total versions: 9
- Total maintainers: 1
cran.r-project.org: epubr
Read EPUB File Metadata and Text
- Homepage: https://docs.ropensci.org/epubr/
- Documentation: http://cran.r-project.org/web/packages/epubr/epubr.pdf
- License: MIT + file LICENSE
-
Latest release: 0.6.5
published over 1 year ago
Rankings
Dependent repos count: 23.9%
Average: 28.4%
Dependent packages count: 28.7%
Downloads: 32.5%
Maintainers (1)
Last synced:
10 months ago
Dependencies
DESCRIPTION
cran
- dplyr * imports
- magrittr * imports
- tibble * imports
- tidyr * imports
- xml2 * imports
- xslt * imports
- covr * suggests
- knitr * suggests
- readr * suggests
- rmarkdown * suggests
- testthat * suggests