epubr

Read EPUB files in R

https://github.com/ropensci/epubr

Keywords

epub epub-files epub-format peer-reviewed r r-package rstats

Last synced: 11 months ago · JSON representation

Repository

Read EPUB files in R

Basic Info

Host: GitHub
Owner: ropensci
License: other
Language: R
Default Branch: master
Homepage: https://docs.ropensci.org/epubr
Size: 963 KB

Statistics

Stars: 24
Watchers: 5
Forks: 1
Open Issues: 0
Releases: 8

Topics

epub epub-files epub-format peer-reviewed r r-package rstats

Created about 8 years ago · Last pushed almost 2 years ago

Metadata Files

Readme Changelog License Code of conduct Codemeta

README.Rmd

---
output: github_document
---



```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE, comment = "#>", fig.path = "man/figures/README-",
  message = FALSE, warning = FALSE, error = FALSE
)
library(epubr)
```

# epubr 


[![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](http://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/)
[![](https://badges.ropensci.org/222_status.svg)](https://github.com/ropensci/software-review/issues/222)
[![CRAN status](https://www.r-pkg.org/badges/version/epubr)](https://cran.r-project.org/package=epubr)
[![CRAN RStudio mirror downloads](https://cranlogs.r-pkg.org/badges/epubr)](https://cran.r-project.org/package=epubr)
[![Github Stars](https://img.shields.io/github/stars/ropensci/epubr.svg?style=social&label=Github)](https://github.com/ropensci/epubr)


## Read EPUB files in R

Read EPUB text and metadata. 

The `epubr` package provides functions supporting the reading and parsing of internal e-book content from EPUB files. E-book metadata and text content are parsed separately and joined together in a tidy, nested tibble data frame. 

E-book formatting is not completely standardized across all literature. It can be challenging to curate parsed e-book content across an arbitrary collection of e-books perfectly and in completely general form, to yield a singular, consistently formatted output. Many EPUB files do not even contain all the same pieces of information in their respective metadata.

EPUB file parsing functionality in this package is intended for relatively general application to arbitrary EPUB e-books. However, poorly formatted e-books or e-books with highly uncommon formatting may not work with this package.
There may even be cases where an EPUB file has DRM or some other property that makes it impossible to read with `epubr`.

Text is read 'as is' for the most part. The only nominal changes are minor substitutions, for example curly quotes changed to straight quotes. Substantive changes are expected to be performed subsequently by the user as part of their text analysis. Additional text cleaning can be performed at the user's discretion, such as with functions from packages like `tm` or `qdap`.

## Installation

Install `epubr` from CRAN with:

``` r
install.packages("epubr")
```

Install the development version from GitHub with:

``` r
# install.packages("remotes")
remotes::install_github("ropensci/epubr")
```

## Example

Bram Stoker's Dracula novel sourced from Project Gutenberg is a good example of an EPUB file with unfortunate formatting.
The first thing that stands out is the naming convention using `item` followed by some ordered digits does not differentiate sections like the book preamble from the chapters.
The numbering also starts in a weird place. But it is actually worse than this. Notice that sections are not broken into chapters; they can begin and end in the middle of chapters!

These annoyances aside, the metadata and contents can still be read into a convenient table. Text mining analyses can still be performed on the overall book, if not so easily on individual chapters. See the [package vignette](https://docs.ropensci.org/epubr/articles/epubr.html) for examples on how to further improve the structure of an e-book with formatting like this.

```{r ex}
file <- system.file("dracula.epub", package = "epubr")
(x <- epub(file))

x$data[[1]]
```

## Related packages

[tesseract](https://github.com/ropensci/tesseract) by @jeroen for more direct control of the OCR process.

[pdftools](https://github.com/ropensci/pdftools) for extracting metadata and text from PDF files (therefore more specific to PDF, and without a Java dependency)

[tabulizer](https://github.com/ropensci/tabulapdf) by @leeper and @tpaskhalis, Bindings for Tabula PDF Table Extractor Library, to extract tables, therefore not text, from PDF files.

[rtika](https://github.com/ropensci/rtika) by @goodmansasha for more general text parsing.

[gutenbergr](https://github.com/ropensci/gutenbergr) by @dgrtwo for searching and downloading public domain texts from Project Gutenberg.

---

Please note that the `epubr` project is released with a [Contributor Code of Conduct](https://github.com/ropensci/epubr/blob/master/CODE_OF_CONDUCT.md). By contributing to this project, you agree to abide by its terms.

[![ropensci_footer](https://ropensci.org/public_images/ropensci_footer.png)](https://ropensci.org)

Owner

Name: rOpenSci
Login: ropensci
Kind: organization
Email: info@ropensci.org
Location: Berkeley, CA

Website: https://ropensci.org/
Twitter: rOpenSci
Repositories: 307
Profile: https://github.com/ropensci

CodeMeta (codemeta.json)

{
  "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
  "@type": "SoftwareSourceCode",
  "identifier": "epubr",
  "description": "Provides functions supporting the reading and parsing of internal e-book content from EPUB files. The 'epubr' package provides functions supporting the reading and parsing of internal e-book content from EPUB files. E-book metadata and text content are parsed separately and joined together in a tidy, nested tibble data frame. E-book formatting is not completely standardized across all literature. It can be challenging to curate parsed e-book content across an arbitrary collection of e-books perfectly and in completely general form, to yield a singular, consistently formatted output. Many EPUB files do not even contain all the same pieces of information in their respective metadata. EPUB file parsing functionality in this package is intended for relatively general application to arbitrary EPUB e-books. However, poorly formatted e-books or e-books with highly uncommon formatting may not work with this package. There may even be cases where an EPUB file has DRM or some other property that makes it impossible to read with 'epubr'. Text is read 'as is' for the most part. The only nominal changes are minor substitutions, for example curly quotes changed to straight quotes. Substantive changes are expected to be performed subsequently by the user as part of their text analysis. Additional text cleaning can be performed at the user's discretion, such as with functions from packages like 'tm' or 'qdap'.",
  "name": "epubr: Read EPUB File Metadata and Text",
  "relatedLink": "https://docs.ropensci.org/epubr/",
  "codeRepository": "https://github.com/ropensci/epubr",
  "issueTracker": "https://github.com/ropensci/epubr/issues",
  "license": "https://spdx.org/licenses/MIT",
  "version": "0.6.5",
  "programmingLanguage": {
    "@type": "ComputerLanguage",
    "name": "R",
    "url": "https://r-project.org"
  },
  "runtimePlatform": "R version 4.4.1 (2024-06-14 ucrt)",
  "provider": {
    "@id": "https://cran.r-project.org",
    "@type": "Organization",
    "name": "Comprehensive R Archive Network (CRAN)",
    "url": "https://cran.r-project.org"
  },
  "author": [
    {
      "@type": "Person",
      "givenName": "Matthew",
      "familyName": "Leonawicz",
      "email": "rpkgs@pm.me",
      "@id": "https://orcid.org/0000-0001-9452-2771"
    }
  ],
  "maintainer": [
    {
      "@type": "Person",
      "givenName": "Matthew",
      "familyName": "Leonawicz",
      "email": "rpkgs@pm.me",
      "@id": "https://orcid.org/0000-0001-9452-2771"
    }
  ],
  "softwareSuggestions": [
    {
      "@type": "SoftwareApplication",
      "identifier": "testthat",
      "name": "testthat",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=testthat"
    },
    {
      "@type": "SoftwareApplication",
      "identifier": "knitr",
      "name": "knitr",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=knitr"
    },
    {
      "@type": "SoftwareApplication",
      "identifier": "rmarkdown",
      "name": "rmarkdown",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=rmarkdown"
    },
    {
      "@type": "SoftwareApplication",
      "identifier": "readr",
      "name": "readr",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=readr"
    }
  ],
  "softwareRequirements": {
    "1": {
      "@type": "SoftwareApplication",
      "identifier": "xml2",
      "name": "xml2",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=xml2"
    },
    "2": {
      "@type": "SoftwareApplication",
      "identifier": "xslt",
      "name": "xslt",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=xslt"
    },
    "3": {
      "@type": "SoftwareApplication",
      "identifier": "magrittr",
      "name": "magrittr",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=magrittr"
    },
    "4": {
      "@type": "SoftwareApplication",
      "identifier": "tibble",
      "name": "tibble",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=tibble"
    },
    "5": {
      "@type": "SoftwareApplication",
      "identifier": "dplyr",
      "name": "dplyr",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=dplyr"
    },
    "6": {
      "@type": "SoftwareApplication",
      "identifier": "tidyr",
      "name": "tidyr",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=tidyr"
    },
    "SystemRequirements": null
  },
  "fileSize": "555.291KB"
}

GitHub Events

Total

Last Year

Committers

Last synced: over 2 years ago

All Time

Total Commits: 146
Total Committers: 4
Avg Commits per committer: 36.5
Development Distribution Score (DDS): 0.295

Past Year

Commits: 3
Committers: 1
Avg Commits per committer: 3.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
leonawicz	m**t@g**m	103
leonawicz	m**z@e**m	36
leonawicz	m**z@g**m	6
Hugo Gruson	B****o	1

Committer Domains (Top 20 + Academic)

esource.com: 1

Issues and Pull Requests

Last synced: 11 months ago

All Time

Total issues: 2
Total pull requests: 2
Average time to close issues: 21 days
Average time to close pull requests: about 4 hours
Total issue authors: 2
Total pull request authors: 2
Average comments per issue: 2.0
Average comments per pull request: 0.5
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: about 8 hours
Issue authors: 0
Pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

mjockers (1)
sckott (1)

Pull Request Authors

maelle (2)
Bisaloo (1)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- cran 368 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 9
Total maintainers: 1

cran.r-project.org: epubr

Read EPUB File Metadata and Text

Homepage: https://docs.ropensci.org/epubr/
Documentation: http://cran.r-project.org/web/packages/epubr/epubr.pdf
License: MIT + file LICENSE
Latest release: 0.6.5
published almost 2 years ago

Versions: 9
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 368 Last month

Rankings

Dependent repos count: 23.9%

Average: 28.4%

Dependent packages count: 28.7%

Downloads: 32.5%

Maintainers (1)

rpkgs@pm.me

Last synced: 11 months ago

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

epubr

Science Score: 26.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.Rmd

Owner

CodeMeta (codemeta.json)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

cran.r-project.org: epubr

Rankings

Maintainers (1)

Dependencies