pubchunks

:warning: ARCHIVED :warning: Get chunks of XML format scholarly articles

https://github.com/ropensci-archive/pubchunks

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 10 DOI reference(s) in README
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.9%) to scientific vocabulary

Keywords

literature open-access r r-package rstats text-mining xml

Keywords from Contributors

genome geocode routes cycle http-mock mock itis phylogenetics iucn-red-list iucn
Last synced: 6 months ago · JSON representation

Repository

:warning: ARCHIVED :warning: Get chunks of XML format scholarly articles

Basic Info
  • Host: GitHub
  • Owner: ropensci-archive
  • License: other
  • Language: R
  • Default Branch: master
  • Homepage:
  • Size: 4.04 MB
Statistics
  • Stars: 8
  • Watchers: 4
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Archived
Topics
literature open-access r r-package rstats text-mining xml
Created about 8 years ago · Last pushed over 3 years ago
Metadata Files
Readme Contributing License

README-not.md

pubchunks

Project Status: Active – The project has reached a stable, usable state and is being actively developed. cran checks R-check codecov rstudio mirror downloads cran version

Get chunks of XML articles

Package API

  • pub_tabularize
  • pubguesspublisher
  • pub_sections
  • pub_chunks
  • pub_providers

The main workhorse function is pub_chunks(). It allows you to pull out sections of articles from many different publishers (see next section below) WITHOUT having to know how to parse/navigate XML. XML has a steep learning curve, and can require quite a bit of Googling to sort out how to get to different parts of an XML document.

The other main function is pub_tabularize() - which takes the output of pub_chunks() and coerces into a data.frame for easier downstream processing.

Supported publishers/sources

  • eLife
  • PLOS
  • Entrez/Pubmed
  • Elsevier
  • Hindawi
  • Pensoft
  • PeerJ
  • Copernicus
  • Frontiers
  • F1000 Research

If you know of other publishers or sources that provide XML let us know by opening an issue.

We'll continue adding additional publishers.

Installation

Stable version

r install.packages("pubchunks")

Development version from GitHub

r remotes::install_github("ropensci/pubchunks")

Load library

r library('pubchunks')

Working with files

r x <- system.file("examples/10_1016_0021_8928_59_90156_x.xml", package = "pubchunks")

```r pub_chunks(x, "abstract")

>

> from: file

> publisher/journal: elsevier/Journal of Applied Mathematics and Mechanics

> sections: abstract

> showing up to first 5:

> abstract (n=1): Abstract

>

> This pa ...

pub_chunks(x, "title")

>

> from: file

> publisher/journal: elsevier/Journal of Applied Mathematics and Mechanics

> sections: title

> showing up to first 5:

> title (n=1): On the driving of a piston with a rigid collar int ...

pub_chunks(x, "authors")

>

> from: file

> publisher/journal: elsevier/Journal of Applied Mathematics and Mechanics

> sections: authors

> showing up to first 5:

> authors (n=1): Chetaev, D.N

pub_chunks(x, c("title", "refs"))

>

> from: file

> publisher/journal: elsevier/Journal of Applied Mathematics and Mechanics

> sections: title, refs

> showing up to first 5:

> title (n=1): On the driving of a piston with a rigid collar int ...

> refs (n=6): Watson G.N.. 1949. Teoriia besselevykh funktsii. N

```

The output of pub_chunks() is a list with an S3 class pub_chunks to make internal work in the package easier. You can easily see the list structure by using unclass().

Working with the xml already in a string

```r xml <- paste0(readLines(x), collapse = "") pub_chunks(xml, "title")

>

> from: character

> publisher/journal: elsevier/Journal of Applied Mathematics and Mechanics

> sections: title

> showing up to first 5:

> title (n=1): On the driving of a piston with a rigid collar int ...

```

Working with xml2 class object

```r xml <- paste0(readLines(x), collapse = "") xml <- xml2::readxml(xml) pubchunks(xml, "title")

>

> from: xml_document

> publisher/journal: elsevier/Journal of Applied Mathematics and Mechanics

> sections: title

> showing up to first 5:

> title (n=1): On the driving of a piston with a rigid collar int ...

```

Working with output of fulltext::ft_get()

r install.packages("fulltext")

```r library("fulltext") x <- fulltext::ftget('10.1371/journal.pone.0086169') pubchunks(fulltext::ft_collect(x), sections="authors")

> $plos

> $plos$10.1371/journal.pone.0086169

>

> from: xml_document

> publisher/journal: plos/PLoS ONE

> sections: authors

> showing up to first 5:

> authors (n=4): nested list

>

>

> attr(,"ft_data")

> [1] TRUE

```

Coerce pub_chunks output into data.frame's

```r x <- system.file("examples/elife1.xml", package = "pubchunks") res <- pubchunks(x, c("doi", "title", "keywords")) pub_tabularize(res)

> doi title

> 1 10.7554/eLife.03032 MicroRNA-mediated repression of nonsense mRNAs

> 2 10.7554/eLife.03032 MicroRNA-mediated repression of nonsense mRNAs

> 3 10.7554/eLife.03032 MicroRNA-mediated repression of nonsense mRNAs

> 4 10.7554/eLife.03032 MicroRNA-mediated repression of nonsense mRNAs

> 5 10.7554/eLife.03032 MicroRNA-mediated repression of nonsense mRNAs

> 6 10.7554/eLife.03032 MicroRNA-mediated repression of nonsense mRNAs

> keywords .publisher

> 1 microRNA elife

> 2 nonsense mutation elife

> 3 nonsense-mediated mRNA decay elife

> 4 APC elife

> 5 intron retention elife

> 6 premature termination codon elife

```

Get a random XML article

```r library(rcrossref) library(dplyr)

res <- crworks(filter = list( fulltexttype = "application/xml", licenseurl="http://creativecommons.org/licenses/by/4.0/")) links <- bindrows(res$data$link) %>% filter(content.type == "application/xml") download.file(links$URL[1], (i <- tempfile(fileext = ".xml"))) pubchunks(i)

>

> from: file

> publisher/journal: unknown/NA

> sections: all

> showing up to first 5:

> front (n=0):

> body (n=0):

> back (n=0):

> title (n=0):

> doi (n=0):

download.file(links$URL[13], (j <- tempfile(fileext = ".xml"))) pub_chunks(j)

>

> from: file

> publisher/journal: hindawi/BioMed Research International

> sections: all

> showing up to first 5:

> front (n=2): nested list

> body (n=49): Oxidative stress and Reactive Oxygen Species (ROS)

> back (n=4): nested list

> title (n=1): Selected Enzyme Inhibitory Effects of Euphorbia ch ...

> doi (n=1): 10.1155/2018/1219367

download.file(links$URL[20], (k <- tempfile(fileext = ".xml"))) pub_chunks(k)

>

> from: file

> publisher/journal: hindawi/Case Reports in Pathology

> sections: all

> showing up to first 5:

> front (n=2): nested list

> body (n=16): Bonnetti et al. first noted in 1992 an unusual cel

> back (n=3): nested list

> title (n=1): An Inguinal Perivascular Epithelioid Cell Tumor Me ...

> doi (n=1): 10.1155/2018/5749421

```

Meta

  • Please report any issues or bugs.
  • License: MIT
  • Get citation information for pubchunks: citation(package = 'pubchunks')
  • Please note that this package is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Owner

  • Name: rOpenSci Archive
  • Login: ropensci-archive
  • Kind: organization
  • Email: info@ropensci.org

Abandoned rOpenSci projects -- email info@ropensci.org if you have questions!

GitHub Events

Total
Last Year

Committers

Last synced: about 2 years ago

All Time
  • Total Commits: 89
  • Total Committers: 3
  • Avg Commits per committer: 29.667
  • Development Distribution Score (DDS): 0.034
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Scott Chamberlain m****s@g****m 86
Maëlle Salmon m****n@y****e 2
rOpenSci Bot m****t@g****m 1

Issues and Pull Requests

Last synced: about 2 years ago

All Time
  • Total issues: 11
  • Total pull requests: 1
  • Average time to close issues: 6 months
  • Average time to close pull requests: 1 minute
  • Total issue authors: 2
  • Total pull request authors: 1
  • Average comments per issue: 1.82
  • Average comments per pull request: 0.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • sckott (9)
  • gwern (2)
Pull Request Authors
  • sckott (1)
Top Labels
Issue Labels
bug (1)
Pull Request Labels