pubchunks
:warning: ARCHIVED :warning: Get chunks of XML format scholarly articles
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
✓DOI references
Found 10 DOI reference(s) in README -
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.9%) to scientific vocabulary
Keywords
Keywords from Contributors
Repository
:warning: ARCHIVED :warning: Get chunks of XML format scholarly articles
Basic Info
Statistics
- Stars: 8
- Watchers: 4
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README-not.md
pubchunks
Get chunks of XML articles
Package API
- pub_tabularize
- pubguesspublisher
- pub_sections
- pub_chunks
- pub_providers
The main workhorse function is pub_chunks(). It allows you to pull out sections of articles from many different publishers (see next section below) WITHOUT having to know how to parse/navigate XML. XML has a steep learning curve, and can require quite a bit of Googling to sort out how to get to different parts of an XML document.
The other main function is pub_tabularize() - which takes the output of pub_chunks() and coerces into a data.frame for easier downstream processing.
Supported publishers/sources
- eLife
- PLOS
- Entrez/Pubmed
- Elsevier
- Hindawi
- Pensoft
- PeerJ
- Copernicus
- Frontiers
- F1000 Research
If you know of other publishers or sources that provide XML let us know by opening an issue.
We'll continue adding additional publishers.
Installation
Stable version
r
install.packages("pubchunks")
Development version from GitHub
r
remotes::install_github("ropensci/pubchunks")
Load library
r
library('pubchunks')
Working with files
r
x <- system.file("examples/10_1016_0021_8928_59_90156_x.xml",
package = "pubchunks")
```r pub_chunks(x, "abstract")
>
> from: file
> publisher/journal: elsevier/Journal of Applied Mathematics and Mechanics
> sections: abstract
> showing up to first 5:
> abstract (n=1): Abstract
>
> This pa ...
pub_chunks(x, "title")
>
> from: file
> publisher/journal: elsevier/Journal of Applied Mathematics and Mechanics
> sections: title
> showing up to first 5:
> title (n=1): On the driving of a piston with a rigid collar int ...
pub_chunks(x, "authors")
>
> from: file
> publisher/journal: elsevier/Journal of Applied Mathematics and Mechanics
> sections: authors
> showing up to first 5:
> authors (n=1): Chetaev, D.N
pub_chunks(x, c("title", "refs"))
>
> from: file
> publisher/journal: elsevier/Journal of Applied Mathematics and Mechanics
> sections: title, refs
> showing up to first 5:
> title (n=1): On the driving of a piston with a rigid collar int ...
> refs (n=6): Watson G.N.. 1949. Teoriia besselevykh funktsii. N
```
The output of pub_chunks() is a list with an S3 class pub_chunks to make
internal work in the package easier. You can easily see the list structure
by using unclass().
Working with the xml already in a string
```r xml <- paste0(readLines(x), collapse = "") pub_chunks(xml, "title")
>
> from: character
> publisher/journal: elsevier/Journal of Applied Mathematics and Mechanics
> sections: title
> showing up to first 5:
> title (n=1): On the driving of a piston with a rigid collar int ...
```
Working with xml2 class object
```r xml <- paste0(readLines(x), collapse = "") xml <- xml2::readxml(xml) pubchunks(xml, "title")
>
> from: xml_document
> publisher/journal: elsevier/Journal of Applied Mathematics and Mechanics
> sections: title
> showing up to first 5:
> title (n=1): On the driving of a piston with a rigid collar int ...
```
Working with output of fulltext::ft_get()
r
install.packages("fulltext")
```r library("fulltext") x <- fulltext::ftget('10.1371/journal.pone.0086169') pubchunks(fulltext::ft_collect(x), sections="authors")
> $plos
> $plos$10.1371/journal.pone.0086169
>
> from: xml_document
> publisher/journal: plos/PLoS ONE
> sections: authors
> showing up to first 5:
> authors (n=4): nested list
>
>
> attr(,"ft_data")
> [1] TRUE
```
Coerce pub_chunks output into data.frame's
```r x <- system.file("examples/elife1.xml", package = "pubchunks") res <- pubchunks(x, c("doi", "title", "keywords")) pub_tabularize(res)
> doi title
> 1 10.7554/eLife.03032 MicroRNA-mediated repression of nonsense mRNAs
> 2 10.7554/eLife.03032 MicroRNA-mediated repression of nonsense mRNAs
> 3 10.7554/eLife.03032 MicroRNA-mediated repression of nonsense mRNAs
> 4 10.7554/eLife.03032 MicroRNA-mediated repression of nonsense mRNAs
> 5 10.7554/eLife.03032 MicroRNA-mediated repression of nonsense mRNAs
> 6 10.7554/eLife.03032 MicroRNA-mediated repression of nonsense mRNAs
> keywords .publisher
> 1 microRNA elife
> 2 nonsense mutation elife
> 3 nonsense-mediated mRNA decay elife
> 4 APC elife
> 5 intron retention elife
> 6 premature termination codon elife
```
Get a random XML article
```r library(rcrossref) library(dplyr)
res <- crworks(filter = list( fulltexttype = "application/xml", licenseurl="http://creativecommons.org/licenses/by/4.0/")) links <- bindrows(res$data$link) %>% filter(content.type == "application/xml") download.file(links$URL[1], (i <- tempfile(fileext = ".xml"))) pubchunks(i)
>
> from: file
> publisher/journal: unknown/NA
> sections: all
> showing up to first 5:
> front (n=0):
> body (n=0):
> back (n=0):
> title (n=0):
> doi (n=0):
download.file(links$URL[13], (j <- tempfile(fileext = ".xml"))) pub_chunks(j)
>
> from: file
> publisher/journal: hindawi/BioMed Research International
> sections: all
> showing up to first 5:
> front (n=2): nested list
> body (n=49): Oxidative stress and Reactive Oxygen Species (ROS)
> back (n=4): nested list
> title (n=1): Selected Enzyme Inhibitory Effects of Euphorbia ch ...
> doi (n=1): 10.1155/2018/1219367
download.file(links$URL[20], (k <- tempfile(fileext = ".xml"))) pub_chunks(k)
>
> from: file
> publisher/journal: hindawi/Case Reports in Pathology
> sections: all
> showing up to first 5:
> front (n=2): nested list
> body (n=16): Bonnetti et al. first noted in 1992 an unusual cel
> back (n=3): nested list
> title (n=1): An Inguinal Perivascular Epithelioid Cell Tumor Me ...
> doi (n=1): 10.1155/2018/5749421
```
Meta
- Please report any issues or bugs.
- License: MIT
- Get citation information for
pubchunks:citation(package = 'pubchunks') - Please note that this package is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.
Owner
- Name: rOpenSci Archive
- Login: ropensci-archive
- Kind: organization
- Email: info@ropensci.org
- Website: ropensci.org
- Repositories: 259
- Profile: https://github.com/ropensci-archive
Abandoned rOpenSci projects -- email info@ropensci.org if you have questions!
GitHub Events
Total
Last Year
Committers
Last synced: about 2 years ago
Top Committers
| Name | Commits | |
|---|---|---|
| Scott Chamberlain | m****s@g****m | 86 |
| Maëlle Salmon | m****n@y****e | 2 |
| rOpenSci Bot | m****t@g****m | 1 |
Issues and Pull Requests
Last synced: about 2 years ago
All Time
- Total issues: 11
- Total pull requests: 1
- Average time to close issues: 6 months
- Average time to close pull requests: 1 minute
- Total issue authors: 2
- Total pull request authors: 1
- Average comments per issue: 1.82
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- sckott (9)
- gwern (2)
Pull Request Authors
- sckott (1)