parsel

parallel execution of RSelenium

https://github.com/till-tietz/parsel

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (18.4%) to scientific vocabulary

Keywords

cran parallel r rselenium web-scraping

Last synced: 6 months ago · JSON representation

Repository

parallel execution of RSelenium

Basic Info

Host: GitHub
Owner: till-tietz
License: other
Language: R
Default Branch: master
Homepage:
Size: 1.16 MB

Statistics

Stars: 14
Watchers: 3
Forks: 3
Open Issues: 0
Releases: 4

Topics

cran parallel r rselenium web-scraping

Created about 5 years ago · Last pushed 10 months ago

Metadata Files

Readme Changelog License

README.Rmd

---
output: github_document
---



```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```

# parsel


[![CRAN status](https://www.r-pkg.org/badges/version/parsel)](https://CRAN.R-project.org/package=parsel)
[![License: MIT](https://img.shields.io/badge/License-MIT-orange.svg)](https://opensource.org/license/mit/)
![](https://cranlogs.r-pkg.org/badges/grand-total/parsel?color)


`parsel` is a framework for parallelized  dynamic web-scraping using `RSelenium`. Leveraging parallel processing, it allows you to run any `RSelenium` web-scraping routine on multiple browser instances simultaneously, thus greatly increasing the efficiency of your scraping. `parsel` utilizes chunked input processing as well as error catching and logging, to ensure seamless execution of your scraping routine and minimal data loss, even in the presence of unforeseen `RSelenium` errors. 
`parsel` additionally provides convenient wrapper functions around `RSelenium` methods, that allow you to quickly generate safe scraping code with minimal coding on your end. 

## Installation

``` r
# Install parsel from CRAN
install.packages("parsel")

# Or the development version from GitHub:
# install.packages("devtools")
devtools::install_github("till-tietz/parsel")
```
## Usage 

### Parallel Scraping 

The following example will hopefully serve to illustrate the functionality and ideas behind how `parsel` operates.
We'll set up the following scraping job:

1. navigate to a random Wikipedia article 
2. retrieve its title 
3. navigate to the first linked page on the article 
4. retrieve the linked page's title and first section 

and parallelize it with `parsel`.

`parsel` requires two things: 

1. a scraping function defining the actions to be executed in each `RSelenium` instance. Actions to be executed in each browser instance should be written in the conventional `RSelenium` syntax with `remDr$` specifying the remote driver.   
2. some input `x` to those actions (e.g. search terms to be entered in search boxes or links to navigate to etc.)

```{r, eval = FALSE}
library(RSelenium)
library(parsel)

#let's define our scraping function input 
#we want to run our function 4 times and we want it to start on the wikipedia main page each time 
input <- rep("https://de.wikipedia.org",4)

#let's define our scraping function 

get_wiki_text <- function(x){
  input_i <- x
  
  #navigate to input page (i.e wikipedia)
  remDr$navigate(input_i)
  
  #find and click random article 
  rand_art <- remDr$findElement(using = "id", "n-randompage")$clickElement()
  
  #get random article title 
  title <- remDr$findElement(using = "id", "firstHeading")$getElementText()[[1]]
  
  #check if there is a linked page
  link_exists <- try(remDr$findElement(using = "xpath", "/html/body/div[3]/div[3]/div[5]/div[1]/p[1]/a[1]"))
  
  #if no linked page fill output with NA
  if(is(link_exists,"try-error")){
    first_link_title <- NA
    first_link_text <- NA
    
    #if there is a linked page
  } else {
    #click on link
    link <- remDr$findElement(using = "xpath", "/html/body/div[3]/div[3]/div[5]/div[1]/p[1]/a[1]")$clickElement()
    
    #get link page title
    first_link_title <- try(remDr$findElement(using = "id", "firstHeading"))
    if(is(first_link_title,"try-error")){
      first_link_title <- NA
    }else{
      first_link_title <- first_link_title$getElementText()[[1]]
    }
    
    #get 1st section of link page
    first_link_text <- try(remDr$findElement(using = "xpath", "/html/body/div[3]/div[3]/div[5]/div[1]/p[1]"))
    if(is(first_link_text,"try-error")){
      first_link_text <- NA
    }else{
      first_link_text <- first_link_text$getElementText()[[1]]
    }
  }
  out <- data.frame("random_article" = title,
                    "first_link_title" = first_link_title,
                    "first_link_text" = first_link_text)
  return(out)
}


```

Now that we have our scrape function and input we can parallelize the execution of the function. 
For speed and efficiency reasons, it is advisable to specify the headless browser option in the `extraCapabilities` argument. 
`parscrape` will show a progress bar, as well as elapsed and estimated remaining time so you can keep track of scraping progress. 

```{r, results = 'hide', warning = FALSE, eval = FALSE}
wiki_text <- parsel::parscrape(scrape_fun = get_wiki_text,
                               scrape_input = input,
                               cores = 2,
                               packages = c("RSelenium","XML"),
                               browser = "firefox",
                               scrape_tries = 1,
                               extraCapabilities = list(
                                     "moz:firefoxOptions" = list(args = list('--headless'))
                                     ))


```

`parscrape` returns a list with two elements:

1. a list of your scrape function output 
2. a data.frame of inputs it was unable to scrape, and the associated error messages 


### RSelenium Constructors 

`parsel` allows you to generate safe scraping code with minimal hassle by simply composing `constructor` functions that effectively act as wrappers around `RSelenium` methods in a pipe. You can return a scraper function defined by `constructors` to the environment by starting your pipe with `start_scraper()` and ending it with `build_scraper()`. Alternatively you can dump the code generated by your `constructor` pipe to the console via `show()`.
We'll reproduce a slightly stripped down version of the `RSelenium` code in the above wikipedia scraping routine via the `parsel` `constructor` functions. 

```{r, warning = FALSE, message = FALSE}
library(parsel)

# returning a scaper function 
start_scraper(args = "x", name = "get_wiki_text") %>>%
  go(url = "x") %>>% 
  click(using = "id", value = "'n-randompage'", name = "rand_art") %>>%
  get_element(using = "id", value = "'firstHeading'", name = "title") %>>%
  click(using = "xpath", value = "'/html/body/div[3]/div[3]/div[5]/div[1]/p[1]/a[1]'", name = "link") %>>%
  get_element(using = "id", value = "'firstHeading'", name = "first_link_title") %>>%
  get_element(using = "xpath", value = "'/html/body/div[3]/div[3]/div[5]/div[1]/p[1]'", name = "first_link_text") %>>%
  build_scraper()

ls()  

# dumping generated code to console 
go(url = "x") %>>%
  click(using = "id", value = "'n-randompage'", name = "rand_art") %>>%
  get_element(using = "id", value = "'firstHeading'", name = "title") %>>%
  click(using = "xpath", value = "'/html/body/div[3]/div[3]/div[5]/div[1]/p[1]/a[1]'", name = "link") %>>%
  get_element(using = "id", value = "'firstHeading'", name = "first_link_title") %>>%
  get_element(using = "xpath", value = "'/html/body/div[3]/div[3]/div[5]/div[1]/p[1]'", name = "first_link_text") %>>%
  show()
  

```

Owner

Name: Till Tietz
Login: till-tietz
Kind: user
Location: Berlin
Company: WZB-IPI, LSHTM, Humboldt Universität Berlin

Repositories: 4
Profile: https://github.com/till-tietz

statistics consulting @ LSHTM computational social science and methods research @ WZB-IPI & HU-Berlin

GitHub Events

Total

Watch event: 1
Push event: 1

Last Year

Watch event: 1
Push event: 1

Committers

Last synced: over 2 years ago

All Time

Total Commits: 105
Total Committers: 2
Avg Commits per committer: 52.5
Development Distribution Score (DDS): 0.057

Past Year

Commits: 13
Committers: 2
Avg Commits per committer: 6.5
Development Distribution Score (DDS): 0.077

Top Committers

Name	Email	Commits
till-tietz	t**4@g**m	99
Till Tietz	6****z	6

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 6
Total pull requests: 33
Average time to close issues: 3 months
Average time to close pull requests: 16 minutes
Total issue authors: 3
Total pull request authors: 1
Average comments per issue: 1.5
Average comments per pull request: 0.0
Merged pull requests: 33
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

till-tietz (4)
julienOlivier3 (1)
KanKuno (1)

Pull Request Authors

till-tietz (33)

Top Labels

Issue Labels

enhancement (2)

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- cran 300 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 4
Total maintainers: 1

cran.r-project.org: parsel

Parallel Dynamic Web-Scraping Using 'RSelenium'

Homepage: https://github.com/till-tietz/parsel
Documentation: http://cran.r-project.org/web/packages/parsel/parsel.pdf
License: MIT + file LICENSE
Latest release: 0.3.0
published almost 3 years ago

Versions: 4
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 300 Last month

Rankings

Forks count: 14.9%

Stargazers count: 15.6%

Average: 27.3%

Dependent packages count: 29.8%

Dependent repos count: 35.5%

Downloads: 40.9%

Maintainers (1)

ttietz2014@gmail.com

Last synced: 6 months ago

Dependencies

DESCRIPTION cran

RSelenium * imports
lubridate >= 1.7.9 imports
methods >= 3.3.1 imports
parallel >= 3.6.2 imports
purrr >= 0.3.4 imports
rlang * imports
utils >= 2.10.1 imports
covr >= 3.5.1 suggests
knitr * suggests
rmarkdown * suggests
testthat >= 3.0.0 suggests

.github/workflows/R-CMD-check.yaml actions

actions/checkout v2 composite
actions/upload-artifact main composite
r-lib/actions/check-r-package v1 composite
r-lib/actions/setup-pandoc v1 composite
r-lib/actions/setup-r v1 composite
r-lib/actions/setup-r-dependencies v1 composite

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

parsel

Science Score: 26.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.Rmd

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

cran.r-project.org: parsel

Rankings

Maintainers (1)

Dependencies