rvest

Simple web scraping for R

https://github.com/tidyverse/rvest

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    1 of 32 committers (3.1%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (18.5%) to scientific vocabulary

Keywords

html r web-scraping

Keywords from Contributors

grammar data-manipulation tidy-data rmarkdown curl pandoc parsing fwf csv package-creation
Last synced: 6 months ago · JSON representation

Repository

Simple web scraping for R

Basic Info
Statistics
  • Stars: 1,506
  • Watchers: 88
  • Forks: 348
  • Open Issues: 30
  • Releases: 14
Topics
html r web-scraping
Created over 11 years ago · Last pushed 6 months ago
Metadata Files
Readme Changelog Contributing License Code of conduct Codeowners Support

README.Rmd

---
output: github_document
---



```{r, echo = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE, 
  comment = "#>",
  fig.path = "README-"  
)
```

# rvest rvest website



[![CRAN status](https://www.r-pkg.org/badges/version/rvest)](https://cran.r-project.org/package=rvest)
[![R-CMD-check](https://github.com/tidyverse/rvest/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/tidyverse/rvest/actions/workflows/R-CMD-check.yaml)
[![Codecov test coverage](https://codecov.io/gh/tidyverse/rvest/graph/badge.svg)](https://app.codecov.io/gh/tidyverse/rvest)


## Overview

rvest helps you scrape (or harvest) data from web pages.
It is designed to work with [magrittr](https://github.com/tidyverse/magrittr) to make it easy to express common web scraping tasks, inspired by libraries like [beautiful soup](https://www.crummy.com/software/BeautifulSoup/) and [RoboBrowser](http://robobrowser.readthedocs.io/en/latest/readme.html).

If you're scraping multiple pages, I highly recommend using rvest in concert with [polite](https://dmi3kno.github.io/polite/).
The polite package ensures that you're respecting the [robots.txt](https://en.wikipedia.org/wiki/Robots_exclusion_standard) and not hammering the site with too many requests.

## Installation

```{r, eval = FALSE}
# The easiest way to get rvest is to install the whole tidyverse:
install.packages("tidyverse")

# Alternatively, install just rvest:
install.packages("rvest")
```

## Usage

```{r, message = FALSE}
library(rvest)

# Start by reading a HTML page with read_html():
starwars <- read_html("https://rvest.tidyverse.org/articles/starwars.html")

# Then find elements that match a css selector or XPath expression
# using html_elements(). In this example, each 
corresponds # to a different film films <- starwars |> html_elements("section") films # Then use html_element() to extract one element per film. Here # we the title is given by the text inside

title <- films |> html_element("h2") |> html_text2() title # Or use html_attr() to get data out of attributes. html_attr() always # returns a string so we convert it to an integer using a readr function episode <- films |> html_element("h2") |> html_attr("data-id") |> readr::parse_integer() episode ``` If the page contains tabular data you can convert it directly to a data frame with `html_table()`: ```{r} html <- read_html("https://en.wikipedia.org/w/index.php?title=The_Lego_Movie&oldid=998422565") html |> html_element(".tracklist") |> html_table() ```

Owner

  • Name: tidyverse
  • Login: tidyverse
  • Kind: organization

The tidyverse is a collection of R packages that share common principles and are designed to work together seamlessly

GitHub Events

Total
  • Issues event: 30
  • Watch event: 30
  • Delete event: 5
  • Issue comment event: 46
  • Push event: 16
  • Pull request review event: 1
  • Pull request event: 13
  • Fork event: 12
  • Create event: 2
Last Year
  • Issues event: 30
  • Watch event: 30
  • Delete event: 5
  • Issue comment event: 46
  • Push event: 16
  • Pull request review event: 1
  • Pull request event: 13
  • Fork event: 12
  • Create event: 2

Committers

Last synced: 8 months ago

All Time
  • Total Commits: 427
  • Total Committers: 32
  • Avg Commits per committer: 13.344
  • Development Distribution Score (DDS): 0.126
Past Year
  • Commits: 2
  • Committers: 2
  • Avg Commits per committer: 1.0
  • Development Distribution Score (DDS): 0.5
Top Committers
Name Email Commits
Hadley Wickham h****m@g****m 373
Mara Averick m****k@g****m 6
john collins j****s@g****m 5
Dmytro Perepolkin d****n@g****m 4
Benjamin Skov Kaas-Hansen e****n@h****m 3
Will May w****y@l****m 2
Hiroaki Yutani y****i@g****m 2
jrnold j****d@g****m 2
jjchern j****n@g****m 2
Kun Ren k****n@r****e 2
Jim Hester j****r@g****m 2
Jamie Lendrum j****m@g****m 2
Eduardo Ariño de la Rubia e****o@g****m 2
David Holstius d****s@g****m 2
vtroost 3****t 1
moody_mudskipper a****i@g****m 1
leledavid l****d@g****m 1
Z_Wael z****s@g****m 1
William Doane w****l@D****m 1
Sam s****e 1
Raymond 3****t 1
Michael Chirico m****4@g****m 1
Matt Cowgill m****l@g****m 1
Marcin Kosiński k****m@s****l 1
Luis Verde Arregoitia l****d@c****x 1
Brent Brewington b****n@g****m 1
Charlotte Wickham c****m@g****m 1
Craig Citro c****o@g****m 1
Daniel Possenriede p****e@g****m 1
Josh Duncan j****d@g****m 1
and 2 more...
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 134
  • Total pull requests: 62
  • Average time to close issues: 10 months
  • Average time to close pull requests: 5 months
  • Total issue authors: 90
  • Total pull request authors: 22
  • Average comments per issue: 1.5
  • Average comments per pull request: 1.24
  • Merged pull requests: 40
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 18
  • Pull requests: 13
  • Average time to close issues: about 2 hours
  • Average time to close pull requests: 1 day
  • Issue authors: 17
  • Pull request authors: 6
  • Average comments per issue: 0.17
  • Average comments per pull request: 0.85
  • Merged pull requests: 8
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • hadley (36)
  • geotheory (3)
  • petrbouchal (3)
  • davidrsch (2)
  • kjschaudt (2)
  • alireza5969 (2)
  • epiben (2)
  • OlexiyPukhov (2)
  • qpmnguyen (1)
  • jeroenjanssens (1)
  • romainfrancois (1)
  • MattCowgill (1)
  • litao1105 (1)
  • jubilee2 (1)
  • cregouby (1)
Pull Request Authors
  • hadley (22)
  • jonthegeek (4)
  • epiben (4)
  • MichaelChirico (3)
  • jeroen (3)
  • luisDVA (3)
  • jrosell (2)
  • SermetPekin (2)
  • shikokuchuo (2)
  • david-jankoski (2)
  • MattCowgill (2)
  • VisruthSK (2)
  • ZWael (2)
  • vtroost (1)
  • HayesJohnD (1)
Top Labels
Issue Labels
feature (18) table 🏓 (11) bug (7) documentation (6) form 🧾 (5) upkeep (4) live :baby_chick: (3) reprex (2) help wanted :heart: (1)
Pull Request Labels

Packages

  • Total packages: 2
  • Total downloads:
    • cran 660,454 last-month
  • Total docker downloads: 45,956,794
  • Total dependent packages: 284
    (may contain duplicates)
  • Total dependent repositories: 1,350
    (may contain duplicates)
  • Total versions: 30
  • Total maintainers: 1
cran.r-project.org: rvest

Easily Harvest (Scrape) Web Pages

  • Versions: 15
  • Dependent Packages: 284
  • Dependent Repositories: 1,350
  • Downloads: 660,454 Last month
  • Docker Downloads: 45,956,794
Rankings
Forks count: 0.1%
Stargazers count: 0.1%
Dependent repos count: 0.3%
Downloads: 0.4%
Dependent packages count: 0.4%
Average: 3.1%
Docker downloads count: 17.3%
Maintainers (1)
Last synced: 6 months ago
proxy.golang.org: github.com/tidyverse/rvest
  • Versions: 15
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent packages count: 5.5%
Average: 5.7%
Dependent repos count: 5.9%
Last synced: 6 months ago

Dependencies

.github/workflows/R-CMD-check.yaml actions
  • actions/checkout v2 composite
  • r-lib/actions/check-r-package v2 composite
  • r-lib/actions/setup-pandoc v2 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite
.github/workflows/pkgdown.yaml actions
  • JamesIves/github-pages-deploy-action 4.1.4 composite
  • actions/checkout v2 composite
  • r-lib/actions/setup-pandoc v2 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite
.github/workflows/pr-commands.yaml actions
  • actions/checkout v2 composite
  • r-lib/actions/pr-fetch v2 composite
  • r-lib/actions/pr-push v2 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite
.github/workflows/test-coverage.yaml actions
  • actions/checkout v2 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite
DESCRIPTION cran
  • R >= 3.2 depends
  • cli * imports
  • glue * imports
  • httr >= 0.5 imports
  • lifecycle >= 1.0.3 imports
  • magrittr * imports
  • rlang >= 1.0.0 imports
  • selectr * imports
  • tibble * imports
  • withr * imports
  • xml2 >= 1.3 imports
  • covr * suggests
  • knitr * suggests
  • readr * suggests
  • repurrrsive * suggests
  • rmarkdown * suggests
  • spelling * suggests
  • stringi >= 0.3.1 suggests
  • testthat >= 3.0.2 suggests
  • webfakes * suggests