adaR

:computer: wrapper for ada-url a WHATWG-compliant and fast URL parser written in modern C++

https://github.com/gesistsa/adar

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (18.1%) to scientific vocabulary

Keywords

r rstats rstats-package url-parser
Last synced: 6 months ago · JSON representation ·

Repository

:computer: wrapper for ada-url a WHATWG-compliant and fast URL parser written in modern C++

Basic Info
Statistics
  • Stars: 26
  • Watchers: 4
  • Forks: 4
  • Open Issues: 6
  • Releases: 7
Topics
r rstats rstats-package url-parser
Created over 2 years ago · Last pushed 6 months ago
Metadata Files
Readme Changelog License Citation

README.Rmd

---
output: github_document
---



```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```

# adaR 


[![R-CMD-check](https://github.com/gesistsa/adaR/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/gesistsa/adaR/actions/workflows/R-CMD-check.yaml)
[![CRAN status](https://www.r-pkg.org/badges/version/adaR)](https://CRAN.R-project.org/package=adaR)
[![CRAN Downloads](https://cranlogs.r-pkg.org/badges/adaR)](https://CRAN.R-project.org/package=adaR)
[![Codecov test coverage](https://codecov.io/gh/gesistsa/adaR/branch/main/graph/badge.svg)](https://app.codecov.io/gh/gesistsa/adaR?branch=main)
[![ada-url Version](https://img.shields.io/badge/ada_url-3.2.2-blue)](https://github.com/ada-url/ada)


adaR is a wrapper for [ada-url](https://github.com/ada-url/ada), a
[WHATWG](https://url.spec.whatwg.org/#url-parsing)-compliant and fast URL parser written in modern C++ .

It implements several auxilliary functions to work with urls:

- public suffix extraction (top level domain excluding private domains) like [psl](https://github.com/hrbrmstr/psl)
- fast c++ implementation of `utils::URLdecode` (~40x speedup)

More general information on URL parsing can be found in the introductory vignette via `vignette("adaR")`.

`adaR` is part of a series of R packages to analyse webtracking data:

- [webtrackR](https://github.com/gesistsa/webtrackR): preprocess raw webtracking data
- [domainator](https://github.com/schochastics/domainator): classify domains
- [adaR](https://github.com/gesistsa/adaR): parse urls

## Installation

You can install the development version of adaR from [GitHub](https://github.com/) with:

``` r
# install.packages("devtools")
devtools::install_github("gesistsa/adaR")
```

The version on CRAN can be installed with
```r
install.packages("adaR")
```

## Example

This is a basic example which shows all the returned components of a URL.

```{r example}
library(adaR)
ada_url_parse("https://user_1:password_1@example.org:8080/dir/../api?q=1#frag")
```

```c++
  /*
   * https://user:pass@example.com:1234/foo/bar?baz#quux
   *       |     |    |          | ^^^^|       |   |
   *       |     |    |          | |   |       |   `----- hash_start
   *       |     |    |          | |   |       `--------- search_start
   *       |     |    |          | |   `----------------- pathname_start
   *       |     |    |          | `--------------------- port
   *       |     |    |          `----------------------- host_end
   *       |     |    `---------------------------------- host_start
   *       |     `--------------------------------------- username_end
   *       `--------------------------------------------- protocol_end
   */
```

It solves some problems of urltools with more complex urls.
```{r better}
urltools::url_parse("https://www.google.com/maps/place/Pennsylvania+Station/@40.7519848,-74.0015045,14.
   7z/data=!4m5!3m4!1s0x89c259ae15b2adcb:0x7955420634fd7eba!8m2!3d40.750568!4d-73.993519")

ada_url_parse("https://www.google.com/maps/place/Pennsylvania+Station/@40.7519848,-74.0015045,14.7z/data=!4m
   5!3m4!1s0x89c259ae15b2adcb:0x7955420634fd7eba!8m2!3d40.750568!4d-73.993519")
```

A "raw" url parse using ada is extremely fast (see [ada-url.com](https://www.ada-url.com/)) but for this to carry over to R is tricky.
The performance is still compatible with `urltools::url_parse` with the noted advantage in accuracy in some
practical circumstances.

```{r faster}
bench::mark(
  ada = ada_url_parse("https://user_1:password_1@example.org:8080/dir/../api?q=1#frag", decode = FALSE),
  urltools = urltools::url_parse("https://user_1:password_1@example.org:8080/dir/../api?q=1#frag"),
  check = FALSE
)
```

For further benchmark results, see `benchmark.md` in `data_raw`.

There are four more groups of functions available to work with url parsing:

- `ada_get_*()` get a specific component
- `ada_has_*()` check if a specific component is present
- `ada_set_*()` set a specific component from URLS
- `ada_clear_*()` remove a specific component from URLS

## Public Suffix extraction

`public_suffix()` extracts their top level domain from the [public suffix list](https://publicsuffix.org/), **excluding** private domains.

```{r public_suffix}
urls <- c(
  "https://subsub.sub.domain.co.uk",
  "https://domain.api.gov.uk",
  "https://thisisnotpart.butthisispartoftheps.kawasaki.jp"
)
public_suffix(urls)
```

If you are wondering about the last url. The list also contains wildcard suffixes such as `*.kawasaki.jp` which need to be matched.


## Acknowledgement

The logo is created from [this portrait](https://commons.wikimedia.org/wiki/File:Ada_Lovelace_portrait.jpg) of [Ada Lovelace](https://de.wikipedia.org/wiki/Ada_Lovelace), a very early pioneer in Computer Science.

Owner

  • Name: Transparent Social Analytics
  • Login: gesistsa
  • Kind: organization
  • Location: Germany

Open Science Tools maintained by Transparent Social Analytics Team, GESIS

Citation (CITATION.cff)

# --------------------------------------------
# CITATION file created with {cffr} R package
# See also: https://docs.ropensci.org/cffr/
# --------------------------------------------
 
cff-version: 1.2.0
message: 'To cite package "adaR" in publications use:'
type: software
license: MIT
title: 'adaR: A Fast ''WHATWG'' Compliant URL Parser'
version: 0.3.2
abstract: A wrapper for 'ada-url', a 'WHATWG' compliant and fast URL parser written
  in modern 'C++'. Also contains auxiliary functions such as a public suffix extractor.
authors:
- family-names: Schoch
  given-names: David
  email: david@schochastics.net
  orcid: https://orcid.org/0000-0003-2952-4812
- family-names: Chan
  given-names: Chung-hong
  email: chainsawtiney@gmail.com
  orcid: https://orcid.org/0000-0002-6232-7530
repository: https://CRAN.R-project.org/package=adaR
repository-code: https://github.com/gesistsa/adaR
url: https://gesistsa.github.io/adaR/
contact:
- family-names: Schoch
  given-names: David
  email: david@schochastics.net
  orcid: https://orcid.org/0000-0003-2952-4812
keywords:
- r
- rstats
- rstats-package
- url-parser
references:
- type: software
  title: Rcpp
  abstract: 'Rcpp: Seamless R and C++ Integration'
  notes: LinkingTo
  url: https://www.rcpp.org
  repository: https://CRAN.R-project.org/package=Rcpp
  authors:
  - family-names: Eddelbuettel
    given-names: Dirk
  - family-names: Francois
    given-names: Romain
  - family-names: Allaire
    given-names: JJ
  - family-names: Ushey
    given-names: Kevin
  - family-names: Kou
    given-names: Qiang
  - family-names: Russell
    given-names: Nathan
  - family-names: Ucar
    given-names: Inaki
  - family-names: Bates
    given-names: Douglas
  - family-names: Chambers
    given-names: John
  year: '2024'
- type: software
  title: triebeard
  abstract: 'triebeard: ''Radix'' Trees in ''Rcpp'''
  notes: Imports
  url: https://github.com/Ironholds/triebeard/
  repository: https://CRAN.R-project.org/package=triebeard
  authors:
  - family-names: Keyes
    given-names: Os
  - family-names: Schmidt
    given-names: Drew
  - family-names: Takano
    given-names: Yuuki
  year: '2024'
- type: software
  title: knitr
  abstract: 'knitr: A General-Purpose Package for Dynamic Report Generation in R'
  notes: Suggests
  url: https://yihui.org/knitr/
  repository: https://CRAN.R-project.org/package=knitr
  authors:
  - family-names: Xie
    given-names: Yihui
    email: xie@yihui.name
    orcid: https://orcid.org/0000-0003-0645-5666
  year: '2024'
- type: software
  title: rmarkdown
  abstract: 'rmarkdown: Dynamic Documents for R'
  notes: Suggests
  url: https://pkgs.rstudio.com/rmarkdown/
  repository: https://CRAN.R-project.org/package=rmarkdown
  authors:
  - family-names: Allaire
    given-names: JJ
    email: jj@posit.co
  - family-names: Xie
    given-names: Yihui
    email: xie@yihui.name
    orcid: https://orcid.org/0000-0003-0645-5666
  - family-names: Dervieux
    given-names: Christophe
    email: cderv@posit.co
    orcid: https://orcid.org/0000-0003-4474-2498
  - family-names: McPherson
    given-names: Jonathan
    email: jonathan@posit.co
  - family-names: Luraschi
    given-names: Javier
  - family-names: Ushey
    given-names: Kevin
    email: kevin@posit.co
  - family-names: Atkins
    given-names: Aron
    email: aron@posit.co
  - family-names: Wickham
    given-names: Hadley
    email: hadley@posit.co
  - family-names: Cheng
    given-names: Joe
    email: joe@posit.co
  - family-names: Chang
    given-names: Winston
    email: winston@posit.co
  - family-names: Iannone
    given-names: Richard
    email: rich@posit.co
    orcid: https://orcid.org/0000-0003-3925-190X
  year: '2024'
- type: software
  title: testthat
  abstract: 'testthat: Unit Testing for R'
  notes: Suggests
  url: https://testthat.r-lib.org
  repository: https://CRAN.R-project.org/package=testthat
  authors:
  - family-names: Wickham
    given-names: Hadley
    email: hadley@posit.co
  year: '2024'
  version: '>= 3.0.0'
- type: software
  title: 'R: A Language and Environment for Statistical Computing'
  notes: Depends
  url: https://www.R-project.org/
  authors:
  - name: R Core Team
  institution:
    name: R Foundation for Statistical Computing
    address: Vienna, Austria
  year: '2024'
  version: '>= 4.2'

GitHub Events

Total
  • Create event: 2
  • Release event: 1
  • Issues event: 4
  • Watch event: 1
  • Issue comment event: 2
  • Push event: 15
  • Pull request review event: 2
  • Pull request event: 8
  • Fork event: 2
Last Year
  • Create event: 2
  • Release event: 1
  • Issues event: 4
  • Watch event: 1
  • Issue comment event: 2
  • Push event: 15
  • Pull request review event: 2
  • Pull request event: 8
  • Fork event: 2

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 40
  • Total pull requests: 42
  • Average time to close issues: 4 days
  • Average time to close pull requests: about 4 hours
  • Total issue authors: 7
  • Total pull request authors: 4
  • Average comments per issue: 2.5
  • Average comments per pull request: 1.29
  • Merged pull requests: 39
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 3
  • Pull requests: 7
  • Average time to close issues: about 7 hours
  • Average time to close pull requests: about 4 hours
  • Issue authors: 3
  • Pull request authors: 3
  • Average comments per issue: 0.67
  • Average comments per pull request: 0.0
  • Merged pull requests: 6
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • schochastics (25)
  • chainsawriot (5)
  • DyfanJones (1)
  • ArthurMuehl (1)
  • Fluke95 (1)
  • cbpuschmann (1)
  • JBGruber (1)
Pull Request Authors
  • chainsawriot (22)
  • schochastics (19)
  • DyfanJones (2)
  • ArthurMuehl (1)
Top Labels
Issue Labels
0.2.0 (6) 0.3.0 (4) feature? (3) bug (3) 0.4.0 (1) 0.1.0 (1)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • cran 352 last-month
  • Total dependent packages: 1
  • Total dependent repositories: 1
  • Total versions: 6
  • Total maintainers: 1
cran.r-project.org: adaR

A Fast 'WHATWG' Compliant URL Parser

  • Versions: 6
  • Dependent Packages: 1
  • Dependent Repositories: 1
  • Downloads: 352 Last month
Rankings
Stargazers count: 11.0%
Forks count: 17.1%
Average: 23.3%
Dependent repos count: 24.0%
Dependent packages count: 28.7%
Downloads: 35.7%
Maintainers (1)
Last synced: 6 months ago

Dependencies

.github/workflows/R-CMD-check.yaml actions
  • actions/checkout v3 composite
  • r-lib/actions/check-r-package v2 composite
  • r-lib/actions/setup-pandoc v2 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite
.github/workflows/pkgdown.yaml actions
  • JamesIves/github-pages-deploy-action v4.4.1 composite
  • actions/checkout v3 composite
  • r-lib/actions/setup-pandoc v2 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite
.github/workflows/test-coverage.yaml actions
  • actions/checkout v3 composite
  • actions/upload-artifact v3 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite
DESCRIPTION cran
  • R >= 4.2 depends
  • Rcpp * imports
  • triebeard * imports
  • knitr * suggests
  • rmarkdown * suggests
  • testthat >= 3.0.0 suggests