Rdataretriever

Rdataretriever: R Interface to the Data Retriever - Published in JOSS (2021)

https://github.com/ropensci/rdataretriever

Science Score: 93.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 4 DOI reference(s) in README and JOSS metadata
✓
Academic publication links
Links to: joss.theoj.org, zenodo.org
○
Committers with academic emails
○
Institutional organization owner
✓
JOSS paper metadata
Published in Journal of Open Source Software

Keywords

data data-science database datasets r r-package rstats science

Keywords from Contributors

data-retrieval hacktobefest ecology weather-data crypto-currency-exchanges osm-data overpass-api pm25 rti-micropem data60uk

Last synced: 6 months ago · JSON representation

Repository

R interface to the Data Retriever

Basic Info

Host: GitHub
Owner: ropensci
License: other
Language: R
Default Branch: main
Homepage: https://docs.ropensci.org/rdataretriever
Size: 388 KB

Statistics

Stars: 47
Watchers: 12
Forks: 21
Open Issues: 9
Releases: 8

Topics

data data-science database datasets r r-package rstats science

Created almost 12 years ago · Last pushed over 1 year ago

Metadata Files

Readme Changelog License Code of conduct

rdataretriever

+ (old package name)

R interface to the Data Retriever.

The rdataretriever provides access to cleaned versions of hundreds of commonly used public datasets with a single line of code.

These datasets come from many different sources and most of them require some cleaning and restructuring prior to analysis. The rdataretriever uses a set of actively maintained recipes for downloading, cleaning, and restructing these datasets using a combination of the Frictionless Data Specification and custom data cleaning scripts.

The rdataretriever also facilitates the automatic storage of these datasets in a choice of database management systems (PostgreSQL, SQLite, MySQL, MariaDB) or flat file formats (CSV, XML, JSON) for later use and integration with large data analysis pipelines.

The rdatretriever also facilitates reproducibile science by providing tools to archive and rerun the precise version of a dataset and associated cleaning steps that was used for a specific analysis.

The rdataretriever handles the work of cleaning, storing, and archiving data so that you can focus on analysis, inference and visualization.

Installation
- Basic Installation (no Python experience needed)
- Advanced Installation for Python Users
Installing Tabular Datasets
Installing Spatial Datasets
Using Docker Containers
Provenance
Acknowledgements

Installation

The rdataretriever is an R wrapper for the Python package, Data Retriever. This means that Python and the retriever Python package need to be installed first.

Basic Installation

If you just want to use the Data Retriever from within R follow these instuctions run the following commands in R. This will create a local Python installation that will only be used by R and install the needed Python package for you.

coffee install.packages('reticulate') # Install R package for interacting with Python reticulate::install_miniconda() # Install Python reticulate::py_install('retriever') # Install the Python retriever package install.packages('rdataretriever') # Install the R package for running the retriever rdataretriever::get_updates() # Update the available datasets

After running these commands restart R.

Advanced Installation for Python Users

If you are using Python for other tasks you can use rdataretriever with your existing Python installation (though the basic installation above will also work in this case by creating a separate miniconda install and Python environment).

Install the `retriever` Python package

Install the retriever Python package into your prefered Python environment using either conda (64-bit conda is required):

bash conda install -c conda-forge retriever

or pip:

bash pip install retriever

Select the Python environment to use in R

rdataretriever will try to find Python environments with retriever (see the reticulate documentation on order of discovery for more details) installed. Alternatively you can select a Python environment to use when working with rdataretriever (and other packages using reticulate).

The most robust way to do this is to set the RETICULATE_PYTHON environment variable to point to the preferred Python executable:

coffee Sys.setenv(RETICULATE_PYTHON = "/path/to/python")

This command can be run interactively or placed in .Renviron in your home directory.

Alternatively you can do select the Python environment through the reticulate package for either conda:

coffee library(reticulate) use_conda('name_of_conda_environment')

or virtualenv:

coffee library(reticulate) use_virtualenv("path_to_virtualenv_environment")

You can check to see which Python environment is being used with:

coffee py_config()

Install the `rdataretriever` R package

coffee install.packages("rdataretriever") # latest release from CRAN

coffee remotes::install_github("ropensci/rdataretriever") # development version from GitHub

Installing Tabular Datasets

```coffee library(rdataretriever)

List the datasets available via the Retriever

rdataretriever::datasets()

Install the portal into csv files in your working directory

rdataretriever::install_csv('portal')

Download the raw portal dataset files without any processing to the

subdirectory named data

rdataretriever::download('portal', './data/')

Install and load a dataset as a list

portal = rdataretriever::fetch('portal') names(portal) head(portal$species)

```

Installing Spatial Datasets

Set-up and Requirements

Tools

PostgreSQL with PostGis, psql(client), raster2pgsql, shp2pgsql, gdal,

The rdataretriever supports installation of spatial data into Postgres DBMS.

Install PostgreSQL and PostGis

To install PostgreSQL with PostGis for use with spatial data please refer to the OSGeo Postgres installation instructions.

We recommend storing your PostgreSQL login information in a .pgpass file to avoid supplying the password every time. See the .pgpass documentation for more details.

After installation, Make sure you have the paths to these tools added to your system's PATHS. Please consult an operating system expert for help on how to change or add the PATH variables.

For example, this could be a sample of paths exported on Mac:

```shell

~/.bash_profile file, Postgres PATHS and tools.

export PATH="/Applications/Postgres.app/Contents/MacOS/bin:${PATH}" export PATH="$PATH:/Applications/Postgres.app/Contents/Versions/10/bin"

```
Enable PostGIS extensions

If you have Postgres set up, enable PostGIS extensions. This is done by using either Postgres CLI or GUI(PgAdmin) and run

For psql CLI shell psql -d yourdatabase -c "CREATE EXTENSION postgis;" psql -d yourdatabase -c "CREATE EXTENSION postgis_topology;"

For GUI(PgAdmin)

sql CREATE EXTENSION postgis; CREATE EXTENSION postgis_topology For more details refer to the PostGIS docs.

Sample commands

```R rdataretriever::installpostgres('harvard-forest') # Vector data rdataretriever::installpostgres('bioclim') # Raster data

Install only the data of USGS elevation in the given extent

rdataretriever::install_postgres('usgs-elevation', list(-94.98704597353938, 39.027001800158615, -94.3599408119917, 40.69577051867074))

```

Provenance

To ensure reproducibility the rdataretriever supports creating snapshots of the data and the script in time.

Use the commit function to create and store the snapshot image of the data in time. Provide a descriptive message for the created commit. This is comparable to a git commit, however the function bundles the data and scripts used as a backup.

With provenace, you will be able to reproduce the same analysis in the future.

Commit a dataset

By default commits will be stored in the provenance directory .retriever_provenance, but this directory can be changed by setting the environment variable PROVENANCE_DIR.

coffee rdataretriever::commit('abalone-age', commit_message='A snapshot of Abalone Dataset as of 2020-02-26')

You can also set the path for an individual commit:

coffee rdataretriever::commit('abalone-age', commit_message='Data and recipe archive for Abalone Data on 2020-02-26', path='.')

View a log of committed datasets in the provenance directory

coffee rdataretriever::commit_log('abalone-age')

Install a committed dataset

To reanalyze a committed dataset, rdataretriever will obtain the data and script from the history and rdataretriever will install this particular data into the given back-end. For example, SQLite:

coffee rdataretriever::install_sqlite('abalone-age-a76e77.zip') Datasets stored in provenance directory can be installed directly using hash value coffee rdataretriever::install_sqlite('abalone-age', hash_value='a76e77')

Using Docker Containers

To run the image interactively

docker-compose run --service-ports rdata /bin/bash

To run tests

docker-compose run rdata Rscript load_and_test.R

Release

Make sure you have tests passing on R-oldrelease, current R-release and R-devel

To check the package

Shell R CMD Build #build the package R CMD check --as-cran --no-manual rdataretriever_[version]tar.gz

To Test

```R setwd("./rdataretriever") # Set working directory

install all deps

install.packages("reticulate")

library(DBI) library(RPostgreSQL) library(RSQLite) library(reticulate) library(RMariaDB) install.packages(".", repos = NULL, type="source") roxygen2::roxygenise() devtools::test() ```

To get citation information for the rdataretriever in R use citation(package = 'rdataretriever')

Acknowledgements

A big thanks to Ben Morris for helping to develop the Data Retriever. Thanks to the rOpenSci team with special thanks to Gavin Simpson, Scott Chamberlain, and Karthik Ram who gave helpful advice and fostered the development of this R package. Development of this software was funded by the National Science Foundation as part of a CAREER award to Ethan White.

Owner

Name: rOpenSci
Login: ropensci
Kind: organization
Email: info@ropensci.org
Location: Berkeley, CA

Website: https://ropensci.org/
Twitter: rOpenSci
Repositories: 307
Profile: https://github.com/ropensci

JOSS Publication

Rdataretriever: R Interface to the Data Retriever

Published

January 06, 2021

DOI

10.21105/joss.02800

Volume 6, Issue 57, Page 2800

Authors

Henry Senyondo

Department of Wildlife Ecology and Conservation, University of Florida

Daniel J. McGlinn

Department of Biology, College of Charleston

Pranita Sharma

North Carolina State University, Department of Computer Science

David J. Harris

Department of Wildlife Ecology and Conservation, University of Florida

Hao Ye

Health Science Center Libraries, University of Florida

Shawn D. Taylor

Department of Wildlife Ecology and Conservation, University of Florida, USDA-ARS Jornada Experimental Range

Jeroen Ooms

Berkeley Institute for Data Science, University of California, Berkeley

Francisco Rodríguez-Sánchez

Department of Agricultural Economics, Sociology, and Education, Penn State University

Karthik Ram

Berkeley Institute for Data Science, University of California, Berkeley

Apoorva Pandey

Department of Electronics and Communication, Indian Institute of Technology, Roorkee

Harshit Bansal

Ajay Kumar Garg Engineering College, Ghaziabad

Max Pohlman
Departamento de Biología Vegetal y Ecología, Universidad de Sevilla.

Ethan P. White

Department of Wildlife Ecology and Conservation, University of Florida, Informatics Institute, University of Florida, Biodiversity Institute, University of Florida

Editor

Frederick Boehm

Papers & Mentions

Total mentions: 1

Forecasting biodiversity in breeding birds using best practices

DOI: 10.7717/peerj.4278
OpenAlex ID: https://openalex.org/W2756692527
Published: February 2018

Last synced: 4 months ago

GitHub Events

Total

Watch event: 2

Last Year

Watch event: 2

Committers

Last synced: 7 months ago

All Time

Total Commits: 319
Total Committers: 17
Avg Commits per committer: 18.765
Development Distribution Score (DDS): 0.624

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Daniel McGlinn	d**n@g**m	120
henrykironde	h**e@g**m	92
Ethan White	e**n@w**g	66
pranita-s	p**a@g**m	11
Hao Ye	l**d@g**m	5
Harshit Bansal	h**c@g**m	4
maxpohlman	m**n@g**m	3
Jeroen Ooms	j**s@g**m	3
Apoorva Pandey	a**5@g**m	3
David J. Harris	d****s	2
Shawn	s**r@w**g	2
Pakillo	f**c@g**m	2
Arfon Smith	a****n	2
Apoorva Pandey	a**a@l**n	1
Frederick Boehm	f**m@g**m	1
Karthik Ram	k**m@g**m	1
Maëlle Salmon	m**n@y**e	1

Committer Domains (Top 20 + Academic)

weecology.org: 2 localhost.localdomain: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 30
Total pull requests: 75
Average time to close issues: 10 months
Average time to close pull requests: 12 days
Total issue authors: 9
Total pull request authors: 7
Average comments per issue: 2.93
Average comments per pull request: 0.91
Merged pull requests: 62
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 3
Average time to close issues: N/A
Average time to close pull requests: about 1 hour
Issue authors: 0
Pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 3
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

ethanwhite (11)
RMHogervorst (7)
henrykironde (4)
ha0ye (3)
maelle (1)
fboehm (1)
harshitbansal05 (1)
dmcglinn (1)
gdicecco (1)

Pull Request Authors

henrykironde (51)
ethanwhite (21)
arfon (2)
jeroen (2)
maelle (1)
fboehm (1)
ashishpriyadarshiCIC (1)

Top Labels

Issue Labels

feature_request (2) good first issue (1) enhancement (1)

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- cran 277 last-month
Total docker downloads: 88,618

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 5
Total maintainers: 1

cran.r-project.org: rdataretriever

R Interface to the Data Retriever

Homepage: https://docs.ropensci.org/rdataretriever/ (website)
Documentation: http://cran.r-project.org/web/packages/rdataretriever/rdataretriever.pdf
License: MIT + file LICENSE
Latest release: 3.1.1
published over 1 year ago

Versions: 5
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 277 Last month
Docker Downloads: 88,618

Rankings

Docker downloads count: 0.0%

Forks count: 3.6%

Stargazers count: 7.4%

Average: 20.1%

Dependent repos count: 24.0%

Dependent packages count: 28.8%

Downloads: 56.7%

Maintainers (1)

henrykironde@gmail.com

Last synced: 6 months ago

Dependencies

DESCRIPTION cran

R >= 3.4.0 depends
reticulate >= 1.16 imports
semver * imports
DBI * suggests
RPostgreSQL * suggests
RSQLite * suggests
dbplyr * suggests
devtools * suggests
ggplot2 * suggests
knitr * suggests
raster * suggests
rmarkdown * suggests
testthat >= 1.0.0 suggests

.github/workflows/main.yml actions

actions/checkout v2 composite

Dockerfile docker

rocker/tidyverse latest build

docker-compose.yml docker

mysql 5.7
postgres latest
rdata_image latest

Rdataretriever

Science Score: 93.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

rdataretriever

Table of Contents

Installation

Basic Installation

Advanced Installation for Python Users

Install the retriever Python package

Select the Python environment to use in R

Install the rdataretriever R package

Installing Tabular Datasets

List the datasets available via the Retriever

Install the portal into csv files in your working directory

Download the raw portal dataset files without any processing to the

subdirectory named data

Install and load a dataset as a list

Installing Spatial Datasets

~/.bash_profile file, Postgres PATHS and tools.

Install only the data of USGS elevation in the given extent

Provenance

Using Docker Containers

Release

install all deps

install.packages("reticulate")

Acknowledgements

Owner

JOSS Publication

Rdataretriever: R Interface to the Data Retriever

Authors

Editor

Tags

Papers & Mentions

Forecasting biodiversity in breeding birds using best practices

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

cran.r-project.org: rdataretriever

Rankings

Maintainers (1)

Dependencies

Install the `retriever` Python package

Install the `rdataretriever` R package