Rdataretriever

Rdataretriever: R Interface to the Data Retriever - Published in JOSS (2021)

https://github.com/ropensci/rdataretriever

Science Score: 93.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 4 DOI reference(s) in README and JOSS metadata
  • Academic publication links
    Links to: joss.theoj.org, zenodo.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
    Published in Journal of Open Source Software

Keywords

data data-science database datasets r r-package rstats science

Keywords from Contributors

data-retrieval hacktobefest ecology weather-data crypto-currency-exchanges osm-data overpass-api pm25 rti-micropem data60uk
Last synced: 4 months ago · JSON representation

Repository

R interface to the Data Retriever

Basic Info
Statistics
  • Stars: 47
  • Watchers: 12
  • Forks: 21
  • Open Issues: 9
  • Releases: 8
Topics
data data-science database datasets r r-package rstats science
Created almost 12 years ago · Last pushed over 1 year ago
Metadata Files
Readme Changelog License Code of conduct

README.md

rdataretriever

Build Status Build statuscran version Documentation Status Downloads + Downloads (old package name) DOI DOI

R interface to the Data Retriever.

The rdataretriever provides access to cleaned versions of hundreds of commonly used public datasets with a single line of code.

These datasets come from many different sources and most of them require some cleaning and restructuring prior to analysis. The rdataretriever uses a set of actively maintained recipes for downloading, cleaning, and restructing these datasets using a combination of the Frictionless Data Specification and custom data cleaning scripts.

The rdataretriever also facilitates the automatic storage of these datasets in a choice of database management systems (PostgreSQL, SQLite, MySQL, MariaDB) or flat file formats (CSV, XML, JSON) for later use and integration with large data analysis pipelines.

The rdatretriever also facilitates reproducibile science by providing tools to archive and rerun the precise version of a dataset and associated cleaning steps that was used for a specific analysis.

The rdataretriever handles the work of cleaning, storing, and archiving data so that you can focus on analysis, inference and visualization.

Table of Contents

Installation

The rdataretriever is an R wrapper for the Python package, Data Retriever. This means that Python and the retriever Python package need to be installed first.

Basic Installation

If you just want to use the Data Retriever from within R follow these instuctions run the following commands in R. This will create a local Python installation that will only be used by R and install the needed Python package for you.

coffee install.packages('reticulate') # Install R package for interacting with Python reticulate::install_miniconda() # Install Python reticulate::py_install('retriever') # Install the Python retriever package install.packages('rdataretriever') # Install the R package for running the retriever rdataretriever::get_updates() # Update the available datasets

After running these commands restart R.

Advanced Installation for Python Users

If you are using Python for other tasks you can use rdataretriever with your existing Python installation (though the basic installation above will also work in this case by creating a separate miniconda install and Python environment).

Install the retriever Python package

Install the retriever Python package into your prefered Python environment using either conda (64-bit conda is required):

bash conda install -c conda-forge retriever

or pip:

bash pip install retriever

Select the Python environment to use in R

rdataretriever will try to find Python environments with retriever (see the reticulate documentation on order of discovery for more details) installed. Alternatively you can select a Python environment to use when working with rdataretriever (and other packages using reticulate).

The most robust way to do this is to set the RETICULATE_PYTHON environment variable to point to the preferred Python executable:

coffee Sys.setenv(RETICULATE_PYTHON = "/path/to/python")

This command can be run interactively or placed in .Renviron in your home directory.

Alternatively you can do select the Python environment through the reticulate package for either conda:

coffee library(reticulate) use_conda('name_of_conda_environment')

or virtualenv:

coffee library(reticulate) use_virtualenv("path_to_virtualenv_environment")

You can check to see which Python environment is being used with:

coffee py_config()

Install the rdataretriever R package

coffee install.packages("rdataretriever") # latest release from CRAN

coffee remotes::install_github("ropensci/rdataretriever") # development version from GitHub

Installing Tabular Datasets

```coffee library(rdataretriever)

List the datasets available via the Retriever

rdataretriever::datasets()

Install the portal into csv files in your working directory

rdataretriever::install_csv('portal')

Download the raw portal dataset files without any processing to the

subdirectory named data

rdataretriever::download('portal', './data/')

Install and load a dataset as a list

portal = rdataretriever::fetch('portal') names(portal) head(portal$species)

```

Installing Spatial Datasets

Set-up and Requirements

Tools

  • PostgreSQL with PostGis, psql(client), raster2pgsql, shp2pgsql, gdal,

The rdataretriever supports installation of spatial data into Postgres DBMS.

  1. Install PostgreSQL and PostGis

    To install PostgreSQL with PostGis for use with spatial data please refer to the OSGeo Postgres installation instructions.

    We recommend storing your PostgreSQL login information in a .pgpass file to avoid supplying the password every time. See the .pgpass documentation for more details.

    After installation, Make sure you have the paths to these tools added to your system's PATHS. Please consult an operating system expert for help on how to change or add the PATH variables.

    For example, this could be a sample of paths exported on Mac:

    ```shell

    ~/.bash_profile file, Postgres PATHS and tools.

    export PATH="/Applications/Postgres.app/Contents/MacOS/bin:${PATH}" export PATH="$PATH:/Applications/Postgres.app/Contents/Versions/10/bin"

    ```

  2. Enable PostGIS extensions

    If you have Postgres set up, enable PostGIS extensions. This is done by using either Postgres CLI or GUI(PgAdmin) and run

    For psql CLI shell psql -d yourdatabase -c "CREATE EXTENSION postgis;" psql -d yourdatabase -c "CREATE EXTENSION postgis_topology;"

    For GUI(PgAdmin)

    sql CREATE EXTENSION postgis; CREATE EXTENSION postgis_topology For more details refer to the PostGIS docs.

Sample commands

```R rdataretriever::installpostgres('harvard-forest') # Vector data rdataretriever::installpostgres('bioclim') # Raster data

Install only the data of USGS elevation in the given extent

rdataretriever::install_postgres('usgs-elevation', list(-94.98704597353938, 39.027001800158615, -94.3599408119917, 40.69577051867074))

```

Provenance

To ensure reproducibility the rdataretriever supports creating snapshots of the data and the script in time.

Use the commit function to create and store the snapshot image of the data in time. Provide a descriptive message for the created commit. This is comparable to a git commit, however the function bundles the data and scripts used as a backup.

With provenace, you will be able to reproduce the same analysis in the future.

Commit a dataset

By default commits will be stored in the provenance directory .retriever_provenance, but this directory can be changed by setting the environment variable PROVENANCE_DIR.

coffee rdataretriever::commit('abalone-age', commit_message='A snapshot of Abalone Dataset as of 2020-02-26')

You can also set the path for an individual commit:

coffee rdataretriever::commit('abalone-age', commit_message='Data and recipe archive for Abalone Data on 2020-02-26', path='.')

View a log of committed datasets in the provenance directory

coffee rdataretriever::commit_log('abalone-age')

Install a committed dataset

To reanalyze a committed dataset, rdataretriever will obtain the data and script from the history and rdataretriever will install this particular data into the given back-end. For example, SQLite:

coffee rdataretriever::install_sqlite('abalone-age-a76e77.zip') Datasets stored in provenance directory can be installed directly using hash value coffee rdataretriever::install_sqlite('abalone-age', hash_value='a76e77')

Using Docker Containers

To run the image interactively

docker-compose run --service-ports rdata /bin/bash

To run tests

docker-compose run rdata Rscript load_and_test.R

Release

Make sure you have tests passing on R-oldrelease, current R-release and R-devel

To check the package

Shell R CMD Build #build the package R CMD check --as-cran --no-manual rdataretriever_[version]tar.gz

To Test

```R setwd("./rdataretriever") # Set working directory

install all deps

install.packages("reticulate")

library(DBI) library(RPostgreSQL) library(RSQLite) library(reticulate) library(RMariaDB) install.packages(".", repos = NULL, type="source") roxygen2::roxygenise() devtools::test() ```

To get citation information for the rdataretriever in R use citation(package = 'rdataretriever')

Acknowledgements

A big thanks to Ben Morris for helping to develop the Data Retriever. Thanks to the rOpenSci team with special thanks to Gavin Simpson, Scott Chamberlain, and Karthik Ram who gave helpful advice and fostered the development of this R package. Development of this software was funded by the National Science Foundation as part of a CAREER award to Ethan White.


ropensci footer

Owner

  • Name: rOpenSci
  • Login: ropensci
  • Kind: organization
  • Email: info@ropensci.org
  • Location: Berkeley, CA

JOSS Publication

Rdataretriever: R Interface to the Data Retriever
Published
January 06, 2021
Volume 6, Issue 57, Page 2800
Authors
Henry Senyondo ORCID
Department of Wildlife Ecology and Conservation, University of Florida
Daniel J. McGlinn ORCID
Department of Biology, College of Charleston
Pranita Sharma ORCID
North Carolina State University, Department of Computer Science
David J. Harris ORCID
Department of Wildlife Ecology and Conservation, University of Florida
Hao Ye ORCID
Health Science Center Libraries, University of Florida
Shawn D. Taylor ORCID
Department of Wildlife Ecology and Conservation, University of Florida, USDA-ARS Jornada Experimental Range
Jeroen Ooms ORCID
Berkeley Institute for Data Science, University of California, Berkeley
Francisco Rodríguez-Sánchez ORCID
Department of Agricultural Economics, Sociology, and Education, Penn State University
Karthik Ram ORCID
Berkeley Institute for Data Science, University of California, Berkeley
Apoorva Pandey ORCID
Department of Electronics and Communication, Indian Institute of Technology, Roorkee
Harshit Bansal ORCID
Ajay Kumar Garg Engineering College, Ghaziabad
Max Pohlman
Departamento de Biología Vegetal y Ecología, Universidad de Sevilla.
Ethan P. White ORCID
Department of Wildlife Ecology and Conservation, University of Florida, Informatics Institute, University of Florida, Biodiversity Institute, University of Florida
Editor
Frederick Boehm ORCID
Tags
data retrieval data processing R data data science datasets

Papers & Mentions

Total mentions: 1

Forecasting biodiversity in breeding birds using best practices
Last synced: 2 months ago

GitHub Events

Total
  • Watch event: 2
Last Year
  • Watch event: 2

Committers

Last synced: 5 months ago

All Time
  • Total Commits: 319
  • Total Committers: 17
  • Avg Commits per committer: 18.765
  • Development Distribution Score (DDS): 0.624
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Daniel McGlinn d****n@g****m 120
henrykironde h****e@g****m 92
Ethan White e****n@w****g 66
pranita-s p****a@g****m 11
Hao Ye l****d@g****m 5
Harshit Bansal h****c@g****m 4
maxpohlman m****n@g****m 3
Jeroen Ooms j****s@g****m 3
Apoorva Pandey a****5@g****m 3
David J. Harris d****s 2
Shawn s****r@w****g 2
Pakillo f****c@g****m 2
Arfon Smith a****n 2
Apoorva Pandey a****a@l****n 1
Frederick Boehm f****m@g****m 1
Karthik Ram k****m@g****m 1
Maëlle Salmon m****n@y****e 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 30
  • Total pull requests: 75
  • Average time to close issues: 10 months
  • Average time to close pull requests: 12 days
  • Total issue authors: 9
  • Total pull request authors: 7
  • Average comments per issue: 2.93
  • Average comments per pull request: 0.91
  • Merged pull requests: 62
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 3
  • Average time to close issues: N/A
  • Average time to close pull requests: about 1 hour
  • Issue authors: 0
  • Pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 3
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • ethanwhite (11)
  • RMHogervorst (7)
  • henrykironde (4)
  • ha0ye (3)
  • maelle (1)
  • fboehm (1)
  • harshitbansal05 (1)
  • dmcglinn (1)
  • gdicecco (1)
Pull Request Authors
  • henrykironde (51)
  • ethanwhite (21)
  • arfon (2)
  • jeroen (2)
  • maelle (1)
  • fboehm (1)
  • ashishpriyadarshiCIC (1)
Top Labels
Issue Labels
feature_request (2) good first issue (1) enhancement (1)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • cran 277 last-month
  • Total docker downloads: 88,618
  • Total dependent packages: 0
  • Total dependent repositories: 1
  • Total versions: 5
  • Total maintainers: 1
cran.r-project.org: rdataretriever

R Interface to the Data Retriever

  • Versions: 5
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 277 Last month
  • Docker Downloads: 88,618
Rankings
Docker downloads count: 0.0%
Forks count: 3.6%
Stargazers count: 7.4%
Average: 20.1%
Dependent repos count: 24.0%
Dependent packages count: 28.8%
Downloads: 56.7%
Maintainers (1)
Last synced: 4 months ago

Dependencies

DESCRIPTION cran
  • R >= 3.4.0 depends
  • reticulate >= 1.16 imports
  • semver * imports
  • DBI * suggests
  • RPostgreSQL * suggests
  • RSQLite * suggests
  • dbplyr * suggests
  • devtools * suggests
  • ggplot2 * suggests
  • knitr * suggests
  • raster * suggests
  • rmarkdown * suggests
  • testthat >= 1.0.0 suggests
.github/workflows/main.yml actions
  • actions/checkout v2 composite
Dockerfile docker
  • rocker/tidyverse latest build
docker-compose.yml docker
  • mysql 5.7
  • postgres latest
  • rdata_image latest