Rdataretriever
Rdataretriever: R Interface to the Data Retriever - Published in JOSS (2021)
Science Score: 93.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 4 DOI reference(s) in README and JOSS metadata -
✓Academic publication links
Links to: joss.theoj.org, zenodo.org -
○Committers with academic emails
-
○Institutional organization owner
-
✓JOSS paper metadata
Published in Journal of Open Source Software
Keywords
Keywords from Contributors
Repository
R interface to the Data Retriever
Basic Info
- Host: GitHub
- Owner: ropensci
- License: other
- Language: R
- Default Branch: main
- Homepage: https://docs.ropensci.org/rdataretriever
- Size: 388 KB
Statistics
- Stars: 47
- Watchers: 12
- Forks: 21
- Open Issues: 9
- Releases: 8
Topics
Metadata Files
README.md
rdataretriever
R interface to the Data Retriever.
The rdataretriever provides access to cleaned versions of hundreds of commonly used public datasets with a single line of code.
These datasets come from many different sources and most of them require some cleaning and restructuring prior to analysis.
The rdataretriever uses a set of actively maintained recipes for downloading, cleaning, and restructing these datasets using a combination of the Frictionless Data Specification and custom data cleaning scripts.
The rdataretriever also facilitates the automatic storage of these datasets in a choice of database management systems (PostgreSQL, SQLite, MySQL, MariaDB) or flat file formats (CSV, XML, JSON) for later use and integration with large data analysis pipelines.
The rdatretriever also facilitates reproducibile science by providing tools to archive and rerun the precise version of a dataset and associated cleaning steps that was used for a specific analysis.
The rdataretriever handles the work of cleaning, storing, and archiving data so that you can focus on analysis, inference and visualization.
Table of Contents
- Installation
- Installing Tabular Datasets
- Installing Spatial Datasets
- Using Docker Containers
- Provenance
- Acknowledgements
Installation
The rdataretriever is an R wrapper for the Python package, Data Retriever. This means
that Python and the retriever Python package need to be installed first.
Basic Installation
If you just want to use the Data Retriever from within R follow these instuctions run the following commands in R. This will create a local Python installation that will only be used by R and install the needed Python package for you.
coffee
install.packages('reticulate') # Install R package for interacting with Python
reticulate::install_miniconda() # Install Python
reticulate::py_install('retriever') # Install the Python retriever package
install.packages('rdataretriever') # Install the R package for running the retriever
rdataretriever::get_updates() # Update the available datasets
After running these commands restart R.
Advanced Installation for Python Users
If you are using Python for other tasks you can use rdataretriever with your
existing Python installation (though the basic installation
above will also work in this case by creating a separate miniconda install and
Python environment).
Install the retriever Python package
Install the retriever Python package into your prefered Python environment
using either conda (64-bit conda is required):
bash
conda install -c conda-forge retriever
or pip:
bash
pip install retriever
Select the Python environment to use in R
rdataretriever will try to find Python environments with retriever (see the
reticulate documentation on
order of discovery
for more details) installed. Alternatively you can select a Python environment
to use when working with rdataretriever (and other packages using
reticulate).
The most robust way to do this is to set the RETICULATE_PYTHON environment
variable to point to the preferred Python executable:
coffee
Sys.setenv(RETICULATE_PYTHON = "/path/to/python")
This command can be run interactively or placed in .Renviron in your home
directory.
Alternatively you can do select the Python environment through the reticulate
package for either conda:
coffee
library(reticulate)
use_conda('name_of_conda_environment')
or virtualenv:
coffee
library(reticulate)
use_virtualenv("path_to_virtualenv_environment")
You can check to see which Python environment is being used with:
coffee
py_config()
Install the rdataretriever R package
coffee
install.packages("rdataretriever") # latest release from CRAN
coffee
remotes::install_github("ropensci/rdataretriever") # development version from GitHub
Installing Tabular Datasets
```coffee library(rdataretriever)
List the datasets available via the Retriever
rdataretriever::datasets()
Install the portal into csv files in your working directory
rdataretriever::install_csv('portal')
Download the raw portal dataset files without any processing to the
subdirectory named data
rdataretriever::download('portal', './data/')
Install and load a dataset as a list
portal = rdataretriever::fetch('portal') names(portal) head(portal$species)
```
Installing Spatial Datasets
Set-up and Requirements
Tools
- PostgreSQL with PostGis, psql(client), raster2pgsql, shp2pgsql, gdal,
The rdataretriever supports installation of spatial data into Postgres DBMS.
Install PostgreSQL and PostGis
To install
PostgreSQLwithPostGisfor use with spatial data please refer to the OSGeo Postgres installation instructions.We recommend storing your PostgreSQL login information in a
.pgpassfile to avoid supplying the password every time. See the.pgpassdocumentation for more details.After installation, Make sure you have the paths to these tools added to your system's
PATHS. Please consult an operating system expert for help on how to change or add thePATHvariables.For example, this could be a sample of paths exported on Mac:
```shell
~/.bash_profile file, Postgres PATHS and tools.
export PATH="/Applications/Postgres.app/Contents/MacOS/bin:${PATH}" export PATH="$PATH:/Applications/Postgres.app/Contents/Versions/10/bin"
```
Enable PostGIS extensions
If you have
Postgresset up, enablePostGISextensions. This is done by using eitherPostgres CLIorGUI(PgAdmin)and runFor psql CLI
shell psql -d yourdatabase -c "CREATE EXTENSION postgis;" psql -d yourdatabase -c "CREATE EXTENSION postgis_topology;"For GUI(PgAdmin)
sql CREATE EXTENSION postgis; CREATE EXTENSION postgis_topologyFor more details refer to the PostGIS docs.
Sample commands
```R rdataretriever::installpostgres('harvard-forest') # Vector data rdataretriever::installpostgres('bioclim') # Raster data
Install only the data of USGS elevation in the given extent
rdataretriever::install_postgres('usgs-elevation', list(-94.98704597353938, 39.027001800158615, -94.3599408119917, 40.69577051867074))
```
Provenance
To ensure reproducibility the rdataretriever supports creating snapshots of the data and the script in time.
Use the commit function to create and store the snapshot image of the data in time. Provide a descriptive message for the created commit. This is comparable to a git commit, however the function bundles the data and scripts used as a backup.
With provenace, you will be able to reproduce the same analysis in the future.
Commit a dataset
By default commits will be stored in the provenance directory .retriever_provenance, but this directory can be changed by setting the environment variable PROVENANCE_DIR.
coffee
rdataretriever::commit('abalone-age',
commit_message='A snapshot of Abalone Dataset as of 2020-02-26')
You can also set the path for an individual commit:
coffee
rdataretriever::commit('abalone-age',
commit_message='Data and recipe archive for Abalone Data on 2020-02-26',
path='.')
View a log of committed datasets in the provenance directory
coffee
rdataretriever::commit_log('abalone-age')
Install a committed dataset
To reanalyze a committed dataset, rdataretriever will obtain the data and script from the history and rdataretriever will install this particular data into the given back-end. For example, SQLite:
coffee
rdataretriever::install_sqlite('abalone-age-a76e77.zip')
Datasets stored in provenance directory can be installed directly using hash value
coffee
rdataretriever::install_sqlite('abalone-age', hash_value='a76e77')
Using Docker Containers
To run the image interactively
docker-compose run --service-ports rdata /bin/bash
To run tests
docker-compose run rdata Rscript load_and_test.R
Release
Make sure you have tests passing on R-oldrelease, current R-release and R-devel
To check the package
Shell
R CMD Build #build the package
R CMD check --as-cran --no-manual rdataretriever_[version]tar.gz
To Test
```R setwd("./rdataretriever") # Set working directory
install all deps
install.packages("reticulate")
library(DBI) library(RPostgreSQL) library(RSQLite) library(reticulate) library(RMariaDB) install.packages(".", repos = NULL, type="source") roxygen2::roxygenise() devtools::test() ```
To get citation information for the rdataretriever in R use citation(package = 'rdataretriever')
Acknowledgements
A big thanks to Ben Morris for helping to develop the Data Retriever. Thanks to the rOpenSci team with special thanks to Gavin Simpson, Scott Chamberlain, and Karthik Ram who gave helpful advice and fostered the development of this R package. Development of this software was funded by the National Science Foundation as part of a CAREER award to Ethan White.
Owner
- Name: rOpenSci
- Login: ropensci
- Kind: organization
- Email: info@ropensci.org
- Location: Berkeley, CA
- Website: https://ropensci.org/
- Twitter: rOpenSci
- Repositories: 307
- Profile: https://github.com/ropensci
JOSS Publication
Rdataretriever: R Interface to the Data Retriever
Authors
Department of Wildlife Ecology and Conservation, University of Florida, USDA-ARS Jornada Experimental Range
Department of Agricultural Economics, Sociology, and Education, Penn State University
Departamento de Biología Vegetal y Ecología, Universidad de Sevilla.
Tags
data retrieval data processing R data data science datasetsPapers & Mentions
Total mentions: 1
Forecasting biodiversity in breeding birds using best practices
- DOI: 10.7717/peerj.4278
- OpenAlex ID: https://openalex.org/W2756692527
- Published: February 2018
GitHub Events
Total
- Watch event: 2
Last Year
- Watch event: 2
Committers
Last synced: 5 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Daniel McGlinn | d****n@g****m | 120 |
| henrykironde | h****e@g****m | 92 |
| Ethan White | e****n@w****g | 66 |
| pranita-s | p****a@g****m | 11 |
| Hao Ye | l****d@g****m | 5 |
| Harshit Bansal | h****c@g****m | 4 |
| maxpohlman | m****n@g****m | 3 |
| Jeroen Ooms | j****s@g****m | 3 |
| Apoorva Pandey | a****5@g****m | 3 |
| David J. Harris | d****s | 2 |
| Shawn | s****r@w****g | 2 |
| Pakillo | f****c@g****m | 2 |
| Arfon Smith | a****n | 2 |
| Apoorva Pandey | a****a@l****n | 1 |
| Frederick Boehm | f****m@g****m | 1 |
| Karthik Ram | k****m@g****m | 1 |
| Maëlle Salmon | m****n@y****e | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 4 months ago
All Time
- Total issues: 30
- Total pull requests: 75
- Average time to close issues: 10 months
- Average time to close pull requests: 12 days
- Total issue authors: 9
- Total pull request authors: 7
- Average comments per issue: 2.93
- Average comments per pull request: 0.91
- Merged pull requests: 62
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 3
- Average time to close issues: N/A
- Average time to close pull requests: about 1 hour
- Issue authors: 0
- Pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 3
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- ethanwhite (11)
- RMHogervorst (7)
- henrykironde (4)
- ha0ye (3)
- maelle (1)
- fboehm (1)
- harshitbansal05 (1)
- dmcglinn (1)
- gdicecco (1)
Pull Request Authors
- henrykironde (51)
- ethanwhite (21)
- arfon (2)
- jeroen (2)
- maelle (1)
- fboehm (1)
- ashishpriyadarshiCIC (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- cran 277 last-month
- Total docker downloads: 88,618
- Total dependent packages: 0
- Total dependent repositories: 1
- Total versions: 5
- Total maintainers: 1
cran.r-project.org: rdataretriever
R Interface to the Data Retriever
- Homepage: https://docs.ropensci.org/rdataretriever/ (website)
- Documentation: http://cran.r-project.org/web/packages/rdataretriever/rdataretriever.pdf
- License: MIT + file LICENSE
-
Latest release: 3.1.1
published over 1 year ago
Rankings
Maintainers (1)
Dependencies
- R >= 3.4.0 depends
- reticulate >= 1.16 imports
- semver * imports
- DBI * suggests
- RPostgreSQL * suggests
- RSQLite * suggests
- dbplyr * suggests
- devtools * suggests
- ggplot2 * suggests
- knitr * suggests
- raster * suggests
- rmarkdown * suggests
- testthat >= 1.0.0 suggests
- actions/checkout v2 composite
- rocker/tidyverse latest build
- mysql 5.7
- postgres latest
- rdata_image latest
