dataverse

R Client for Dataverse Repositories

https://github.com/iqss/dataverse-client-r

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 6 DOI reference(s) in README
○
Academic publication links
✓
Committers with academic emails
1 of 8 committers (12.5%) from academic institutions
✓
Institutional organization owner
Organization iqss has institutional domain (iq.harvard.edu)
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (15.5%) to scientific vocabulary

Keywords

cran data data-deposit dataverse dataverse-api r sword

Last synced: 6 months ago · JSON representation

Repository

R Client for Dataverse Repositories

Basic Info

Host: GitHub
Owner: IQSS
Language: R
Default Branch: main
Homepage: https://iqss.github.io/dataverse-client-r
Size: 1.25 MB

Statistics

Stars: 63
Watchers: 13
Forks: 26
Open Issues: 15
Releases: 6

Topics

cran data data-deposit dataverse dataverse-api r sword

Created over 10 years ago · Last pushed 9 months ago

Metadata Files

Readme Changelog

README.Rmd

---
title: "R Client for Dataverse Repositories"
output: github_document
---

```{r knitr_options, echo=FALSE, results="hide"}
options(width = 120)
knitr::opts_chunk$set(results = "hold")
Sys.setenv("DATAVERSE_SERVER" = "dataverse.harvard.edu")
```

[![CRAN Version](https://www.r-pkg.org/badges/version/dataverse)](https://cran.r-project.org/package=dataverse)
![Downloads](https://cranlogs.r-pkg.org/badges/dataverse)

[![R-CMD-check-thorough](https://github.com/IQSS/dataverse-client-r/actions/workflows/R-CMD-check-thorough.yaml/badge.svg?branch=main)](https://github.com/IQSS/dataverse-client-r/actions/workflows/R-CMD-check-thorough.yaml)
[![R-CMD-check-daily](https://github.com/IQSS/dataverse-client-r/actions/workflows/R-CMD-check-daily.yaml/badge.svg?branch=main)](https://github.com/IQSS/dataverse-client-r/actions/workflows/R-CMD-check-daily.yaml)
[![R-CMD-check-dev](https://github.com/IQSS/dataverse-client-r/actions/workflows/R-CMD-check-dev.yaml/badge.svg)](https://github.com/IQSS/dataverse-client-r/actions/workflows/R-CMD-check-dev.yaml)
[![codecov.io](https://codecov.io/github/IQSS/dataverse-client-r/coverage.svg?branch=main)](https://app.codecov.io/github/IQSS/dataverse-client-r?branch=main)

[![Dataverse Project logo](https://dataverse.org/files/dataverseorg/files/dataverse_project_logo-hp.png)](https://dataverse.org)

The **dataverse** package provides access to [Dataverse](https://dataverse.org/) APIs (versions 4+), enabling data search, retrieval, and deposit, thus allowing R users to integrate public data sharing into the reproducible research workflow. 


### Getting Started

You can find a stable release on [CRAN](https://cran.r-project.org/package=dataverse), or install the latest development version from [GitHub](https://github.com/iqss/dataverse-client-r/):


```{r, eval = FALSE}
# Install from CRAN
install.packages("dataverse")

# Install from GitHub
# install.packages("remotes")
remotes::install_github("iqss/dataverse-client-r")
```

```{r, eval = TRUE, echo = FALSE}
library("dataverse")
```

#### API Access Keys

Many features of the Dataverse API are public and require no authentication. This means in many cases you can search for and retrieve data without a Dataverse account or API key -- you will not need to worry about this. 

For features that require a Dataverse account for the specific server installation of the Dataverse software, and an API key linked to that account. Instructions for obtaining an account and setting up an API key are available in the [Dataverse User Guide](https://guides.dataverse.org/en/latest/user/account.html). (Note: if your key is compromised, it can be regenerated to preserve security.) Once you have an API key, this should be stored as an environment variable called `DATAVERSE_KEY`. It can be set as a default by adding

``` r
DATAVERSE_KEY="examplekey12345"
```

in your .Renviron file, where `examplekey12345` should be replaced with your own key.  The environment file can be opened by `usethis::edit_r_environ()`.


#### Server

Because [there are many Dataverse installations](https://dataverse.org/), all functions in the R client require specifying what server installation you are interacting with.  There are multiple ways to specify the server:

1. Set the `server` argument in each function. e.g., `server = "dataverse.harvard.edu"` in the `get_dataframe_by_name()` function.

2. Set the environment variable, `DATAVERSE_SERVER`, in the script to be used throughout the session.  e.g.,

``` r
Sys.setenv("DATAVERSE_SERVER" = "dataverse.harvard.edu")
```

3. Hard-code a default server in your own environment.  Direct your `.Renviron` file directly or open it by `usethis::edit_r_environ()`. Then enter `DATAVERSE_SERVER = "dataverse.harvard.edu"`. However, doing this may make your scripts not replicable to other people who do not have access to the environment.

In all cases, values should be the Dataverse server, without the "https" prefix or the "/api" URL path.

### Data Download

The dataverse package provides multiple interfaces to obtain data into R. Users can supply a file DOI, a dataset DOI combined with a filename, or a dataverse object. They can read in the file as a raw binary or a dataset read in with the appropriate R function.

#### Reading data as R objects

Use the `get_dataframe_*()` functions, depending on the input you have. For example, we will read a survey dataset on Dataverse, [nlsw88.dta](https://demo.dataverse.org/dataset.xhtml?persistentId=doi:10.70122/FK2/PPIAXE) (`doi:10.70122/FK2/PPKHI1/ZYATZZ`), originally in Stata dta form.

With a file DOI, we can use the `get_dataframe_by_doi` function:

```{r get_dataframe_by_doi}
nlsw <-
  get_dataframe_by_doi(
    filedoi     = "10.70122/FK2/PPIAXE/MHDB0O",
    server      = "demo.dataverse.org"
  )
```
which by default reads in the ingested file (not the original dta) by the [`readr::read_tsv`](https://readr.tidyverse.org/reference/read_delim.html) function.

Alternatively, we can download the same file by specifying the filename and the DOI of the "dataset" (in Dataverse, a collection of files is called a dataset).

```{r get_dataframe_by_name_tsv, message=FALSE}
nlsw_tsv <-
  get_dataframe_by_name(
    filename  = "nlsw88.tab",
    dataset   = "10.70122/FK2/PPIAXE",
    server    = "demo.dataverse.org"
  )
```

**The `original` argument:** Dataverse often translates rectangular data into an ingested, or "archival" version, which is application-neutral and easily-readable. `read_dataframe_*()` defaults to taking this ingested version rather than using the original, through the argument `original = FALSE`.
This default is safe because you may not have the proprietary software that was originally used. 

On the other hand, the data may have lost information in the process of the ingestion.
Instead, to read the same file but its original version, specify `original = TRUE` and set an `.f` argument. In this case, we know that `nlsw88.tab` is a Stata `.dta` dataset, so we will use the `haven::read_dta` function.

```{r get_dataframe_by_name_original}
nlsw_original <-
  get_dataframe_by_name(
    filename    = "nlsw88.tab",
    dataset     = "10.70122/FK2/PPIAXE",
    .f          = haven::read_dta,
    original    = TRUE,
    server      = "demo.dataverse.org"
  )
```

Note that even though the file prefix is ".tab", we use `haven::read_dta`.

Of course, when the dataset is not ingested (such as a Rds file), users would always need to specify an `.f` argument for the specific file.

Note the difference between `nls_tsv` and `nls_original`. `nls_original` preserves the data attributes like value labels, whereas `nls_tsv` has dropped this or left this in file metadata.

```{r}
class(nlsw_tsv$race) # tab ingested version only has numeric data
```

```{r}
attr(nlsw_original$race, "labels") # original dta has value labels
```


**Caching**: When the dataset to be downloaded is large, downloading the dataset from the internet can be time consuming, and users want to run the download only once in a script they run multiple times. As of version 0.3.15, our package will cache the download data if the user specifies which version of the Dataverse dataset they download from. See the `version` argument in the help page.

### Data Upload and Archiving

**Note**: _There are known issues to using to dataverse creation and dataset addition functionalities listed here. `add_dataset_file()` appears stable as of again as of v0.3.11. One possible workaround is to mix the two workflows described above (See e.g. this [comment](https://github.com/IQSS/dataverse-client-r/issues/82#issuecomment-1094623268))._  

Dataverse provides two - basically unrelated - workflows for managing (adding, documenting, and publishing) datasets. The first workflow is called the "native" API and uses `create_dataset` to make an empty dataset and adds files by `add_dataset_file` by taking a path to a dataset that is located in your local. Through the native API it is possible to update a dataset by modifying its metadata with `update_dataset()` or file contents using `update_dataset_file()` and then republish a new version using `publish_dataset()`.

``` r
# create the dataset. e/g/ 
ds <- create_dataset("mydataverse") # pick a name of dataset

# add files
tmp <- tempfile() # In this example, we write to a temporary destiation
write.csv(iris, file = tmp)
add_dataset_file(file = tmp, dataset = ds)

# publish dataset
publish_dataset(ds)

# dataset will now be published
get_dataverse("mydataverse")
```


The second is built on [SWORD](https://sword.cottagelabs.com/) (v2.0). This means that to create a new dataset listing, you will have to first initialize a dataset entry with some metadata, add one or more files to the dataset, and then publish it. This looks something like the following:

``` r
# After setting appropriate dataverse server and environment, obtain SWORD
# service doc
d <- service_document()

# create a list of metadata for a file
metadat <-
  list(
    title       = paste0("My-Study_", format(Sys.time(), '%Y-%m-%d_%H:%M')),
    creator     = "Doe, John",
    description = "An example study"
  )

# create the dataset, where "mydataverse" is to be replaced by the name 
# of the already-created dataverse as shown in the URL
ds <- initiate_sword_dataset("", body = metadat)

# add files to dataset
readr::write_csv(iris, file = "iris.csv")

# Search the initiated dataset and give a DOI and version of the dataverse as an identifier
mydoi <- "doi:10.70122/FK2/BMZPJZ&version=DRAFT"

# add dataset
add_dataset_file(file = "iris.csv", dataset = mydoi)

# publish new dataset
publish_sword_dataset(ds)

# dataset will now be published
list_datasets("")
```



### Limitations

The R client is current stable for data search and download. For more extensive features of _uploading_ and maintaining data, see the issues reported in the Github repository.  You may need to use alternative methods, such as working on the Dataverse GUI directly or using  [pyDataverse](https://pydataverse.readthedocs.io/en/latest/).

Functions related to user management and permissions are currently not exported in the package (but are drafted in the source code).




### Related Software

**dataverse** is the next-generation iteration of the now removed  **dvn** package, which works with Dataverse 3 ("Dataverse Network") applications. 

Dataverse clients in other programming languages include [pyDataverse](https://pydataverse.readthedocs.io/en/latest/) for Python and the [Java client](https://github.com/IQSS/dataverse-client-java). For more information, see [the Dataverse API page](https://guides.dataverse.org/en/5.5/api/client-libraries.html#r).

Users interested in downloading metadata from archives other than Dataverse may be interested in Kurt Hornik's [OAIHarvester](https://cran.r-project.org/package=OAIHarvester) and Scott Chamberlain's [oai](https://cran.r-project.org/package=oai), which offer metadata download from any web repository that is compliant with the [Open Archives Initiative](https://www.openarchives.org:443/) standards. Additionally, [rdryad](https://cran.r-project.org/package=rdryad) uses OAIHarvester to interface with [Dryad](https://datadryad.org/). The [rfigshare](https://cran.r-project.org/package=rfigshare) package works in a similar spirit to **dataverse** with .


### More Information

A 2021 talk demonstrating the Dataverse package is available at .

Owner

Name: Institute for Quantitative Social Science
Login: IQSS
Kind: organization
Location: Harvard University, Cambridge, MA, USA

Website: http://iq.harvard.edu
Repositories: 160
Profile: https://github.com/IQSS

GitHub Events

Total

Issues event: 3
Watch event: 2
Member event: 1
Issue comment event: 3
Push event: 6
Pull request event: 3
Pull request review event: 1
Fork event: 1

Last Year

Issues event: 3
Watch event: 2
Member event: 1
Issue comment event: 3
Push event: 6
Pull request event: 3
Pull request review event: 1
Fork event: 1

Committers

Last synced: over 2 years ago

All Time

Total Commits: 467
Total Committers: 8
Avg Commits per committer: 58.375
Development Distribution Score (DDS): 0.57

Past Year

Commits: 19
Committers: 1
Avg Commits per committer: 19.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Shiro Kuriwaki	s**i@g**m	201
Will Beasley	w**y@h**m	158
Thomas J. Leeper	t**r@g**m	94
adam3smith	k**r@u**u	8
Ed Jee	e**6@g**m	3
sindribaldur	s**b@g**m	1
Danny-dK	4****K	1
Jan Kanis	j**e@j**l	1

Committer Domains (Top 20 + Academic)

jankanis.nl: 1 u.northwestern.edu: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 76
Total pull requests: 38
Average time to close issues: 9 months
Average time to close pull requests: about 1 month
Total issue authors: 19
Total pull request authors: 10
Average comments per issue: 4.11
Average comments per pull request: 1.84
Merged pull requests: 35
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 4
Average time to close issues: 2 months
Average time to close pull requests: 14 days
Issue authors: 1
Pull request authors: 3
Average comments per issue: 4.0
Average comments per pull request: 3.5
Merged pull requests: 3
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

wibeasley (31)
kuriwaki (19)
adam3smith (4)
leeper (3)
Danny-dK (3)
pdurbin (2)
MarcosAnjos (1)
sjkiss (1)
beniaminogreen (1)
EdJeeOnGitHub (1)
billy34 (1)
thomascli19 (1)
eblondel (1)
christopherkenny (1)
paulgronke (1)

Pull Request Authors

kuriwaki (17)
wibeasley (10)
adam3smith (3)
Danny-dK (2)
JBGruber (2)
beniaminogreen (2)
mtmorgan (2)
sindribaldur (1)
billy34 (1)
EdJeeOnGitHub (1)

Top Labels

Issue Labels

testing (17) data-download (13) enhancement (9) bug (7) documentation (4) help wanted (2) question (1)

Pull Request Labels

testing (1)

Packages

Total packages: 1
Total downloads:
- cran 956 last-month
Total docker downloads: 43,390

Total dependent packages: 4
Total dependent repositories: 48
Total versions: 11
Total maintainers: 1

cran.r-project.org: dataverse

Client for Dataverse 4+ Repositories

Homepage: https://iqss.github.io/dataverse-client-r/
Documentation: http://cran.r-project.org/web/packages/dataverse/dataverse.pdf
License: GPL-2
Latest release: 0.3.15
published 9 months ago

Versions: 11
Dependent Packages: 4
Dependent Repositories: 48
Downloads: 956 Last month
Docker Downloads: 43,390

Rankings

Docker downloads count: 0.4%

Forks count: 3.3%

Dependent repos count: 3.6%

Stargazers count: 5.9%

Average: 7.0%

Dependent packages count: 9.3%

Downloads: 19.3%

Maintainers (1)

shirokuriwaki@gmail.com

Last synced: 6 months ago

Dependencies

DESCRIPTION cran

checkmate * imports
httr * imports
jsonlite * imports
readr * imports
stats * imports
utils * imports
xml2 * imports
covr * suggests
haven * suggests
knitr * suggests
purrr * suggests
rmarkdown * suggests
testthat * suggests
tibble * suggests
yaml * suggests

.github/workflows/R-CMD-check-daily.yaml actions

actions/cache v1 composite
actions/checkout v2 composite
r-lib/actions/setup-pandoc master composite
r-lib/actions/setup-r v2 composite

.github/workflows/R-CMD-check-dev.yaml actions

actions/cache v1 composite
actions/checkout v2 composite
r-lib/actions/setup-pandoc master composite
r-lib/actions/setup-r v1 composite

.github/workflows/R-CMD-check-thorough.yaml actions

actions/cache v2 composite
actions/checkout v2 composite
actions/upload-artifact main composite
r-lib/actions/setup-pandoc v1 composite
r-lib/actions/setup-r v1 composite

.github/workflows/pkgdown.yaml actions

actions/cache v2 composite
actions/checkout v2 composite
r-lib/actions/setup-pandoc v1 composite
r-lib/actions/setup-r v1 composite

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

dataverse

Science Score: 57.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.Rmd

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

cran.r-project.org: dataverse

Rankings

Maintainers (1)

Dependencies