perccalc

perccalc: An R package for estimating percentiles from categorical variables - Published in JOSS (2019)

https://github.com/cimentadaj/perccalc

Science Score: 95.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 4 DOI reference(s) in README and JOSS metadata
  • Academic publication links
    Links to: joss.theoj.org
  • Committers with academic emails
    2 of 5 committers (40.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
    Published in Journal of Open Source Software
Last synced: 6 months ago · JSON representation

Repository

Estimate percentile differences from ordered categorical data

Basic Info
Statistics
  • Stars: 4
  • Watchers: 1
  • Forks: 3
  • Open Issues: 0
  • Releases: 2
Created over 8 years ago · Last pushed over 5 years ago
Metadata Files
Readme Changelog Contributing License

README.Rmd

---
output:
  github_document:
    html_preview: false

---
# perccalc 

```{r, echo = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-"
)
options(tibble.print_min = 5, tibble.print_max = 5)
```

[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/perccalc)](http://cran.r-project.org/package=perccalc)
[![Travis-CI Build Status](https://travis-ci.org/cimentadaj/perccalc.svg?branch=master)](https://travis-ci.org/cimentadaj/perccalc)
[![AppVeyor build status](https://ci.appveyor.com/api/projects/status/github/cimentadaj/perccalc?branch=master&svg=true)](https://ci.appveyor.com/project/cimentadaj/perccalc)
[![Codecov test coverage](https://codecov.io/gh/cimentadaj/perccalc/branch/master/graph/badge.svg)](https://codecov.io/gh/cimentadaj/perccalc?branch=master)
[![DOI](https://joss.theoj.org/papers/10.21105/joss.01796/status.svg)](https://doi.org/10.21105/joss.01796)

## Overview

`perccalc` is a direct implementation of the theoretical work of [Reardon
(2011)](https://www.russellsage.org/publications/whither-opportunity) where it is possible to
estimate the difference between two percentiles from an ordered categorical variable. More
concretely, by specifying an ordered categorical variable and a continuous variable, this method can estimate differences in the continuous variable between percentiles of the ordered categorical
variable. This brings forth a relevant strategy to contrast ordered categorical variables which
usually have alternative continuous measures to the percentiles of the continuous
measures. Moreover, this opens an avenue for calculating percentile distributions and percentile
differences for ordered categorical variables which don't necessarily have an alternative continuous measure such as job occupation classifications. With this package I introduce two functions that apply the procedure.

The package has two main functions:

* `perc_diff`, for calculating percentile differences
* `perc_dist`, for calculating scores for all percentiles

## Installation

You can install and load the package with these commands:

```{r, eval = FALSE}
install.packages("perccalc") # for stable version
# or
devtools::install_github("cimentadaj/perccalc") # for development version
```

## Usage

To look at a real world example, let's use the data from the General Social Survey (GSS). This dataset contains information on the responses given by subjects on a vocabulary test together with their age expressed in age groups (such as `30-39`, `40-49`, etc...). We're interested in calculating the difference in vocabulary test scores between the old and young respondents.

In many scenarios, we could calculate the difference between these groups in their vocabulary tests by estimating the mean difference between two age groups (for example, ages `20-29` versus ages `60-69`). However, in many other settings we're specifically interested in the difference of vocabulary tests by the percentiles of the age variable. In particular, this could be of interest for studies looking to contrast their results to other studies which have age as a continuous variable. In our example, age is a categorical variable so we cannot calculate percentiles. The method implemented in this package introduces a strategy for calculating percentiles from ordered categories.

Let's load our packages of interest and limit the GSS data to the year 1978.

```{r, message = FALSE, warning = FALSE}
library(perccalc)
library(dplyr)
library(ggplot2)
library(carData)

set.seed(213141)
data("GSSvocab")

gss <- 
  as_tibble(GSSvocab) %>% 
  filter(year == "1978") %>% 
  mutate(weight = sample(1:3, size = nrow(.), replace = TRUE, prob = c(0.1, 0.5, 0.4)),
         ageGroup = factor(ageGroup, ordered = TRUE)) %>%
  select(ageGroup, vocab, weight)
```

Note that the categorical variable (`ageGroup`) has to be an ordered factor (this is a requirement of both functions). Moving to the example, `perc_diff` calculates the difference in the continuous variable by the percentiles of the ordered categorical variable. In our example, this would the question of what's the difference in vocabulary test scores between the 90th and 10th percentile of age groups?

```{r}
perc_diff(gss, ageGroup, vocab, percentiles = c(90, 10))
```

It's about .21 points with a standard error of .39 points. In addittion, you can optionally add weights with the `weights` argument.

```{r}
perc_diff(gss, ageGroup, vocab, percentiles = c(90, 10), weights = weight)
```

On the other hand, the `perc_dist` (short for percentile distribution) allows you to estimate the score for every percentile and not limit the analysis to only the difference between two percentiles.

```{r}

perc_dist <- perc_dist(gss, ageGroup, vocab)
perc_dist

```

We could visualize this in a more intuitive representation:

```{r}
perc_dist %>%
  ggplot(aes(percentile, estimate)) +
  geom_point() +
  geom_line() +
  theme_minimal() +
  labs(x = "Age group percentiles",
       y = "Vocabulary test scores")

```

This function also allows the use of weights. 

## Documentation and Support

Please visit https://cimentadaj.github.io/perccalc/ for documentation and
vignettes with real-world examples. In case you want to file an issue or
contribute in another way to the package, please follow this
[guide](https://github.com/cimentadaj/perccalc/blob/master/.github/CONTRIBUTING.md). For
questions about the functionality, feel free to file an issue on Github.

- Reardon, Sean F. "The widening academic achievement gap between the rich and the poor: New evidence and possible explanations." Whither opportunity (2011): 91-116.

Owner

  • Name: Jorge Cimentada
  • Login: cimentadaj
  • Kind: user
  • Location: Madrid
  • Company: Senior Data Scientist

@ eDreams

JOSS Publication

perccalc: An R package for estimating percentiles from categorical variables
Published
December 08, 2019
Volume 4, Issue 44, Page 1796
Authors
Jorge Cimentada ORCID
Laboratory of Digital and Computational Demography, Max Planck Institute of Demographic Research (MPIDR)
Editor
Mark A. Jensen ORCID
Tags
categorical data analysis achievement gaps

GitHub Events

Total
Last Year

Committers

Last synced: 7 months ago

All Time
  • Total Commits: 136
  • Total Committers: 5
  • Avg Commits per committer: 27.2
  • Development Distribution Score (DDS): 0.059
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Jorge Cimentada c****j@g****m 128
François Briatte b****e 3
Mark A. Jensen m****n@n****v 2
Daniel S. Katz d****z@i****g 2
Jeroen j****s@g****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 1
  • Total pull requests: 7
  • Average time to close issues: about 2 months
  • Average time to close pull requests: 3 months
  • Total issue authors: 1
  • Total pull request authors: 4
  • Average comments per issue: 21.0
  • Average comments per pull request: 0.29
  • Merged pull requests: 6
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • amoeba (1)
Pull Request Authors
  • briatte (4)
  • jeroen (1)
  • majensen (1)
  • danielskatz (1)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • cran 257 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 6
  • Total maintainers: 1
cran.r-project.org: perccalc

Estimate Percentiles from an Ordered Categorical Variable

  • Versions: 6
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 257 Last month
Rankings
Forks count: 14.9%
Stargazers count: 24.2%
Dependent packages count: 29.8%
Average: 32.0%
Dependent repos count: 35.5%
Downloads: 55.8%
Maintainers (1)
Last synced: 6 months ago

Dependencies

DESCRIPTION cran
  • R >= 3.4.0 depends
  • multcomp * imports
  • stats * imports
  • tibble * imports
  • MASS * suggests
  • carData * suggests
  • covr * suggests
  • dplyr * suggests
  • ggplot2 * suggests
  • knitr * suggests
  • magrittr * suggests
  • rmarkdown * suggests
  • spelling * suggests
  • testthat * suggests
  • tidyr >= 1.0.0 suggests