expss

expss: Tables and Labels in R

https://github.com/gdemin/expss

Keywords

excel labels labels-support msexcel pivot-tables r recode spss spss-statistics tables variable-labels vlookup

Last synced: 6 months ago · JSON representation

Repository

expss: Tables and Labels in R

Basic Info

Host: GitHub
Owner: gdemin
Language: R
Default Branch: master
Homepage: https://cran.r-project.org/web/packages/expss/
Size: 16 MB

Statistics

Stars: 84
Watchers: 6
Forks: 16
Open Issues: 8
Releases: 21

Topics

excel labels labels-support msexcel pivot-tables r recode spss spss-statistics tables variable-labels vlookup

Created almost 11 years ago · Last pushed almost 2 years ago

Metadata Files

Readme Funding

expss

$CRAN\_Status\_Badge$

Introduction

expss computes and displays tables with support for 'SPSS'-style labels, multiple / nested banners, weights, multiple-response variables and significance testing. There are facilities for nice output of tables in 'knitr', R notebooks, 'Shiny' and 'Jupyter' notebooks. Proper methods for labelled variables add value labels support to base R functions and to some functions from other packages. Additionally, the package offers useful functions for data processing in marketing research / social surveys - popular data transformation functions from 'SPSS' Statistics and 'Excel' ('RECODE', 'COUNT', 'COUNTIF', 'VLOOKUP', etc.). Package is intended to help people to move data processing from 'Excel'/'SPSS' to R. See examples below. You can get help about any function by typing ?function_name in the R console.

Links

Installation

expss is on CRAN, so for installation you can print in the console install.packages("expss").

Cross-tablulation examples

We will use for demonstartion well-known mtcars dataset. Let's start with adding labels to the dataset. Then we can continue with tables creation.

```R library(expss) data(mtcars) mtcars = apply_labels(mtcars, mpg = "Miles/(US) gallon", cyl = "Number of cylinders", disp = "Displacement (cu.in.)", hp = "Gross horsepower", drat = "Rear axle ratio", wt = "Weight (1000 lbs)", qsec = "1/4 mile time", vs = "Engine", vs = c("V-engine" = 0, "Straight engine" = 1), am = "Transmission", am = c("Automatic" = 0, "Manual"=1), gear = "Number of forward gears", carb = "Number of carburetors" )

```

For quick cross-tabulation there are fre and cross family of function. For simplicity we demonstrate here only cross_cpct which calculates column percent. Documentation for other functions, such as cross_cases for counts, cross_rpct for row percent, cross_tpct for table percent and cross_fun for custom summary functions can be seen by typing ?cross_cpct and ?cross_fun in the console.

```R

'cross_*' examples

just simple crosstabulation, similar to base R 'table' function

cross_cases(mtcars, am, vs)

Table column % with multiple banners

cross_cpct(mtcars, cyl, list(total(), am, vs))

magrittr pipe usage and nested banners

mtcars %>% cross_cpct(cyl, list(total(), am %nest% vs))

``We have more sophisticated interface for table construction withmagrittrpiping. Table construction consists of at least of three functions chained with pipe operator:%>%. At first we need to specify variables for which statistics will be computed withtabcells. Secondary, we calculate statistics with one of thetabstat*functions. And last, we finalize table creation withtabpivot, e. g.:dataset %>% tabcells(variable) %>% tabstatcases() %>% tabpivot(). After that we can optionally sort table withtabsortasc, drop empty rows/columns withdroprcand transpose withtabtranspose. Resulting table is just adata.frameso we can use usual R operations on it. Detailed documentation for table creation can be seen via?tables. For significance testing see?significance. Generally, tables automatically translated to HTML for output in knitr or Jupyter notebooks. However, if we want HTML output in the R notebooks or in the RStudio viewer we need to set options for that:expssoutputrnotebook()orexpssoutputviewer()`.

```R

simple example

mtcars %>% tabcells(cyl) %>% tabcols(total(), am) %>% tabstatcpct() %>% tab_pivot()

table with caption

mtcars %>% tabcells(mpg, disp, hp, wt, qsec) %>% tabcols(total(), am) %>% tabstatmeansdn() %>% tablastsigmeans(subtablemarks = "both") %>% tabpivot() %>% setcaption("Table with summary statistics and significance marks.")

Table with the same summary statistics. Statistics labels in columns.

mtcars %>% tabcells(mpg, disp, hp, wt, qsec) %>% tabcols(total(label = "#Total| |"), am) %>% tabstatfun(Mean = wmean, "Std. dev." = wsd, "Valid N" = wn, method = list) %>% tabpivot()

Different statistics for different variables.

mtcars %>% tabcols(total(), vs) %>% tabcells(mpg) %>% tabstatmean() %>% tabstatvalidn() %>% tabcells(am) %>% tabstatcpct(totalrowposition = "none", label = "col %") %>% tabstatrpct(totalrowposition = "none", label = "row %") %>% tabstattpct(totalrowposition = "none", label = "table %") %>% tabpivot(statposition = "inside_rows")

Table with split by rows and with custom totals.

mtcars %>% tabcells(cyl) %>% tabcols(total(), vs) %>% tabrows(am) %>% tabstatcpct(totalrowposition = "above", totallabel = c("number of cases", "row %"), totalstatistic = c("ucases", "urpct")) %>% tabpivot()

Linear regression by groups.

mtcars %>% tabcells(sheet(mpg, disp, hp, wt, qsec)) %>% tabcols(total(label = "#Total| |"), am) %>% tabstatfundf( function(x){ frm = reformulate(".", response = as.name(names(x)[1])) model = lm(frm, data = x) sheet('Coef.' = coef(model), confint(model) ) }
) %>% tabpivot() ```

Example of data processing with multiple-response variables

Here we use truncated dataset with data from product test of two samples of chocolate sweets. 150 respondents tested two kinds of sweets (codenames: VSX123 and SDF546). Sample was divided into two groups (cells) of 75 respondents in each group. In cell 1 product VSX123 was presented first and then SDF546. In cell 2 sweets were presented in reversed order. Questions about respondent impressions about first product are in the block A (and about second tested product in the block B). At the end of the questionnaire there was a question about the preferences between sweets.

List of variables:

id Respondent Id
cell First tested product (cell number)
s2a Age
a1_1-a1_6 What did you like in these sweets? Multiple response. First tested product
a22 Overall quality. First tested product
b1_1-b1_6 What did you like in these sweets? Multiple response. Second tested product
b22 Overall quality. Second tested product
c1 Preferences

```R

data(product_test)

w = product_test # shorter name to save some keystrokes

here we recode variables from first/second tested product to separate variables for each product according to their cells

'h' variables - VSX123 sample, 'p' variables - 'SDF456' sample

also we recode preferences from first/second product to true names

for first cell there are no changes, for second cell we should exchange 1 and 2.

w = w %>% letif(cell == 1, h11 %to% h16 := recode(a11 %to% a16, other ~ copy), p11 %to% p16 := recode(b11 %to% b16, other ~ copy), h22 := recode(a22, other ~ copy), p22 := recode(b22, other ~ copy), c1r = c1 ) %>% letif(cell == 2, p11 %to% p16 := recode(a11 %to% a16, other ~ copy), h11 %to% h16 := recode(b11 %to% b16, other ~ copy), p22 := recode(a22, other ~ copy), h22 := recode(b22, other ~ copy), c1r := recode(c1, 1 ~ 2, 2 ~ 1, other ~ copy) ) %>% let( # recode age by groups agecat = recode(s2a, lo %thru% 25 ~ 1, lo %thru% hi ~ 2), # count number of likes # codes 2 and 99 are ignored. hlikes = countrowif(1 | 3 %thru% 98, h11 %to% h16), plikes = countrowif(1 | 3 %thru% 98, p11 %to% p1_6) )

here we prepare labels for future usage

codeframelikes = numlab(" 1 Liked everything 2 Disliked everything 3 Chocolate 4 Appearance 5 Taste 6 Stuffing 7 Nuts 8 Consistency 98 Other 99 Hard to answer ")

overalllikingscale = num_lab(" 1 Extremely poor 2 Very poor 3 Quite poor 4 Neither good, nor poor 5 Quite good 6 Very good 7 Excellent ")

w = applylabels(w, c1r = "Preferences", c1r = numlab(" 1 VSX123 2 SDF456 3 Hard to say "),

age_cat = "Age",
age_cat = c("18 - 25" = 1, "26 - 35" = 2),

h1_1 = "Likes. VSX123",
p1_1 = "Likes. SDF456",
h1_1 = codeframe_likes,
p1_1 = codeframe_likes,

h_likes = "Number of likes. VSX123",
p_likes = "Number of likes. SDF456",

h22 = "Overall quality. VSX123",
p22 = "Overall quality. SDF456",
h22 = overall_liking_scale,
p22 = overall_liking_scale

)

Are there any significant differences between preferences? Yes, difference is significant.R

'tabmisval(3)' remove 'hard to say' from vector

w %>% tabcols(total(), agecat) %>% tabcells(c1r) %>% tabmisval(3) %>% tabstatcases() %>% tablastsigcases() %>% tab_pivot()

Further we calculate distribution of answers in the survey questions.R

lets specify repeated parts of table creation chains

banner = w %>% tabcols(total(), agecat, c1r)

column percent with significance

tabcpctsig = . %>% tabstatcpct() %>% tablastsigcpct(siglabels = paste0("",LETTERS, ""))

means with siginifcance

tabmeanssig = . %>% tabstatmeansdn(labels = c("Mean", "sd", "N")) %>% tablastsigmeans( siglabels = paste0("",LETTERS, ""),
keep = "means")

Preferences

banner %>% tabcells(c1r) %>% tabcpctsig() %>% tabpivot()

Overall liking

banner %>%
tabcells(h22) %>% tabmeanssig() %>% tabcpctsig() %>%
tabcells(p22) %>% tabmeanssig() %>% tabcpctsig() %>% tab_pivot()

Likes

banner %>% tabcells(hlikes) %>% tabmeanssig() %>% tabcells(mrset(h11 %to% h16)) %>% tabcpctsig() %>% tabcells(plikes) %>% tabmeanssig() %>% tabcells(mrset(p11 %to% p16)) %>% tabcpctsig() %>% tab_pivot()

below more complicated table where we compare likes side by side

Likes - side by side comparison

w %>% tabcols(total(label = "#Total| |"), c1r) %>% tabcells(list(unvr(mrset(h11 %to% h16)))) %>% tabstatcpct(label = varlab(h11)) %>% tabcells(list(unvr(mrset(p11 %to% p16)))) %>% tabstatcpct(label = varlab(p11)) %>% tabpivot(statposition = "insidecolumns")

```

We can save labelled dataset as *.csv file with accompanying R code for labelling.

R write_labelled_csv(w, file filename = "product_test.csv")

Or, we can save dataset as *.csv file with SPSS syntax to read data and apply labels.

R write_labelled_spss(w, file filename = "product_test.csv")

Export to Microsoft Excel

To export expss tables to *.xlsx you need to install excellent openxlsx package. To install it just type in the console install.packages("openxlsx").

Examples

First we apply labels on the mtcars dataset and build simple table with caption. ```R library(expss) library(openxlsx) data(mtcars) mtcars = apply_labels(mtcars, mpg = "Miles/(US) gallon", cyl = "Number of cylinders", disp = "Displacement (cu.in.)", hp = "Gross horsepower", drat = "Rear axle ratio", wt = "Weight (lb/1000)", qsec = "1/4 mile time", vs = "Engine", vs = c("V-engine" = 0, "Straight engine" = 1), am = "Transmission", am = c("Automatic" = 0, "Manual"=1), gear = "Number of forward gears", carb = "Number of carburetors" )

mtcarstable = mtcars %>% crosscpct( cellvars = list(cyl, gear), colvars = list(total(), am, vs) ) %>% set_caption("Table 1")

mtcars_table ```

Then we create workbook and add worksheet to it. R wb = createWorkbook() sh = addWorksheet(wb, "Tables") Export - we should specify workbook and worksheet. R xl_write(mtcars_table, wb, sh) And, finally, we save workbook with table to the xlsx file. R saveWorkbook(wb, "table1.xlsx", overwrite = TRUE) Screenshot of the exported table:

Automation of the report generation

First of all, we create banner which we will use for all our tables. R banner = with(mtcars, list(total(), am, vs)) Then we generate list with all tables. If variables have small number of discrete values we create column percent table. In other cases we calculate table with means. For both types of tables we mark significant differencies between groups. ```R listoftables = lapply(mtcars, function(variable) { if(length(unique(variable))<7){ cro_cpct(variable, banner) %>% significancecpct() } else { # if number of unique values greater than seven we calculate mean cromeansdn(variable, banner) %>% significance_means()

}) Create workbook:R wb = createWorkbook() sh = addWorksheet(wb, "Tables") Here we export our list with tables with additional formatting. We remove '#' sign from totals and mark total column with bold. You can read about formatting options in the manual fro `xl_write` (`?xl_write` in the console).R xlwrite(listoftables, wb, sh, # remove '#' sign from totals colsymbolstoremove = "#", rowsymbolstoremove = "#", # format total column as bold othercollabelsformats = list("#" = createStyle(textDecoration = "bold")), othercolsformats = list("#" = createStyle(textDecoration = "bold")), ) Save workbook:R saveWorkbook(wb, "report.xlsx", overwrite = TRUE) ``` Screenshot of the generated report:

Labels support for base R

Variable label is human readable description of the variable. R supports rather long variable names and these names can contain even spaces and punctuation but short variables names make coding easier. Variable label can give a nice, long description of variable. With this description it is easier to remember what those variable names refer to. Value labels are similar to variable labels, but value labels are descriptions of the values a variable can take. Labeling values means we don’t have to remember if 1=Extremely poor and 7=Excellent or vice-versa. We can easily get dataset description and variables summary with info function.

The usual way to connect numeric data to labels in R is factor variables. However, factors miss important features which the value labels provide. Factors only allow for integers to be mapped to a text label, these integers have to be a count starting at 1 and every value need to be labelled. Also, we can’t calculate means or other numeric statistics on factors.

With labels we can manipulate short variable names and codes when we analyze our data but in the resulting tables and graphs we will see human-readable text.

It is easy to store labels as variable attributes in R but most R functions cannot use them or even drop them. expss package integrates value labels support into base R functions and into functions from other packages. Every function which internally converts variable to factor will utilize labels. Labels will be preserved during variables subsetting and concatenation. Additionally, there is a function (use_labels) which greatly simplify variable labels usage. See examples below.

Getting and setting variable and value labels

First, apply value and variables labels to dataset: ```R library(expss) data(mtcars) mtcars = apply_labels(mtcars, mpg = "Miles/(US) gallon", cyl = "Number of cylinders", disp = "Displacement (cu.in.)", hp = "Gross horsepower", drat = "Rear axle ratio", wt = "Weight (1000 lbs)", qsec = "1/4 mile time", vs = "Engine", vs = c("V-engine" = 0, "Straight engine" = 1), am = "Transmission", am = c("Automatic" = 0, "Manual"=1), gear = "Number of forward gears", carb = "Number of carburetors" )

In addition to `apply_labels` we have SPSS-style `var_lab` and `val_lab` functions:R nps = c(-1, 0, 1, 1, 0, 1, 1, -1) varlab(nps) = "Net promoter score" vallab(nps) = num_lab(" -1 Detractors 0 Neutralists
1 Promoters
")

We can read, add or remove existing labels:R varlab(nps) # get variable label vallab(nps) # get value labels

add new labels

addvallab(nps) = num_lab(" 98 Other
99 Hard to say ")

remove label by value

%d% - diff, %n_d% - names diff

vallab(nps) = vallab(nps) %d% 98

or, remove value by name

vallab(nps) = vallab(nps) %nd% "Other" Additionaly, there are some utility functions. They can applied on one variable as well as on the entire dataset.R dropvallabs(nps) dropvarlabs(nps) unlab(nps) dropunusedlabels(nps) prependvalues(nps) ``There is alsoprepend_names` function but it can be applied only to data.frame.

Labels with base R and ggplot2 functions

Base table and plotting with value labels: R with(mtcars, table(am, vs)) with(mtcars, barplot( table(am, vs), beside = TRUE, legend = TRUE) )

There is a special function for variables labels support - use_labels. By now variables labels support available only for expression which will be evaluated inside data.frame. ```R

table with dimension names

use_labels(mtcars, table(am, vs))

linear regression

use_labels(mtcars, lm(mpg ~ wt + hp + qsec)) %>% summary

boxplot with variable labels

use_labels(mtcars, boxplot(mpg ~ am)) ```

And, finally, ggplot2 graphics with variables and value labels. Note that with ggplot2 version 3.2.0 and higher you need to explicitly convert labelled variables to factors in the facet_grid formula: ```R library(ggplot2, warn.conflicts = FALSE)

uselabels(mtcars, { # '..data' is shortcut for all 'mtcars' data.frame inside expression ggplot(..data) + geompoint(aes(y = mpg, x = wt, color = qsec)) + facet_grid(factor(am) ~ factor(vs)) }) ```

Extreme value labels support

We have an option for extreme values lables support: expss_enable_value_labels_support_extreme(). With this option factor/as.factor will take into account empty levels. However, unique will give weird result for labelled variables: labels without values will be added to unique values. That's why it is recommended to turn off this option immediately after usage. See examples.

We have label 'Hard to say' for which there are no values in nps: R nps = c(-1, 0, 1, 1, 0, 1, 1, -1) var_lab(nps) = "Net promoter score" val_lab(nps) = num_lab(" -1 Detractors 0 Neutralists 1 Promoters 99 Hard to say ") Here we disable labels support and get results without labels: R expss_disable_value_labels_support() table(nps) # there is no labels in the result unique(nps) Results with default value labels support - three labels are here but "Hard to say" is absent. ```R expssenablevaluelabelssupport()

table with labels but there are no label "Hard to say"

table(nps) unique(nps) And now extreme value labels support - we see "Hard to say" with zero counts. Note the weird `unique` result.R expssenablevaluelabelssupport_extreme()

now we see "Hard to say" with zero counts

table(nps)

weird 'unique'! There is a value 99 which is absent in 'nps'

unique(nps)

Return immediately to defaults to avoid issues:R expssenablevaluelabelssupport() ```

Labels are preserved during common operations on the data

There are special methods for subsetting and concatenating labelled variables. These methods preserve labels during common operations. We don't need to restore labels on subsetted or sorted data.frame.

mtcars with labels: R str(mtcars) Make subset of the data.frame: R mtcars_subset = mtcars[1:10, ] Labels are here, nothing is lost: R str(mtcars_subset)

Interaction with 'haven'

To use expss with haven you need to load expss strictly after haven (or other package with implemented 'labelled' class) to avoid conflicts. And it is better to use read_spss with explict package specification: haven::read_spss. See example below. haven package doesn't set 'labelled' class for variables which have variable label but don't have value labels. It leads to labels losing during subsetting and other operations. We have a special function to fix this: add_labelled_class. Apply it to dataset loaded by haven.

```R

we need to load packages strictly in this order to avoid conflicts

library(haven) library(expss) spssdata = haven::readspss("spss_file.sav")

add missing 'labelled' class

spssdata = addlabelledclass(spssdata) ```

Owner

Name: Gregory Demin
Login: gdemin
Kind: user

Website: https://ru.linkedin.com/pub/gregory-demin/16/85a/698
Repositories: 73
Profile: https://github.com/gdemin

GitHub Events

Total

Issues event: 3
Watch event: 1

Last Year

Issues event: 3
Watch event: 1

Committers

Last synced: 9 months ago

All Time

Total Commits: 1,377
Total Committers: 7
Avg Commits per committer: 196.714
Development Distribution Score (DDS): 0.109

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Gregory Demin	g**n@g**m	1,227
Gregory Demin	g**b@g**m	119
Sebastian Jeworutzki	s**i@r**e	19
Tom Elliott	t**z@g**m	7
John Williams	j**s@g**g	3
Michael Chirico	m**4@g**m	1
Dan Chaltiel	d**l@g**m	1

Committer Domains (Top 20 + Academic)

gnome.org: 1 ruhr-uni-bochum.de: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 101
Total pull requests: 6
Average time to close issues: about 1 month
Average time to close pull requests: about 15 hours
Total issue authors: 51
Total pull request authors: 5
Average comments per issue: 3.67
Average comments per pull request: 0.33
Merged pull requests: 6
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 3
Pull requests: 0
Average time to close issues: about 17 hours
Average time to close pull requests: N/A
Issue authors: 3
Pull request authors: 0
Average comments per issue: 0.67
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

robertogilsaura (25)
zelihay (13)
vinhdizzo (7)
sjewo (3)
DanChaltiel (2)
Waschoi (2)
momo3246 (2)
przemo (2)
abrunk (2)
shirdekel (2)
rgdicker (1)
dkunichoff (1)
aidusgs (1)
arunkshrestha (1)
tgravelle (1)

Pull Request Authors

sjewo (2)
MichaelChirico (2)
tmelliott (1)
DanChaltiel (1)
johnfrombluff (1)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 2
Total downloads:
- cran 6,478 last-month
Total docker downloads: 342

Total dependent packages: 6
(may contain duplicates)
Total dependent repositories: 8
(may contain duplicates)
Total versions: 26
Total maintainers: 1

cran.r-project.org: expss

Tables, Labels and Some Useful Functions from Spreadsheets and 'SPSS' Statistics

Homepage: https://gdemin.github.io/expss/
Documentation: http://cran.r-project.org/web/packages/expss/expss.pdf
License: GPL-2 | GPL-3 [expanded from: GPL (≥ 2)]
Latest release: 0.11.6
published over 2 years ago

Versions: 23
Dependent Packages: 6
Dependent Repositories: 7
Downloads: 6,478 Last month
Docker Downloads: 342

Rankings

Stargazers count: 4.6%

Downloads: 4.9%

Forks count: 5.5%

Dependent packages count: 7.3%

Average: 8.4%

Dependent repos count: 11.1%

Docker downloads count: 16.9%

Maintainers (1)

gdemin@gmail.com

Last synced: 6 months ago

conda-forge.org: r-expss

Homepage: https://gdemin.github.io/expss/
License: GPL-2.0-or-later
Latest release: 0.11.4
published over 3 years ago

Versions: 3
Dependent Packages: 0
Dependent Repositories: 1

Rankings

Dependent repos count: 24.3%

Stargazers count: 36.1%

Average: 38.5%

Forks count: 41.8%

Dependent packages count: 51.6%

Last synced: 6 months ago

expss

Science Score: 23.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.MD

expss

Introduction

Links

Installation

Cross-tablulation examples

'cross_*' examples

just simple crosstabulation, similar to base R 'table' function

Table column % with multiple banners

magrittr pipe usage and nested banners

simple example

table with caption

Table with the same summary statistics. Statistics labels in columns.

Different statistics for different variables.

Table with split by rows and with custom totals.

Linear regression by groups.

Example of data processing with multiple-response variables

here we recode variables from first/second tested product to separate variables for each product according to their cells

'h' variables - VSX123 sample, 'p' variables - 'SDF456' sample

also we recode preferences from first/second product to true names

for first cell there are no changes, for second cell we should exchange 1 and 2.

here we prepare labels for future usage

'tabmisval(3)' remove 'hard to say' from vector

lets specify repeated parts of table creation chains

column percent with significance

means with siginifcance

Preferences

Overall liking

Likes

below more complicated table where we compare likes side by side

Likes - side by side comparison

Export to Microsoft Excel

Examples

Automation of the report generation

Labels support for base R

Getting and setting variable and value labels

add new labels

remove label by value

%d% - diff, %n_d% - names diff

or, remove value by name

Labels with base R and ggplot2 functions

table with dimension names

linear regression

boxplot with variable labels

Extreme value labels support

table with labels but there are no label "Hard to say"

now we see "Hard to say" with zero counts

weird 'unique'! There is a value 99 which is absent in 'nps'

Labels are preserved during common operations on the data

Interaction with 'haven'

we need to load packages strictly in this order to avoid conflicts

add missing 'labelled' class

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

cran.r-project.org: expss