cat2cat

Handling an Inconsistently Coded Categorical Variable in a Longitudinal Dataset

https://github.com/polkas/cat2cat

Science Score: 39.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 4 DOI reference(s) in README
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.4%) to scientific vocabulary

Keywords

categories cran encoding encodings factor longitudinal mapping mappings panel r r-package transitions

Last synced: 4 months ago · JSON representation

Repository

Handling an Inconsistently Coded Categorical Variable in a Longitudinal Dataset

Basic Info

Host: GitHub
Owner: Polkas
License: gpl-2.0
Language: R
Default Branch: main
Homepage: https://polkas.github.io/cat2cat
Size: 15 MB

Statistics

Stars: 5
Watchers: 2
Forks: 1
Open Issues: 4
Releases: 3

Topics

categories cran encoding encodings factor longitudinal mapping mappings panel r r-package transitions

Created almost 6 years ago · Last pushed almost 2 years ago

Metadata Files

Readme Changelog Contributing License Codemeta

cat2cat

Handling an Inconsistent Coded Categorical Variable in a Longitudinal Dataset

Unifying an inconsistent coded categorical variable in a panel/longtitudal dataset.
There is offered the novel cat2cat procedure to map a categorical variable according to a mapping (transition) table between two different time points. The mapping (transition) table should to have a candidate for each category from the targeted for an update period. The main rule is to replicate the observation if it could be assigned to a few categories, then using simple frequencies or modern statistical methods to approximate probabilities of being assigned to each of them.

This algorithm was invented and implemented in the paper by (Nasinski, Majchrowska and Broniatowska (2020)).

For more details please read the paper by (Nasinski, Gajowniczek (2023)).

Please visit the cat2cat webpage for more information

Python Version

Installation

```r

install.packages("remotes")

remotes::install_github("polkas/cat2cat")

or

install.packages("cat2cat") ```

Example

occup dataset is an example of unbalance panel dataset. This is a simulated data although there are applied a real world characteristics from national statistical office survey. The original survey is anonymous and take place every two years.

trans dataset containing mappings (transitions) between old (2008) and new (2010) occupational codes. This table could be used to map encodings in both directions.

Panel dataset without the unique identifiers and only two periods, backward and simple frequencies:

```r library("cat2cat") data("occup", package = "cat2cat") data("trans", package = "cat2cat")

occupold <- occup[occup$year == 2008, ] occupnew <- occup[occup$year == 2010, ]

occupsimple <- cat2cat( data = list( old = occupold, new = occupnew, catvarold = "code", catvarnew = "code", timevar = "year" ), mappings = list(trans = trans, direction = "backward") ) ```

Panel dataset without the unique identifiers and four periods, backward direction and ml models:

```r library("cat2cat") data("occup", package = "cat2cat") data("trans", package = "cat2cat")

occup2006 <- occup[occup$year == 2006,] occup2008 <- occup[occup$year == 2008,] occup2010 <- occup[occup$year == 2010,] occup2012 <- occup[occup$year == 2012,]

library("caret")

mlsetup <- list( data = occup2010, cat_var = "code", method = c("knn"), features = c("age", "sex", "edu", "exp", "parttime", "salary"), args = list(k = 10, ntree = 50) )

mappings <- list(trans = trans, direction = "backward")

ml model performance check

print(cat2catmlrun(mappings, ml_setup))

from 2010 to 2008

occupback20082010 <- cat2cat( data = list( old = occup2008, new = occup2010, catvarold = "code", catvarnew = "code", timevar = "year" ), mappings = mappings, ml = ml_setup )

from 2008 to 2006

occupback20062008 <- cat2cat( data = list( old = occup2006, new = occupback20082010$old, catvarnew = "gnewc2c", catvarold = "code", timevar = "year" ), mappings = mappings, ml = ml_setup )

o2006new <- occupback20062008$old o2008new <- occupback20082010$old # or occupback20062008$new o2010new <- occupback20082010$new o2012new <- dummyc2c( occup2012, cat_var = "code", ml = c("knn") )

finaldataback <- do.call( rbind, list(o2006new, o2008new, o2010new, o2012new) )

possible processing, leaving only one obs per subject and period

still it is recommended to leave all replications and use the weights in the statistical models

library(magrittr) ff <- finaldataback %>% split(.$year) %>% lapply(function(x) crossc2c(x)) %>% lapply(function(x) prunec2c(x, column = "weicrossc2c", method = "highest1") ) %>% do.call(rbind, .) all.equal(nrow(ff), sum(ff$weicrossc2c)) all.equal(nrow(ff), sum(finaldataback$weifreqc2c)) ```

More complex examples are presented in the "Get Started" vignette.

Owner

Name: Maciej Nasinski
Login: Polkas
Kind: user
Location: Warsaw Poland
Company: @insightsengineering

Repositories: 5
Profile: https://github.com/Polkas

Maciej Nasinski - Data Scientist

CodeMeta (codemeta.json)

{
  "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
  "@type": "SoftwareSourceCode",
  "identifier": "cat2cat",
  "description": " Unifying of an inconsistently coded categorical variable between two different time points in accordance with a mapping table. The main rule is to replicate the observation if it could be assign to a few categories. Then using simple frequencies or statistical methods to approximate probabilities of being assign to each of them. This novel procedure was invented and implemented in the paper by (Nasinski, Majchrowska and Broniatowska (2020) <doi:10.24425/cejeme.2020.134747>).",
  "name": "cat2cat: Handling an Inconsistently Coded Categorical Variable in a Panel Dataset",
  "relatedLink": [
    "https://polkas.github.io/cat2cat/",
    "https://CRAN.R-project.org/package=cat2cat"
  ],
  "codeRepository": "https://github.com/Polkas/cat2cat",
  "issueTracker": "https://github.com/Polkas/cat2cat/issues",
  "license": "https://spdx.org/licenses/GPL-2.0",
  "version": "0.4.5.9000",
  "programmingLanguage": {
    "@type": "ComputerLanguage",
    "name": "R",
    "url": "https://r-project.org"
  },
  "runtimePlatform": "R version 4.2.2 (2022-10-31)",
  "provider": {
    "@id": "https://cran.r-project.org",
    "@type": "Organization",
    "name": "Comprehensive R Archive Network (CRAN)",
    "url": "https://cran.r-project.org"
  },
  "author": [
    {
      "@type": "Person",
      "givenName": "Maciej",
      "familyName": "Nasinski",
      "email": "nasinski.maciej@gmail.com"
    }
  ],
  "maintainer": [
    {
      "@type": "Person",
      "givenName": "Maciej",
      "familyName": "Nasinski",
      "email": "nasinski.maciej@gmail.com"
    }
  ],
  "softwareSuggestions": [
    {
      "@type": "SoftwareApplication",
      "identifier": "caret",
      "name": "caret",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=caret"
    },
    {
      "@type": "SoftwareApplication",
      "identifier": "randomForest",
      "name": "randomForest",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=randomForest"
    },
    {
      "@type": "SoftwareApplication",
      "identifier": "knitr",
      "name": "knitr",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=knitr"
    },
    {
      "@type": "SoftwareApplication",
      "identifier": "rmarkdown",
      "name": "rmarkdown",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=rmarkdown"
    },
    {
      "@type": "SoftwareApplication",
      "identifier": "pacman",
      "name": "pacman",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=pacman"
    },
    {
      "@type": "SoftwareApplication",
      "identifier": "testthat",
      "name": "testthat",
      "version": ">= 3.0.0",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=testthat"
    },
    {
      "@type": "SoftwareApplication",
      "identifier": "magrittr",
      "name": "magrittr",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=magrittr"
    },
    {
      "@type": "SoftwareApplication",
      "identifier": "dplyr",
      "name": "dplyr",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=dplyr"
    }
  ],
  "softwareRequirements": {
    "1": {
      "@type": "SoftwareApplication",
      "identifier": "R",
      "name": "R",
      "version": ">= 3.6"
    },
    "2": {
      "@type": "SoftwareApplication",
      "identifier": "MASS",
      "name": "MASS",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=MASS"
    },
    "SystemRequirements": null
  },
  "fileSize": "4492.231KB",
  "releaseNotes": "https://github.com/Polkas/cat2cat/blob/master/NEWS.md",
  "readme": "https://github.com/Polkas/cat2cat/blob/master/README.md",
  "contIntegration": [
    "https://github.com/polkas/cat2cat/actions",
    "https://codecov.io/gh/Polkas/cat2cat"
  ],
  "keywords": [
    "factor",
    "categories",
    "panel",
    "encoding",
    "encodings",
    "transitions",
    "mappings",
    "mapping",
    "r",
    "cran",
    "r-package"
  ]
}

GitHub Events

Total

Watch event: 1

Last Year

Watch event: 1

Committers

Last synced: about 2 years ago

All Time

Total Commits: 113
Total Committers: 1
Avg Commits per committer: 113.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 20
Committers: 1
Avg Commits per committer: 20.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Maciej Nasinski	n**j@g**m	113

Issues and Pull Requests

Last synced: 5 months ago

All Time

Total issues: 26
Total pull requests: 6
Average time to close issues: 7 months
Average time to close pull requests: 16 days
Total issue authors: 1
Total pull request authors: 1
Average comments per issue: 0.12
Average comments per pull request: 0.67
Merged pull requests: 5
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Polkas (26)

Pull Request Authors

Polkas (7)

Top Labels

Issue Labels

feature (6) documentation (2) research (2) enhancement (1) bug (1)

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- cran 298 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 12
Total maintainers: 1

cran.r-project.org: cat2cat

Handling an Inconsistently Coded Categorical Variable in a Longitudinal Dataset

Homepage: https://github.com/Polkas/cat2cat
Documentation: http://cran.r-project.org/web/packages/cat2cat/cat2cat.pdf
License: GPL-2 | GPL-3 | file LICENSE [expanded from: GPL (≥ 2) | file LICENSE]
Latest release: 0.4.7
published almost 2 years ago

Versions: 12
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 298 Last month

Rankings

Stargazers count: 24.2%

Forks count: 28.8%

Average: 29.7%

Dependent packages count: 29.8%

Downloads: 30.1%

Dependent repos count: 35.5%

Maintainers (1)

nasinski.maciej@gmail.com

Last synced: 5 months ago

Dependencies

DESCRIPTION cran

R >= 3.6 depends
MASS * imports
caret * suggests
dplyr * suggests
knitr * suggests
magrittr * suggests
pacman * suggests
randomForest * suggests
rmarkdown * suggests
testthat * suggests

.github/workflows/R-CMD-check.yaml actions

actions/checkout v2 composite
actions/upload-artifact main composite
r-lib/actions/setup-pandoc v1 composite
r-lib/actions/setup-r v1 composite

.github/workflows/pkgdown.yaml actions

actions/cache v1 composite
actions/checkout v2 composite
r-lib/actions/setup-pandoc v2 composite
r-lib/actions/setup-r v2 composite

.github/workflows/test-coverage.yaml actions

actions/cache v1 composite
actions/checkout v2 composite
r-lib/actions/setup-pandoc v2 composite
r-lib/actions/setup-r v2 composite