densify

densify: An R package to reduce empty cells in data frames of typological linguistic data - Published in JOSS (2024)

https://github.com/annagraff/densify

Science Score: 93.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in JOSS metadata
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
    Published in Journal of Open Source Software
Last synced: 6 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: annagraff
  • License: agpl-3.0
  • Language: R
  • Default Branch: main
  • Size: 39.4 MB
Statistics
  • Stars: 2
  • Watchers: 7
  • Forks: 1
  • Open Issues: 0
  • Releases: 2
Created over 2 years ago · Last pushed 7 months ago
Metadata Files
Readme Contributing License Code of conduct

README.md

densify

densify ist an R package for densifying (sparse) matrices.

Installation (for users)

To install densify, run the following code in R:

~~~~ install.packages("devtools") library("devtools") installgithub('annagraff/densify', buildvignettes = T) library(densify) ~~~~

Performing matrix densification with densify

Preparing input

The data frame that requires subsetting must have rows representing taxa or observations (with taxon names provided in a dedicated column) and columns representing variables (and variable names as column names). Any cells with empty entries, not applicable or question marks must be coded as NA. If taxonomic structure is relevant to the pruning process, a taxonomy must be provided as a phylo object or as an adjacency table (i.e. a data frame containing columns id and parent_id, with each row encoding one parent-child relationship). The glottolog_languoids dataframe provided by the package can be used directly for this purpose.

~~~~

prepare example data: WALS and Glottolog

data(WALS) data(glottolog_languoids)

any question marks, empty entries, "NA"s must be coded as NA

WALS[WALS=="?"] <- NA WALS[WALS=="NA"] <- NA head(WALS)

all taxa must be present in the taxonomy used for pruning

WALS <- WALS[which(WALS$Glottocode %in% glottolog_languoids$id), ] ~~~~

Densifying, visualizing, ranking and pruning the input

The densify() function iteratively prunes the input matrix. It can be modulated by several parameters to, among others: specify to what extent taxonomic diversity of the sample should weigh in the pruning process; choose among various methods to calculate importance weights (e.g. via arithmetic or logit-transformed means of row-wise coding density); and choose the minimum variation required for a feature to be retained (e.g. constant variables might be of no interest). For a detailed discussion of the parameters, refer to the function documentation or to the vignette hosted in the software repository.

~~~~ set.seed(2024) exampleresult <- densify(data = WALS, cols = !Glottocode, taxonomy = glottologlanguoids, taxonid = "Glottocode", densitymean = "logodds", minvariability = 3, limits = list(mincodingdensity = 1), densitymeanweights = list(coding = 1, taxonomy = 1)) ~~~~

The output of the function is a densify_result object, documenting several summary statistics of all resulting sub-matrices. These summary statistics can be used to define a scoring function, which is used by rank_results(), visualize() and prune() to identify the optimum: rank_results() returns the relative ranks of all generated sub-matrices given the scoring function, visualize() visualizes their relative ranking, and prune() extracts the optimal sub-matrix (ranked first).

~~~~ head(example_result)

use rank_results() to obtain a vector indicating the rank of each sub-matrix

with the default scoring function

exampleranks1 <- rankresults(exampleresult, scoringfunction = ndatapoints*codingdensity)

with a scoring function that gives high weight to taxonomic diversity:

exampleranks2 <- rankresults(exampleresult, scoringfunction = ndatapoints*codingdensity*taxonomic_index^3)

use visualize() to illustrate the quality scores and optimum given each scoring function

visualize(exampleresult, scoringfunction = ndatapointscodingdensity) visualize(exampleresult, scoringfunction = ndata_pointscodingdensity*taxonomicindex^3)

use prune() to obtain the optimum sub-matrix given each scoring function

exampleoptimum1 <- prune(exampleresult, scoringfunction = ndatapoints*coding_density)

exampleoptimum2 <- prune(exampleresult, scoringfunction = ndatapointscoding_densitytaxonomic_index^3) ~~~~

For more details on each function, refer to the help pages (see below) and/or resort to the publication paper and vignette. ~~~~ ?densify ?rank_results ?visualize ?prune ~~~~

Contributing

To report bugs, seek support or suggest improvements (e.g. additional functionalities or changes to functionality or API arguments), please open an issue on GitHub. Similarly, to make direct package contributions, please open an issue on GitHub for discussion before submitting a merge request.

Code of Conduct

Please note that densify is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Owner

  • Name: Anna Graff
  • Login: annagraff
  • Kind: user

JOSS Publication

densify: An R package to reduce empty cells in data frames of typological linguistic data
Published
September 06, 2024
Volume 9, Issue 101, Page 7024
Authors
Anna Graff ORCID
University of Zurich, Department of Comparative Language Science, University of Zurich, Department of Evolutionary Biology and Environmental Studies, University of Zurich, Center for the Interdisciplinary Study of Language Evolution
Marc Lischka ORCID
University of Zurich, Department of Mathematical Modeling and Machine Learning
Taras Zakharko ORCID
University of Zurich, Department of Comparative Language Science, University of Zurich, Center for the Interdisciplinary Study of Language Evolution
Reinhard Furrer ORCID
University of Zurich, Center for the Interdisciplinary Study of Language Evolution, University of Zurich, Department of Mathematical Modeling and Machine Learning
Balthasar Bickel ORCID
University of Zurich, Department of Comparative Language Science, University of Zurich, Center for the Interdisciplinary Study of Language Evolution
Editor
Øystein Sørensen ORCID
Tags
sparse matrices sub-sampling linguistic data diversity samples

GitHub Events

Total
  • Issues event: 8
  • Delete event: 3
  • Issue comment event: 36
  • Push event: 8
  • Pull request event: 8
  • Pull request review event: 1
  • Fork event: 1
  • Create event: 4
Last Year
  • Issues event: 8
  • Delete event: 3
  • Issue comment event: 36
  • Push event: 8
  • Pull request event: 8
  • Pull request review event: 1
  • Fork event: 1
  • Create event: 4

Committers

Last synced: 7 months ago

All Time
  • Total Commits: 270
  • Total Committers: 16
  • Avg Commits per committer: 16.875
  • Development Distribution Score (DDS): 0.5
Past Year
  • Commits: 12
  • Committers: 5
  • Avg Commits per committer: 2.4
  • Development Distribution Score (DDS): 0.583
Top Committers
Name Email Commits
Anna Graff a****h@g****m 135
Marc Lischka m****c@w****h 43
Marc Lischka m****c@o****h 31
marclischka 1****a 21
Taras Zakharko t****o@u****h 9
Work w****k@m****e 9
reinhardfurrer r****r@m****h 5
Taras Zakharko t****o@g****m 4
Balthasar Bickel b****l@u****h 2
Hedvig Skirgård h****d@g****m 2
Simon J Greenhill S****l 2
Taras Zakharko 2
Taras Zakharko t****o@B****n 2
Marc Lischka m****c@n****h 1
Marc Lischka m****c@r****h 1
Marc Lischka m****a@m****h 1

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 12
  • Total pull requests: 13
  • Average time to close issues: 25 days
  • Average time to close pull requests: 10 days
  • Total issue authors: 5
  • Total pull request authors: 3
  • Average comments per issue: 1.92
  • Average comments per pull request: 1.69
  • Merged pull requests: 10
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 5
  • Pull requests: 8
  • Average time to close issues: 27 days
  • Average time to close pull requests: 9 days
  • Issue authors: 2
  • Pull request authors: 3
  • Average comments per issue: 1.0
  • Average comments per pull request: 1.13
  • Merged pull requests: 5
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • tzakharko (3)
  • yjunechoe (3)
  • HedvigS (3)
  • G-You (2)
  • elenlefoll (1)
  • stefanocoretta (1)
Pull Request Authors
  • tzakharko (8)
  • HedvigS (4)
  • SimonGreenhill (4)
Top Labels
Issue Labels
Pull Request Labels