densify

densify: An R package to reduce empty cells in data frames of typological linguistic data - Published in JOSS (2024)

https://github.com/annagraff/densify

Science Score: 93.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in JOSS metadata
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
✓
JOSS paper metadata
Published in Journal of Open Source Software

Last synced: 6 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: annagraff
License: agpl-3.0
Language: R
Default Branch: main
Size: 39.4 MB

Statistics

Stars: 2
Watchers: 7
Forks: 1
Open Issues: 0
Releases: 2

Created over 2 years ago · Last pushed 7 months ago

Metadata Files

Readme Contributing License Code of conduct

densify

densify ist an R package for densifying (sparse) matrices.

Installation (for users)

To install densify, run the following code in R:

~~~~ install.packages("devtools") library("devtools") installgithub('annagraff/densify', buildvignettes = T) library(densify) ~~~~

Performing matrix densification with densify

Preparing input

The data frame that requires subsetting must have rows representing taxa or observations (with taxon names provided in a dedicated column) and columns representing variables (and variable names as column names). Any cells with empty entries, not applicable or question marks must be coded as NA. If taxonomic structure is relevant to the pruning process, a taxonomy must be provided as a phylo object or as an adjacency table (i.e. a data frame containing columns id and parent_id, with each row encoding one parent-child relationship). The glottolog_languoids dataframe provided by the package can be used directly for this purpose.

~~~~

prepare example data: WALS and Glottolog

data(WALS) data(glottolog_languoids)

any question marks, empty entries, "NA"s must be coded as NA

WALS[WALS=="?"] <- NA WALS[WALS=="NA"] <- NA head(WALS)

all taxa must be present in the taxonomy used for pruning

WALS <- WALS[which(WALS$Glottocode %in% glottolog_languoids$id), ] ~~~~

Densifying, visualizing, ranking and pruning the input

The densify() function iteratively prunes the input matrix. It can be modulated by several parameters to, among others: specify to what extent taxonomic diversity of the sample should weigh in the pruning process; choose among various methods to calculate importance weights (e.g. via arithmetic or logit-transformed means of row-wise coding density); and choose the minimum variation required for a feature to be retained (e.g. constant variables might be of no interest). For a detailed discussion of the parameters, refer to the function documentation or to the vignette hosted in the software repository.

~~~~ set.seed(2024) exampleresult <- densify(data = WALS, cols = !Glottocode, taxonomy = glottologlanguoids, taxonid = "Glottocode", densitymean = "logodds", minvariability = 3, limits = list(mincodingdensity = 1), densitymeanweights = list(coding = 1, taxonomy = 1)) ~~~~

The output of the function is a densify_result object, documenting several summary statistics of all resulting sub-matrices. These summary statistics can be used to define a scoring function, which is used by rank_results(), visualize() and prune() to identify the optimum: rank_results() returns the relative ranks of all generated sub-matrices given the scoring function, visualize() visualizes their relative ranking, and prune() extracts the optimal sub-matrix (ranked first).

~~~~ head(example_result)

use rank_results() to obtain a vector indicating the rank of each sub-matrix

with the default scoring function

exampleranks1 <- rankresults(exampleresult, scoringfunction = ndatapoints*codingdensity)

with a scoring function that gives high weight to taxonomic diversity:

exampleranks2 <- rankresults(exampleresult, scoringfunction = ndatapoints*codingdensity*taxonomic_index^3)

use visualize() to illustrate the quality scores and optimum given each scoring function

visualize(exampleresult, scoringfunction = ndatapointscodingdensity) visualize(exampleresult, scoringfunction = ndata_pointscodingdensity*taxonomicindex^3)

use prune() to obtain the optimum sub-matrix given each scoring function

exampleoptimum1 <- prune(exampleresult, scoringfunction = ndatapoints*coding_density)

exampleoptimum2 <- prune(exampleresult, scoringfunction = ndatapointscoding_densitytaxonomic_index^3) ~~~~

For more details on each function, refer to the help pages (see below) and/or resort to the publication paper and vignette. ~~~~ ?densify ?rank_results ?visualize ?prune ~~~~

Contributing

To report bugs, seek support or suggest improvements (e.g. additional functionalities or changes to functionality or API arguments), please open an issue on GitHub. Similarly, to make direct package contributions, please open an issue on GitHub for discussion before submitting a merge request.

Code of Conduct

Please note that densify is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Owner

Name: Anna Graff
Login: annagraff
Kind: user

Repositories: 1
Profile: https://github.com/annagraff

JOSS Publication

densify: An R package to reduce empty cells in data frames of typological linguistic data

Published

September 06, 2024

DOI

10.21105/joss.07024

Volume 9, Issue 101, Page 7024

Authors

Anna Graff

University of Zurich, Department of Comparative Language Science, University of Zurich, Department of Evolutionary Biology and Environmental Studies, University of Zurich, Center for the Interdisciplinary Study of Language Evolution

Marc Lischka

University of Zurich, Department of Mathematical Modeling and Machine Learning

Taras Zakharko

University of Zurich, Department of Comparative Language Science, University of Zurich, Center for the Interdisciplinary Study of Language Evolution

Reinhard Furrer

University of Zurich, Center for the Interdisciplinary Study of Language Evolution, University of Zurich, Department of Mathematical Modeling and Machine Learning

Balthasar Bickel

University of Zurich, Department of Comparative Language Science, University of Zurich, Center for the Interdisciplinary Study of Language Evolution

Editor

Øystein Sørensen

GitHub Events

Total

Issues event: 8
Delete event: 3
Issue comment event: 36
Push event: 8
Pull request event: 8
Pull request review event: 1
Fork event: 1
Create event: 4

Last Year

Issues event: 8
Delete event: 3
Issue comment event: 36
Push event: 8
Pull request event: 8
Pull request review event: 1
Fork event: 1
Create event: 4

Committers

Last synced: 7 months ago

All Time

Total Commits: 270
Total Committers: 16
Avg Commits per committer: 16.875
Development Distribution Score (DDS): 0.5

Past Year

Commits: 12
Committers: 5
Avg Commits per committer: 2.4
Development Distribution Score (DDS): 0.583

Top Committers

Name	Email	Commits
Anna Graff	a**h@g**m	135
Marc Lischka	m**c@w**h	43
Marc Lischka	m**c@o**h	31
marclischka	1****a	21
Taras Zakharko	t**o@u**h	9
Work	w**k@m**e	9
reinhardfurrer	r**r@m**h	5
Taras Zakharko	t**o@g**m	4
Balthasar Bickel	b**l@u**h	2
Hedvig Skirgård	h**d@g**m	2
Simon J Greenhill	S****l	2
Taras Zakharko		2
Taras Zakharko	t**o@B**n	2
Marc Lischka	m**c@n**h	1
Marc Lischka	m**c@r**h	1
Marc Lischka	m**a@m**h	1

Committer Domains (Top 20 + Academic)

math.uzh.ch: 2 uzh.ch: 2 rstudio.math.uzh.ch: 1 ntl2.math.uzh.ch: 1 blackbox.localdomain: 1 marcs-mac-mini.home: 1 olive.math.uzh.ch: 1 walker.math.uzh.ch: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 12
Total pull requests: 13
Average time to close issues: 25 days
Average time to close pull requests: 10 days
Total issue authors: 5
Total pull request authors: 3
Average comments per issue: 1.92
Average comments per pull request: 1.69
Merged pull requests: 10
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 5
Pull requests: 8
Average time to close issues: 27 days
Average time to close pull requests: 9 days
Issue authors: 2
Pull request authors: 3
Average comments per issue: 1.0
Average comments per pull request: 1.13
Merged pull requests: 5
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

tzakharko (3)
yjunechoe (3)
HedvigS (3)
G-You (2)
elenlefoll (1)
stefanocoretta (1)

densify

Science Score: 93.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

densify

Installation (for users)

Performing matrix densification with densify

Preparing input

prepare example data: WALS and Glottolog

any question marks, empty entries, "NA"s must be coded as NA

all taxa must be present in the taxonomy used for pruning

Densifying, visualizing, ranking and pruning the input

use rank_results() to obtain a vector indicating the rank of each sub-matrix

with the default scoring function

with a scoring function that gives high weight to taxonomic diversity:

use visualize() to illustrate the quality scores and optimum given each scoring function

use prune() to obtain the optimum sub-matrix given each scoring function

Contributing

Code of Conduct

Owner

JOSS Publication

densify: An R package to reduce empty cells in data frames of typological linguistic data

Authors

Editor

Tags

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels