densify
densify: An R package to reduce empty cells in data frames of typological linguistic data - Published in JOSS (2024)
Science Score: 93.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 1 DOI reference(s) in JOSS metadata -
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
✓JOSS paper metadata
Published in Journal of Open Source Software
Repository
Basic Info
- Host: GitHub
- Owner: annagraff
- License: agpl-3.0
- Language: R
- Default Branch: main
- Size: 39.4 MB
Statistics
- Stars: 2
- Watchers: 7
- Forks: 1
- Open Issues: 0
- Releases: 2
Metadata Files
README.md
densify
densify ist an R package for densifying (sparse) matrices.
Installation (for users)
To install densify, run the following code in R:
~~~~ install.packages("devtools") library("devtools") installgithub('annagraff/densify', buildvignettes = T) library(densify) ~~~~
Performing matrix densification with densify
Preparing input
The data frame that requires subsetting must have rows representing taxa or observations (with taxon names provided in a dedicated column) and columns representing variables (and variable names as column names). Any cells with empty entries, not applicable or question marks must be coded as NA. If taxonomic structure is relevant to the pruning process, a taxonomy must be provided as a phylo object or as an adjacency table (i.e. a data frame containing columns id and parent_id, with each row encoding one parent-child relationship). The glottolog_languoids dataframe provided by the package can be used directly for this purpose.
~~~~
prepare example data: WALS and Glottolog
data(WALS) data(glottolog_languoids)
any question marks, empty entries, "NA"s must be coded as NA
WALS[WALS=="?"] <- NA WALS[WALS=="NA"] <- NA head(WALS)
all taxa must be present in the taxonomy used for pruning
WALS <- WALS[which(WALS$Glottocode %in% glottolog_languoids$id), ] ~~~~
Densifying, visualizing, ranking and pruning the input
The densify() function iteratively prunes the input matrix. It can be modulated by several parameters to, among others: specify to what extent taxonomic diversity of the sample should weigh in the pruning process; choose among various methods to calculate importance weights (e.g. via arithmetic or logit-transformed means of row-wise coding density); and choose the minimum variation required for a feature to be retained (e.g. constant variables might be of no interest). For a detailed discussion of the parameters, refer to the function documentation or to the vignette hosted in the software repository.
~~~~ set.seed(2024) exampleresult <- densify(data = WALS, cols = !Glottocode, taxonomy = glottologlanguoids, taxonid = "Glottocode", densitymean = "logodds", minvariability = 3, limits = list(mincodingdensity = 1), densitymeanweights = list(coding = 1, taxonomy = 1)) ~~~~
The output of the function is a densify_result object, documenting several summary statistics of all resulting sub-matrices. These summary statistics can be used to define a scoring function, which is used by rank_results(), visualize() and prune() to identify the optimum: rank_results() returns the relative ranks of all generated sub-matrices given the scoring function, visualize() visualizes their relative ranking, and prune() extracts the optimal sub-matrix (ranked first).
~~~~ head(example_result)
use rank_results() to obtain a vector indicating the rank of each sub-matrix
with the default scoring function
exampleranks1 <- rankresults(exampleresult, scoringfunction = ndatapoints*codingdensity)
with a scoring function that gives high weight to taxonomic diversity:
exampleranks2 <- rankresults(exampleresult, scoringfunction = ndatapoints*codingdensity*taxonomic_index^3)
use visualize() to illustrate the quality scores and optimum given each scoring function
visualize(exampleresult, scoringfunction = ndatapointscodingdensity) visualize(exampleresult, scoringfunction = ndata_pointscodingdensity*taxonomicindex^3)
use prune() to obtain the optimum sub-matrix given each scoring function
exampleoptimum1 <- prune(exampleresult, scoringfunction = ndatapoints*coding_density)
exampleoptimum2 <- prune(exampleresult, scoringfunction = ndatapointscoding_densitytaxonomic_index^3) ~~~~
For more details on each function, refer to the help pages (see below) and/or resort to the publication paper and vignette. ~~~~ ?densify ?rank_results ?visualize ?prune ~~~~
Contributing
To report bugs, seek support or suggest improvements (e.g. additional functionalities or changes to functionality or API arguments), please open an issue on GitHub. Similarly, to make direct package contributions, please open an issue on GitHub for discussion before submitting a merge request.
Code of Conduct
Please note that densify is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.
Owner
- Name: Anna Graff
- Login: annagraff
- Kind: user
- Repositories: 1
- Profile: https://github.com/annagraff
JOSS Publication
densify: An R package to reduce empty cells in data frames of typological linguistic data
Authors
University of Zurich, Department of Comparative Language Science, University of Zurich, Department of Evolutionary Biology and Environmental Studies, University of Zurich, Center for the Interdisciplinary Study of Language Evolution
University of Zurich, Department of Comparative Language Science, University of Zurich, Center for the Interdisciplinary Study of Language Evolution
Tags
sparse matrices sub-sampling linguistic data diversity samplesGitHub Events
Total
- Issues event: 8
- Delete event: 3
- Issue comment event: 36
- Push event: 8
- Pull request event: 8
- Pull request review event: 1
- Fork event: 1
- Create event: 4
Last Year
- Issues event: 8
- Delete event: 3
- Issue comment event: 36
- Push event: 8
- Pull request event: 8
- Pull request review event: 1
- Fork event: 1
- Create event: 4
Committers
Last synced: 7 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Anna Graff | a****h@g****m | 135 |
| Marc Lischka | m****c@w****h | 43 |
| Marc Lischka | m****c@o****h | 31 |
| marclischka | 1****a | 21 |
| Taras Zakharko | t****o@u****h | 9 |
| Work | w****k@m****e | 9 |
| reinhardfurrer | r****r@m****h | 5 |
| Taras Zakharko | t****o@g****m | 4 |
| Balthasar Bickel | b****l@u****h | 2 |
| Hedvig Skirgård | h****d@g****m | 2 |
| Simon J Greenhill | S****l | 2 |
| Taras Zakharko | 2 | |
| Taras Zakharko | t****o@B****n | 2 |
| Marc Lischka | m****c@n****h | 1 |
| Marc Lischka | m****c@r****h | 1 |
| Marc Lischka | m****a@m****h | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 12
- Total pull requests: 13
- Average time to close issues: 25 days
- Average time to close pull requests: 10 days
- Total issue authors: 5
- Total pull request authors: 3
- Average comments per issue: 1.92
- Average comments per pull request: 1.69
- Merged pull requests: 10
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 5
- Pull requests: 8
- Average time to close issues: 27 days
- Average time to close pull requests: 9 days
- Issue authors: 2
- Pull request authors: 3
- Average comments per issue: 1.0
- Average comments per pull request: 1.13
- Merged pull requests: 5
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- tzakharko (3)
- yjunechoe (3)
- HedvigS (3)
- G-You (2)
- elenlefoll (1)
- stefanocoretta (1)
Pull Request Authors
- tzakharko (8)
- HedvigS (4)
- SimonGreenhill (4)