lingdist

lingdist: A Fast Linguistic Distance and Alignment Computation R Package

https://github.com/fncokg/lingdist

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.8%) to scientific vocabulary

Keywords

language-distance r rcpp
Last synced: 6 months ago · JSON representation

Repository

lingdist: A Fast Linguistic Distance and Alignment Computation R Package

Basic Info
  • Host: GitHub
  • Owner: fncokg
  • Language: C++
  • Default Branch: master
  • Homepage:
  • Size: 5.78 MB
Statistics
  • Stars: 3
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Topics
language-distance r rcpp
Created over 2 years ago · Last pushed over 2 years ago
Metadata Files
Readme Changelog

README.md

lingdist: Fast Linguistic Distance and Alignment Computation

lingdist is an R package mainly for efficient computations of linguistic distances. In its current version, a generalized edit distance framework is implemented. More linguitic distance algorithms will be implemented in the later version. The package originated from requirements in linguistic researches but has the flexibility to be used in any suitable situations.

Functionality

In general: - Fast Generalized Edit Distance Computation: All alogrithms as well as the parallel computation features and related tool functions are implemented in native C++ leveraging the Rcpp package. - Parallel Computation for Large Datasets: When pairwise distance matrix is required for large datasets, the parallel computation feature can be enabled as easy as setting the parallel parameter to TRUE.

For the generalized edit distance algorithm: - Custom Cost Assignments: You can specify all cost values for any character pairs. This is useful when you want specify different cost values for a-i substitution and a-k substitution. - String as a Character: The package is able to treate a string as a single symbol when computing edit distance. This is useful when you want the algorithm to consider the IPA transcription [kʰʷ] as a single symbol, rather than 3 separate characters. - Alignment Results: In addition to distance result, the package also provides you a detailed algiment result report including which two symbols are algined by the algorithm and how many alignment scenarios are possible.

Installation

Just

install.packages("lingdist")

and then

library(lingdist)

You can also download the newest dev codes in Github and compile yourself (only for experienced users).

Usage

Edit distance and alignment details for two strings

With edit_dist_string (cost_mat is optional and can be an empty dataframe):

res <- edit_dist_string("pʰ_l_i_z̥", "p_l_i_s", cost_mat = cost.mat, delim = "_")

Inspect res$distance for distance result, and res$alignments for alignment details, which may look like this:

[[1]] Chars1 Chars2 Operation Cost Cumcost 1 pʰ p substitute 1 1 2 l l same 0 1 3 i i same 0 1 4 z̥ s substitute 1 2

Average pairwise edit distances for all rows in a dataframe

When you have a dataframe looks like this:

0 1 2 albanian pʰ_l_i_z̥ k_o_l s_t_ɛ_l_aː azerbaijani pʰ_l̥_i_z̥ k_ɑ_lˠ s_t̪_ɛ_l_ə bengali p_l_i_s k_o_l ə_s_t_e_l_ʌ

And you want all pairwise distances between rows (i.e. albanian vs azerbaijani, bengali vs azerbaijani etc.), call:

result <- edit_dist_df(df, cost_mat = cost.mat, delim = "_", parallel = TRUE, n_threads = 4)

In this call, parallel computation is enabled and 4 threads are used in the computation. For more details and other APIs, refer to the documentation.

How to cite

Forthcoming.

Owner

  • Name: Chao Kong
  • Login: fncokg
  • Kind: user

GitHub Events

Total
  • Watch event: 1
Last Year
  • Watch event: 1

Issues and Pull Requests

Last synced: 8 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • cran 179 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 1
  • Total maintainers: 1
cran.r-project.org: lingdist

Fast Linguistic Distance and Alignment Computation

  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 179 Last month
Rankings
Forks count: 27.8%
Stargazers count: 27.8%
Dependent packages count: 28.6%
Dependent repos count: 36.8%
Average: 39.1%
Downloads: 74.5%
Maintainers (1)
Last synced: 7 months ago

Dependencies

DESCRIPTION cran
  • Rcpp >= 1.0.10 imports