stringdist

String distance functions for R

https://github.com/markvanderloo/stringdist

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
  • Committers with academic emails
    1 of 8 committers (12.5%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.6%) to scientific vocabulary
Last synced: 6 months ago · JSON representation

Repository

String distance functions for R

Basic Info
  • Host: GitHub
  • Owner: markvanderloo
  • Language: R
  • Default Branch: master
  • Size: 1.32 MB
Statistics
  • Stars: 330
  • Watchers: 15
  • Forks: 36
  • Open Issues: 22
  • Releases: 0
Created about 13 years ago · Last pushed about 1 year ago
Metadata Files
Readme

README.md

CRAN status DownloadsResearch software impactMentioned in Awesome Official Statistics

stringdist

  • Approximate matching, fuzzy text search, and string distance calculations for R.
  • All distance and matching operations are system- and encoding-independent.
  • Built for speed, using openMP for parallel computing.

Citing

Please cite the R-Journal article

@article{RJ-2014-011, author = {Mark P.J. van der Loo}, title = {{The stringdist Package for Approximate String Matching}}, year = {2014}, journal = {{The R Journal}}, doi = {10.32614/RJ-2014-011}, url = {https://doi.org/10.32614/RJ-2014-011}, pages = {111--122}, volume = {6}, number = {1} }

Functionality

The package offers the following main functions:

  • stringdist computes pairwise distances between two input character vectors (shorter one is recycled)
  • stringdistmatrix computes the distance matrix for one or two vectors
  • stringsim computes a string similarity between 0 and 1, based on stringdist
  • amatch is a fuzzy matching equivalent of R's native match function
  • ain is a fuzzy matching equivalent of R's native %in% operator
  • afind finds the location of fuzzy matches of a short string in a long string.
  • seq_dist, seq_distmatrix, seq_amatch and seq_ain for distances between, and matching of integer sequences. (see also the hashr package).

These functions are built upon C-code that re-implements some common (weighted) string distance functions. Distance functions include:

  • Hamming distance;
  • Levenshtein distance (weighted);
  • Restricted Damerau-Levenshtein distance (weighted, a.k.a. Optimal String Alignment);
  • Full Damerau-Levenshtein distance (weighted);
  • Longest Common Substring distance;
  • Q-gram distance
  • cosine distance for q-gram count vectors (= 1-cosine similarity)
  • Jaccard distance for q-gram count vectors (= 1-Jaccard similarity)
  • Jaro, and Jaro-Winkler distance
  • Soundex-based string distance.

Also, there are some utility functions:

  • qgrams() tabulates the qgrams in one or more character vectors.
  • seq_qrams() tabulates the qgrams (somtimes called ngrams) in one or more integer vectors.
  • phonetic() computes phonetic codes of strings (currently only soundex)
  • printable_ascii() is a utility function that detects non-printable ascii or non-ascii characters.

C API

As of version 0.9.5.0 you can call a number of stringdist functions directly from the C code of your R package. The description of the API can be found

  • By typing ?stringdist_api in the R console
  • By browsing the package's help index to User guides, package vignettes and other documentation and clicking on doc/stringdist_api.pdf.
  • Or you can find the file's location as follows

system.file("doc/stringdist_api.pdf", package="stringdist")

Examples of packages that link to stringdist can be found here and here.

Installation

To install the latest release from CRAN, open an R terminal and type

install.packages('stringdist')

To obtain the package from the very latest source code open a bash terminal (or git bash if you work under Windows with msysgit) and type

git clone https://github.com/markvanderloo/stringdist.git cd stringdist bash ./build.bash R CMD INSTALL output/stringdist_*.tar.gz

Warning: the github version can change any time and may not even build properly. As most of the code is written in C, the development version may crash your R-session.

Resources

  • A paper on stringdist has been published in the R-journal
  • Slides of te useR!2014 conference.

Note to users: deprecated arguments removed as of version 0.9.5.0

The following arguments have been obsolete since 2015 and have been removed in the 0.9.5.0 release (spring 2018)

  • Argument cluster for function stringdistmatrix.
  • Argument maxDist for functions stringdist and stringdistmatrix (not amatch).
  • Argument ncores for function stringdistmatrix

Note to users: deprecated arguments as of >= 0.9.0, >= 0.9.2

Parallelization used to be based on R's parallel package, that works by spawning several R sessions in the background. As of version 0.9.0, stringdist uses the more efficient openMP protocol to parallelize everything under the hood.

The following arguments have become obsolete and will be removed somewhere in 2016: * Argument cluster for function stringdistmatrix. * Argument maxDist for functions stringdist and stringdistmatrix (not amatch). * Argument ncores for function stringdistmatrix

Owner

  • Name: Mark van der Loo
  • Login: markvanderloo
  • Kind: user
  • Location: Netherlands
  • Company: Statistics Netherlands | Tridata

math, programming, data

GitHub Events

Total
  • Issues event: 3
  • Watch event: 12
  • Issue comment event: 3
  • Push event: 1
Last Year
  • Issues event: 3
  • Watch event: 12
  • Issue comment event: 3
  • Push event: 1

Committers

Last synced: over 2 years ago

All Time
  • Total Commits: 577
  • Total Committers: 8
  • Avg Commits per committer: 72.125
  • Development Distribution Score (DDS): 0.033
Past Year
  • Commits: 2
  • Committers: 1
  • Avg Commits per committer: 2.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
mark m****o@g****m 558
djvanderlaan d****n@u****l 10
ChrisMuir c****A@g****m 4
Johannes Gruber j****1@r****k 1
Pieter Schoonees s****s@g****m 1
Nirmal n****l@d****l 1
Behzad Kianian 2****i@u****m 1
Ricardo Saporta g****t@R****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 96
  • Total pull requests: 10
  • Average time to close issues: 6 months
  • Average time to close pull requests: 2 days
  • Total issue authors: 59
  • Total pull request authors: 8
  • Average comments per issue: 2.23
  • Average comments per pull request: 2.4
  • Merged pull requests: 9
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 3
  • Pull requests: 0
  • Average time to close issues: about 7 hours
  • Average time to close pull requests: N/A
  • Issue authors: 3
  • Pull request authors: 0
  • Average comments per issue: 1.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • markvanderloo (33)
  • soodoku (2)
  • brooksambrose (2)
  • lucazav (2)
  • laurikoobas (2)
  • pommedeterresautee (2)
  • markusdumke (1)
  • KyleHaynes (1)
  • dselivanov (1)
  • Joanmime (1)
  • tamas-ferenci (1)
  • Premsheth (1)
  • comckay (1)
  • zachmayer (1)
  • leonardosnr (1)
Pull Request Authors
  • ChrisMuir (3)
  • Moohan (2)
  • JBGruber (1)
  • rsaporta (1)
  • nirmalpatel (1)
  • bzki (1)
  • richierocks (1)
  • schoonees (1)
Top Labels
Issue Labels
enhancement (28) bug (14) question (10) wontfix (7) duplicate (1)
Pull Request Labels

Packages

  • Total packages: 2
  • Total downloads:
    • cran 48,258 last-month
  • Total docker downloads: 2,186,468
  • Total dependent packages: 88
    (may contain duplicates)
  • Total dependent repositories: 206
    (may contain duplicates)
  • Total versions: 48
  • Total maintainers: 1
cran.r-project.org: stringdist

Approximate String Matching, Fuzzy Text Search, and String Distance Functions

  • Versions: 36
  • Dependent Packages: 78
  • Dependent Repositories: 197
  • Downloads: 48,258 Last month
  • Docker Downloads: 2,186,468
Rankings
Dependent packages count: 1.1%
Dependent repos count: 1.3%
Stargazers count: 1.3%
Downloads: 1.5%
Forks count: 2.1%
Average: 4.1%
Docker downloads count: 17.3%
Maintainers (1)
Last synced: 6 months ago
conda-forge.org: r-stringdist
  • Versions: 12
  • Dependent Packages: 10
  • Dependent Repositories: 9
Rankings
Dependent packages count: 5.9%
Dependent repos count: 11.6%
Average: 17.2%
Stargazers count: 22.7%
Forks count: 28.8%
Last synced: 6 months ago

Dependencies

pkg/DESCRIPTION cran
  • R >= 2.15.3 depends
  • parallel * imports
  • tinytest * suggests