text_edit_distance_similarity

The method compares two text samples for their similarity/dissimilarity as edits needed to convert source string to target string.

https://github.com/taimoorkhan-nlp/text_edit_distance_similarity

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: sciencedirect.com
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.8%) to scientific vocabulary

Keywords

edit-distance text-dissimilarity text-similarity
Last synced: 6 months ago · JSON representation ·

Repository

The method compares two text samples for their similarity/dissimilarity as edits needed to convert source string to target string.

Basic Info
  • Host: GitHub
  • Owner: taimoorkhan-nlp
  • License: apache-2.0
  • Language: Jupyter Notebook
  • Default Branch: main
  • Homepage:
  • Size: 351 KB
Statistics
  • Stars: 0
  • Watchers: 2
  • Forks: 3
  • Open Issues: 1
  • Releases: 0
Topics
edit-distance text-dissimilarity text-similarity
Created about 2 years ago · Last pushed 7 months ago
Metadata Files
Readme License Citation

README.md

Analyzing Text Similarity Using Edit Distance

Description

This method calculates the edit distance between two texts to estimate their similarity or dissimilarity. The edit distance measures how many operations — such as inserting, deleting, or substituting characters — are needed to transform one text into another with minimum cost. For instance, Simple edit distance between "cut" and "cat" is 1, as only one substitution is needed. Similarly, Simple distance between "cat" and "at" is also 1, as one deletion suffices. In its simplest form, edit distance assigns an equal cost to all operations — insertions, deletions, and substitutions. Variants of the method allow for different cost structures, making it adaptable to various applications. For example, this method can be used to compare texts like dialects of a language, definitions of similar concepts across disciplines, or even versions of the same news article from different media sources.

Use Cases

  • Identifying different mentions of entities (e.g. names like "Donald Trump", "D. Trump", and "Trump")
  • Finding tweets/social media posts similar to a certain tweet, sentence, or claim.

Input Data

The method reads the input text pairs from the file data/input_text_pairs.csv to compute edit distance for, having the following examples:

| | | |:-----:|:----:| | The sun rises in the east | The sun sets in the west | | He likes to play football | He loves to play soccer | | She made a cup of tea | She prepared a cup of coffee | | They visited the museum | They toured the art gallery | | Reading helps improve vocabulary | Reading enhances language skills |

Output Data

The output data comprises of the text pairs and their distance scores using the three edit distance cost variants at the character level (word level is the other alternative).

|Text 1| Text 2 | Simple | Levenshtein| Damerau-Levenshtein| |------|--------|--------|------------|--------------------| | The sun rises in the east | The sun sets in the west | 5 | 5 | 5 | | He likes to play football | He loves to play soccer | 9 | 16 | 9 | | She made a cup of tea | She prepared a cup of coffee | 11 | 15 | 11 | | They visited the museum | They toured the art gallery | 15 | 22 | 15 | | Reading helps improve vocabulary | Reading enhances language skills | 22 | 34 | 22 |

Hardware

The method runs on a small virtual machine provided by a cloud computing company (2 x86 CPU core, 4 GB RAM, 40GB HDD).

Environment Setup

Execute the following command to install the required packages

pip install -r requirements.txt

How to Use

  • Run Jupyter using the command jupyter lab or jupyter notebook
  • Open and execute all cells in texteditdistance_similarity.ipynb using the methods defined in utils.py.
  • It reads the input as text pairs from data/input_text_pairs.csv and write the output to data/output_scores.csv having text pairs along with the edit distances.
  • Optional: Provide specific method (simple, levenshtein, damerau, default is all) and level (c for character level, w for word level) in the method batch_edit_distance(csv_path='../data/input_text_pairs.csv', method='all', level='c')

Alternately to execute as Python script - cd src - jupyter nbconvert --to script text_edit_distance.ipynb to convert the notebook text_edit_distance.ipynb to text_edit_distance.py in the directory - python text_edit_distance.py

Technical Details

The method offers 3 edit distance variants (Simple edit distance, Levenshtein edit distance, and Damerau-Levenshtein edit distance) between two texts, both at character and word level, and has the following operations:

  • Simple edit distance, i.e., having insertion, deletion, and substitution operations, all having cost 1.
  • Levenshtein edit distance i.e., having insertion and deletion with cost 1 and substitution with cost 2 (it is also equivalent to saying no substitution allowed)
  • Damerau-Levenshtein edit distance i.e., having insertion, deletion, substitution, and transposition, all having equal cost 1.

The method is reproducible as it offers the vanilla implementation without requiring any packages or resources to be installed. It only uses the basic (string and random) packages, usually already included. It gives full control to update costs and scale as needed. Random seeds are defined to have predictable random numbers for reproducibility.

Publications

  1. Hossain, E., Rana, R., Higgins, N., Soar, J., Barua, P. D., Pisani, A. R., & Turner, K. (2023). Natural language processing in electronic health records in relation to healthcare decision-making: a systematic review. Computers in biology and medicine, 155, 106649.
  2. Chaabi, Y., & Allah, F. A. (2022). Amazigh spell checker using the Damerau-Levenshtein algorithm and N-gram. Journal of King Saud University-Computer and Information Sciences, 34(8), 6116-6124.

Contact Details

Taimoor Khan (taimoor.khan@gesis.org)

Owner

  • Name: Taimoor Khan
  • Login: taimoorkhan-nlp
  • Kind: user
  • Location: Köln

Senior Researcher at GESIS

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Khan
    given-names: Taimoor
    orcid: 0000-0002-6542-9217
title: "text edit distance similarity"
version: 1.0
identifiers:
  - type: 
    value: 
date-released: 2024-11-28

GitHub Events

Total
  • Issues event: 1
  • Issue comment event: 4
  • Member event: 1
  • Push event: 76
  • Pull request review event: 1
  • Pull request event: 28
  • Fork event: 2
Last Year
  • Issues event: 1
  • Issue comment event: 4
  • Member event: 1
  • Push event: 76
  • Pull request review event: 1
  • Pull request event: 28
  • Fork event: 2