fuzzysearch

Find parts of long text or data, allowing for some changes/typos.

https://github.com/taleinat/fuzzysearch

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.5%) to scientific vocabulary

Keywords

fuzzy-matching fuzzy-search python string-search text-search

Last synced: 9 months ago · JSON representation

Repository

Find parts of long text or data, allowing for some changes/typos.

Basic Info

Host: GitHub
Owner: taleinat
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 1010 KB

Statistics

Stars: 328
Watchers: 7
Forks: 25
Open Issues: 10
Releases: 9

Topics

fuzzy-matching fuzzy-search python string-search text-search

Created over 12 years ago · Last pushed about 1 year ago

Metadata Files

Readme Changelog Contributing License Authors

README.rst

===========
fuzzysearch
===========

.. image:: https://img.shields.io/pypi/v/fuzzysearch.svg?style=flat
    :target: https://pypi.python.org/pypi/fuzzysearch
    :alt: Latest Version

.. image:: https://img.shields.io/coveralls/taleinat/fuzzysearch.svg?branch=master
    :target: https://coveralls.io/r/taleinat/fuzzysearch?branch=master
    :alt: Test Coverage

.. image:: https://img.shields.io/pypi/wheel/fuzzysearch.svg?style=flat
    :target: https://pypi.python.org/pypi/fuzzysearch
    :alt: Wheels

.. image:: https://img.shields.io/pypi/pyversions/fuzzysearch.svg?style=flat
    :target: https://pypi.python.org/pypi/fuzzysearch
    :alt: Supported Python versions

.. image:: https://img.shields.io/pypi/implementation/fuzzysearch.svg?style=flat
    :target: https://pypi.python.org/pypi/fuzzysearch
    :alt: Supported Python implementations

.. image:: https://img.shields.io/pypi/l/fuzzysearch.svg?style=flat
    :target: https://pypi.python.org/pypi/fuzzysearch/
    :alt: License

Fuzzy search: Find parts of long text or data, allowing for some
changes/typos.

Highly optimized, simple to use, does one thing well.

.. code:: python

    >>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1)
    [Match(start=3, end=9, dist=1, matched="PATERN")]

* Two simple functions to use: one for in-memory data and one for files

  * Fastest search algorithm is chosen automatically

* Levenshtein Distance metric with configurable parameters

  * Separately configure the max. allowed distance, substitutions, deletions
    and/or insertions

* Advanced algorithms with optional C and Cython optimizations

* Properly handles Unicode; special optimizations for binary data

* Simple installation:
   * ``pip install fuzzysearch`` just works
   * pure-Python fallbacks for compiled modules
   * only one dependency (``attrs``)

* Extensively tested

* Free software: `MIT license `_

For more info, see the `documentation `_.


How is this different than FuzzyWuzzy or RapidFuzz?
---------------------------------------------------

The main difference is that fuzzysearch searches for fuzzy matches through
long texts or data. FuzzyWuzzy and RapidFuzz, on the other hand, are intended
for fuzzy comparison of pairs of strings, identifying how closely they match
according to some metric such as the Levenshtein distance.

These are very different use-cases, and the solutions are very different as
well.


How is this different than ElasticSearch and Lucene?
----------------------------------------------------

The main difference is that fuzzysearch does no indexing or other
preparations; it directly searches through the given text or data for a given
sub-string. Therefore, it is much simpler to use compared to systems based on
text indexing.


Installation
------------

``fuzzysearch`` supports Python versions 3.8+, as well as PyPy 3.9 and 3.10.

.. code::

    $ pip install fuzzysearch

This will work even if installing the C and Cython extensions fails, using
pure-Python fallbacks.


Usage
-----
Just call ``find_near_matches()`` with the sub-sequence you're looking for,
the sequence to search, and the matching parameters:

.. code:: python

    >>> from fuzzysearch import find_near_matches
    # search for 'PATTERN' with a maximum Levenshtein Distance of 1
    >>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1)
    [Match(start=3, end=9, dist=1, matched="PATERN")]

To search in a file, use ``find_near_matches_in_file()``:

.. code:: python

    >>> from fuzzysearch import find_near_matches_in_file
    >>> with open('data_file', 'rb') as f:
    ...     find_near_matches_in_file(b'PATTERN', f, max_l_dist=1)
    [Match(start=3, end=9, dist=1, matched="PATERN")]


Examples
--------

*fuzzysearch* is great for ad-hoc searches of genetic data, such as DNA or
protein sequences, before reaching for more complex tools:

.. code:: python

    >>> sequence = '''\
    GACTAGCACTGTAGGGATAACAATTTCACACAGGTGGACAATTACATTGAAAATCACAGATTGGTCACACACACA
    TTGGACATACATAGAAACACACACACATACATTAGATACGAACATAGAAACACACATTAGACGCGTACATAGACA
    CAAACACATTGACAGGCAGTTCAGATGATGACGCCCGACTGATACTCGCGTAGTCGTGGGAGGCAAGGCACACAG
    GGGATAGG'''
    >>> subsequence = 'TGCACTGTAGGGATAACAAT' # distance = 1
    >>> find_near_matches(subsequence, sequence, max_l_dist=2)
    [Match(start=3, end=24, dist=1, matched="TAGCACTGTAGGGATAACAAT")]

BioPython sequences are also supported:

.. code:: python

    >>> from Bio.Seq import Seq
    >>> from Bio.Alphabet import IUPAC
    >>> sequence = Seq('''\
    GACTAGCACTGTAGGGATAACAATTTCACACAGGTGGACAATTACATTGAAAATCACAGATTGGTCACACACACA
    TTGGACATACATAGAAACACACACACATACATTAGATACGAACATAGAAACACACATTAGACGCGTACATAGACA
    CAAACACATTGACAGGCAGTTCAGATGATGACGCCCGACTGATACTCGCGTAGTCGTGGGAGGCAAGGCACACAG
    GGGATAGG''', IUPAC.unambiguous_dna)
    >>> subsequence = Seq('TGCACTGTAGGGATAACAAT', IUPAC.unambiguous_dna)
    >>> find_near_matches(subsequence, sequence, max_l_dist=2)
    [Match(start=3, end=24, dist=1, matched="TAGCACTGTAGGGATAACAAT")]


Matching Criteria
-----------------
The search function supports four possible match criteria, which may be
supplied in any combination:

* maximum Levenshtein distance (``max_l_dist``)

* maximum # of subsitutions

* maximum # of deletions ("delete" = skip a character in the sub-sequence)

* maximum # of insertions ("insert" = skip a character in the sequence)

Not supplying a criterion means that there is no limit for it. For this reason,
one must always supply ``max_l_dist`` and/or all other criteria.

.. code:: python

    >>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1)
    [Match(start=3, end=9, dist=1, matched="PATERN")]

    # this will not match since max-deletions is set to zero
    >>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1, max_deletions=0)
    []

    # note that a deletion + insertion may be combined to match a substution
    >>> find_near_matches('PATTERN', '---PAT-ERN---', max_deletions=1, max_insertions=1, max_substitutions=0)
    [Match(start=3, end=10, dist=1, matched="PAT-ERN")] # the Levenshtein distance is still 1

    # ... but deletion + insertion may also match other, non-substitution differences
    >>> find_near_matches('PATTERN', '---PATERRN---', max_deletions=1, max_insertions=1, max_substitutions=0)
    [Match(start=3, end=10, dist=2, matched="PATERRN")]

Owner

Name: Tal Einat
Login: taleinat
Kind: user

Repositories: 42
Profile: https://github.com/taleinat

GitHub Events

Total

Watch event: 21
Push event: 1
Create event: 1

Last Year

Watch event: 21
Push event: 1
Create event: 1

Committers

Last synced: over 2 years ago

All Time

Total Commits: 289
Total Committers: 3
Avg Commits per committer: 96.333
Development Distribution Score (DDS): 0.042

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Tal Einat	t**t@g**m	277
Tal Einat	t**t@s**m	10
Tal Einat	5****t	2

Committer Domains (Top 20 + Academic)

socialcodeinc.com: 1

Issues and Pull Requests

Last synced: 11 months ago

All Time

Total issues: 45
Total pull requests: 2
Average time to close issues: 4 months
Average time to close pull requests: about 4 hours
Total issue authors: 30
Total pull request authors: 2
Average comments per issue: 2.56
Average comments per pull request: 2.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 1
Pull request authors: 0
Average comments per issue: 0.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

taleinat (5)
kevinrue (3)
DanielBiskup (3)
sasi143 (3)
georgh (3)
prabhatM (2)
spooknik (2)
levitation (2)
jtlz2 (1)
Ericxgao (1)
heshamwhite (1)
sanjeevpe (1)
yasinzaehringer-paradime (1)
Stonatus (1)
theo-allnutt-bioinformatics (1)

Pull Request Authors

Aman-Clement (2)
maximkir-fl (1)

Top Labels

Issue Labels

enhancement (5) bug (4) question (1) wontfix (1)

Pull Request Labels

Packages

Total packages: 2
Total downloads:
- pypi 277,596 last-month
Total docker downloads: 3,022

Total dependent packages: 14
(may contain duplicates)
Total dependent repositories: 274
(may contain duplicates)
Total versions: 18
Total maintainers: 1

pypi.org: fuzzysearch

fuzzysearch is useful for finding approximate subsequence matches

Homepage: https://github.com/taleinat/fuzzysearch
Documentation: https://fuzzysearch.readthedocs.io/
License: MIT
Latest release: 0.8.0
published about 1 year ago

Versions: 15
Dependent Packages: 14
Dependent Repositories: 274
Downloads: 277,596 Last month
Docker Downloads: 3,022

Rankings

Downloads: 0.9%

Dependent repos count: 0.9%

Dependent packages count: 1.1%

Docker downloads count: 1.6%

Average: 2.8%

Stargazers count: 3.9%

Forks count: 8.1%

Maintainers (1)

taleinat

Last synced: 10 months ago

conda-forge.org: fuzzysearch

Homepage: https://github.com/taleinat/fuzzysearch
License: MIT
Latest release: 0.7.3
published over 4 years ago

Versions: 3
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Stargazers count: 22.8%

Forks count: 33.2%

Dependent repos count: 34.0%

Average: 35.3%

Dependent packages count: 51.2%

Last synced: 10 months ago

Dependencies

requirements_dev.txt pypi

bump2version * development
cython * development
sphinx * development
tox <3 development
virtualenv * development

setup.py pypi

attrs >=19.3