https://github.com/chris-santiago/stringcluster

A Scikit-Learn style deduper.

Keywords

dedupe deduplication scikit-learn text-processing text-similarity transformer

Last synced: 5 months ago · JSON representation

Repository

A Scikit-Learn style deduper.

Basic Info

Host: GitHub
Owner: chris-santiago
License: mit
Language: Jupyter Notebook
Default Branch: master
Homepage: https://chris-santiago.github.io/stringcluster/
Size: 345 KB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Topics

dedupe deduplication scikit-learn text-processing text-similarity transformer

Created over 4 years ago · Last pushed over 3 years ago

https://github.com/chris-santiago/stringcluster/blob/master/

# string-cluster
[![Build Status](https://app.travis-ci.com/chris-santiago/stringcluster.svg?branch=master)](https://app.travis-ci.com/chris-santiago/stringcluster)
[![codecov](https://codecov.io/gh/chris-santiago/stringcluster/branch/master/graph/badge.svg?token=X2SqPEfCdZ)](https://codecov.io/gh/chris-santiago/stringcluster)
## Install

Create a virtual environment with Python 3.9 and install from git:

```bash
pip install git+https://github.com/chris-santiago/stringcluster.git
```

## Use

### Preliminaries

This example shows how to use `StringCluster` to deduplicate a list of public company names.  The example dataset is a series of company names and their respective variations.  

`StringCluster` uses Tf-Idf vectorization to tokenize each element in a series of strings and normalize the count of each n-gram token. It then uses this transformation to construct a cosine similarity matrix by computing the linear kernel for the vector representations of each data observation. `StringCluster` can compare cosine similarity to either itself or a master list of strings to de-duplicate the original series.


```python
import re

import pandas as pd

from stringcluster import StringCluster
```

### Data

As mentioned, the example dataset is a series of company names (strings). To illustrate, we'll pull out all samples that contain the string "FACEBOOK"; we have 11 unique versions for this single company.


```python
data = pd.read_csv('../data/companies.csv')
data.head(10)
```






  
    
      
      company
    
  
  
    
      0
      MICROSOFT CORP
    
    
      1
      APPLE INC
    
    
      2
      FACEBOOK INC
    
    
      3
      ISHARES TR
    
    
      4
      ORACLE CORP
    
    
      5
      ALPHABET INC - A
    
    
      6
      JOHNSON & JOHNSON
    
    
      7
      WESTERN DIGITAL CORP
    
    
      8
      AMAZON.COM INC
    
    
      9
      VISA INC
    
  






```python
companies = data['company']
mask = data['company'].str.contains('FACEBOOK')
facebook = data['company'][mask]
print(f'Number of unique version: {facebook.nunique()}')
facebook
```

    Number of unique version: 11





    2                                           FACEBOOK INC
    408     FACEBOOK INC            CLASS                  A
    474                                    FACEBOOK INC CL A
    998                                           FACEBOOK-A
    1042                                FACEBOOK INC CLASS A
    1101                                      FACEBOOK INC A
    1448                                      FACEBOOK INC-A
    3020                                FACEBOOK INC COM NPV
    3626                                     FACEBOOK INC -A
    3638                                            FACEBOOK
    4340                                      FACEBOOK, INC.
    Name: company, dtype: object



### De-duplicating

As mentioned, `StringCluster` can be used with or without a "master" list of string representations, depending on the use case. A master list is provided as the `y` parameter in the `.fit_transform()` method. This can be useful if user have a designated set of representations that they wish to group each sample under.

#### Without a master list

Let's first take a look at use **without** a master list.  The `StringCluster` transformer takes three parameters:

|Parameter|Type|Description|
|---------|----|-----------|
|`ngram_size`|int|Size of ngrams to be extracted; default 2.|
|`threshold`|float|Threshold to determine similarities; must be between [0, 1]; default 0.8.|
|`stop_tokens`|str|RegEx pattern to remove during tokenization; default `r'[\W_]+'`|

Although we're using Tf-Idf vectorization, and common tokens will have less effect, we can improve performance by providing a list of domain-specific stop tokens. In this case, we'll remove special characters, white space and any word that relates to "corporation", "incorporated", etc., prior to Tf-Idf vectorization-- these variations within a company's name are meaningless.

After fitting the `StringCluster` object and transforming the data, we see that all 11 variations of "Facebook" have consolidated to "FACEBOOK INC". 

**Of note: When using `StringCluster` without a master list, the transformer will default to replacing variations of a string representation with the first variation seen-- in the case, "FACEBOOK INC".**


```python
STOP_TOKENS = r'[\W_]+|(corporation$)|(corp.$)|(corp$)|(incorporated$)|(inc.$)|(inc$)|(company$)|(common$)|(com$)'

cluster = StringCluster(ngram_size=2, threshold=0.7, stop_tokens=STOP_TOKENS)
labels = cluster.fit_transform(data['company'])
```


```python
labels[facebook.index]
```




    2       FACEBOOK INC
    408     FACEBOOK INC
    474     FACEBOOK INC
    998     FACEBOOK INC
    1042    FACEBOOK INC
    1101    FACEBOOK INC
    1448    FACEBOOK INC
    3020    FACEBOOK INC
    3626    FACEBOOK INC
    3638    FACEBOOK INC
    4340    FACEBOOK INC
    Name: company, dtype: object



#### With a master list

Let's take a look at use with a master list.  As mentioned, the master list is passed as the `y` parameter in the `.fit()` and `fit_transform()` methods.  In this case, each string in the series is compared against the master list and replaced with the representation in the master list with which it exhibits the highest cosine similarity.


```python
TEST_SERIES = pd.Series(
        ['Johnson & Johnson, Inc.', 'Johnson & Johnson Inc.', 'Johnson & Johnson Inc',
         'Johnson & Johnson', 'Intel Corp', 'Intel Corp.', 'Intel Corporation', 'Google',
         'Apple', 'Amazon', 'Amazon Inc', 'Comcast Inc.', 'Comcast Corp']
    )
MASTER = ['Johnson & Johnson', 'Intel Corp', 'Google', 'Apple Inc', 'Amazon', 'Comcast']

STOP_TOKENS = r'[\W_]+|(corporation$)|(corp.$)|(corp$)|(incorporated$)|(inc.$)|(inc$)|(company$)|(common$)|(com$)'

cluster = StringCluster(ngram_size=2, stop_tokens=STOP_TOKENS)
labels = cluster.fit_transform(TEST_SERIES, MASTER)
```


```python
labels
```




    0     Johnson & Johnson
    1     Johnson & Johnson
    2     Johnson & Johnson
    3     Johnson & Johnson
    4            Intel Corp
    5            Intel Corp
    6            Intel Corp
    7                Google
    8             Apple Inc
    9                Amazon
    10               Amazon
    11              Comcast
    12              Comcast
    dtype: object



### Trialing Different Threshold Values

The `StringCluster` transformer is sensitive to the `threshold` parameter (especially without a master list), as this controls how matches are flagged, based on their cosine similarity.  Let's take a look at how varying levels of the `threshold` parameter affect results on our Facebook example.


```python
thresh = 0.7
while thresh < 1:
    cluster = StringCluster(ngram_size=2, threshold=thresh, stop_tokens=STOP_TOKENS)
    labels = cluster.fit_transform(data['company'])
    print(f'Threshold: {thresh}')
    print('----------------------------------------')
    print(labels[facebook.index])
    print('========================================')
    thresh += 0.05
```

    Threshold: 0.7
    ----------------------------------------
    2       FACEBOOK INC
    408     FACEBOOK INC
    474     FACEBOOK INC
    998     FACEBOOK INC
    1042    FACEBOOK INC
    1101    FACEBOOK INC
    1448    FACEBOOK INC
    3020    FACEBOOK INC
    3626    FACEBOOK INC
    3638    FACEBOOK INC
    4340    FACEBOOK INC
    Name: company, dtype: object
    ========================================
    Threshold: 0.75
    ----------------------------------------
    2               FACEBOOK INC
    408             FACEBOOK INC
    474             FACEBOOK INC
    998             FACEBOOK INC
    1042            FACEBOOK INC
    1101            FACEBOOK INC
    1448            FACEBOOK INC
    3020    FACEBOOK INC COM NPV
    3626            FACEBOOK INC
    3638            FACEBOOK INC
    4340            FACEBOOK INC
    Name: company, dtype: object
    ========================================
    Threshold: 0.8
    ----------------------------------------
    2                                           FACEBOOK INC
    408     FACEBOOK INC            CLASS                  A
    474                                         FACEBOOK INC
    998                                         FACEBOOK INC
    1042    FACEBOOK INC            CLASS                  A
    1101                                        FACEBOOK INC
    1448                                        FACEBOOK INC
    3020                                FACEBOOK INC COM NPV
    3626                                        FACEBOOK INC
    3638                                        FACEBOOK INC
    4340                                        FACEBOOK INC
    Name: company, dtype: object
    ========================================
    Threshold: 0.8500000000000001
    ----------------------------------------
    2                                           FACEBOOK INC
    408     FACEBOOK INC            CLASS                  A
    474     FACEBOOK INC            CLASS                  A
    998                                         FACEBOOK INC
    1042    FACEBOOK INC            CLASS                  A
    1101                                        FACEBOOK INC
    1448                                        FACEBOOK INC
    3020                                FACEBOOK INC COM NPV
    3626                                        FACEBOOK INC
    3638                                        FACEBOOK INC
    4340                                        FACEBOOK INC
    Name: company, dtype: object
    ========================================
    Threshold: 0.9000000000000001
    ----------------------------------------
    2                                           FACEBOOK INC
    408     FACEBOOK INC            CLASS                  A
    474     FACEBOOK INC            CLASS                  A
    998                                         FACEBOOK INC
    1042    FACEBOOK INC            CLASS                  A
    1101                                   FACEBOOK INC CL A
    1448                                   FACEBOOK INC CL A
    3020                                FACEBOOK INC COM NPV
    3626                                   FACEBOOK INC CL A
    3638                                        FACEBOOK INC
    4340                                        FACEBOOK INC
    Name: company, dtype: object
    ========================================
    Threshold: 0.9500000000000002
    ----------------------------------------
    2                                           FACEBOOK INC
    408     FACEBOOK INC            CLASS                  A
    474                                    FACEBOOK INC CL A
    998                                         FACEBOOK INC
    1042    FACEBOOK INC            CLASS                  A
    1101                                      FACEBOOK INC A
    1448                                      FACEBOOK INC A
    3020                                FACEBOOK INC COM NPV
    3626                                      FACEBOOK INC A
    3638                                        FACEBOOK INC
    4340                                        FACEBOOK INC
    Name: company, dtype: object
    ========================================

	company
0	MICROSOFT CORP
1	APPLE INC
2	FACEBOOK INC
3	ISHARES TR
4	ORACLE CORP
5	ALPHABET INC - A
6	JOHNSON & JOHNSON
7	WESTERN DIGITAL CORP
8	AMAZON.COM INC
9	VISA INC

Owner

Name: Chris Santiago
Login: chris-santiago
Kind: user

Repositories: 64
Profile: https://github.com/chris-santiago

GitHub Events

Total

Last Year

Committers

Last synced: over 1 year ago

All Time

Total Commits: 16
Total Committers: 2
Avg Commits per committer: 8.0
Development Distribution Score (DDS): 0.25

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
chris-santiago	c**o@g**u	12
Chris Santiago	4****o	4

Committer Domains (Top 20 + Academic)

gatech.edu: 1

Issues and Pull Requests

Last synced: 11 months ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/chris-santiago/stringcluster

Science Score: 10.0%

Keywords

Repository

Basic Info

Statistics

Topics

https://github.com/chris-santiago/stringcluster/blob/master/

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels