https://github.com/chris-santiago/stringcluster
A Scikit-Learn style deduper.
Science Score: 10.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
✓Committers with academic emails
1 of 2 committers (50.0%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.1%) to scientific vocabulary
Keywords
dedupe
deduplication
scikit-learn
text-processing
text-similarity
transformer
Last synced: 5 months ago
·
JSON representation
Repository
A Scikit-Learn style deduper.
Basic Info
- Host: GitHub
- Owner: chris-santiago
- License: mit
- Language: Jupyter Notebook
- Default Branch: master
- Homepage: https://chris-santiago.github.io/stringcluster/
- Size: 345 KB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
dedupe
deduplication
scikit-learn
text-processing
text-similarity
transformer
Created over 4 years ago
· Last pushed over 3 years ago
https://github.com/chris-santiago/stringcluster/blob/master/
# string-cluster
[](https://app.travis-ci.com/chris-santiago/stringcluster)
[](https://codecov.io/gh/chris-santiago/stringcluster)
## Install
Create a virtual environment with Python 3.9 and install from git:
```bash
pip install git+https://github.com/chris-santiago/stringcluster.git
```
## Use
### Preliminaries
This example shows how to use `StringCluster` to deduplicate a list of public company names. The example dataset is a series of company names and their respective variations.
`StringCluster` uses Tf-Idf vectorization to tokenize each element in a series of strings and normalize the count of each n-gram token. It then uses this transformation to construct a cosine similarity matrix by computing the linear kernel for the vector representations of each data observation. `StringCluster` can compare cosine similarity to either itself or a master list of strings to de-duplicate the original series.
```python
import re
import pandas as pd
from stringcluster import StringCluster
```
### Data
As mentioned, the example dataset is a series of company names (strings). To illustrate, we'll pull out all samples that contain the string "FACEBOOK"; we have 11 unique versions for this single company.
```python
data = pd.read_csv('../data/companies.csv')
data.head(10)
```
company
0
MICROSOFT CORP
1
APPLE INC
2
FACEBOOK INC
3
ISHARES TR
4
ORACLE CORP
5
ALPHABET INC - A
6
JOHNSON & JOHNSON
7
WESTERN DIGITAL CORP
8
AMAZON.COM INC
9
VISA INC
```python
companies = data['company']
mask = data['company'].str.contains('FACEBOOK')
facebook = data['company'][mask]
print(f'Number of unique version: {facebook.nunique()}')
facebook
```
Number of unique version: 11
2 FACEBOOK INC
408 FACEBOOK INC CLASS A
474 FACEBOOK INC CL A
998 FACEBOOK-A
1042 FACEBOOK INC CLASS A
1101 FACEBOOK INC A
1448 FACEBOOK INC-A
3020 FACEBOOK INC COM NPV
3626 FACEBOOK INC -A
3638 FACEBOOK
4340 FACEBOOK, INC.
Name: company, dtype: object
### De-duplicating
As mentioned, `StringCluster` can be used with or without a "master" list of string representations, depending on the use case. A master list is provided as the `y` parameter in the `.fit_transform()` method. This can be useful if user have a designated set of representations that they wish to group each sample under.
#### Without a master list
Let's first take a look at use **without** a master list. The `StringCluster` transformer takes three parameters:
|Parameter|Type|Description|
|---------|----|-----------|
|`ngram_size`|int|Size of ngrams to be extracted; default 2.|
|`threshold`|float|Threshold to determine similarities; must be between [0, 1]; default 0.8.|
|`stop_tokens`|str|RegEx pattern to remove during tokenization; default `r'[\W_]+'`|
Although we're using Tf-Idf vectorization, and common tokens will have less effect, we can improve performance by providing a list of domain-specific stop tokens. In this case, we'll remove special characters, white space and any word that relates to "corporation", "incorporated", etc., prior to Tf-Idf vectorization-- these variations within a company's name are meaningless.
After fitting the `StringCluster` object and transforming the data, we see that all 11 variations of "Facebook" have consolidated to "FACEBOOK INC".
**Of note: When using `StringCluster` without a master list, the transformer will default to replacing variations of a string representation with the first variation seen-- in the case, "FACEBOOK INC".**
```python
STOP_TOKENS = r'[\W_]+|(corporation$)|(corp.$)|(corp$)|(incorporated$)|(inc.$)|(inc$)|(company$)|(common$)|(com$)'
cluster = StringCluster(ngram_size=2, threshold=0.7, stop_tokens=STOP_TOKENS)
labels = cluster.fit_transform(data['company'])
```
```python
labels[facebook.index]
```
2 FACEBOOK INC
408 FACEBOOK INC
474 FACEBOOK INC
998 FACEBOOK INC
1042 FACEBOOK INC
1101 FACEBOOK INC
1448 FACEBOOK INC
3020 FACEBOOK INC
3626 FACEBOOK INC
3638 FACEBOOK INC
4340 FACEBOOK INC
Name: company, dtype: object
#### With a master list
Let's take a look at use with a master list. As mentioned, the master list is passed as the `y` parameter in the `.fit()` and `fit_transform()` methods. In this case, each string in the series is compared against the master list and replaced with the representation in the master list with which it exhibits the highest cosine similarity.
```python
TEST_SERIES = pd.Series(
['Johnson & Johnson, Inc.', 'Johnson & Johnson Inc.', 'Johnson & Johnson Inc',
'Johnson & Johnson', 'Intel Corp', 'Intel Corp.', 'Intel Corporation', 'Google',
'Apple', 'Amazon', 'Amazon Inc', 'Comcast Inc.', 'Comcast Corp']
)
MASTER = ['Johnson & Johnson', 'Intel Corp', 'Google', 'Apple Inc', 'Amazon', 'Comcast']
STOP_TOKENS = r'[\W_]+|(corporation$)|(corp.$)|(corp$)|(incorporated$)|(inc.$)|(inc$)|(company$)|(common$)|(com$)'
cluster = StringCluster(ngram_size=2, stop_tokens=STOP_TOKENS)
labels = cluster.fit_transform(TEST_SERIES, MASTER)
```
```python
labels
```
0 Johnson & Johnson
1 Johnson & Johnson
2 Johnson & Johnson
3 Johnson & Johnson
4 Intel Corp
5 Intel Corp
6 Intel Corp
7 Google
8 Apple Inc
9 Amazon
10 Amazon
11 Comcast
12 Comcast
dtype: object
### Trialing Different Threshold Values
The `StringCluster` transformer is sensitive to the `threshold` parameter (especially without a master list), as this controls how matches are flagged, based on their cosine similarity. Let's take a look at how varying levels of the `threshold` parameter affect results on our Facebook example.
```python
thresh = 0.7
while thresh < 1:
cluster = StringCluster(ngram_size=2, threshold=thresh, stop_tokens=STOP_TOKENS)
labels = cluster.fit_transform(data['company'])
print(f'Threshold: {thresh}')
print('----------------------------------------')
print(labels[facebook.index])
print('========================================')
thresh += 0.05
```
Threshold: 0.7
----------------------------------------
2 FACEBOOK INC
408 FACEBOOK INC
474 FACEBOOK INC
998 FACEBOOK INC
1042 FACEBOOK INC
1101 FACEBOOK INC
1448 FACEBOOK INC
3020 FACEBOOK INC
3626 FACEBOOK INC
3638 FACEBOOK INC
4340 FACEBOOK INC
Name: company, dtype: object
========================================
Threshold: 0.75
----------------------------------------
2 FACEBOOK INC
408 FACEBOOK INC
474 FACEBOOK INC
998 FACEBOOK INC
1042 FACEBOOK INC
1101 FACEBOOK INC
1448 FACEBOOK INC
3020 FACEBOOK INC COM NPV
3626 FACEBOOK INC
3638 FACEBOOK INC
4340 FACEBOOK INC
Name: company, dtype: object
========================================
Threshold: 0.8
----------------------------------------
2 FACEBOOK INC
408 FACEBOOK INC CLASS A
474 FACEBOOK INC
998 FACEBOOK INC
1042 FACEBOOK INC CLASS A
1101 FACEBOOK INC
1448 FACEBOOK INC
3020 FACEBOOK INC COM NPV
3626 FACEBOOK INC
3638 FACEBOOK INC
4340 FACEBOOK INC
Name: company, dtype: object
========================================
Threshold: 0.8500000000000001
----------------------------------------
2 FACEBOOK INC
408 FACEBOOK INC CLASS A
474 FACEBOOK INC CLASS A
998 FACEBOOK INC
1042 FACEBOOK INC CLASS A
1101 FACEBOOK INC
1448 FACEBOOK INC
3020 FACEBOOK INC COM NPV
3626 FACEBOOK INC
3638 FACEBOOK INC
4340 FACEBOOK INC
Name: company, dtype: object
========================================
Threshold: 0.9000000000000001
----------------------------------------
2 FACEBOOK INC
408 FACEBOOK INC CLASS A
474 FACEBOOK INC CLASS A
998 FACEBOOK INC
1042 FACEBOOK INC CLASS A
1101 FACEBOOK INC CL A
1448 FACEBOOK INC CL A
3020 FACEBOOK INC COM NPV
3626 FACEBOOK INC CL A
3638 FACEBOOK INC
4340 FACEBOOK INC
Name: company, dtype: object
========================================
Threshold: 0.9500000000000002
----------------------------------------
2 FACEBOOK INC
408 FACEBOOK INC CLASS A
474 FACEBOOK INC CL A
998 FACEBOOK INC
1042 FACEBOOK INC CLASS A
1101 FACEBOOK INC A
1448 FACEBOOK INC A
3020 FACEBOOK INC COM NPV
3626 FACEBOOK INC A
3638 FACEBOOK INC
4340 FACEBOOK INC
Name: company, dtype: object
========================================
Owner
- Name: Chris Santiago
- Login: chris-santiago
- Kind: user
- Repositories: 64
- Profile: https://github.com/chris-santiago
GitHub Events
Total
Last Year
Committers
Last synced: over 1 year ago
Top Committers
| Name | Commits | |
|---|---|---|
| chris-santiago | c****o@g****u | 12 |
| Chris Santiago | 4****o | 4 |
Committer Domains (Top 20 + Academic)
gatech.edu: 1
Issues and Pull Requests
Last synced: 11 months ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0