namematch

Tool for probabilistically linking the records of individual entities (e.g. people) within and across datasets

https://github.com/urban-labs/namematch

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
✓
Committers with academic emails
1 of 3 committers (33.3%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (15.8%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Tool for probabilistically linking the records of individual entities (e.g. people) within and across datasets

Basic Info

Host: GitHub
Owner: urban-labs
License: agpl-3.0
Language: Python
Default Branch: main
Size: 10.2 MB

Statistics

Stars: 117
Watchers: 4
Forks: 4
Open Issues: 4
Releases: 1

Created over 5 years ago · Last pushed over 1 year ago

Metadata Files

Readme Changelog Contributing License Citation

Name Match

About the Project

Tool for probabilistically linking the records of individual entities (e.g. people) within and across datasets.

The code was originally developed for linking records in criminal justice-related datasets (arrests, victimizations, city programs, school records, etc.) using at least first name, last name, date of birth, and age (some missingness in DOB and age is tolerated). If available, other data fields like middle initial, race, gender, address, and zipcode can be included to strengthen the quality of the match.

Project Link: https://urban-labs.github.io/namematch/

Getting Started

Installation

pip install namematch

Name Match has been tested using Python 3.7 and 3.8, on both linux and Windows systems. Note, Name Match will not currently work using Python 3.9 on Windows because of the dependency on NMSLIB.

Reference

Requirements of the input data

Name Match links records by learning a supervised machine learning model that is then used to predict the likelihood that two records "match" (refer to the same person or entity). To build this model the algorithm needs training data with ground-truth "match" or "non-match" labels. In other words, it needs a way of generating a set of record pairs where it knows whether or not the records should be linked. Fortunately, if a subset of the records being input into Name Match already have a unique identifier like Social Securuity Number (SSN) or Fingerprint ID, Name Match is able to generate the training data it needs.

To see an example of this, say you are linking two datasets: dataset A and dataset B. People in dataset A can show up multiple times and can be uniquely identified via SSN. People in dataset B cannot be uniquely identified by any existing data field (hence the reason for using Name Match). If John (SSN 123) has two records in dataset A, we have found an example of two records that we know are a match. If Jane (SSN 456) also has a record in dataset A, we have found an example of two records that we know are NOT a match (Jane's record and either of John's records). Already we are on our way to building a training dataset for the Name Match model to learn from.

To facilitate the above process and make using Name Match possible, a portion of the input data must meet the following criteria: * Already have a unique person or entity identifier that can be used to link records (e.g. SSN or Fingerprint ID) * Be granular enough that some people or entities appear multiple times (e.g. the same person being arrested two or three times) * Contain inconsistencies in identifying fields like name and date of birth (e.g. arrested once as John Browne and once as Jonathan Brown)

Usage

Package usage

```python config = {

'data_files': {
    'datasetA': {
        'filepath' : '../preprocessed_data/datasetA.csv',
        'record_id_col' : 'record_id'
    },
    'datasetB': {
        'filepath' : '../preprocessed_data/datasetB.csv',
        'record_id_col' : 'record_num'
    }        
},

'variables': [
    {
        'name' : 'first_name',
        'compare_type' : 'String',
        'datasetA' : 'first_name',
        'datasetB' : 'fname',
    }, {
        'name' : 'last_name',
        'compare_type' : 'String',
        'datasetA' : 'last_name',
        'datasetB' : 'lname',
    }, {
        'name' : 'dob',
        'compare_type' : 'Date',
        'datasetA' : 'date_of_birth',
        'datasetB' : 'dob',
    }, {
        'name' : 'social_security_number',
        'compare_type' : 'UniqueID', 
        'datasetA' : 'ssn',
        'datasetB' : ''
    }
]

}

nm = NameMatcher(config=config) nm.run() ```

See examples/end_to_end_tutorial.ipynb or examples/python_usage/link_data.py for a full runnable example.

Command line tool usage

cd examples/command_line_usage/ namematch --config-file=config.yaml --output-dir=nm_output --cluster-constraints-file=constraints.py run

For more details, please checkout examples/command_line_usage/README.md.

Roadmap

See the open issues for a list of proposed features (and known issues).

Contributing

All contributions -- to code, documentation, tests, examples, etc. -- are greatly appreciated. For more detailed information, see CONTRIBUTING.md. 1. Fork the project 2. Create your feature branch (git checkout -b some-feature) 3. Commit your changes (git commit -m 'Add some amazing feature') 4. Push to the branch (git push origin some-feature) 5. Open a pull request

License

Distributed under the GNU Affero General Public License v3.0 license. See LICENSE for more information.

Team

Melissa McNeill, UChicago Crime and Education Labs

Eddie Tzu-Yun Lin, UChicago Crime and Education Labs

Zubin Jelveh, University of Maryland

Citation

If you use Name Match in an academic work, please give this citation:

Zubin Jelveh, Melissa McNeill, and Tzu-Yun Lin. 2022. Name Match. https://github.com/urban-labs/namematch.

Citation (citation.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: Name Match
message: >-
  If you use Name Match in an academic work, please
  give this citation:
type: software
authors:
  - given-names: Zubin
    family-names: Jelveh
  - given-names: Melissa
    family-names: McNeill
  - given-names: Tzu-Yun
    family-names: Lin
identifiers:
  - type: url
    value: 'https://github.com/urban-labs/namematch'
    description: Github Project URL
repository-code: 'https://github.com/urban-labs/namematch'
url: 'https://github.com/urban-labs/namematch'
repository-artifact: 'https://pypi.org/project/namematch/'
abstract: >-
  Tool for probabilistically linking the records of
  individual entities (e.g. people) within and across
  datasets.


  The code was originally developed for linking
  records in criminal justice-related datasets
  (arrests, victimizations, city programs, school
  records, etc.) using at least first name, last
  name, date of birth, and age (some missingness in
  DOB and age is tolerated). If available, other data
  fields like middle initial, race, gender, address,
  and zipcode can be included to strengthen the
  quality of the match.


  Project Link:
  https://urban-labs.github.io/namematch/
license: AGPL-3.0-only

GitHub Events

Total

Issues event: 2
Watch event: 2
Issue comment event: 1
Push event: 2
Pull request event: 1
Pull request review comment event: 1
Pull request review event: 2
Fork event: 1

Last Year

Issues event: 2
Watch event: 2
Issue comment event: 1
Push event: 2
Pull request event: 1
Pull request review comment event: 1
Pull request review event: 2
Fork event: 1

Committers

Last synced: over 3 years ago

All Time

Total Commits: 33
Total Committers: 3
Avg Commits per committer: 11.0
Development Distribution Score (DDS): 0.303

Top Committers

Name	Email	Commits
Eddie Lin	t**n@g**m	23
Melissa McNeill	m**3@g**m	6
zjelveh	z**h@u**u	4

Committer Domains (Top 20 + Academic)

uchicago.edu: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 5
Total pull requests: 15
Average time to close issues: 8 months
Average time to close pull requests: about 1 month
Total issue authors: 3
Total pull request authors: 4
Average comments per issue: 0.0
Average comments per pull request: 0.4
Merged pull requests: 14
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 2
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 2
Pull request authors: 0
Average comments per issue: 0.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

tweddielin (3)
mmcneill (1)
teddythepooh (1)

Pull Request Authors

tweddielin (12)
mmcneill (2)
jameshowison (1)
zjelveh (1)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 22 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 2
Total maintainers: 2

pypi.org: namematch

Tool for probabilistically linking the records of individual entities (e.g. people) within and across datasets

Homepage: https://github.com/urban-labs/namematch
Documentation: https://urban-labs.github.io/namematch/
License: AGPL-3.0
Latest release: 1.2.1
published over 3 years ago

Versions: 2
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 22 Last month

Rankings

Dependent packages count: 6.6%

Stargazers count: 7.4%

Average: 18.8%

Forks count: 19.6%

Downloads: 29.7%

Dependent repos count: 30.6%

Maintainers (2)

mmcneill tweddielin

Last synced: 10 months ago

Dependencies

requirement/dev.txt pypi

karma_sphinx_theme * development
pytest * development
pytest-cov * development
sphinx * development
sphinx-autobuild * development

requirement/main.txt pypi

Dickens >=1.0.1
Fuzzy ==1.2.2
NameProbability 03de54f8d964e3d74accb39e7089bcac345beffb
argcmdr >=0.7.0
coloredlogs ==14.0
editdistance ==0.6.0
ipykernel ==6.16.0
ipywidgets *
jellyfish ==0.8.9
line_profiler ==3.3.1
memory_profiler fdf4488ffe42c588bfa632537e9a959e4b36bf83
nbconvert ==6.5.2
networkx ==2.6.3
nmslib >=2.1.1,<2.2
numpy >=1.20.1
pandas ==1.3.4
papermill ==2.4.0
pyarrow ==7.0.0
pyjarowinkler ==1.8
python-levenshtein ==0.12.2
pyyaml ==5.1
ruamel.yaml ==0.17.17
scikit-learn ==1.0.1
street-address ==0.4.0

.github/workflows/static.yml actions

actions/checkout v3 composite
actions/configure-pages v2 composite
actions/deploy-pages v1 composite
actions/upload-pages-artifact v1 composite

setup.py pypi