https://github.com/datamade/probablepeople

:family: a python library for parsing unstructured western names into name components.

Keywords

names parse python

Keywords from Contributors

datamade de-duplicating dedupe dedupe-library entity-resolution record-linkage chicago councilmatic

Last synced: 5 months ago · JSON representation

Repository

:family: a python library for parsing unstructured western names into name components.

Basic Info

Host: GitHub
Owner: datamade
License: mit
Language: Python
Default Branch: main
Homepage: http://parserator.datamade.us/probablepeople
Size: 16.8 MB

Statistics

Stars: 609
Watchers: 28
Forks: 74
Open Issues: 66
Releases: 0

Topics

names parse python

Created over 11 years ago · Last pushed 9 months ago

Metadata Files

Readme License

probablepeople

probablepeople is a python library for parsing unstructured romanized name or company strings into components, using advanced NLP methods. This is based off usaddress, a python library for parsing addresses.

Try it out on our web interface! For those who aren't python developers, we also have an API.

What this can do: Using a probabilistic model, it makes (very educated) guesses in identifying name or corporation components, even in tricky cases where rule-based parsers typically break down.

What this cannot do: It cannot identify components with perfect accuracy, nor can it verify that a given name/company is correct/valid.

probablepeople learns how to parse names/companies through a body of training data. If you have examples of names/companies that stump this parser, please send them over! By adding more examples to the training data, probablepeople can continue to learn and improve.

How to use the probablepeople python library

Install probablepeople with pip, a tool for installing and managing python packages (beginner's guide here)

In the terminal,

```
pip install probablepeople  
```

Parse some names/companies!

Note that parse and tag are differet methods: ```python import probablepeople as pp namestr='Mr George "Gob" Bluth II' corpstr='Sitwell Housing Inc'

# The parse method will split your string into components, and label each component. pp.parse(namestr) # expected output: [('Mr', 'PrefixMarital'), ('George', 'GivenName'), ('"Gob"', 'Nickname'), ('Bluth', 'Surname'), ('II', 'SuffixGenerational')] pp.parse(corpstr) # expected output: [('Sitwell', 'CorporationName'), ('Housing', 'CorporationName'), ('Inc', 'CorporationLegalType')]

# The tag method will try to be a little smarter # it will merge consecutive components, strip commas, & return a string type pp.tag(namestr) # expected output: (OrderedDict([('PrefixMarital', 'Mr'), ('GivenName', 'George'), ('Nickname', '"Gob"'), ('Surname', 'Bluth'), ('SuffixGenerational', 'II')]), 'Person') pp.tag(corpstr) # expected output: (OrderedDict([('CorporationName', 'Sitwell Housing'), ('CorporationLegalType', 'Inc')]), 'Corporation') ```

Links:

Documentation: https://probablepeople.readthedocs.io/
Web Interface: http://parserator.datamade.us/probablepeople
Distribution: https://pypi.python.org/pypi/probablepeople
Repository: https://github.com/datamade/probablepeople
Issues: https://github.com/datamade/usaddress/issues
Blog post: https://datamade.us/blog/parse-name-or-parse-anything-really

For the nerds:

Probablepeople uses parserator, a library for making and improving probabilistic parsers - specifically, parsers that use python-crfsuite's implementation of conditional random fields. Parserator allows you to train probablepeople's model (a .crfsuite settings file) on labeled training data, and provides tools for easily adding new labeled training data.

Building & testing development code

console git clone https://github.com/datamade/probablepeople.git cd probablepeople pip install -e . pytest

Creating/adding labeled training data (.xml outfile) from unlabeled raw data (.csv infile)

If there are name/company formats that the parser isn't performing well on, you can add them to training data. As probablepeople continually learns about new cases, it will continually become smarter and more robust.

NOTE: The model doesn't need many examples to learn about new patterns - if you are trying to get probablepeople to perform better on a specific type of name, start with a few (<5) examples, check performance, and then add more examples as necessary.

For this parser, we are keeping person names and organization names separate in the training data. The two training files used to produce the model are: - name_data/labeled/labeled.xml for people - name_data/labeled/company_labeled.xml for organizations.

To add your own training examples, first put your unlabeled raw data in a csv. Then:

parserator label [infile] [outfile] probablepeople

[infile] is your raw csv and [outfile] is the appropriate training file to write to. For example, if you put raw strings in my_companies.csv, you'd use parserator label my_companies.csv name_data/labeled/company_labeled.xml probablepeople

The parserator label command will start a console labeling task, where you will be prompted to label raw strings via the command line. For more info on using parserator, see the parserator documentation.

Re-training the model

If you've added new training data, you will need to re-train the model. To set multiple files as traindata, separate them with commas.

parserator train [traindata] probablepeople

probablepeople allows for multiple model files - person for person names only, company for company names only, or generic (both). here are examples of commands for training models:

parserator train name_data/labeled/person_labeled.xml,name_data/labeled/company_labeled.xml probablepeople --modelfile=generic parserator train name_data/labeled/person_labeled.xml probablepeople --modelfile=person parserator train name_data/labeled/company_labeled.xml probablepeople --modelfile=company

Errors and Bugs

If something is not behaving intuitively, it is a bug and should be reported. Report it here by creating an issue: https://github.com/datamade/probablepeople/issues

Help us fix the problem as quickly as possible by following Mozilla's guidelines for reporting bugs.

Patches and Pull Requests

Your patches are welcome. Here's our suggested workflow:

Fork the project.
Add your labeled examples.
Send us a pull request with a description of your work.

Copyright

Owner

Name: datamade
Login: datamade
Kind: organization
Email: info@datamade.us
Location: Chicago, IL

Website: http://datamade.us
Twitter: datamadeco
Repositories: 123
Profile: https://github.com/datamade

We build open source technology using open data to empower journalists, researchers, governments and advocacy organizations.

GitHub Events

Total

Watch event: 21
Delete event: 1
Issue comment event: 2
Push event: 11
Pull request event: 4
Fork event: 3
Create event: 2
Commit comment event: 1

Last Year

Watch event: 21
Delete event: 1
Issue comment event: 2
Push event: 11
Pull request event: 4
Fork event: 3
Create event: 2
Commit comment event: 1

Committers

Last synced: 10 months ago

All Time

Total Commits: 447
Total Committers: 17
Avg Commits per committer: 26.294
Development Distribution Score (DDS): 0.427

Past Year

Commits: 5
Committers: 3
Avg Commits per committer: 1.667
Development Distribution Score (DDS): 0.4

Top Committers

Name	Email	Commits
Cathy Deng	c**5@g**m	256
Forest Gregg	f**g@u**u	138
Derek Eder	d**r@g**m	14
Miroslav Batchkarov	m**v@g**m	13
Andrew Ziem	a**m@u**g	9
Xavier Medrano	x**o@X**l	3
Nicholas Chammas	n**s@g**m	3
Jean Cochrane	j**n@j**m	2
Adam Johnson	me@a****u	1
Francis T. O'Donovan	f**n@g**m	1
Hannah Cushman	h****h	1
Joe Germuska	j**e@g**m	1
Linh Nguyen	t**m@g**m	1
Xavier Medrano	x**o@x**n	1
Thom Neale	t**e@m**e	1
Derek Willis	d**s@n**m	1
Richard West	r**t@p**m	1

Committer Domains (Top 20 + Academic)

peachtreedata.com: 1 nytimes.com: 1 mm20698pc.home: 1 xaviers-mbp.lan: 1 germuska.com: 1 adamj.eu: 1 jeancochrane.com: 1 us.ci.org: 1 uchicago.edu: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 90
Total pull requests: 19
Average time to close issues: about 2 months
Average time to close pull requests: 4 months
Total issue authors: 69
Total pull request authors: 14
Average comments per issue: 1.72
Average comments per pull request: 2.21
Merged pull requests: 9
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 3
Pull requests: 3
Average time to close issues: N/A
Average time to close pull requests: 1 minute
Issue authors: 3
Pull request authors: 3
Average comments per issue: 0.33
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

az0 (15)
fgregg (3)
rkiddy (2)
dda-mark-burlock (2)
derekeder (2)
mattkiefer (2)
agnathan (2)
hancush (1)
merrymccarron (1)
russian-developer (1)
bwbensonjr (1)
gfranco008 (1)
lawli3t (1)
rlad (1)
vasantloka (1)

Pull Request Authors

az0 (4)
fgregg (3)
mparent61 (2)
atk81-candor (2)
pombredanne (2)
Richard-West (2)
tuleism (1)
Helw150 (1)
adamchainz (1)
proinsias (1)
twneale (1)
nchammas (1)
jernsthausen (1)
evz (1)

Top Labels

Issue Labels

bad parse (6) documentation (2) bug (1)

Pull Request Labels

Packages

Total packages: 3
Total downloads:
- pypi 72,650 last-month

Total dependent packages: 2
(may contain duplicates)
Total dependent repositories: 27
(may contain duplicates)
Total versions: 27
Total maintainers: 2

pypi.org: probablepeople

Parse romanized names & companies using advanced NLP methods

Homepage: https://github.com/datamade/probablepeople
Documentation: https://probablepeople.readthedocs.io/
License: MIT License
Latest release: 0.5.6
published over 1 year ago

Versions: 21
Dependent Packages: 2
Dependent Repositories: 27
Downloads: 72,650 Last month

Rankings

Downloads: 2.0%

Stargazers count: 2.6%

Dependent repos count: 2.8%

Dependent packages count: 3.1%

Average: 3.1%

Forks count: 5.2%

Maintainers (2)

Derek.Eder hancush

Last synced: 6 months ago

proxy.golang.org: github.com/datamade/probablepeople

Documentation: https://pkg.go.dev/github.com/datamade/probablepeople#section-documentation
License: mit
Latest release: v0.5.6
published over 1 year ago

Versions: 4
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Stargazers count: 2.8%

Forks count: 3.3%

Average: 4.3%

Dependent packages count: 5.4%

Dependent repos count: 5.7%

Last synced: 6 months ago

conda-forge.org: probablepeople

Homepage: https://parserator.datamade.us/probablepeople
License: MIT
Latest release: 0.5.4
published over 6 years ago

Versions: 2
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Stargazers count: 15.4%

Forks count: 20.5%

Average: 30.3%

Dependent repos count: 34.0%

Dependent packages count: 51.2%

Last synced: 6 months ago

https://github.com/datamade/probablepeople

Science Score: 23.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

probablepeople

How to use the probablepeople python library

Links:

For the nerds:

Building & testing development code

Creating/adding labeled training data (.xml outfile) from unlabeled raw data (.csv infile)

Re-training the model

Errors and Bugs

Patches and Pull Requests

Copyright

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: probablepeople

Rankings

Maintainers (2)

proxy.golang.org: github.com/datamade/probablepeople

Rankings

conda-forge.org: probablepeople

Rankings

Dependencies