https://github.com/datamade/usaddress

:us: a python library for parsing unstructured United States address strings into address components

Keywords

address address-parser conditional-random-fields crf machine-learning natural-language-processing nlp parserator python python-library

Keywords from Contributors

names datamade de-duplicating dedupe dedupe-library entity-resolution record-linkage chicago councilmatic checklist

Last synced: 5 months ago · JSON representation

Repository

:us: a python library for parsing unstructured United States address strings into address components

Basic Info

Host: GitHub
Owner: datamade
License: mit
Language: Python
Default Branch: main
Homepage: https://parserator.datamade.us/usaddress
Size: 6.8 MB

Statistics

Stars: 1,583
Watchers: 38
Forks: 304
Open Issues: 163
Releases: 0

Topics

address address-parser conditional-random-fields crf machine-learning natural-language-processing nlp parserator python python-library

Created over 11 years ago · Last pushed 8 months ago

Metadata Files

Readme License

usaddress

usaddress is a Python library for parsing unstructured United States address strings into address components, using advanced NLP methods.

What this can do: Using a probabilistic model, it makes (very educated) guesses in identifying address components, even in tricky cases where rule-based parsers typically break down.

What this cannot do: It cannot identify address components with perfect accuracy, nor can it verify that a given address is correct/valid.

It also does not normalize the address. However, this library built on top of usaddress does.

Tools built with usaddress

Parserator API

A RESTful API built on top of usaddress for programmers who don't use python. Requires an API key and the first 1,000 parses are free.

Parserator Google Sheets App

Parserator: Parse and Split Addresses allows you to easily split addresses into separate columns by street, city, state, zipcode and more right in Google Sheets.

How to use the usaddress python library

Install usaddress with pip, a tool for installing and managing python packages (beginner's guide here).

In the terminal,

bash pip install usaddress 2. Parse some addresses!

usaddress

Note that parse and tag are different methods: ```python import usaddress addr='123 Main St. Suite 100 Chicago, IL'

# The parse method will split your address string into components, and label each component. # expected output: [(u'123', 'AddressNumber'), (u'Main', 'StreetName'), (u'St.', 'StreetNamePostType'), (u'Suite', 'OccupancyType'), (u'100', 'OccupancyIdentifier'), (u'Chicago,', 'PlaceName'), (u'IL', 'StateName')] usaddress.parse(addr)

# The tag method will try to be a little smarter # it will merge consecutive components, strip commas, & return an address type # expected output: (OrderedDict([('AddressNumber', u'123'), ('StreetName', u'Main'), ('StreetNamePostType', u'St.'), ('OccupancyType', u'Suite'), ('OccupancyIdentifier', u'100'), ('PlaceName', u'Chicago'), ('StateName', u'IL')]), 'Street Address') usaddress.tag(addr) ```

How to use this development code (for the nerds)

usaddress uses parserator, a library for making and improving probabilistic parsers - specifically, parsers that use python-crfsuite's implementation of conditional random fields. Parserator allows you to train the usaddress parser's model (a .crfsuite settings file) on labeled training data, and provides tools for adding new labeled training data.

Building & testing the code in this repo

To build a development version of usaddress on your machine, run the following code in your command line:

git clone https://github.com/datamade/usaddress.git cd usaddress pip install -e ."[dev]"

Then run the testing suite to confirm that everything is working properly:

pytest

Having trouble building the code? Open an issue and we'd be glad to help you troubleshoot.

Adding new training data

If usaddress is consistently failing on particular address patterns, you can adjust the parser's behavior by adding new training data to the model. Follow our guide in the training directory, and be sure to make a pull request so that we can incorporate your contribution into our next release!

Important links

Web Interface: https://parserator.datamade.us/usaddress
Python Package Distribution: https://pypi.python.org/pypi/usaddress
Python Package Documentation: https://usaddress.readthedocs.io/
API Documentation: https://parserator.datamade.us/api-docs
Repository: https://github.com/datamade/usaddress
Issues: https://github.com/datamade/usaddress/issues
Blog post: http://datamade.us/blog/parsing-addresses-with-usaddress

Team

Forest Gregg, DataMade
Cathy Deng, DataMade
Miroslav Batchkarov, University of Sussex
Jean Cochrane, DataMade

Bad Parses / Bugs

Report issues in the issue tracker

If an address was parsed incorrectly, please let us know! You can either open an issue or (if you're adventurous) add new training data to improve the parser's model. When possible, please send over a few real-world examples of similar address patterns, along with some info about the source of the data - this will help us train the parser and improve its performance.

If something in the library is not behaving intuitively, it is a bug, and should be reported.

Note on Patches/Pull Requests

Fork the project.
Make your feature addition or bug fix.
Send us a pull request. Bonus points for topic branches!

Copyright

Owner

Name: datamade
Login: datamade
Kind: organization
Email: info@datamade.us
Location: Chicago, IL

Website: http://datamade.us
Twitter: datamadeco
Repositories: 123
Profile: https://github.com/datamade

We build open source technology using open data to empower journalists, researchers, governments and advocacy organizations.

GitHub Events

Total

Issues event: 12
Watch event: 65
Delete event: 1
Issue comment event: 28
Push event: 12
Pull request review comment event: 6
Pull request review event: 8
Pull request event: 17
Fork event: 9
Create event: 11

Last Year

Issues event: 12
Watch event: 65
Delete event: 1
Issue comment event: 28
Push event: 12
Pull request review comment event: 6
Pull request review event: 8
Pull request event: 17
Fork event: 9
Create event: 11

Committers

Last synced: 9 months ago

All Time

Total Commits: 425
Total Committers: 19
Avg Commits per committer: 22.368
Development Distribution Score (DDS): 0.565

Past Year

Commits: 19
Committers: 5
Avg Commits per committer: 3.8
Development Distribution Score (DDS): 0.579

Top Committers

Name	Email	Commits
Cathy Deng	c**5@g**m	185
Forest Gregg	f**g@u**u	159
Jean Cochrane	j**n@j**m	16
Derek Eder	d**r@g**m	16
Miroslav Batchkarov	m**v@g**m	13
Xavier Medrano	x**o@X**l	8
Michael Lissner	m**r@m**m	6
Brent Payne	b**e@g**m	4
Travis Brown	t**s@b**m	4
Xavier Medrano	x**o@x**n	3
Ben Shulman	s**n@g**m	2
Jacob R. Stevens	s**9@p**u	2
Dave Guarino	d**e@c**g	1
Mark Baas	m**s@g**m	1
Tanya Schlusser	t**a@t**t	1
xmedr	1****r	1
Shahin Saneinejad	s**d@c**m	1
Adam Chainz	a**m@a**u	1
RJ	r**g@c**g	1

Committer Domains (Top 20 + Academic)

casecommons.org: 1 adamj.eu: 1 castlighthealth.com: 1 tickel.net: 1 codeforamerica.org: 1 purdue.edu: 1 xaviers-mbp.lan: 1 bryx.com: 1 michaeljaylissner.com: 1 jeancochrane.com: 1 uchicago.edu: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 123
Total pull requests: 28
Average time to close issues: 8 months
Average time to close pull requests: about 3 years
Total issue authors: 114
Total pull request authors: 11
Average comments per issue: 0.56
Average comments per pull request: 0.71
Merged pull requests: 8
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 10
Pull requests: 12
Average time to close issues: about 2 months
Average time to close pull requests: 10 days
Issue authors: 9
Pull request authors: 4
Average comments per issue: 0.7
Average comments per pull request: 0.17
Merged pull requests: 8
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

stevetb777 (2)
dimaryaz (2)
rbrtmrtn (2)
PoRich (2)
fablet (2)
Aj-232425 (2)
boompig (2)
stdavis (2)
webauthor (2)
AmericanY (2)
Baugus (1)
nonprofittechy (1)
evanrzai (1)
ezheidtmann (1)
bschollnick (1)

Pull Request Authors

xmedr (9)
bbharathrao (5)
vasil166 (2)
ecatkins (2)
rptetzloff (2)
adriennefranke (2)
mlissner (2)
tirkarthi (2)
IlyaSukhanov (2)
stdavis (1)
fgregg (1)

Top Labels

Issue Labels

bad parse (4)

Pull Request Labels

Packages

Total packages: 3
Total downloads:
- pypi 2,026,352 last-month
Total docker downloads: 7,830

Total dependent packages: 11
(may contain duplicates)
Total dependent repositories: 149
(may contain duplicates)
Total versions: 50
Total maintainers: 2

pypi.org: usaddress

Parse US addresses using conditional random fields

Homepage: https://github.com/datamade/usaddress
Documentation: https://usaddress.readthedocs.io/
License: MIT License
Latest release: 0.5.16
published 7 months ago

Versions: 39
Dependent Packages: 11
Dependent Repositories: 149
Downloads: 2,026,352 Last month
Docker Downloads: 7,830

Rankings

Downloads: 0.4%

Dependent packages count: 0.9%

Dependent repos count: 1.2%

Average: 1.5%

Stargazers count: 1.8%

Docker downloads count: 1.8%

Forks count: 3.0%

Maintainers (2)

Derek.Eder fgregg

Last synced: 7 months ago

proxy.golang.org: github.com/datamade/usaddress

Documentation: https://pkg.go.dev/github.com/datamade/usaddress#section-documentation
License: mit
Latest release: v0.5.16
published 7 months ago

Versions: 9
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Forks count: 1.7%

Stargazers count: 1.9%

Average: 3.7%

Dependent packages count: 5.4%

Dependent repos count: 5.7%

Last synced: 6 months ago

conda-forge.org: usaddress

Homepage: https://github.com/datamade/usaddress
License: MIT
Latest release: 0.5.10
published over 3 years ago

Versions: 2
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Forks count: 9.4%

Stargazers count: 9.8%

Average: 26.1%

Dependent repos count: 34.0%

Dependent packages count: 51.2%

Last synced: 5 months ago

https://github.com/datamade/usaddress

Science Score: 36.0%

Keywords

Keywords from Contributors

Basic Info

Statistics

Topics

Metadata Files

usaddress

Tools built with usaddress

How to use the usaddress python library

How to use this development code (for the nerds)

Building & testing the code in this repo

Adding new training data

Important links

Team

Bad Parses / Bugs

Note on Patches/Pull Requests

Copyright

Owner

GitHub Events

Total

Last Year

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: usaddress

Rankings

Maintainers (2)

proxy.golang.org: github.com/datamade/usaddress

Rankings

conda-forge.org: usaddress

Rankings