https://github.com/cldf/segments

Unicode Standard tokenization routines and orthography profile segmentation

Science Score: 33.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
✓
DOI references
Found 6 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
✓
Committers with academic emails
2 of 8 committers (25.0%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (6.7%) to scientific vocabulary

Keywords from Contributors

linguistics concepts cross-linguistic-data glottolog

Last synced: 10 months ago · JSON representation

Repository

Unicode Standard tokenization routines and orthography profile segmentation

Basic Info

Host: GitHub
Owner: cldf
License: apache-2.0
Language: Python
Default Branch: master
Homepage:
Size: 134 KB

Statistics

Stars: 37
Watchers: 8
Forks: 12
Open Issues: 3
Releases: 0

Created almost 10 years ago · Last pushed over 1 year ago

Metadata Files

Readme Changelog Contributing License

segments

The segments package provides Unicode Standard tokenization routines and orthography segmentation, implementing the linear algorithm described in the orthography profile specification from The Unicode Cookbook (Moran and Cysouw 2018 ).

Command line usage

Create a text file: $ echo "aäaaöaaüaa" > text.txt

Now look at the profile: $ cat text.txt | segments profile Grapheme frequency mapping a 7 a ä 1 ä ü 1 ü ö 1 ö

Write the profile to a file: $ cat text.txt | segments profile > profile.prf

Edit the profile:

$ more profile.prf Grapheme frequency mapping aa 0 x a 7 a ä 1 ä ü 1 ü ö 1 ö

Now tokenize the text without profile: $ cat text.txt | segments tokenize a ä a a ö a a ü a a

And with profile: ``` $ cat text.txt | segments --profile=profile.prf tokenize a ä aa ö aa ü aa

$ cat text.txt | segments --mapping=mapping --profile=profile.prf tokenize a ä x ö x ü x ```

API

```python

from segments import Profile, Tokenizer t = Tokenizer() t('abcd') 'a b c d' prf = Profile({'Grapheme': 'ab', 'mapping': 'x'}, {'Grapheme': 'cd', 'mapping': 'y'}) print(prf) Grapheme mapping ab x cd y t = Tokenizer(profile=prf) t('abcd') 'ab cd' t('abcd', column='mapping') 'x y' ```

Owner

Name: Cross-Linguistic Data Formats
Login: cldf
Kind: organization

Website: https://cldf.clld.org
Repositories: 15
Profile: https://github.com/cldf

GitHub Events

Total

Issues event: 3
Watch event: 3
Delete event: 1
Push event: 5
Create event: 1

Last Year

Issues event: 3
Watch event: 3
Delete event: 1
Push event: 5
Create event: 1

Committers

Last synced: over 3 years ago

All Time

Total Commits: 72
Total Committers: 8
Avg Commits per committer: 9.0
Development Distribution Score (DDS): 0.139

Top Committers

Name	Email	Commits
xrotwang	x**g@g**m	62
Steven Moran	b**t@g**m	3
Lucas Ashby	l**y@g**m	2
LinguList	m**t@u**e	1
Gereon Kaiping	g**g@h**l	1
Kyle Gorman	k**n@g**m	1
Simon J Greenhill	S**l@u**m	1
Johann-Mattis List	L**t@u**m	1

Committer Domains (Top 20 + Academic)

hum.leidenuniv.nl: 1 uni-marburg.de: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 29
Total pull requests: 22
Average time to close issues: about 1 year
Average time to close pull requests: 16 days
Total issue authors: 9
Total pull request authors: 7
Average comments per issue: 5.72
Average comments per pull request: 3.23
Merged pull requests: 19
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

xrotwang (13)
LinguList (6)
Anaphory (3)
HughP (2)
bambooforest (1)
kylebgorman (1)
lfashby (1)
tresoldi (1)
SimonGreenhill (1)

Pull Request Authors

xrotwang (11)
LinguList (3)
Anaphory (2)
bambooforest (2)
lfashby (2)
kylebgorman (1)
tresoldi (1)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 2
Total downloads:
- pypi 371,592 last-month
Total docker downloads: 427,687

Total dependent packages: 5
(may contain duplicates)
Total dependent repositories: 696
(may contain duplicates)
Total versions: 32
Total maintainers: 4

pypi.org: segments

Segmentation with orthography profiles

Homepage: https://github.com/cldf/segments
Documentation: https://segments.readthedocs.io/
License: Apache 2.0
Latest release: 2.3.0
published over 1 year ago

Versions: 18
Dependent Packages: 5
Dependent Repositories: 696
Downloads: 371,592 Last month
Docker Downloads: 427,687

Rankings

Dependent repos count: 0.5%

Docker downloads count: 1.0%

Downloads: 1.2%

Dependent packages count: 1.6%

Average: 4.4%

Forks count: 10.2%

Stargazers count: 12.1%

Maintainers (4)

LinguList xrotwang bambooforest chrzyki

Last synced: 11 months ago

proxy.golang.org: github.com/cldf/segments

Documentation: https://pkg.go.dev/github.com/cldf/segments#section-documentation
License: apache-2.0
Latest release: v2.3.0+incompatible
published over 1 year ago

Versions: 14
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Dependent packages count: 5.5%

Average: 5.7%

Dependent repos count: 5.9%

Last synced: 11 months ago

Dependencies

setup.py pypi

clldutils >=1.7.3
csvw >=1.5.6
regex *

https://github.com/cldf/segments

Science Score: 33.0%

Keywords from Contributors

Repository

Basic Info

Statistics

Metadata Files

README.md

segments

Command line usage

API

Owner

GitHub Events

Total

Last Year

Committers

All Time

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: segments

Rankings

Maintainers (4)

proxy.golang.org: github.com/cldf/segments

Rankings

Dependencies