https://github.com/cldf/segments
Unicode Standard tokenization routines and orthography profile segmentation
Science Score: 33.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
✓DOI references
Found 6 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
✓Committers with academic emails
2 of 8 committers (25.0%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (6.7%) to scientific vocabulary
Keywords from Contributors
Repository
Unicode Standard tokenization routines and orthography profile segmentation
Basic Info
Statistics
- Stars: 37
- Watchers: 8
- Forks: 12
- Open Issues: 3
- Releases: 0
Metadata Files
README.md
segments
The segments package provides Unicode Standard tokenization routines and orthography segmentation,
implementing the linear algorithm described in the orthography profile specification from
The Unicode Cookbook (Moran and Cysouw 2018 ).
Command line usage
Create a text file:
$ echo "aäaaöaaüaa" > text.txt
Now look at the profile:
$ cat text.txt | segments profile
Grapheme frequency mapping
a 7 a
ä 1 ä
ü 1 ü
ö 1 ö
Write the profile to a file:
$ cat text.txt | segments profile > profile.prf
Edit the profile:
$ more profile.prf
Grapheme frequency mapping
aa 0 x
a 7 a
ä 1 ä
ü 1 ü
ö 1 ö
Now tokenize the text without profile:
$ cat text.txt | segments tokenize
a ä a a ö a a ü a a
And with profile: ``` $ cat text.txt | segments --profile=profile.prf tokenize a ä aa ö aa ü aa
$ cat text.txt | segments --mapping=mapping --profile=profile.prf tokenize a ä x ö x ü x ```
API
```python
from segments import Profile, Tokenizer t = Tokenizer() t('abcd') 'a b c d' prf = Profile({'Grapheme': 'ab', 'mapping': 'x'}, {'Grapheme': 'cd', 'mapping': 'y'}) print(prf) Grapheme mapping ab x cd y t = Tokenizer(profile=prf) t('abcd') 'ab cd' t('abcd', column='mapping') 'x y' ```
Owner
- Name: Cross-Linguistic Data Formats
- Login: cldf
- Kind: organization
- Website: https://cldf.clld.org
- Repositories: 15
- Profile: https://github.com/cldf
GitHub Events
Total
- Issues event: 3
- Watch event: 3
- Delete event: 1
- Push event: 5
- Create event: 1
Last Year
- Issues event: 3
- Watch event: 3
- Delete event: 1
- Push event: 5
- Create event: 1
Committers
Last synced: over 3 years ago
All Time
- Total Commits: 72
- Total Committers: 8
- Avg Commits per committer: 9.0
- Development Distribution Score (DDS): 0.139
Top Committers
| Name | Commits | |
|---|---|---|
| xrotwang | x****g@g****m | 62 |
| Steven Moran | b****t@g****m | 3 |
| Lucas Ashby | l****y@g****m | 2 |
| LinguList | m****t@u****e | 1 |
| Gereon Kaiping | g****g@h****l | 1 |
| Kyle Gorman | k****n@g****m | 1 |
| Simon J Greenhill | S****l@u****m | 1 |
| Johann-Mattis List | L****t@u****m | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 10 months ago
All Time
- Total issues: 29
- Total pull requests: 22
- Average time to close issues: about 1 year
- Average time to close pull requests: 16 days
- Total issue authors: 9
- Total pull request authors: 7
- Average comments per issue: 5.72
- Average comments per pull request: 3.23
- Merged pull requests: 19
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- xrotwang (13)
- LinguList (6)
- Anaphory (3)
- HughP (2)
- bambooforest (1)
- kylebgorman (1)
- lfashby (1)
- tresoldi (1)
- SimonGreenhill (1)
Pull Request Authors
- xrotwang (11)
- LinguList (3)
- Anaphory (2)
- bambooforest (2)
- lfashby (2)
- kylebgorman (1)
- tresoldi (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 2
-
Total downloads:
- pypi 371,592 last-month
- Total docker downloads: 427,687
-
Total dependent packages: 5
(may contain duplicates) -
Total dependent repositories: 696
(may contain duplicates) - Total versions: 32
- Total maintainers: 4
pypi.org: segments
Segmentation with orthography profiles
- Homepage: https://github.com/cldf/segments
- Documentation: https://segments.readthedocs.io/
- License: Apache 2.0
-
Latest release: 2.3.0
published over 1 year ago
Rankings
Maintainers (4)
proxy.golang.org: github.com/cldf/segments
- Documentation: https://pkg.go.dev/github.com/cldf/segments#section-documentation
- License: apache-2.0
-
Latest release: v2.3.0+incompatible
published over 1 year ago
Rankings
Dependencies
- clldutils >=1.7.3
- csvw >=1.5.6
- regex *