https://github.com/cldf/segments

Unicode Standard tokenization routines and orthography profile segmentation

https://github.com/cldf/segments

Science Score: 33.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 6 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Committers with academic emails
    2 of 8 committers (25.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (6.7%) to scientific vocabulary

Keywords from Contributors

linguistics concepts cross-linguistic-data glottolog
Last synced: 10 months ago · JSON representation

Repository

Unicode Standard tokenization routines and orthography profile segmentation

Basic Info
  • Host: GitHub
  • Owner: cldf
  • License: apache-2.0
  • Language: Python
  • Default Branch: master
  • Homepage:
  • Size: 134 KB
Statistics
  • Stars: 37
  • Watchers: 8
  • Forks: 12
  • Open Issues: 3
  • Releases: 0
Created almost 10 years ago · Last pushed over 1 year ago
Metadata Files
Readme Changelog Contributing License

README.md

segments

Build Status PyPI

DOI

The segments package provides Unicode Standard tokenization routines and orthography segmentation, implementing the linear algorithm described in the orthography profile specification from The Unicode Cookbook (Moran and Cysouw 2018 DOI).

Command line usage

Create a text file: $ echo "aäaaöaaüaa" > text.txt

Now look at the profile: $ cat text.txt | segments profile Grapheme frequency mapping a 7 a ä 1 ä ü 1 ü ö 1 ö

Write the profile to a file: $ cat text.txt | segments profile > profile.prf

Edit the profile:

$ more profile.prf Grapheme frequency mapping aa 0 x a 7 a ä 1 ä ü 1 ü ö 1 ö

Now tokenize the text without profile: $ cat text.txt | segments tokenize a ä a a ö a a ü a a

And with profile: ``` $ cat text.txt | segments --profile=profile.prf tokenize a ä aa ö aa ü aa

$ cat text.txt | segments --mapping=mapping --profile=profile.prf tokenize a ä x ö x ü x ```

API

```python

from segments import Profile, Tokenizer t = Tokenizer() t('abcd') 'a b c d' prf = Profile({'Grapheme': 'ab', 'mapping': 'x'}, {'Grapheme': 'cd', 'mapping': 'y'}) print(prf) Grapheme mapping ab x cd y t = Tokenizer(profile=prf) t('abcd') 'ab cd' t('abcd', column='mapping') 'x y' ```

Owner

  • Name: Cross-Linguistic Data Formats
  • Login: cldf
  • Kind: organization

GitHub Events

Total
  • Issues event: 3
  • Watch event: 3
  • Delete event: 1
  • Push event: 5
  • Create event: 1
Last Year
  • Issues event: 3
  • Watch event: 3
  • Delete event: 1
  • Push event: 5
  • Create event: 1

Committers

Last synced: over 3 years ago

All Time
  • Total Commits: 72
  • Total Committers: 8
  • Avg Commits per committer: 9.0
  • Development Distribution Score (DDS): 0.139
Top Committers
Name Email Commits
xrotwang x****g@g****m 62
Steven Moran b****t@g****m 3
Lucas Ashby l****y@g****m 2
LinguList m****t@u****e 1
Gereon Kaiping g****g@h****l 1
Kyle Gorman k****n@g****m 1
Simon J Greenhill S****l@u****m 1
Johann-Mattis List L****t@u****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 10 months ago

All Time
  • Total issues: 29
  • Total pull requests: 22
  • Average time to close issues: about 1 year
  • Average time to close pull requests: 16 days
  • Total issue authors: 9
  • Total pull request authors: 7
  • Average comments per issue: 5.72
  • Average comments per pull request: 3.23
  • Merged pull requests: 19
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • xrotwang (13)
  • LinguList (6)
  • Anaphory (3)
  • HughP (2)
  • bambooforest (1)
  • kylebgorman (1)
  • lfashby (1)
  • tresoldi (1)
  • SimonGreenhill (1)
Pull Request Authors
  • xrotwang (11)
  • LinguList (3)
  • Anaphory (2)
  • bambooforest (2)
  • lfashby (2)
  • kylebgorman (1)
  • tresoldi (1)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 2
  • Total downloads:
    • pypi 371,592 last-month
  • Total docker downloads: 427,687
  • Total dependent packages: 5
    (may contain duplicates)
  • Total dependent repositories: 696
    (may contain duplicates)
  • Total versions: 32
  • Total maintainers: 4
pypi.org: segments

Segmentation with orthography profiles

  • Versions: 18
  • Dependent Packages: 5
  • Dependent Repositories: 696
  • Downloads: 371,592 Last month
  • Docker Downloads: 427,687
Rankings
Dependent repos count: 0.5%
Docker downloads count: 1.0%
Downloads: 1.2%
Dependent packages count: 1.6%
Average: 4.4%
Forks count: 10.2%
Stargazers count: 12.1%
Last synced: 11 months ago
proxy.golang.org: github.com/cldf/segments
  • Versions: 14
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent packages count: 5.5%
Average: 5.7%
Dependent repos count: 5.9%
Last synced: 11 months ago

Dependencies

setup.py pypi
  • clldutils >=1.7.3
  • csvw >=1.5.6
  • regex *