https://github.com/bernard-ng/drc-native-tokenizer
Tokenization for Low-Resource Congolese Languages: Efficiency, Coverage, and Downstream Impact
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.6%) to scientific vocabulary
Keywords
Repository
Tokenization for Low-Resource Congolese Languages: Efficiency, Coverage, and Downstream Impact
Basic Info
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
Tokenization for Low-Resource Congolese Languages: Efficiency, Coverage, and Downstream Impact
Tokenization is a foundational step in natural language processing (NLP), yet most existing approaches are developed and optimized for high-resource languages, often overlooking African linguistic contexts. This work explores the use of Byte Pair Encoding (BPE) applied exclusively to Congolese native names as training data to investigate whether meaningful subword units can emerge that generalize across the four major national languages of the Democratic Republic of Congo (Lingala, Swahili, Kikongo, and Tshiluba). By constructing a tokenizer solely from personal names—ubiquitous, linguistically rich, and culturally grounded—we aim to examine whether name-derived subword patterns capture phonological and morphological regularities shared across languages.
Getting Started
Installation & Setup
Instructions and command line snippets bellow are provided to help you set up the project environment quickly and efficiently. assuming you have Python 3.11 and Git installed and working on a Unix-like system (Linux, macOS, etc.).
Using Makefile (Recommended)
```bash git clone https://github.com/bernard-ng/drc-native-tokenizer.git cd drc-native-tokenizer
Setup environment
make setup make activate ```
Contributors
Owner
- Name: Bernard Ngandu
- Login: bernard-ng
- Kind: user
- Location: Lubumbashi RDC
- Company: @devscast
- Website: https://devscast.tech
- Twitter: BernardNgandu
- Repositories: 7
- Profile: https://github.com/bernard-ng
Building a community of skilled developers : @devscast
GitHub Events
Total
- Push event: 1
Last Year
- Push event: 1
Dependencies
- GitPython ==3.1.45
- Jinja2 ==3.1.6
- MarkupSafe ==3.0.2
- PyYAML ==6.0.2
- Pygments ==2.19.2
- altair ==5.5.0
- attrs ==25.3.0
- blinker ==1.9.0
- cachetools ==6.1.0
- certifi ==2025.8.3
- charset-normalizer ==3.4.3
- click ==8.2.1
- filelock ==3.19.1
- fsspec ==2025.7.0
- gitdb ==4.0.12
- hf-xet ==1.1.7
- huggingface-hub ==0.34.4
- idna ==3.10
- jsonschema ==4.25.1
- jsonschema-specifications ==2025.4.1
- markdown-it-py ==4.0.0
- mdurl ==0.1.2
- narwhals ==2.1.2
- numpy ==2.3.2
- packaging ==25.0
- pandas ==2.3.1
- pillow ==11.3.0
- protobuf ==6.32.0
- pyarrow ==21.0.0
- pydeck ==0.9.1
- python-dateutil ==2.9.0.post0
- pytz ==2025.2
- referencing ==0.36.2
- regex ==2025.7.34
- requests ==2.32.4
- rich ==14.1.0
- rpds-py ==0.27.0
- safetensors ==0.6.2
- shellingham ==1.5.4
- six ==1.17.0
- smmap ==5.0.2
- streamlit ==1.48.1
- tenacity ==9.1.2
- tiktoken ==0.11.0
- tokenizers ==0.21.4
- toml ==0.10.2
- tornado ==6.5.2
- tqdm ==4.67.1
- transformers ==4.55.2
- typer ==0.16.0
- typing_extensions ==4.14.1
- tzdata ==2025.2
- urllib3 ==2.5.0
- altair ==5.3.0
- numpy ==1.26.4
- pandas ==2.2.2
- pyyaml ==6.0.1
- rich ==13.7.1
- scikit-learn ==1.4.2
- streamlit ==1.36.0
- tiktoken ==0.7.0
- tokenizers ==0.15.2
- tqdm ==4.66.4
- transformers ==4.41.2
- typer ==0.12.3