https://github.com/bernard-ng/drc-native-tokenizer

Tokenization for Low-Resource Congolese Languages: Efficiency, Coverage, and Downstream Impact

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.6%) to scientific vocabulary

Keywords

llm nlp tokenizer

Last synced: 6 months ago · JSON representation

Repository

Tokenization for Low-Resource Congolese Languages: Efficiency, Coverage, and Downstream Impact

Basic Info

Host: GitHub
Owner: bernard-ng
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 20.5 KB

Statistics

Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Releases: 0

Topics

llm nlp tokenizer

Created 6 months ago · Last pushed 6 months ago

Metadata Files

Readme License

Tokenization for Low-Resource Congolese Languages: Efficiency, Coverage, and Downstream Impact

Tokenization is a foundational step in natural language processing (NLP), yet most existing approaches are developed and optimized for high-resource languages, often overlooking African linguistic contexts. This work explores the use of Byte Pair Encoding (BPE) applied exclusively to Congolese native names as training data to investigate whether meaningful subword units can emerge that generalize across the four major national languages of the Democratic Republic of Congo (Lingala, Swahili, Kikongo, and Tshiluba). By constructing a tokenizer solely from personal names—ubiquitous, linguistically rich, and culturally grounded—we aim to examine whether name-derived subword patterns capture phonological and morphological regularities shared across languages.

Getting Started

Installation & Setup

Instructions and command line snippets bellow are provided to help you set up the project environment quickly and efficiently. assuming you have Python 3.11 and Git installed and working on a Unix-like system (Linux, macOS, etc.).

Using Makefile (Recommended)

```bash git clone https://github.com/bernard-ng/drc-native-tokenizer.git cd drc-native-tokenizer

Setup environment

make setup make activate ```

Contributors

Owner

Name: Bernard Ngandu
Login: bernard-ng
Kind: user
Location: Lubumbashi RDC
Company: @devscast

Website: https://devscast.tech
Twitter: BernardNgandu
Repositories: 7
Profile: https://github.com/bernard-ng

Building a community of skilled developers : @devscast

GitHub Events

Total

Push event: 1

Last Year

Push event: 1

Dependencies

requirements.lock.txt pypi

GitPython ==3.1.45
Jinja2 ==3.1.6
MarkupSafe ==3.0.2
PyYAML ==6.0.2
Pygments ==2.19.2
altair ==5.5.0
attrs ==25.3.0
blinker ==1.9.0
cachetools ==6.1.0
certifi ==2025.8.3
charset-normalizer ==3.4.3
click ==8.2.1
filelock ==3.19.1
fsspec ==2025.7.0
gitdb ==4.0.12
hf-xet ==1.1.7
huggingface-hub ==0.34.4
idna ==3.10
jsonschema ==4.25.1
jsonschema-specifications ==2025.4.1
markdown-it-py ==4.0.0
mdurl ==0.1.2
narwhals ==2.1.2
numpy ==2.3.2
packaging ==25.0
pandas ==2.3.1
pillow ==11.3.0
protobuf ==6.32.0
pyarrow ==21.0.0
pydeck ==0.9.1
python-dateutil ==2.9.0.post0
pytz ==2025.2
referencing ==0.36.2
regex ==2025.7.34
requests ==2.32.4
rich ==14.1.0
rpds-py ==0.27.0
safetensors ==0.6.2
shellingham ==1.5.4
six ==1.17.0
smmap ==5.0.2
streamlit ==1.48.1
tenacity ==9.1.2
tiktoken ==0.11.0
tokenizers ==0.21.4
toml ==0.10.2
tornado ==6.5.2
tqdm ==4.67.1
transformers ==4.55.2
typer ==0.16.0
typing_extensions ==4.14.1
tzdata ==2025.2
urllib3 ==2.5.0

requirements.txt pypi

altair ==5.3.0
numpy ==1.26.4
pandas ==2.2.2
pyyaml ==6.0.1
rich ==13.7.1
scikit-learn ==1.4.2
streamlit ==1.36.0
tiktoken ==0.7.0
tokenizers ==0.15.2
tqdm ==4.66.4
transformers ==4.41.2
typer ==0.12.3

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/bernard-ng/drc-native-tokenizer

Science Score: 26.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Tokenization for Low-Resource Congolese Languages: Efficiency, Coverage, and Downstream Impact

Getting Started

Installation & Setup

Setup environment

Contributors

Owner

GitHub Events

Total

Last Year

Dependencies