https://github.com/bernard-ng/drc-native-tokenizer

Tokenization for Low-Resource Congolese Languages: Efficiency, Coverage, and Downstream Impact

https://github.com/bernard-ng/drc-native-tokenizer

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.6%) to scientific vocabulary

Keywords

llm nlp tokenizer
Last synced: 6 months ago · JSON representation

Repository

Tokenization for Low-Resource Congolese Languages: Efficiency, Coverage, and Downstream Impact

Basic Info
  • Host: GitHub
  • Owner: bernard-ng
  • License: mit
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 20.5 KB
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Topics
llm nlp tokenizer
Created 6 months ago · Last pushed 6 months ago
Metadata Files
Readme License

README.md

Tokenization for Low-Resource Congolese Languages: Efficiency, Coverage, and Downstream Impact

Tokenization is a foundational step in natural language processing (NLP), yet most existing approaches are developed and optimized for high-resource languages, often overlooking African linguistic contexts. This work explores the use of Byte Pair Encoding (BPE) applied exclusively to Congolese native names as training data to investigate whether meaningful subword units can emerge that generalize across the four major national languages of the Democratic Republic of Congo (Lingala, Swahili, Kikongo, and Tshiluba). By constructing a tokenizer solely from personal names—ubiquitous, linguistically rich, and culturally grounded—we aim to examine whether name-derived subword patterns capture phonological and morphological regularities shared across languages.

Getting Started

Installation & Setup

Instructions and command line snippets bellow are provided to help you set up the project environment quickly and efficiently. assuming you have Python 3.11 and Git installed and working on a Unix-like system (Linux, macOS, etc.).

Using Makefile (Recommended)

```bash git clone https://github.com/bernard-ng/drc-native-tokenizer.git cd drc-native-tokenizer

Setup environment

make setup make activate ```

Contributors

contributors

Owner

  • Name: Bernard Ngandu
  • Login: bernard-ng
  • Kind: user
  • Location: Lubumbashi RDC
  • Company: @devscast

Building a community of skilled developers : @devscast

GitHub Events

Total
  • Push event: 1
Last Year
  • Push event: 1

Dependencies

requirements.lock.txt pypi
  • GitPython ==3.1.45
  • Jinja2 ==3.1.6
  • MarkupSafe ==3.0.2
  • PyYAML ==6.0.2
  • Pygments ==2.19.2
  • altair ==5.5.0
  • attrs ==25.3.0
  • blinker ==1.9.0
  • cachetools ==6.1.0
  • certifi ==2025.8.3
  • charset-normalizer ==3.4.3
  • click ==8.2.1
  • filelock ==3.19.1
  • fsspec ==2025.7.0
  • gitdb ==4.0.12
  • hf-xet ==1.1.7
  • huggingface-hub ==0.34.4
  • idna ==3.10
  • jsonschema ==4.25.1
  • jsonschema-specifications ==2025.4.1
  • markdown-it-py ==4.0.0
  • mdurl ==0.1.2
  • narwhals ==2.1.2
  • numpy ==2.3.2
  • packaging ==25.0
  • pandas ==2.3.1
  • pillow ==11.3.0
  • protobuf ==6.32.0
  • pyarrow ==21.0.0
  • pydeck ==0.9.1
  • python-dateutil ==2.9.0.post0
  • pytz ==2025.2
  • referencing ==0.36.2
  • regex ==2025.7.34
  • requests ==2.32.4
  • rich ==14.1.0
  • rpds-py ==0.27.0
  • safetensors ==0.6.2
  • shellingham ==1.5.4
  • six ==1.17.0
  • smmap ==5.0.2
  • streamlit ==1.48.1
  • tenacity ==9.1.2
  • tiktoken ==0.11.0
  • tokenizers ==0.21.4
  • toml ==0.10.2
  • tornado ==6.5.2
  • tqdm ==4.67.1
  • transformers ==4.55.2
  • typer ==0.16.0
  • typing_extensions ==4.14.1
  • tzdata ==2025.2
  • urllib3 ==2.5.0
requirements.txt pypi
  • altair ==5.3.0
  • numpy ==1.26.4
  • pandas ==2.2.2
  • pyyaml ==6.0.1
  • rich ==13.7.1
  • scikit-learn ==1.4.2
  • streamlit ==1.36.0
  • tiktoken ==0.7.0
  • tokenizers ==0.15.2
  • tqdm ==4.66.4
  • transformers ==4.41.2
  • typer ==0.12.3