https://github.com/bernard-ng/drc-ners-nlp

A Culturally-Aware NLP System for Congolese Name Analysis and Gender Inference

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.1%) to scientific vocabulary

Keywords

cultural-sociology gender-detection ner nlp

Last synced: 5 months ago · JSON representation

Repository

A Culturally-Aware NLP System for Congolese Name Analysis and Gender Inference

Basic Info

Host: GitHub
Owner: bernard-ng
Language: Python
Default Branch: main
Homepage:
Size: 4.79 MB

Statistics

Stars: 1
Watchers: 1
Forks: 0
Open Issues: 1
Releases: 0

Topics

cultural-sociology gender-detection ner nlp

Created 11 months ago · Last pushed 6 months ago

Metadata Files

Readme

A Culturally-Aware NLP System for Congolese Name Analysis and Gender Inference

Despite the growing success of gender inference models in Natural Language Processing (NLP), these tools often underperform when applied to culturally diverse African contexts due to the lack of culturally-representative training data. This project introduces a comprehensive pipeline for Congolese name analysis with a large-scale dataset of over 5 million names from the Democratic Republic of Congo (DRC) annotated with gender and demographic metadata.

Getting Started

Installation & Setup

Instructions and command line snippets bellow are provided to help you set up the project environment quickly and efficiently. assuming you have Python 3.11 and Git installed and working on a Unix-like system (Linux, macOS, etc.).

Using Makefile (Recommended)

```bash git clone https://github.com/bernard-ng/drc-ners-nlp.git cd drc-ners-nlp

Setup environment

make setup make activate ```

Manual Setup

```bash git clone https://github.com/bernard-ng/drc-ners-nlp.git cd drc-ners-nlp

Setup environment

python -m venv .venv .venv/bin/pip install --upgrade pip .venv/bin/pip install -r requirements.txt

pip install --upgrade pip pip install -r requirements.txt pip install jupyter notebook ipykernel pytest black flake8 mypy

source .venv/bin/activate ```

Data Processing

This project includes a robust data processing pipeline designed to handle large datasets efficiently with batching, checkpointing, and parallel processing capabilities. step are defined in the drc-ners-nlp/processing/steps directory. and configuration to enable them is managed through the drc-ners-nlp/config/pipeline.yaml file.

Pipeline Configuration

yaml stages: - "data_cleaning" - "feature_extraction" - "data_splitting"

Running the Pipeline

bash python main.py --env development

NER Processing

This project implements a custom named entity recognition (NER) pipeline tailored for Congolese names. Its main objective is to accurately identify and tag the different components of a Congolese name, specifically distinguishing between the native part and the surname.

bash python ner.py --env development

Once you've built and train the NER model you can use it to annotate CoMPOSE name in the original dataset

Running the Pipeline with NER Annotation yaml stages: - "data_cleaning" - "feature_extraction" - "ner_annotation" - "data_splitting"

Running the Pipeline with LLM Annotation yaml stages: - "data_cleaning" - "feature_extraction" - "llm_annotation" - "data_splitting"

Experiments

This project provides a modular experiment (model training and evaluation) framework for systematic model comparison and research iteration. models are defined in the drc-ners-nlp/research/models directory. you can define model features, training parameters, and evaluation metrics in the research_templates.yaml file.

Running Experiments

```bash python train.py --name="bigru" --type="baseline" --env="development" python train.py --name="cnn" --type="baseline" --env="development" python train.py --name="lightgbm" --type="baseline" --env="development"

python train.py --name="logisticregressionfullname" --type="baseline" --env="development" python train.py --name="logisticregressionnative" --type="baseline" --env="development" python train.py --name="logisticregressionsurname" --type="baseline" --env="development"

python train.py --name="lstm" --type="baseline" --env="development" python train.py --name="randomforest" --type="baseline" --env="development" python train.py --name="svm" --type="baseline" --env="development" python train.py --name="naivebayes" --type="baseline" --env="development" python train.py --name="transformer" --type="baseline" --env="development" python train.py --name="xgboost" --type="baseline" --env="development" ```

Web Interface

This project includes a user-friendly web interface built with Streamlit, allowing non-technical users to run experiments and make predictions without needing to understand the underlying code.

Running the Web Interface

bash streamlit run web/app.py

Contributors

Owner

Name: Bernard Ngandu
Login: bernard-ng
Kind: user
Location: Lubumbashi RDC
Company: @devscast

Website: https://devscast.tech
Twitter: BernardNgandu
Repositories: 7
Profile: https://github.com/bernard-ng

Building a community of skilled developers : @devscast

GitHub Events

Total

Push event: 9
Public event: 1
Pull request review event: 3
Pull request review comment event: 4
Pull request event: 1

Last Year

Push event: 9
Public event: 1
Pull request review event: 3
Pull request review comment event: 4
Pull request event: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 0
Total pull requests: 9
Average time to close issues: N/A
Average time to close pull requests: 1 day
Total issue authors: 0
Total pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.67
Merged pull requests: 6
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 9
Average time to close issues: N/A
Average time to close pull requests: 1 day
Issue authors: 0
Pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.67
Merged pull requests: 6
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

1Cansa (9)

Top Labels

Issue Labels

Pull Request Labels

enhancement (1) good first issue (1)

Dependencies

requirements.txt pypi

GitPython ==3.1.45
Jinja2 ==3.1.6
Markdown ==3.8.2
MarkupSafe ==3.0.2
PyYAML *
PyYAML ==6.0.2
Pygments ==2.19.1
Send2Trash ==1.8.3
Werkzeug ==3.1.3
absl-py ==2.3.0
altair ==5.1.2
annotated-types ==0.7.0
anyio ==4.9.0
appnope ==0.1.4
argon2-cffi ==25.1.0
argon2-cffi-bindings ==21.2.0
arrow ==1.3.0
asttokens ==3.0.0
astunparse ==1.6.3
async-lru ==2.0.5
attrs ==25.3.0
babel ==2.17.0
beautifulsoup4 ==4.13.4
black ==25.1.0
bleach ==6.2.0
blinker ==1.9.0
cachetools ==6.1.0
certifi ==2025.6.15
cffi ==1.17.1
charset-normalizer ==3.4.2
click ==8.2.1
comm ==0.2.2
contourpy ==1.3.2
cycler ==0.12.1
debugpy ==1.8.14
decorator ==5.2.1
defusedxml ==0.7.1
executing ==2.2.0
fastjsonschema ==2.21.1
flake8 ==7.3.0
flatbuffers ==25.2.10
fonttools ==4.58.4
fqdn ==1.5.1
gast ==0.6.0
gitdb ==4.0.12
google-pasta ==0.2.0
grpcio ==1.73.0
h11 ==0.16.0
h5py ==3.14.0
httpcore ==1.0.9
httpx ==0.28.1
idna ==3.10
imbalanced-learn ==0.13.0
ipykernel ==6.29.5
ipython ==9.4.0
ipython_pygments_lexers ==1.1.1
isoduration ==20.11.0
jedi ==0.19.2
joblib ==1.5.1
json5 ==0.12.0
jsonpointer ==3.0.0
jsonschema ==4.24.0
jsonschema-specifications ==2025.4.1
jupyter-events ==0.12.0
jupyter-lsp ==2.2.5
jupyter_client ==8.6.3
jupyter_core ==5.8.1
jupyter_server ==2.16.0
jupyter_server_terminals ==0.5.3
jupyterlab ==4.4.4
jupyterlab_pygments ==0.3.0
jupyterlab_server ==2.27.3
keras ==3.10.0
kiwisolver ==1.4.8
libclang ==18.1.1
lightgbm *
lightgbm ==4.6.0
markdown-it-py ==3.0.0
matplotlib ==3.10.3
matplotlib-inline ==0.1.7
mccabe ==0.7.0
mdurl ==0.1.2
mistune ==3.1.3
ml-dtypes ==0.3.2
mypy ==1.17.0
mypy_extensions ==1.1.0
namex ==0.1.0
narwhals ==2.0.1
nbclient ==0.10.2
nbconvert ==7.16.6
nbformat ==5.10.4
nest-asyncio ==1.6.0
nltk ==3.9.1
notebook ==7.4.4
notebook_shim ==0.2.4
numpy ==1.26.4
ollama ==0.5.1
ollama *
opt_einsum ==3.4.0
optree ==0.16.0
overrides ==7.7.0
packaging ==25.0
pandas ==2.3.0
pandocfilters ==1.5.1
parso ==0.8.4
pathspec ==0.12.1
pexpect ==4.9.0
pillow ==11.2.1
platformdirs ==4.3.8
plotly *
plotly ==6.2.0
prometheus_client ==0.22.1
prompt_toolkit ==3.0.51
protobuf ==4.25.8
psutil ==7.0.0
ptyprocess ==0.7.0
pure_eval ==0.2.3
pyarrow ==21.0.0
pycodestyle ==2.14.0
pycparser ==2.22
pydantic *
pydantic ==2.11.7
pydantic_core ==2.33.2
pydeck ==0.9.1
pyflakes ==3.4.0
pyparsing ==3.2.3
python-dateutil ==2.9.0.post0
python-json-logger ==3.3.0
pytz ==2025.2
pyzmq ==27.0.0
referencing ==0.36.2
regex ==2024.11.6
requests ==2.32.4
rfc3339-validator ==0.1.4
rfc3986-validator ==0.1.1
rich ==14.0.0
rpds-py ==0.26.0
scikit-learn ==1.6.1
scikit-learn *
scipy ==1.15.3
seaborn ==0.13.2
six ==1.17.0
sklearn-compat ==0.1.3
smmap ==5.0.2
sniffio ==1.3.1
soupsieve ==2.7
spacy *
stack-data ==0.6.3
streamlit *
streamlit ==1.47.1
tenacity ==9.1.2
tensorboard ==2.16.2
tensorboard-data-server ==0.7.2
tensorflow ==2.16.2
tensorflow-io-gcs-filesystem ==0.37.1
termcolor ==3.1.0
terminado ==0.18.1
threadpoolctl ==3.6.0
tinycss2 ==1.4.0
toml ==0.10.2
toolz ==1.0.0
tornado ==6.5.1
tqdm ==4.67.1
traitlets ==5.14.3
types-PyYAML ==6.0.12.20250516
types-python-dateutil ==2.9.0.20250516
typing-inspection ==0.4.1
typing_extensions ==4.14.0
tzdata ==2025.2
uri-template ==1.3.0
urllib3 ==2.5.0
wcwidth ==0.2.13
webcolors ==24.11.1
webencodings ==0.5.1
websocket-client ==1.8.0
wrapt ==1.17.2
xgboost ==3.0.3
xgboost *

https://github.com/bernard-ng/drc-ners-nlp

Science Score: 26.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

A Culturally-Aware NLP System for Congolese Name Analysis and Gender Inference

Getting Started

Installation & Setup

Setup environment

Setup environment

Data Processing

NER Processing

Experiments

Web Interface

Running the Web Interface

Contributors

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies