https://github.com/bernard-ng/drc-ners-nlp
A Culturally-Aware NLP System for Congolese Name Analysis and Gender Inference
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.1%) to scientific vocabulary
Keywords
Repository
A Culturally-Aware NLP System for Congolese Name Analysis and Gender Inference
Basic Info
Statistics
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 1
- Releases: 0
Topics
Metadata Files
README.md
A Culturally-Aware NLP System for Congolese Name Analysis and Gender Inference
Despite the growing success of gender inference models in Natural Language Processing (NLP), these tools often underperform when applied to culturally diverse African contexts due to the lack of culturally-representative training data. This project introduces a comprehensive pipeline for Congolese name analysis with a large-scale dataset of over 5 million names from the Democratic Republic of Congo (DRC) annotated with gender and demographic metadata.
Getting Started
Installation & Setup
Instructions and command line snippets bellow are provided to help you set up the project environment quickly and efficiently. assuming you have Python 3.11 and Git installed and working on a Unix-like system (Linux, macOS, etc.).
Using Makefile (Recommended)
```bash git clone https://github.com/bernard-ng/drc-ners-nlp.git cd drc-ners-nlp
Setup environment
make setup make activate ```
Manual Setup
```bash git clone https://github.com/bernard-ng/drc-ners-nlp.git cd drc-ners-nlp
Setup environment
python -m venv .venv .venv/bin/pip install --upgrade pip .venv/bin/pip install -r requirements.txt
pip install --upgrade pip pip install -r requirements.txt pip install jupyter notebook ipykernel pytest black flake8 mypy
source .venv/bin/activate ```
Data Processing
This project includes a robust data processing pipeline designed to handle large datasets efficiently with batching,
checkpointing, and parallel processing capabilities.
step are defined in the drc-ners-nlp/processing/steps directory. and configuration to enable them is managed through
the drc-ners-nlp/config/pipeline.yaml file.
Pipeline Configuration
yaml
stages:
- "data_cleaning"
- "feature_extraction"
- "data_splitting"
Running the Pipeline
bash
python main.py --env development
NER Processing
This project implements a custom named entity recognition (NER) pipeline tailored for Congolese names. Its main objective is to accurately identify and tag the different components of a Congolese name, specifically distinguishing between the native part and the surname.
bash
python ner.py --env development
Once you've built and train the NER model you can use it to annotate CoMPOSE name in the original dataset
Running the Pipeline with NER Annotation
yaml
stages:
- "data_cleaning"
- "feature_extraction"
- "ner_annotation"
- "data_splitting"
Running the Pipeline with LLM Annotation
yaml
stages:
- "data_cleaning"
- "feature_extraction"
- "llm_annotation"
- "data_splitting"
Experiments
This project provides a modular experiment (model training and evaluation) framework for systematic model comparison and
research iteration. models are defined in the drc-ners-nlp/research/models directory.
you can define model features, training parameters, and evaluation metrics in the research_templates.yaml file.
Running Experiments
```bash python train.py --name="bigru" --type="baseline" --env="development" python train.py --name="cnn" --type="baseline" --env="development" python train.py --name="lightgbm" --type="baseline" --env="development"
python train.py --name="logisticregressionfullname" --type="baseline" --env="development" python train.py --name="logisticregressionnative" --type="baseline" --env="development" python train.py --name="logisticregressionsurname" --type="baseline" --env="development"
python train.py --name="lstm" --type="baseline" --env="development" python train.py --name="randomforest" --type="baseline" --env="development" python train.py --name="svm" --type="baseline" --env="development" python train.py --name="naivebayes" --type="baseline" --env="development" python train.py --name="transformer" --type="baseline" --env="development" python train.py --name="xgboost" --type="baseline" --env="development" ```
Web Interface
This project includes a user-friendly web interface built with Streamlit, allowing non-technical users to run experiments and make predictions without needing to understand the underlying code.
Running the Web Interface
bash
streamlit run web/app.py
Contributors
Owner
- Name: Bernard Ngandu
- Login: bernard-ng
- Kind: user
- Location: Lubumbashi RDC
- Company: @devscast
- Website: https://devscast.tech
- Twitter: BernardNgandu
- Repositories: 7
- Profile: https://github.com/bernard-ng
Building a community of skilled developers : @devscast
GitHub Events
Total
- Push event: 9
- Public event: 1
- Pull request review event: 3
- Pull request review comment event: 4
- Pull request event: 1
Last Year
- Push event: 9
- Public event: 1
- Pull request review event: 3
- Pull request review comment event: 4
- Pull request event: 1
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 0
- Total pull requests: 9
- Average time to close issues: N/A
- Average time to close pull requests: 1 day
- Total issue authors: 0
- Total pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.67
- Merged pull requests: 6
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 9
- Average time to close issues: N/A
- Average time to close pull requests: 1 day
- Issue authors: 0
- Pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.67
- Merged pull requests: 6
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
- 1Cansa (9)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- GitPython ==3.1.45
- Jinja2 ==3.1.6
- Markdown ==3.8.2
- MarkupSafe ==3.0.2
- PyYAML *
- PyYAML ==6.0.2
- Pygments ==2.19.1
- Send2Trash ==1.8.3
- Werkzeug ==3.1.3
- absl-py ==2.3.0
- altair ==5.1.2
- annotated-types ==0.7.0
- anyio ==4.9.0
- appnope ==0.1.4
- argon2-cffi ==25.1.0
- argon2-cffi-bindings ==21.2.0
- arrow ==1.3.0
- asttokens ==3.0.0
- astunparse ==1.6.3
- async-lru ==2.0.5
- attrs ==25.3.0
- babel ==2.17.0
- beautifulsoup4 ==4.13.4
- black ==25.1.0
- bleach ==6.2.0
- blinker ==1.9.0
- cachetools ==6.1.0
- certifi ==2025.6.15
- cffi ==1.17.1
- charset-normalizer ==3.4.2
- click ==8.2.1
- comm ==0.2.2
- contourpy ==1.3.2
- cycler ==0.12.1
- debugpy ==1.8.14
- decorator ==5.2.1
- defusedxml ==0.7.1
- executing ==2.2.0
- fastjsonschema ==2.21.1
- flake8 ==7.3.0
- flatbuffers ==25.2.10
- fonttools ==4.58.4
- fqdn ==1.5.1
- gast ==0.6.0
- gitdb ==4.0.12
- google-pasta ==0.2.0
- grpcio ==1.73.0
- h11 ==0.16.0
- h5py ==3.14.0
- httpcore ==1.0.9
- httpx ==0.28.1
- idna ==3.10
- imbalanced-learn ==0.13.0
- ipykernel ==6.29.5
- ipython ==9.4.0
- ipython_pygments_lexers ==1.1.1
- isoduration ==20.11.0
- jedi ==0.19.2
- joblib ==1.5.1
- json5 ==0.12.0
- jsonpointer ==3.0.0
- jsonschema ==4.24.0
- jsonschema-specifications ==2025.4.1
- jupyter-events ==0.12.0
- jupyter-lsp ==2.2.5
- jupyter_client ==8.6.3
- jupyter_core ==5.8.1
- jupyter_server ==2.16.0
- jupyter_server_terminals ==0.5.3
- jupyterlab ==4.4.4
- jupyterlab_pygments ==0.3.0
- jupyterlab_server ==2.27.3
- keras ==3.10.0
- kiwisolver ==1.4.8
- libclang ==18.1.1
- lightgbm *
- lightgbm ==4.6.0
- markdown-it-py ==3.0.0
- matplotlib ==3.10.3
- matplotlib-inline ==0.1.7
- mccabe ==0.7.0
- mdurl ==0.1.2
- mistune ==3.1.3
- ml-dtypes ==0.3.2
- mypy ==1.17.0
- mypy_extensions ==1.1.0
- namex ==0.1.0
- narwhals ==2.0.1
- nbclient ==0.10.2
- nbconvert ==7.16.6
- nbformat ==5.10.4
- nest-asyncio ==1.6.0
- nltk ==3.9.1
- notebook ==7.4.4
- notebook_shim ==0.2.4
- numpy ==1.26.4
- ollama ==0.5.1
- ollama *
- opt_einsum ==3.4.0
- optree ==0.16.0
- overrides ==7.7.0
- packaging ==25.0
- pandas ==2.3.0
- pandocfilters ==1.5.1
- parso ==0.8.4
- pathspec ==0.12.1
- pexpect ==4.9.0
- pillow ==11.2.1
- platformdirs ==4.3.8
- plotly *
- plotly ==6.2.0
- prometheus_client ==0.22.1
- prompt_toolkit ==3.0.51
- protobuf ==4.25.8
- psutil ==7.0.0
- ptyprocess ==0.7.0
- pure_eval ==0.2.3
- pyarrow ==21.0.0
- pycodestyle ==2.14.0
- pycparser ==2.22
- pydantic *
- pydantic ==2.11.7
- pydantic_core ==2.33.2
- pydeck ==0.9.1
- pyflakes ==3.4.0
- pyparsing ==3.2.3
- python-dateutil ==2.9.0.post0
- python-json-logger ==3.3.0
- pytz ==2025.2
- pyzmq ==27.0.0
- referencing ==0.36.2
- regex ==2024.11.6
- requests ==2.32.4
- rfc3339-validator ==0.1.4
- rfc3986-validator ==0.1.1
- rich ==14.0.0
- rpds-py ==0.26.0
- scikit-learn ==1.6.1
- scikit-learn *
- scipy ==1.15.3
- seaborn ==0.13.2
- six ==1.17.0
- sklearn-compat ==0.1.3
- smmap ==5.0.2
- sniffio ==1.3.1
- soupsieve ==2.7
- spacy *
- stack-data ==0.6.3
- streamlit *
- streamlit ==1.47.1
- tenacity ==9.1.2
- tensorboard ==2.16.2
- tensorboard-data-server ==0.7.2
- tensorflow ==2.16.2
- tensorflow-io-gcs-filesystem ==0.37.1
- termcolor ==3.1.0
- terminado ==0.18.1
- threadpoolctl ==3.6.0
- tinycss2 ==1.4.0
- toml ==0.10.2
- toolz ==1.0.0
- tornado ==6.5.1
- tqdm ==4.67.1
- traitlets ==5.14.3
- types-PyYAML ==6.0.12.20250516
- types-python-dateutil ==2.9.0.20250516
- typing-inspection ==0.4.1
- typing_extensions ==4.14.0
- tzdata ==2025.2
- uri-template ==1.3.0
- urllib3 ==2.5.0
- wcwidth ==0.2.13
- webcolors ==24.11.1
- webencodings ==0.5.1
- websocket-client ==1.8.0
- wrapt ==1.17.2
- xgboost ==3.0.3
- xgboost *