https://github.com/bernard-ng/drc-ners-nlp

A Culturally-Aware NLP System for Congolese Name Analysis and Gender Inference

https://github.com/bernard-ng/drc-ners-nlp

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.1%) to scientific vocabulary

Keywords

cultural-sociology gender-detection ner nlp
Last synced: 5 months ago · JSON representation

Repository

A Culturally-Aware NLP System for Congolese Name Analysis and Gender Inference

Basic Info
  • Host: GitHub
  • Owner: bernard-ng
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 4.79 MB
Statistics
  • Stars: 1
  • Watchers: 1
  • Forks: 0
  • Open Issues: 1
  • Releases: 0
Topics
cultural-sociology gender-detection ner nlp
Created 11 months ago · Last pushed 6 months ago
Metadata Files
Readme

README.md

A Culturally-Aware NLP System for Congolese Name Analysis and Gender Inference

Despite the growing success of gender inference models in Natural Language Processing (NLP), these tools often underperform when applied to culturally diverse African contexts due to the lack of culturally-representative training data. This project introduces a comprehensive pipeline for Congolese name analysis with a large-scale dataset of over 5 million names from the Democratic Republic of Congo (DRC) annotated with gender and demographic metadata.

Getting Started

Installation & Setup

Instructions and command line snippets bellow are provided to help you set up the project environment quickly and efficiently. assuming you have Python 3.11 and Git installed and working on a Unix-like system (Linux, macOS, etc.).

Using Makefile (Recommended)

```bash git clone https://github.com/bernard-ng/drc-ners-nlp.git cd drc-ners-nlp

Setup environment

make setup make activate ```

Manual Setup

```bash git clone https://github.com/bernard-ng/drc-ners-nlp.git cd drc-ners-nlp

Setup environment

python -m venv .venv .venv/bin/pip install --upgrade pip .venv/bin/pip install -r requirements.txt

pip install --upgrade pip pip install -r requirements.txt pip install jupyter notebook ipykernel pytest black flake8 mypy

source .venv/bin/activate ```

Data Processing

This project includes a robust data processing pipeline designed to handle large datasets efficiently with batching, checkpointing, and parallel processing capabilities. step are defined in the drc-ners-nlp/processing/steps directory. and configuration to enable them is managed through the drc-ners-nlp/config/pipeline.yaml file.

Pipeline Configuration

yaml stages: - "data_cleaning" - "feature_extraction" - "data_splitting"

Running the Pipeline

bash python main.py --env development

NER Processing

This project implements a custom named entity recognition (NER) pipeline tailored for Congolese names. Its main objective is to accurately identify and tag the different components of a Congolese name, specifically distinguishing between the native part and the surname.

bash python ner.py --env development

Once you've built and train the NER model you can use it to annotate CoMPOSE name in the original dataset

Running the Pipeline with NER Annotation yaml stages: - "data_cleaning" - "feature_extraction" - "ner_annotation" - "data_splitting"

Running the Pipeline with LLM Annotation yaml stages: - "data_cleaning" - "feature_extraction" - "llm_annotation" - "data_splitting"

Experiments

This project provides a modular experiment (model training and evaluation) framework for systematic model comparison and research iteration. models are defined in the drc-ners-nlp/research/models directory. you can define model features, training parameters, and evaluation metrics in the research_templates.yaml file.

Running Experiments

```bash python train.py --name="bigru" --type="baseline" --env="development" python train.py --name="cnn" --type="baseline" --env="development" python train.py --name="lightgbm" --type="baseline" --env="development"

python train.py --name="logisticregressionfullname" --type="baseline" --env="development" python train.py --name="logisticregressionnative" --type="baseline" --env="development" python train.py --name="logisticregressionsurname" --type="baseline" --env="development"

python train.py --name="lstm" --type="baseline" --env="development" python train.py --name="randomforest" --type="baseline" --env="development" python train.py --name="svm" --type="baseline" --env="development" python train.py --name="naivebayes" --type="baseline" --env="development" python train.py --name="transformer" --type="baseline" --env="development" python train.py --name="xgboost" --type="baseline" --env="development" ```

Web Interface

This project includes a user-friendly web interface built with Streamlit, allowing non-technical users to run experiments and make predictions without needing to understand the underlying code.

Running the Web Interface

bash streamlit run web/app.py

Contributors

contributors

Owner

  • Name: Bernard Ngandu
  • Login: bernard-ng
  • Kind: user
  • Location: Lubumbashi RDC
  • Company: @devscast

Building a community of skilled developers : @devscast

GitHub Events

Total
  • Push event: 9
  • Public event: 1
  • Pull request review event: 3
  • Pull request review comment event: 4
  • Pull request event: 1
Last Year
  • Push event: 9
  • Public event: 1
  • Pull request review event: 3
  • Pull request review comment event: 4
  • Pull request event: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 0
  • Total pull requests: 9
  • Average time to close issues: N/A
  • Average time to close pull requests: 1 day
  • Total issue authors: 0
  • Total pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.67
  • Merged pull requests: 6
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 9
  • Average time to close issues: N/A
  • Average time to close pull requests: 1 day
  • Issue authors: 0
  • Pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.67
  • Merged pull requests: 6
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • 1Cansa (9)
Top Labels
Issue Labels
Pull Request Labels
enhancement (1) good first issue (1)

Dependencies

requirements.txt pypi
  • GitPython ==3.1.45
  • Jinja2 ==3.1.6
  • Markdown ==3.8.2
  • MarkupSafe ==3.0.2
  • PyYAML *
  • PyYAML ==6.0.2
  • Pygments ==2.19.1
  • Send2Trash ==1.8.3
  • Werkzeug ==3.1.3
  • absl-py ==2.3.0
  • altair ==5.1.2
  • annotated-types ==0.7.0
  • anyio ==4.9.0
  • appnope ==0.1.4
  • argon2-cffi ==25.1.0
  • argon2-cffi-bindings ==21.2.0
  • arrow ==1.3.0
  • asttokens ==3.0.0
  • astunparse ==1.6.3
  • async-lru ==2.0.5
  • attrs ==25.3.0
  • babel ==2.17.0
  • beautifulsoup4 ==4.13.4
  • black ==25.1.0
  • bleach ==6.2.0
  • blinker ==1.9.0
  • cachetools ==6.1.0
  • certifi ==2025.6.15
  • cffi ==1.17.1
  • charset-normalizer ==3.4.2
  • click ==8.2.1
  • comm ==0.2.2
  • contourpy ==1.3.2
  • cycler ==0.12.1
  • debugpy ==1.8.14
  • decorator ==5.2.1
  • defusedxml ==0.7.1
  • executing ==2.2.0
  • fastjsonschema ==2.21.1
  • flake8 ==7.3.0
  • flatbuffers ==25.2.10
  • fonttools ==4.58.4
  • fqdn ==1.5.1
  • gast ==0.6.0
  • gitdb ==4.0.12
  • google-pasta ==0.2.0
  • grpcio ==1.73.0
  • h11 ==0.16.0
  • h5py ==3.14.0
  • httpcore ==1.0.9
  • httpx ==0.28.1
  • idna ==3.10
  • imbalanced-learn ==0.13.0
  • ipykernel ==6.29.5
  • ipython ==9.4.0
  • ipython_pygments_lexers ==1.1.1
  • isoduration ==20.11.0
  • jedi ==0.19.2
  • joblib ==1.5.1
  • json5 ==0.12.0
  • jsonpointer ==3.0.0
  • jsonschema ==4.24.0
  • jsonschema-specifications ==2025.4.1
  • jupyter-events ==0.12.0
  • jupyter-lsp ==2.2.5
  • jupyter_client ==8.6.3
  • jupyter_core ==5.8.1
  • jupyter_server ==2.16.0
  • jupyter_server_terminals ==0.5.3
  • jupyterlab ==4.4.4
  • jupyterlab_pygments ==0.3.0
  • jupyterlab_server ==2.27.3
  • keras ==3.10.0
  • kiwisolver ==1.4.8
  • libclang ==18.1.1
  • lightgbm *
  • lightgbm ==4.6.0
  • markdown-it-py ==3.0.0
  • matplotlib ==3.10.3
  • matplotlib-inline ==0.1.7
  • mccabe ==0.7.0
  • mdurl ==0.1.2
  • mistune ==3.1.3
  • ml-dtypes ==0.3.2
  • mypy ==1.17.0
  • mypy_extensions ==1.1.0
  • namex ==0.1.0
  • narwhals ==2.0.1
  • nbclient ==0.10.2
  • nbconvert ==7.16.6
  • nbformat ==5.10.4
  • nest-asyncio ==1.6.0
  • nltk ==3.9.1
  • notebook ==7.4.4
  • notebook_shim ==0.2.4
  • numpy ==1.26.4
  • ollama ==0.5.1
  • ollama *
  • opt_einsum ==3.4.0
  • optree ==0.16.0
  • overrides ==7.7.0
  • packaging ==25.0
  • pandas ==2.3.0
  • pandocfilters ==1.5.1
  • parso ==0.8.4
  • pathspec ==0.12.1
  • pexpect ==4.9.0
  • pillow ==11.2.1
  • platformdirs ==4.3.8
  • plotly *
  • plotly ==6.2.0
  • prometheus_client ==0.22.1
  • prompt_toolkit ==3.0.51
  • protobuf ==4.25.8
  • psutil ==7.0.0
  • ptyprocess ==0.7.0
  • pure_eval ==0.2.3
  • pyarrow ==21.0.0
  • pycodestyle ==2.14.0
  • pycparser ==2.22
  • pydantic *
  • pydantic ==2.11.7
  • pydantic_core ==2.33.2
  • pydeck ==0.9.1
  • pyflakes ==3.4.0
  • pyparsing ==3.2.3
  • python-dateutil ==2.9.0.post0
  • python-json-logger ==3.3.0
  • pytz ==2025.2
  • pyzmq ==27.0.0
  • referencing ==0.36.2
  • regex ==2024.11.6
  • requests ==2.32.4
  • rfc3339-validator ==0.1.4
  • rfc3986-validator ==0.1.1
  • rich ==14.0.0
  • rpds-py ==0.26.0
  • scikit-learn ==1.6.1
  • scikit-learn *
  • scipy ==1.15.3
  • seaborn ==0.13.2
  • six ==1.17.0
  • sklearn-compat ==0.1.3
  • smmap ==5.0.2
  • sniffio ==1.3.1
  • soupsieve ==2.7
  • spacy *
  • stack-data ==0.6.3
  • streamlit *
  • streamlit ==1.47.1
  • tenacity ==9.1.2
  • tensorboard ==2.16.2
  • tensorboard-data-server ==0.7.2
  • tensorflow ==2.16.2
  • tensorflow-io-gcs-filesystem ==0.37.1
  • termcolor ==3.1.0
  • terminado ==0.18.1
  • threadpoolctl ==3.6.0
  • tinycss2 ==1.4.0
  • toml ==0.10.2
  • toolz ==1.0.0
  • tornado ==6.5.1
  • tqdm ==4.67.1
  • traitlets ==5.14.3
  • types-PyYAML ==6.0.12.20250516
  • types-python-dateutil ==2.9.0.20250516
  • typing-inspection ==0.4.1
  • typing_extensions ==4.14.0
  • tzdata ==2025.2
  • uri-template ==1.3.0
  • urllib3 ==2.5.0
  • wcwidth ==0.2.13
  • webcolors ==24.11.1
  • webencodings ==0.5.1
  • websocket-client ==1.8.0
  • wrapt ==1.17.2
  • xgboost ==3.0.3
  • xgboost *