https://github.com/bernard-ng/drc-legal-ner

Towards a Congolese Legal Knowledge Graph: LLM-Enhanced NER for Citation Detection

https://github.com/bernard-ng/drc-legal-ner

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.6%) to scientific vocabulary

Keywords

information-retrieval legal-intelligence ner nlp
Last synced: 5 months ago · JSON representation

Repository

Towards a Congolese Legal Knowledge Graph: LLM-Enhanced NER for Citation Detection

Basic Info
  • Host: GitHub
  • Owner: bernard-ng
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 8.59 MB
Statistics
  • Stars: 2
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Topics
information-retrieval legal-intelligence ner nlp
Created about 1 year ago · Last pushed 11 months ago
Metadata Files
Readme

README.md

Automated Citation Detection in Congolese Legal Texts: Leveraging LLM-Based NER for Knowledge Graph Construction

This paper builds upon our previous work on Juro, an AI-powered chatbot designed to improve legal information access in the Democratic Republic of Congo (DRC), by ad- dressing the specific challenge of automated citation detection in unstructured legal texts. We propose an end-to-end approach that combines Large Language Model (LLM)-based annotation and Named Entity Recognition (NER) for extracting key entities critical to constructing a legal knowledge graph. Over 8,400 Congolese legal document titles were scraped and annotated via the GPT-4o-mini model, with subsequent training implemented in spaCy under two distinct configurations emphasizing accuracy and efficiency. We evaluated the system using both a split dataset and a human-annotated benchmark, demonstrating robust per- formance in identifying document types, reference numbers, and publication dates. An initial mapping algorithm connected documents based on annotated entities, revealing a preliminary citation graph of over 1,400 relationships. While the current methodology shows promise in automating entity extraction and preliminary graph construction, future developments will explore deeper relationship modeling, improved type coverage, and integration into the Juro framework to provide enhanced legal support.

Usage

```bash git clone https://github.com/bernard-ng/drc-legal-ner.git cd drc-legal-ner

python3 -m venv .venv source .venv/bin/activate pip install -r requirements.txt docker compose up ```

  1. Annotation

Will generate a dataset of Congolese legal texts and annotate it using OpenAI's GPT-4o-mini you can do it synchronously or asynchronously (with batch API).

```bash python -m processing.batch.requests --build python -m processing.batch.requests --upload python -m processing.batch.requests --create python -m processing.batch.response # 24h later

python -m process.annotate --method=async

python -m processing.format --label-studio # for Human feedback and validation python -m processing.format --spacy-binary # Spacy compatible format for training ```

  1. Tasks

bash make train_efficiency # Train the model with efficiency make train_accuracy # Train the model with accuracy make evaluate # Evaluate the model make benchmark # Benchmark the model make visualize # Visualize NER make clean # Clean the model and results

Owner

  • Name: Bernard Ngandu
  • Login: bernard-ng
  • Kind: user
  • Location: Lubumbashi RDC
  • Company: @devscast

Building a community of skilled developers : @devscast

GitHub Events

Total
  • Watch event: 3
  • Push event: 10
  • Create event: 2
Last Year
  • Watch event: 3
  • Push event: 10
  • Create event: 2

Issues and Pull Requests

Last synced: 12 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Dependencies

requirements.txt pypi
  • Deprecated ==1.2.18
  • Django ==4.2.19
  • Faker ==36.1.1
  • GitPython ==3.1.44
  • Jinja2 ==3.1.5
  • MarkupSafe ==3.0.2
  • PyYAML ==6.0.2
  • Pygments ==2.19.1
  • altair ==5.5.0
  • annotated-types ==0.7.0
  • anyio ==4.8.0
  • appdirs ==1.4.4
  • argcomplete ==3.5.3
  • asgiref ==3.8.1
  • attr ==0.3.1
  • attrs ==25.1.0
  • azure-core ==1.32.0
  • azure-storage-blob ==12.24.1
  • black ==25.1.0
  • bleach ==5.0.1
  • blinker ==1.9.0
  • blis ==1.2.0
  • boto ==2.49.0
  • boto3 ==1.36.21
  • botocore ==1.36.21
  • cachetools ==5.5.1
  • catalogue ==2.0.10
  • certifi ==2025.1.31
  • cffi ==1.17.1
  • charset-normalizer ==3.4.1
  • click ==8.1.8
  • cloudpathlib ==0.20.0
  • colorama ==0.4.6
  • confection ==0.1.5
  • cryptography ==44.0.1
  • cymem ==2.0.11
  • datamodel-code-generator ==0.26.1
  • defusedxml ==0.7.1
  • distro ==1.9.0
  • django-annoying ==0.10.6
  • django-cors-headers ==3.6.0
  • django-csp ==3.7
  • django-debug-toolbar ==3.2.1
  • django-environ ==0.10.0
  • django-extensions ==3.2.3
  • django-filter ==2.4.0
  • django-migration-linter ==5.1.0
  • django-model-utils ==4.1.1
  • django-ranged-fileresponse ==0.1.2
  • django-rq ==2.5.1
  • django-storages ==1.12.3
  • django-user-agents ==0.4.0
  • djangorestframework ==3.15.2
  • dnspython ==2.7.0
  • drf-dynamic-fields ==0.3.0
  • drf-flex-fields ==0.9.5
  • drf-generators ==0.3.0
  • email_validator ==2.2.0
  • exceptiongroup ==1.2.2
  • expiringdict ==1.2.2
  • genson ==1.3.0
  • gitdb ==4.0.12
  • google-api-core ==2.24.1
  • google-auth ==2.38.0
  • google-cloud-appengine-logging ==1.6.0
  • google-cloud-audit-log ==0.3.0
  • google-cloud-core ==2.4.1
  • google-cloud-logging ==3.11.4
  • google-cloud-storage ==2.19.0
  • google-crc32c ==1.6.0
  • google-resumable-media ==2.7.2
  • googleapis-common-protos ==1.67.0
  • grpc-google-iam-v1 ==0.14.0
  • grpcio ==1.70.0
  • grpcio-status ==1.70.0
  • h11 ==0.14.0
  • httpcore ==1.0.7
  • httpx ==0.28.1
  • humansignal-drf-yasg ==1.21.10.post1
  • idna ==3.10
  • ijson ==3.3.0
  • importlib_metadata ==8.5.0
  • inflect ==5.6.2
  • inflection ==0.5.1
  • isodate ==0.7.2
  • isort ==5.13.2
  • jiter ==0.8.2
  • jmespath ==1.0.1
  • joblib ==1.4.2
  • jsf ==0.11.2
  • jsonschema ==4.23.0
  • jsonschema-specifications ==2024.10.1
  • label-studio ==1.15.0
  • label-studio-sdk ==1.0.8
  • langcodes ==3.5.0
  • language_data ==1.3.0
  • launchdarkly-server-sdk ==8.2.1
  • lockfile ==0.12.2
  • lxml ==5.3.1
  • lxml_html_clean ==0.4.1
  • marisa-trie ==1.2.1
  • markdown-it-py ==3.0.0
  • mdurl ==0.1.2
  • murmurhash ==1.0.12
  • mypy-extensions ==1.0.0
  • narwhals ==1.27.1
  • nltk ==3.9.1
  • numpy ==1.26.4
  • ollama ==0.4.7
  • openai ==1.61.1
  • opentelemetry-api ==1.30.0
  • ordered-set ==4.0.2
  • packaging ==24.2
  • pandas ==2.2.3
  • pathspec ==0.12.1
  • pillow ==10.4.0
  • platformdirs ==4.3.6
  • preshed ==3.0.9
  • proto-plus ==1.26.0
  • protobuf ==5.29.3
  • psycopg2-binary ==2.9.10
  • pyRFC3339 ==2.0.1
  • pyarrow ==19.0.0
  • pyasn1 ==0.6.1
  • pyasn1_modules ==0.4.1
  • pyboxen ==1.3.0
  • pycparser ==2.22
  • pydantic ==2.10.6
  • pydantic_core ==2.27.2
  • pydeck ==0.9.1
  • python-dateutil ==2.9.0.post0
  • python-dotenv ==1.0.1
  • python-json-logger ==2.0.4
  • pytz ==2022.7.1
  • redis ==3.5.3
  • referencing ==0.36.2
  • regex ==2024.11.6
  • requests ==2.32.3
  • requests-mock ==1.12.1
  • rich ==13.9.4
  • rpds-py ==0.22.3
  • rq ==1.10.1
  • rsa ==4.9
  • rstr ==3.2.2
  • rules ==3.4
  • s3transfer ==0.11.2
  • semver ==3.0.4
  • sentry-sdk ==2.21.0
  • shellingham ==1.5.4
  • six ==1.17.0
  • smart-open ==7.1.0
  • smmap ==5.0.2
  • sniffio ==1.3.1
  • spacy ==3.8.3
  • spacy-legacy ==3.0.12
  • spacy-loggers ==1.0.5
  • spacy-streamlit ==1.0.6
  • sqlparse ==0.5.3
  • srsly ==2.5.1
  • streamlit ==1.36.0
  • tenacity ==8.5.0
  • thinc ==8.3.4
  • toml ==0.10.2
  • tomli ==2.2.1
  • tornado ==6.4.2
  • tqdm ==4.67.1
  • typer ==0.15.1
  • typing_extensions ==4.12.2
  • tzdata ==2025.1
  • ua-parser ==1.0.1
  • ua-parser-builtins ==0.18.0.post1
  • ujson ==5.10.0
  • uritemplate ==4.1.1
  • urllib3 ==1.26.20
  • user-agents ==2.2.0
  • wasabi ==1.1.3
  • weasel ==0.4.1
  • webencodings ==0.5.1
  • wrapt ==1.17.2
  • xmljson ==0.2.1
  • zipp ==3.21.0