drc-news-corpus

DRC News Corpus : Towards a scalable and intelligent system for Congolese News curation

https://github.com/bernard-ng/drc-news-corpus

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (8.6%) to scientific vocabulary

Keywords

aggregator data news nlp politics
Last synced: 9 months ago · JSON representation ·

Repository

DRC News Corpus : Towards a scalable and intelligent system for Congolese News curation

Basic Info
Statistics
  • Stars: 5
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 3
Topics
aggregator data news nlp politics
Created over 2 years ago · Last pushed 12 months ago
Metadata Files
Readme Changelog License Citation

README.md

Core and Backend

Deployed Coding Standard Tests Security

| Scope | Link | |-------------------|------------------------------------------------------------| | core and backend | https://github.com/bernard-ng/drc-news-corpus | | ML models | https://github.com/bernard-ng/drc-news-ml | | Mobile App | https://github.com/bernard-ng/drc-news-app | | Dataset (partial) | https://huggingface.co/datasets/bernard-ng/drc-news-corpus |


DRC News Corpus : Towards a scalable and intelligent system for Congolese News curation

Introduction

The "DRC News Corpus" is a structured and scalable dataset of news articles sourced from major media outlets covering diverse aspects of the Democratic Republic of Congo (DRC). Designed for efficiency, this system enables the automated collection, processing, and organization of news stories spanning politics, economy, society, culture, environment, and international affairs.

Scalability and Use Cases:

This dataset is built to support large-scale text analysis, making it a valuable resource for researchers, journalists, policymakers, and data scientists. It facilitates tasks such as sentiment analysis, trend detection, entity recognition, and language modeling, providing deep insights into the evolving socio-political and economic landscape of the DRC.

To ensure quality and reliability, the dataset prioritizes reputable news sources while maintaining an adaptable framework for continuous expansion. However, users are encouraged to critically assess the content, as journalistic standards and perspectives may vary.

Sources

| Source | Articles | Link | |----------------------|----------|--------------------------------------| | radiookapi.net | +100k | https://www.radiookapi.net/actualite | | mediacongo.cd | +100k | https://www.mediacongo.net/ | | beto.cd | +30k | https://www.beto.cd/ | | actualite.cd | +57k | https://actualite.cd/ | | 7sur7.cd | +50k | https://7sur7.cd | | newscd.net | +5k | https://newscd.net | | congoindependant.com | +10k | https://www.congoindependant.com/ | | congoactu.net | +10k | https://www.congoactu.net/ |

Build the dataset

If you want to rebuild the dataset follow the steps bellow :

Installation

bash git clone https://github.com/bernard-ng/drc-news-corpus.git && cd drc-news-corpus make build make start

Usage

See supported sources above. you can also add your own source by extending the App/Aggregator/Infrastructure/Crawler/Source/Source abstract class. if you want to crawl radiookapi.net you can run the following command:

1. Crawling

```bash php bin/console app:crawl radiookapi.net

You can specify a date range to crawl articles.

php bin/console app:crawl beto.cd --date="2022-01-01:2022-12-31"

You can specify a page range to crawl articles.

php bin/console app:crawl mediacongo.net --page="0:6"

You can specify both date and page range.

php bin/console app:crawl actualite.cd --date="2022-01-01:2022-12-31" --page="0:6"

some sources require a category to crawl articles.

php bin/console app:crawl 7sur7.cd --category=politique

You can crawl multiple pages in parallel (WIP - not stable).

php bin/console app:crawl radiookapi.net --parallel=20 ```

2. Updating

```bash

Update the database with the latest articles.

php bin/console app:update radiookapi.net ```

Notice that this can take a while depending on the number of articles you want to crawl and will store the articles in the database. running this command in the background is recommended. by default no output is generated, you can add the -v option to see the progress.

bash nohup php bin/console app:crawl radiookapi.net -v > crawling.log

3. Statistics

```bash

Get the number of articles in the database.

php bin/console app:stats ```

Export the dataset

You can export the dataset to a CSV file by running the following command:

bash php bin/console app:export radiookapi.net

a CSV file will be generated in the data directory.

Acknowledgment:

The compilation and curation of the "DRC News Corpus" were conducted by Tshabu Ngandu Bernard with the primary objective of facilitating research and analysis related to the Democratic Republic of Congo. I do not own the content of the articles, and all rights belong to the respective publishers. The dataset is intended for non-commercial research purposes only.

Owner

  • Name: Bernard Ngandu
  • Login: bernard-ng
  • Kind: user
  • Location: Lubumbashi RDC
  • Company: @devscast

Building a community of skilled developers : @devscast

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: DRC News Corpus
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Bernard
    name-particle: Tshabu
    family-names: Ngandu
    email: bernard@devscast.tech
    affiliation: Devscast Community
    orcid: 'https://orcid.org/0009-0003-9777-6349'
repository-code: 'https://github.com/bernard-ng/drc-news-corpus'
repository: >-
  https://www.huggingface.c0/datasets/bernard-ng/drc-news-corpus
abstract: >-
  The "DRC News Corpus" is a curated collection of news
  articles sourced from major media outlets covering a wide
  spectrum of topics related to the Democratic Republic of
  Congo (DRC). This dataset encompasses a diverse range of
  news stories, including but not limited to politics,
  economy, social issues, culture, environment, and
  international relations, providing comprehensive coverage
  of events and developments within the country.
keywords:
  - news
  - datasets
  - DRC
  - politics
  - NLP
license: CC-BY-NC-SA-4.0
commit: b1d386986b196ae0ab637ec1a50fd992cf829d34
version: 1.2.1
date-released: '2024-03-31'

GitHub Events

Total
  • Release event: 1
  • Watch event: 4
  • Delete event: 3
  • Issue comment event: 4
  • Push event: 69
  • Pull request event: 6
  • Pull request review event: 3
  • Pull request review comment event: 3
  • Create event: 3
Last Year
  • Release event: 1
  • Watch event: 4
  • Delete event: 3
  • Issue comment event: 4
  • Push event: 69
  • Pull request event: 6
  • Pull request review event: 3
  • Pull request review comment event: 3
  • Create event: 3

Issues and Pull Requests

Last synced: about 1 year ago

All Time
  • Total issues: 1
  • Total pull requests: 7
  • Average time to close issues: 21 days
  • Average time to close pull requests: 6 days
  • Total issue authors: 1
  • Total pull request authors: 2
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.57
  • Merged pull requests: 6
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 4
  • Average time to close issues: N/A
  • Average time to close pull requests: 10 days
  • Issue authors: 0
  • Pull request authors: 2
  • Average comments per issue: 0
  • Average comments per pull request: 1.0
  • Merged pull requests: 3
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • rooneyi (1)
  • bernard-ng (1)
Pull Request Authors
  • bernard-ng (10)
  • rooneyi (1)
Top Labels
Issue Labels
enhancement (1)
Pull Request Labels
enhancement (4)

Dependencies

.github/workflows/quality.yaml actions
  • actions/cache v2 composite
  • actions/checkout v4 composite
  • shivammathur/setup-php v2 composite
.github/workflows/test.yaml actions
  • actions/cache v2 composite
  • actions/checkout v4 composite
  • shivammathur/setup-php v2 composite
docker/php/Dockerfile docker
  • php 8.3-cli-alpine build
composer.json packagist
  • behat/behat ^3.18.1 development
  • phpstan/phpstan ^1.12.16 development
  • phpstan/phpstan-doctrine ^1.5.7 development
  • phpunit/phpunit ^10.5.44 development
  • qossmic/deptrac ^2.0.4 development
  • rector/rector ^1.2.10 development
  • shipmonk/composer-dependency-analyser ^1.8.2 development
  • symfony/maker-bundle ^1.62.1 development
  • symplify/easy-coding-standard ^12.5.8 development
  • tomasvotruba/class-leak ^1.2.7 development
  • cweagans/composer-patches ^1.7.3
  • doctrine/dbal ^3.9.4
  • doctrine/doctrine-bundle ^2.13.2
  • doctrine/doctrine-migrations-bundle ^3.4.1
  • doctrine/orm ^3.3.1
  • ext-ctype *
  • ext-iconv *
  • league/csv ^9.21
  • php >=8.3
  • symfony/console 7.2.*
  • symfony/css-selector 7.2.*
  • symfony/dom-crawler 7.2.*
  • symfony/dotenv 7.2.*
  • symfony/flex ^2.4.7
  • symfony/framework-bundle 7.2.*
  • symfony/http-client 7.2.*
  • symfony/mailer 7.2.*
  • symfony/messenger 7.2.*
  • symfony/monolog-bundle ^3.10
  • symfony/runtime 7.2.*
  • symfony/stopwatch 7.2.*
  • symfony/twig-bundle 7.2.*
  • symfony/uid 7.2.*
  • symfony/yaml 7.2.*
  • twig/extra-bundle ^2.12|^3.19
  • twig/twig ^2.12|^3.19
  • webmozart/assert ^1.11
composer.lock packagist
  • 113 dependencies