za-isizulu-siswati-news-2022

IsiZulu News (articles and headlines) and Siswati News (headlines) Corpora - za-isizulu-siswati-news-2022

https://github.com/dsfsi/za-isizulu-siswati-news-2022

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 6 DOI reference(s) in README
  • Academic publication links
    Links to: arxiv.org, zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.5%) to scientific vocabulary

Keywords

african-nlp corpora dsfsi-datasets low-resource-languages natural-language-processing news-categorizer south-africa
Last synced: 6 months ago · JSON representation ·

Repository

IsiZulu News (articles and headlines) and Siswati News (headlines) Corpora - za-isizulu-siswati-news-2022

Basic Info
  • Host: GitHub
  • Owner: dsfsi
  • License: cc-by-sa-4.0
  • Default Branch: main
  • Homepage:
  • Size: 292 KB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 3
Topics
african-nlp corpora dsfsi-datasets low-resource-languages natural-language-processing news-categorizer south-africa
Created over 3 years ago · Last pushed over 2 years ago
Metadata Files
Readme License Citation

README.md

IsiZulu News (articles and headlines) and Siswati News (headlines) Corpora - za-isizulu-siswati-news-2022

DOI arXiv

Give Feedback 📑: DSFSI Resource Feedback Form

About Dataset

Dataset for both isiZulu news (articles and headlines) and Siswati news headlines. Process included scraping the data from internet, from Isolezwe news website http://www.isolezwe.co.za and public posts from the SABC news LigwalagwalaFM Facebook page https://www.facebook.com/ligwalagwalafm/ respectively.

The obtained datasets are isiZulu news articles, isiZulu news headlines, and Siswati news headlines.

Post data collection the datasets were then sent to annotators, and they were sent back after the annotation process. The datasets contain special characters, some English words and characters that are not ASCII encoded which must be removed prior to model training. The aim of these three datasets is to create a baseline news categorisation model for the two South African low resources languages i.e. isiZulu and Siswati.

For categorisation, we use high level IPTC NewsCodes as categories. You can view the news categories here data/news-categories-iptc-newscodes.csv

The datasets were found to have class categories with very few observations, hence the class categories which have less than 35 observations were removed for isiZulu and less 6 observations for Siswati.

The dataset has both full category data as well as reduced category data.

Please see the data-statement.md for full dataset information.

Online Repository link

See also the list of contributors who participated in this project.

Citation

Citation:

@article{MadodongaMarivateAdendorff_2023, title={Izindaba-Tindzaba: Machine learning news categorisation for Long and Short Text for isiZulu and Siswati}, volume={4}, url={https://upjournals.up.ac.za/index.php/dhasa/article/view/4449}, DOI={10.55492/dhasa.v4i01.4449}, author={Madodonga, Andani and Marivate, Vukosi and Adendorff, Matthew}, year={2023}, month={Jan.} }

License

Data is Licensed under CC 4.0 BY SA Code is Licences under MIT License.

Owner

  • Name: Data Science for Social Impact Research Group @ University of Pretoria
  • Login: dsfsi
  • Kind: organization
  • Email: vukosi.marivate@cs.up.ac.za
  • Location: University of Pretoria, South Africa

We are the Data Science for Social Impact research group at the Computer Science Department, University of Pretoria.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "Dataset for both isiZulu news (articles and headlines) and Siswati news headlines. Process included scraping the data from internet, from Isolezwe news website http://www.isolezwe.co.za and public posts from the SABC news LigwalagwalaFM Facebook page https://www.facebook.com/ligwalagwalafm/ respectively."
authors:
- family-names: "Madodonga"
  given-names: "Andani"
  affiliation: "Department of Computer Science, University of Pretoria"
- family-names: "Marivate"
  given-names: "Vukosi"
  orcid: "https://orcid.org/0000-0002-6731-6267"
  affiliation: "Department of Computer Science, University of Pretoria"
- family-names: "Adendorff"
  given-names: "Matthew"
  affiliation: "Open Cities Lab"
title: "IsiZulu News (articles and headlines) and Siswati News (headlines) Corpora - za-isizulu-siswati-news-2022"
version: 0.9.5
doi: 10.5281/zenodo.7193346
date-released: 2022-10-13
url: "https://github.com/dsfsi/za-isizulu-siswati-news-2022"
type: data
license: cc-by-sa-4.0

GitHub Events

Total
Last Year

Issues and Pull Requests

Last synced: 12 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels