amharic-qa

AmQA - The first Amharic Open Domain Question Answering Dataset

https://github.com/semantic-systems/amharic-qa

Science Score: 62.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
    Organization semantic-systems has institutional domain (www.inf.uni-hamburg.de)
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (8.1%) to scientific vocabulary

Keywords

amharic amharic-nlp qa question-answering reading-comprehension
Last synced: 6 months ago · JSON representation ·

Repository

AmQA - The first Amharic Open Domain Question Answering Dataset

Basic Info
  • Host: GitHub
  • Owner: semantic-systems
  • License: mit
  • Default Branch: main
  • Homepage:
  • Size: 1.25 MB
Statistics
  • Stars: 12
  • Watchers: 4
  • Forks: 6
  • Open Issues: 0
  • Releases: 0
Topics
amharic amharic-nlp qa question-answering reading-comprehension
Created about 4 years ago · Last pushed over 1 year ago
Metadata Files
Readme License Citation

README.md

Low Resource Question Answering: An Amharic Benchmarking Dataset (Taffa et al., 2024)

Question Answering (QA) systems return concise answers or answer lists based on natural language text, which uses a given context document. Many resources go into curating QA datasets to advance the development of robust QA models. There is a surge in QA datasets for languages such as English; this is different for low-resource languages like Amharic. Indeed, there is no published or publicly available Amharic QA dataset. Hence, to foster further research in low-resource QA, we present the first publicly available benchmarking Amharic Question Answering Dataset (Amh-QuAD). We crowdsource 2,628 question-answer pairs from over 378 Amharic Wikipedia articles. Using the training set, we fine-tune an XLM-R-based language model and introduce a new reader model. Leveraging our newly fine-tuned reader run a baseline model to spark open-domain Amharic QA research interest. The best-performing baseline QA achieves an F-score of 80.3 and 81.34 in retriever-reader and reading comprehension settings.

Dataset

In Amharic, interrogative sentences can be formulated using information-seeking pronouns like “ምን” (what), “መቼ” (when), “ማን” (who), “የት” (where), “የትኛው” (which), etc. and prepositional interrogative phrases like “ለምን” ለ-ምን, “በምን” በ-ምን, etc. Besides, a verb phrase could be used to pose questions (Getahun 2013; Baye 2009). As shown below, the AmQA dataset contains context, question, and answer triplets. The contexts are articles collected from Amharic Wikipedia dump file. The question-answer pairs are created by crowdsourcing and annotated using the Haystack QA annotation tool. 2628 question and answer pairs are created from 378 documents. The whole AmQA dataset can be found here. We also split the dataset into train, dev, and test with a size of 1728, 600, and 300 respectively.

json { "paragraphs": [ { "qas": [ { "question": "በላሊበላ ስንት ውቅር አብያተ ክርስቲያናት አሉ?", "id": 272819, "answers": [ { "answer_id": 270463, "document_id": 266719, "question_id": 272819, "text": "11", "answer_start": 290, "answer_end": 292, "answer_category": null } ], "is_impossible": false }, { "question": "ከላሊበላ አስራ አንዱ ውቅር አብያተ ክርስቲያናት ግዙፉ የትኛው ነው?", "id": 272822, "answers": [ { "answer_id": 270466, "document_id": 266719, "question_id": 272822, "text": "ቤተ መድሃኔ ዓለም", "answer_start": 372, "answer_end": 383, "answer_category": null } ], "is_impossible": false }, { "question": "በላሊበላ የጌታ ልደት ቀን በቤተ ማርያም የሚቀርበው ልዩ ዝማሬ ምን ይባላል?", "id": 272836, "answers": [ { "answer_id": 270480, "document_id": 266719, "question_id": 272836, "text": "ቤዛ ኩሉ", "answer_start": 465, "answer_end": 470, "answer_category": null } ], "is_impossible": false } ], "context": "ንጉሡ ላሊበላ የሚለውን ስም ያገኘው፣ ሲወለድ በንቦች ስለተከበበ ነው። ላል ማለት ማር ማለት ሲሆን፤ ላሊበላ ማለትም -ላል ይበላል (ማር ይበላል) ማለት አንደሆነ ይነግራል። ውቅር ቤተክርስቲያናቱን ንጉሡ ጠርቦ የስራቸው ከመላእክት እገዛ ጋር እንደሆነ በኢትዮጵያ ኦርቶዶክስ እምነት ተከታዮች ይነግራል። በ16ኛው ከፍለ ዘመን አውሮፓዊ ተጓዥ ላሊበላን ተመልክቶ «ያየሁትን ብናግር ማንም እንደኔ ካላየ በፍጹም አያምነኝም» ሲል ተናግሮ ነበር። በላሊበላ 11 ውቅር ዐብያተ ክርስቲያናት ያሉ ሲሆን ከነዚህም ውስጥ ቤተ ጊዮርጊስ (ባለ መስቀል ቅርፁ) ሲታይ ውሃልኩን የጠበቀ ይመስላል። ቤተ መድሃኔ ዓለም የተባለው ደግሞ ከሁሉም ትልቁ ነው። ላሊበላ (ዳግማዊ ኢየሩሳሌም) የገና በዓል ታህሳስ 29 በልዩ ሁኔታ ና ድምቀት ይከበራል፣ \"ቤዛ ኩሉ\" ተብሎ የሚጠራው በነግህ የሚደረገው ዝማሬ በዚሁ በዓል የሚታይ ልዩ ና ታላቅ ትዕይንት ነው።የሚደረገውም ከቅዳሴ በኋላ በቤተ ማርያም ሲሆን ከታች ባለ ነጭ ካባ ካህናት ከላይ ደግሞ ባለጥቁር ካብ ካህናት በቅዱስ ያሬድ ዜማ ቤዛ ኩሉ እያሉ ይዘምራሉ። 11ዱ የቅዱስ ላሊበላ ፍልፍል አብያተ ክርስቲያናት ቤተ መድሃኔ ዓለም፣ ቤተ ማርያም፣ ቤተ ደናግል፣ ቤተ መስቀል፣ ቤተ ደብረሲና፣ ቤተ ጎለጎታ፣ ቤተ አማኑኤል፣ ቤተ አባ ሊባኖስ፣ ቤተ መርቆሬዎስ፣ ቤተ ገብርኤል ወሩፋኤል፣ ቤተ ጊዮርጊስ ናቸው።", "document_id": 266719 } ] }

Baseline Model

amh_qa_basline

Evaluation

Since the AmQA dataset contains a set of contexts and question-answer pairs, it can be considered a reading comprehension (RC) task. That is, given a question Q and a context, the goal of the model is to identify a word or group of consecutive words that answer question Q. On the other hand, retriever-reader-based QA models first retrieve relevant passages, then read top-ranked passages and try to predict the start and end positions of the answer. So, we have implemented a retriever-reader (RR) QA model using the Farm Haystack open-source framework. For the retriever part, we have used BM25 as a retriever and finetuned XLM-RLarge for the reader. The experimental result is shown in the figure given below.

amh_qa_evaluation

The full paper can be found here.

The dataset is also available on Hugging Face here.

N.B. This work previously released in Arxiv entitled AmQA: Amharic Question Answering Dataset.

Cite Us

bibtex @inproceedings{taffa-etal-2024-low, title = "Low Resource Question Answering: An {A}mharic Benchmarking Dataset", author = "Taffa, Tilahun Abedissa and Usbeck, Ricardo and Assabie, Yaregal", booktitle = "Proceedings of the Fifth Workshop on Resources for African Indigenous Languages @ LREC-COLING 2024", month = may, year = "2024", address = "Torino, Italia", publisher = "ELRA and ICCL", url = "https://aclanthology.org/2024.rail-1.14", pages = "124--132" }

Owner

  • Name: Semantic Systems research group
  • Login: semantic-systems
  • Kind: organization
  • Email: sems@informatik.uni-hamburg.de
  • Location: Germany

The Semantic Systems (SEMS) research group wants machines to understand humans.

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: 'AmQA: Amharic Question Answering Dataset'
message: >-
  If you use this dataset, please cite it using the
  metadata from this file.
type: dataset
authors:
  - given-names: Tilahun
    family-names: Abedissa
    email: tilahun.abedissa@aau.edu.et
    affiliation: Addis Ababa University
  - given-names: Ricardo
    family-names: Usbeck
    email: ricardo.usbeck@uni-hamburg.de
    affiliation: Universität Hamburg
    orcid: 'https://orcid.org/0000-0002-0191-7211'
  - given-names: Yaregal
    family-names: Assabie
    email: yaregal.assabie@aau.edu.et
    affiliation: Addis Ababa University

GitHub Events

Total
  • Watch event: 3
  • Fork event: 1
Last Year
  • Watch event: 3
  • Fork event: 1