syntran-fa

Syntatic Transformer in Farsi

https://github.com/agp-internship/syntran-fa

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
    Links to: arxiv.org, ieee.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.5%) to scientific vocabulary
Last synced: 10 months ago · JSON representation ·

Repository

Syntatic Transformer in Farsi

Basic Info
  • Host: GitHub
  • Owner: agp-internship
  • License: mit
  • Language: Python
  • Default Branch: main
  • Size: 1.87 MB
Statistics
  • Stars: 3
  • Watchers: 2
  • Forks: 0
  • Open Issues: 1
  • Releases: 0
Created about 4 years ago · Last pushed over 1 year ago
Metadata Files
Readme License Citation

README.md

syntran-fa

Syntactic Transformed Version of Farsi QA datasets to make fluent responses from questions and short answers. You can use the syntran-fa dataset with :hugs:/datasets by the code below:

python import datasets data = datasets.load_dataset('SLPL/syntran-fa', split="train")

Table of Contents

Dataset Description

Dataset Summary

Generating fluent responses has always been challenging for the question-answering task, especially in low-resource languages like Farsi. In recent years there were some efforts for enhancing the size of datasets in Farsi. Syntran-fa is a question-answering dataset that accumulates the former Farsi QA dataset's short answers and proposes a complete fluent answer for each pair of (question, short_answer).

This dataset contains nearly 50,000 indices of questions and answers. The dataset that has been used as our sources are in Source Data section.

The main idea for this dataset comes from Fluent Response Generation for Conversational Question Answering where they used a "parser + syntactic rules" module to make different fluent answers from a pair of question and a short answer using a parser and some syntactic rules. In this project, we used stanza as our parser to parse the question and generate a response according to it using the short (1-2 word) answers. One can continue this project by generating different permutations of the sentence's parts (and thus providing more than one sentence for an answer) or training a seq2seq model which does what we do with our rule-based system (by defining a new text-to-text task).

Supported Tasks and Leaderboards

This dataset can be used for the question-answering task, especially when you are going to generate fluent responses. You can train a seq2seq model with this dataset to generate fluent responses - as done by Fluent Response Generation for Conversational Question Answering.

Languages

  • Persian (fa)

Dataset Structure

Each row of the dataset will look like something like the below: python { 'id': 0, 'question': 'باشگاه هاکی ساوتهمپتون چه نام دارد؟', 'short_answer': 'باشگاه هاکی ساوتهمپتون', 'fluent_answer': 'باشگاه هاکی ساوتهمپتون باشگاه هاکی ساوتهمپتون نام دارد.', 'bert_loss': 1.110097069682014 } + id : the entry id in dataset + question : the question + short_answer : the short answer corresponding to the question (the primary answer) + fluent_answer : fluent (long) answer generated from both question and the short_answer (the secondary answer) + bert_loss : the loss that pars-bert gives when inputting the fluent_answer to it. As it increases the sentence is more likely to be influent.

Note: the dataset is sorted increasingly by the bert_loss, so first sentences are more likely to be fluent.

Data Splits

Currently, the dataset just provided the train split. There would be a test split soon.

Dataset Creation

We extract all short answer (1-2 words as answer) entries of all open source QA datasets in Farsi and used some rules featuring the question parse tree to make long (fluent) answers.

Source Data

The source datasets that we used are as follows:

Personal and Sensitive Information

The dataset is completely a subset of open source known datasets so all information in it is already there on the internet as a open-source dataset. By the way, we do not take responsibility for any of that.

Dataset Curators

The dataset is gathered together completely in the Asr Gooyesh Pardaz company's summer internship under the supervision of Soroush Gooran, Prof. Hossein Sameti, and the mentorship of Sadra Sabouri. This project was Farhan Farsi's first internship project.

Contributions

Thanks to @farhaaaaa for adding this dataset.

Cite Us

Please cite our technical preprint if you're using this dataset:

bibtex @article{farsi2024syntran, title={SynTran-fa: Generating Comprehensive Answers for Farsi QA Pairs via Syntactic Transformation}, author={Farsi, Farhan and Sabouri, Sadra and Kashfipour, Kian and Gooran, Soroush and Sameti, Hossein and Asgari, Ehsaneddin}, year={2024}, doi={10.20944/preprints202410.1684.v1}, publisher={Preprints} }

References

  1. Fluent Response Generation for Conversational Question Answering (Baheti et al., ACL 2020)
  2. Good Question! Statistical Ranking for Question Generation (Heilman & Smith, NAACL 2010)
  3. Accurate Unlexicalized Parsing (Klein & Manning, ACL 2003)

Owner

  • Name: AGP Internship
  • Login: agp-internship
  • Kind: organization

ASR Gooyesh Pardaz Internship Projects Repository

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this work, please cite it as below."
title: "SynTran-fa: Generating Comprehensive Answers for Farsi QA Pairs via Syntactic Transformation"
authors:
  - family-names: "Farsi"
    given-names: "Farhan"
  - family-names: "Sabouri"
    given-names: "Sadra"
  - family-names: "Kashfipour"
    given-names: "Kian"
  - family-names: "Gooran"
    given-names: "Soroush"
  - family-names: "Sameti"
    given-names: "Hossein"
  - family-names: "Asgari"
    given-names: "Ehsaneddin"
date-released: 2024-10-22
repository-code: "https://github.com/agp-internship/syntran-fa"
preferred-citation:
  type: article
  authors:
  - family-names: "Farsi"
    given-names: "Farhan"
  - family-names: "Sabouri"
    given-names: "Sadra"
  - family-names: "Kashfipour"
    given-names: "Kian"
  - family-names: "Gooran"
    given-names: "Soroush"
  - family-names: "Sameti"
    given-names: "Hossein"
  - family-names: "Asgari"
    given-names: "Ehsaneddin"
  year: 2024
  doi: "10.20944/preprints202410.1684.v1"
  publisher: "Preprints"
  title: "SynTran-fa: Generating Comprehensive Answers for Farsi QA Pairs via Syntactic Transformation"

GitHub Events

Total
  • Push event: 6
Last Year
  • Push event: 6