syntran-fa

Syntatic Transformer in Farsi

https://github.com/agp-internship/syntran-fa

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
✓
Academic publication links
Links to: arxiv.org, ieee.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.5%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Syntatic Transformer in Farsi

Basic Info

Host: GitHub
Owner: agp-internship
License: mit
Language: Python
Default Branch: main
Size: 1.87 MB

Statistics

Stars: 3
Watchers: 2
Forks: 0
Open Issues: 1
Releases: 0

Created about 4 years ago · Last pushed over 1 year ago

Metadata Files

Readme License Citation

syntran-fa

Syntactic Transformed Version of Farsi QA datasets to make fluent responses from questions and short answers. You can use the syntran-fa dataset with :hugs:/datasets by the code below:

python import datasets data = datasets.load_dataset('SLPL/syntran-fa', split="train")

Table of Contents
Dataset Description
Dataset Structure
Dataset Creation
Considerations for Using the Data
- Social Impact of Dataset
Additional Information

Dataset Description

Homepage: Sharif-SLPL
Repository: SynTran-fa
Point of Contact: Sadra Sabouri
Size of dataset files: 6.68MB

Dataset Summary

Generating fluent responses has always been challenging for the question-answering task, especially in low-resource languages like Farsi. In recent years there were some efforts for enhancing the size of datasets in Farsi. Syntran-fa is a question-answering dataset that accumulates the former Farsi QA dataset's short answers and proposes a complete fluent answer for each pair of (question, short_answer).

This dataset contains nearly 50,000 indices of questions and answers. The dataset that has been used as our sources are in Source Data section.

The main idea for this dataset comes from Fluent Response Generation for Conversational Question Answering where they used a "parser + syntactic rules" module to make different fluent answers from a pair of question and a short answer using a parser and some syntactic rules. In this project, we used stanza as our parser to parse the question and generate a response according to it using the short (1-2 word) answers. One can continue this project by generating different permutations of the sentence's parts (and thus providing more than one sentence for an answer) or training a seq2seq model which does what we do with our rule-based system (by defining a new text-to-text task).

Supported Tasks and Leaderboards

This dataset can be used for the question-answering task, especially when you are going to generate fluent responses. You can train a seq2seq model with this dataset to generate fluent responses - as done by Fluent Response Generation for Conversational Question Answering.

Languages

Persian (fa)

Dataset Structure

Each row of the dataset will look like something like the below: python { 'id': 0, 'question': 'باشگاه هاکی ساوتهمپتون چه نام دارد؟', 'short_answer': 'باشگاه هاکی ساوتهمپتون', 'fluent_answer': 'باشگاه هاکی ساوتهمپتون باشگاه هاکی ساوتهمپتون نام دارد.', 'bert_loss': 1.110097069682014 } + id : the entry id in dataset + question : the question + short_answer : the short answer corresponding to the question (the primary answer) + fluent_answer : fluent (long) answer generated from both question and the short_answer (the secondary answer) + bert_loss : the loss that pars-bert gives when inputting the fluent_answer to it. As it increases the sentence is more likely to be influent.

Note: the dataset is sorted increasingly by the bert_loss, so first sentences are more likely to be fluent.

Data Splits

Currently, the dataset just provided the train split. There would be a test split soon.

Dataset Creation

We extract all short answer (1-2 words as answer) entries of all open source QA datasets in Farsi and used some rules featuring the question parse tree to make long (fluent) answers.

Source Data

The source datasets that we used are as follows:

Personal and Sensitive Information

The dataset is completely a subset of open source known datasets so all information in it is already there on the internet as a open-source dataset. By the way, we do not take responsibility for any of that.

Dataset Curators

The dataset is gathered together completely in the Asr Gooyesh Pardaz company's summer internship under the supervision of Soroush Gooran, Prof. Hossein Sameti, and the mentorship of Sadra Sabouri. This project was Farhan Farsi's first internship project.

Contributions

Thanks to @farhaaaaa for adding this dataset.

Cite Us

Please cite our technical preprint if you're using this dataset:

bibtex @article{farsi2024syntran, title={SynTran-fa: Generating Comprehensive Answers for Farsi QA Pairs via Syntactic Transformation}, author={Farsi, Farhan and Sabouri, Sadra and Kashfipour, Kian and Gooran, Soroush and Sameti, Hossein and Asgari, Ehsaneddin}, year={2024}, doi={10.20944/preprints202410.1684.v1}, publisher={Preprints} }

References

Fluent Response Generation for Conversational Question Answering (Baheti et al., ACL 2020)
Good Question! Statistical Ranking for Question Generation (Heilman & Smith, NAACL 2010)
Accurate Unlexicalized Parsing (Klein & Manning, ACL 2003)

Owner

Name: AGP Internship
Login: agp-internship
Kind: organization

Repositories: 3
Profile: https://github.com/agp-internship

ASR Gooyesh Pardaz Internship Projects Repository

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this work, please cite it as below."
title: "SynTran-fa: Generating Comprehensive Answers for Farsi QA Pairs via Syntactic Transformation"
authors:
  - family-names: "Farsi"
    given-names: "Farhan"
  - family-names: "Sabouri"
    given-names: "Sadra"
  - family-names: "Kashfipour"
    given-names: "Kian"
  - family-names: "Gooran"
    given-names: "Soroush"
  - family-names: "Sameti"
    given-names: "Hossein"
  - family-names: "Asgari"
    given-names: "Ehsaneddin"
date-released: 2024-10-22
repository-code: "https://github.com/agp-internship/syntran-fa"
preferred-citation:
  type: article
  authors:
  - family-names: "Farsi"
    given-names: "Farhan"
  - family-names: "Sabouri"
    given-names: "Sadra"
  - family-names: "Kashfipour"
    given-names: "Kian"
  - family-names: "Gooran"
    given-names: "Soroush"
  - family-names: "Sameti"
    given-names: "Hossein"
  - family-names: "Asgari"
    given-names: "Ehsaneddin"
  year: 2024
  doi: "10.20944/preprints202410.1684.v1"
  publisher: "Preprints"
  title: "SynTran-fa: Generating Comprehensive Answers for Farsi QA Pairs via Syntactic Transformation"

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

syntran-fa

Science Score: 67.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

syntran-fa

Table of Contents

Dataset Description

Dataset Summary

Supported Tasks and Leaderboards

Languages

Dataset Structure

Data Splits

Dataset Creation

Source Data

Personal and Sensitive Information

Dataset Curators

Contributions

Cite Us

References

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year