Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 1 DOI reference(s) in README -
✓Academic publication links
Links to: arxiv.org, ieee.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.5%) to scientific vocabulary
Repository
Syntatic Transformer in Farsi
Basic Info
- Host: GitHub
- Owner: agp-internship
- License: mit
- Language: Python
- Default Branch: main
- Size: 1.87 MB
Statistics
- Stars: 3
- Watchers: 2
- Forks: 0
- Open Issues: 1
- Releases: 0
Metadata Files
README.md
syntran-fa
Syntactic Transformed Version of Farsi QA datasets to make fluent responses from questions and short answers. You can use the syntran-fa dataset with :hugs:/datasets by the code below:
python
import datasets
data = datasets.load_dataset('SLPL/syntran-fa', split="train")
Table of Contents
- Table of Contents
- Dataset Description
- Dataset Structure
- Dataset Creation
- Considerations for Using the Data
- Additional Information
Dataset Description
- Homepage: Sharif-SLPL
- Repository: SynTran-fa
- Point of Contact: Sadra Sabouri
- Size of dataset files: 6.68MB
Dataset Summary
Generating fluent responses has always been challenging for the question-answering task, especially in low-resource languages like Farsi. In recent years there were some efforts for enhancing the size of datasets in Farsi. Syntran-fa is a question-answering dataset that accumulates the former Farsi QA dataset's short answers and proposes a complete fluent answer for each pair of (question, short_answer).
This dataset contains nearly 50,000 indices of questions and answers. The dataset that has been used as our sources are in Source Data section.
The main idea for this dataset comes from Fluent Response Generation for Conversational Question Answering where they used a "parser + syntactic rules" module to make different fluent answers from a pair of question and a short answer using a parser and some syntactic rules. In this project, we used stanza as our parser to parse the question and generate a response according to it using the short (1-2 word) answers. One can continue this project by generating different permutations of the sentence's parts (and thus providing more than one sentence for an answer) or training a seq2seq model which does what we do with our rule-based system (by defining a new text-to-text task).
Supported Tasks and Leaderboards
This dataset can be used for the question-answering task, especially when you are going to generate fluent responses. You can train a seq2seq model with this dataset to generate fluent responses - as done by Fluent Response Generation for Conversational Question Answering.
Languages
- Persian (fa)
Dataset Structure
Each row of the dataset will look like something like the below:
python
{
'id': 0,
'question': 'باشگاه هاکی ساوتهمپتون چه نام دارد؟',
'short_answer': 'باشگاه هاکی ساوتهمپتون',
'fluent_answer': 'باشگاه هاکی ساوتهمپتون باشگاه هاکی ساوتهمپتون نام دارد.',
'bert_loss': 1.110097069682014
}
+ id : the entry id in dataset
+ question : the question
+ short_answer : the short answer corresponding to the question (the primary answer)
+ fluent_answer : fluent (long) answer generated from both question and the short_answer (the secondary answer)
+ bert_loss : the loss that pars-bert gives when inputting the fluent_answer to it. As it increases the sentence is more likely to be influent.
Note: the dataset is sorted increasingly by the bert_loss, so first sentences are more likely to be fluent.
Data Splits
Currently, the dataset just provided the train split. There would be a test split soon.
Dataset Creation
We extract all short answer (1-2 words as answer) entries of all open source QA datasets in Farsi and used some rules featuring the question parse tree to make long (fluent) answers.
Source Data
The source datasets that we used are as follows:
Personal and Sensitive Information
The dataset is completely a subset of open source known datasets so all information in it is already there on the internet as a open-source dataset. By the way, we do not take responsibility for any of that.
Dataset Curators
The dataset is gathered together completely in the Asr Gooyesh Pardaz company's summer internship under the supervision of Soroush Gooran, Prof. Hossein Sameti, and the mentorship of Sadra Sabouri. This project was Farhan Farsi's first internship project.
Contributions
Thanks to @farhaaaaa for adding this dataset.
Cite Us
Please cite our technical preprint if you're using this dataset:
bibtex
@article{farsi2024syntran,
title={SynTran-fa: Generating Comprehensive Answers for Farsi QA Pairs via Syntactic Transformation},
author={Farsi, Farhan and Sabouri, Sadra and Kashfipour, Kian and Gooran, Soroush and Sameti, Hossein and Asgari, Ehsaneddin},
year={2024},
doi={10.20944/preprints202410.1684.v1},
publisher={Preprints}
}
References
- Fluent Response Generation for Conversational Question Answering (Baheti et al., ACL 2020)
- Good Question! Statistical Ranking for Question Generation (Heilman & Smith, NAACL 2010)
- Accurate Unlexicalized Parsing (Klein & Manning, ACL 2003)
Owner
- Name: AGP Internship
- Login: agp-internship
- Kind: organization
- Repositories: 3
- Profile: https://github.com/agp-internship
ASR Gooyesh Pardaz Internship Projects Repository
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this work, please cite it as below."
title: "SynTran-fa: Generating Comprehensive Answers for Farsi QA Pairs via Syntactic Transformation"
authors:
- family-names: "Farsi"
given-names: "Farhan"
- family-names: "Sabouri"
given-names: "Sadra"
- family-names: "Kashfipour"
given-names: "Kian"
- family-names: "Gooran"
given-names: "Soroush"
- family-names: "Sameti"
given-names: "Hossein"
- family-names: "Asgari"
given-names: "Ehsaneddin"
date-released: 2024-10-22
repository-code: "https://github.com/agp-internship/syntran-fa"
preferred-citation:
type: article
authors:
- family-names: "Farsi"
given-names: "Farhan"
- family-names: "Sabouri"
given-names: "Sadra"
- family-names: "Kashfipour"
given-names: "Kian"
- family-names: "Gooran"
given-names: "Soroush"
- family-names: "Sameti"
given-names: "Hossein"
- family-names: "Asgari"
given-names: "Ehsaneddin"
year: 2024
doi: "10.20944/preprints202410.1684.v1"
publisher: "Preprints"
title: "SynTran-fa: Generating Comprehensive Answers for Farsi QA Pairs via Syntactic Transformation"
GitHub Events
Total
- Push event: 6
Last Year
- Push event: 6