https://github.com/amazon-science/resource-constrained-naturalized-semantic-parsing

This repository is made public for reproducibility of our recent work on Training Naturalized Semantic Parsers with Very Little Data

https://github.com/amazon-science/resource-constrained-naturalized-semantic-parsing

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.5%) to scientific vocabulary

Keywords

dataset few-shot natural-language-processing natural-language-understanding nlp nlu overnight pizza semantic-parsing
Last synced: 6 months ago · JSON representation

Repository

This repository is made public for reproducibility of our recent work on Training Naturalized Semantic Parsers with Very Little Data

Basic Info
  • Host: GitHub
  • Owner: amazon-science
  • License: other
  • Default Branch: main
  • Homepage:
  • Size: 61.5 MB
Statistics
  • Stars: 4
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Topics
dataset few-shot natural-language-processing natural-language-understanding nlp nlu overnight pizza semantic-parsing
Created almost 4 years ago · Last pushed about 3 years ago
Metadata Files
Readme Contributing License Code of conduct

README.md

Resource Constrained Naturalized Semantic Parsing

This repository is released as part of our paper Training Naturalized Semantic Parsers with Very Little Data

The repository contains data a subfolder containing all the different splits of data we used in our joint training experiments, and the appendix of our publication appendix.pdf.

Getting Started

For the Pizza dataset and each of the eight domains of the Overnight dataset, there are five subfolders containing files for different tasks:

annotated: Contains the annotated data for each split (n=16, 32, 48, or 200). The .src files contain utterances and .tgt files contain the canonical forms.

unannotated: Contains the remaining data from the original dataset that didn't go into the respective annotated split. We assume this data has no annotations (ignore the .tgt files) and use it to create the mask prediction task.

maskpred: Contains the mask prediction data we create from unnannoted utterances (.src file in unannotated folder). We super sample 10x and mask spans of size equal to roughly 25% of the total sequence length. The .src files contain the masked utterances and .tgt files contain the original utterances.

denoising: Contains the denoising data we created by randomly generating lots of target canonical forms. The .tgt files contain the canonical forms and .src files contain noised versions of those examples (25% noise operations on non-content tokens). For Pizza, the target canonical forms are sampled from the synthetic training dataset and for Overnight, they are generated using SEMPRE.

jt: Contains the combined joint training data for each data split. The .src files are created by concatenating the respective .src files of from the annotated, maskpred, and denoising folders. The .tgt files are similarly created by concatenating the .tgt files from the annotated, maskpred, and denoising folders.

original: Contains the original full training data files which includes the utterances, canonical forms, and exrs (executable representations/semantic parses).

NOTE: We haven't attached the self training and paraphrase augmentation data since that is produced after the first round of joint training. The model checkpoints created after training the JT models are used to tag the unannotated utterances and any new paraphrases produced from existing utterances.

Cite

If you use this dataset, please cite the following paper:

bibtex [Rongali et al. 2022] @article{rongali2022training, title={Training Naturalized Semantic Parsers with Very Little Data}, author={Rongali, Subendhu and Arkoudas, Konstantine and Rubino, Melanie and Hamza, Wael}, journal={arXiv preprint arXiv:2204.14243}, year={2022} }

as well as the original PIZZA dataset this work builds upon (see https://github.com/amazon-research/pizza-semantic-parsing-dataset) ``` @article{arkoudas2022pizza, title={PIZZA: A new benchmark for complex end-to-end task-oriented parsing}, author={Arkoudas, Konstantine and Mesnards, Nicolas Guenon des and Rubino, Melanie and Swamy, Sandesh and Khanna, Saarthak and Sun, Weiqi and Haidar, Khan}, journal={arXiv preprint arXiv:2212.00265}, year={2022} }

```

Security

See CONTRIBUTING for more information.

License

This library is licensed under two licenses. See LICENSE-SUMMARY for more details.

Owner

  • Name: Amazon Science
  • Login: amazon-science
  • Kind: organization

GitHub Events

Total
Last Year

Issues and Pull Requests

Last synced: 12 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels