https://github.com/amazon-science/resource-constrained-naturalized-semantic-parsing

This repository is made public for reproducibility of our recent work on Training Naturalized Semantic Parsers with Very Little Data

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.5%) to scientific vocabulary

Keywords

dataset few-shot natural-language-processing natural-language-understanding nlp nlu overnight pizza semantic-parsing

Last synced: 10 months ago · JSON representation

Repository

This repository is made public for reproducibility of our recent work on Training Naturalized Semantic Parsers with Very Little Data

Basic Info

Host: GitHub
Owner: amazon-science
License: other
Default Branch: main
Homepage:
Size: 61.5 MB

Statistics

Stars: 4
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Topics

dataset few-shot natural-language-processing natural-language-understanding nlp nlu overnight pizza semantic-parsing

Created about 4 years ago · Last pushed over 3 years ago

Metadata Files

Readme Contributing License Code of conduct

Resource Constrained Naturalized Semantic Parsing

This repository is released as part of our paper Training Naturalized Semantic Parsers with Very Little Data

The repository contains data a subfolder containing all the different splits of data we used in our joint training experiments, and the appendix of our publication appendix.pdf.

Getting Started

For the Pizza dataset and each of the eight domains of the Overnight dataset, there are five subfolders containing files for different tasks:

annotated: Contains the annotated data for each split (n=16, 32, 48, or 200). The .src files contain utterances and .tgt files contain the canonical forms.

unannotated: Contains the remaining data from the original dataset that didn't go into the respective annotated split. We assume this data has no annotations (ignore the .tgt files) and use it to create the mask prediction task.

maskpred: Contains the mask prediction data we create from unnannoted utterances (.src file in unannotated folder). We super sample 10x and mask spans of size equal to roughly 25% of the total sequence length. The .src files contain the masked utterances and .tgt files contain the original utterances.

denoising: Contains the denoising data we created by randomly generating lots of target canonical forms. The .tgt files contain the canonical forms and .src files contain noised versions of those examples (25% noise operations on non-content tokens). For Pizza, the target canonical forms are sampled from the synthetic training dataset and for Overnight, they are generated using SEMPRE.

jt: Contains the combined joint training data for each data split. The .src files are created by concatenating the respective .src files of from the annotated, maskpred, and denoising folders. The .tgt files are similarly created by concatenating the .tgt files from the annotated, maskpred, and denoising folders.

original: Contains the original full training data files which includes the utterances, canonical forms, and exrs (executable representations/semantic parses).

NOTE: We haven't attached the self training and paraphrase augmentation data since that is produced after the first round of joint training. The model checkpoints created after training the JT models are used to tag the unannotated utterances and any new paraphrases produced from existing utterances.

Cite

If you use this dataset, please cite the following paper:

bibtex [Rongali et al. 2022] @article{rongali2022training, title={Training Naturalized Semantic Parsers with Very Little Data}, author={Rongali, Subendhu and Arkoudas, Konstantine and Rubino, Melanie and Hamza, Wael}, journal={arXiv preprint arXiv:2204.14243}, year={2022} }

as well as the original PIZZA dataset this work builds upon (see https://github.com/amazon-research/pizza-semantic-parsing-dataset) ``` @article{arkoudas2022pizza, title={PIZZA: A new benchmark for complex end-to-end task-oriented parsing}, author={Arkoudas, Konstantine and Mesnards, Nicolas Guenon des and Rubino, Melanie and Swamy, Sandesh and Khanna, Saarthak and Sun, Weiqi and Haidar, Khan}, journal={arXiv preprint arXiv:2212.00265}, year={2022} }

```

Security

See CONTRIBUTING for more information.

License

This library is licensed under two licenses. See LICENSE-SUMMARY for more details.

Owner

Name: Amazon Science
Login: amazon-science
Kind: organization

Website: https://amazon.science
Twitter: AmazonScience
Repositories: 80
Profile: https://github.com/amazon-science

GitHub Events

Total

Last Year

Issues and Pull Requests

Last synced: over 1 year ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/amazon-science/resource-constrained-naturalized-semantic-parsing

Science Score: 23.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Resource Constrained Naturalized Semantic Parsing

Getting Started

Cite

Security

License

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels