vtechagp-dataset
A structured dataset of Virginia Tech ETD abstracts with academic-to-general-audience paraphrases, useful for NLP and text simplification.
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 4 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.8%) to scientific vocabulary
Repository
A structured dataset of Virginia Tech ETD abstracts with academic-to-general-audience paraphrases, useful for NLP and text simplification.
Basic Info
- Host: GitHub
- Owner: waingram
- License: other
- Default Branch: main
- Size: 5.35 MB
Statistics
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 2
Metadata Files
README.md
VTechAGP: An Academic-to-General-Audience Text Paraphrase Dataset
Variable Descriptions
This dataset contains the following columns:
- identifier_url: Persistent identifier (CNRI handle) for the ETD.
- title: Title of the ETD.
- abstract: Regular abstract for the ETD.
- abstract_general: General audience abstract for the ETD.
- subject_terms: List of subject terms for the ETD.
- discipline: Field of study for the degree awarded.
- department: Name of the academic department.
- degree: Degree awarded.
- degree_level: Level of the degree (e.g., 'doctoral' or 'masters').
- type: Type of ETD (e.g., 'thesis' or 'dissertation').
Methodology
This dataset was collected from Virginia Tech's institutional repository,
VTechWorks, using the Open Archives
Initiative Protocol for Metadata Harvesting (OAI-PMH) on September 22, 2023.
Citing This Dataset
If you use this dataset in your research, please cite the following paper:
bibtex
@article{ming2024vtechagp,
author = {Ming Cheng and
Jiaying Gong and
Chenhan Yuan and
William A. Ingram and
Edward A. Fox and
Hoda Eldardiry},
title = {VTechAGP: An Academic-to-General-Audience Text Paraphrase Dataset and Benchmark Models},
journal = {CoRR},
volume = {abs/2411.04825},
year = {2024},
doi = {10.48550/arXiv.2411.04825},
eprinttype = {arXiv},
eprint = {2411.04825}
}
Ethical & Usage Considerations
- Publicly Available Data: All ETD metadata in this dataset was collected from publicly available sources in compliance with institutional repository policies.
- Attribution: Users of this dataset should respect citation norms
and acknowledge the original authors of the ETDs when appropriate.
Acknowledgment
This project was made possible in part by the Institute of Museum and Library Services
LG-256638-OLS-24.
License
This dataset is released under the Open Data Commons Attribution License (ODC-By).
Owner
- Name: Bill Ingram
- Login: waingram
- Kind: user
- Location: Blacksburg, VA
- Company: @VirginiaTech @VTUL
- Website: http://orcid.org/0000-0002-8307-8844
- Twitter: sudobear
- Repositories: 28
- Profile: https://github.com/waingram
Assistant Professor, Assistant Dean and Director of IT University Libraries @VTUL, @VirginiaTech
Citation (CITATION.cff)
cff-version: 1.2.0
title: "VTechAGP: Academic-to-General-Audience ETD Abstracts"
version: "1.0.0"
authors:
- family-names: "Ingram"
given-names: "William A."
affiliation: "Virginia Tech"
orcid: "0000-0002-8307-8844"
- family-names: "Cheng"
given-names: "Ming"
affiliation: "Virginia Tech"
orcid: "0009-0006-8475-2331"
- family-names: "Gong"
given-names: "Jiaying"
affiliation: "Virginia Tech"
- family-names: "Yuan"
given-names: "Chenhan"
affiliation: "Virginia Tech"
- family-names: "Fox"
given-names: "Edward A."
affiliation: "Virginia Tech"
orcid: "0000-0003-1447-6870"
- family-names: "Eldardiry"
given-names: "Hoda"
affiliation: "Virginia Tech"
orcid: "0000-0002-9712-6667"
doi: "10.5281/zenodo.14833933"
license: "ODC-By-1.0"
date-released: "2025-02-06"
repository-code: "https://github.com/waingram/VTechAGP-Dataset"
url: "https://doi.org/10.5281/zenodo.14833933"
keywords:
- ETDs
- text simplification
- academic abstracts
- paraphrase dataset
preferred-citation:
type: article
authors:
- family-names: "Cheng"
given-names: "Ming"
orcid: "0009-0006-8475-2331"
- family-names: "Gong"
given-names: "Jiaying"
- family-names: "Yuan"
given-names: "Chenhan"
- family-names: "Ingram"
given-names: "William A."
orcid: "0000-0002-8307-8844"
- family-names: "Fox"
given-names: "Edward A."
orcid: "0000-0003-1447-6870"
- family-names: "Eldardiry"
given-names: "Hoda"
orcid: "0000-0002-9712-6667"
title: "VTechAGP: An Academic-to-General-Audience Text Paraphrase Dataset and Benchmark Models"
journal: "CoRR"
year: 2024
doi: "10.48550/arXiv.2411.04825"
funding:
- name: "Institute of Museum and Library Services"
award_number: "LG-256638-OLS-24"
award_uri: "https://www.imls.gov/grants/awarded/lg-256638-ols-24"
GitHub Events
Total
- Release event: 2
- Watch event: 2
- Push event: 5
- Create event: 4
Last Year
- Release event: 2
- Watch event: 2
- Push event: 5
- Create event: 4