vtechagp-dataset

A structured dataset of Virginia Tech ETD abstracts with academic-to-general-audience paraphrases, useful for NLP and text simplification.

https://github.com/waingram/vtechagp-dataset

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 4 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (8.8%) to scientific vocabulary
Last synced: 7 months ago · JSON representation ·

Repository

A structured dataset of Virginia Tech ETD abstracts with academic-to-general-audience paraphrases, useful for NLP and text simplification.

Basic Info
  • Host: GitHub
  • Owner: waingram
  • License: other
  • Default Branch: main
  • Size: 5.35 MB
Statistics
  • Stars: 1
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 2
Created about 1 year ago · Last pushed about 1 year ago
Metadata Files
Readme Changelog License Citation

README.md

DOI

VTechAGP: An Academic-to-General-Audience Text Paraphrase Dataset

Variable Descriptions

This dataset contains the following columns:

  • identifier_url: Persistent identifier (CNRI handle) for the ETD.
  • title: Title of the ETD.
  • abstract: Regular abstract for the ETD.
  • abstract_general: General audience abstract for the ETD.
  • subject_terms: List of subject terms for the ETD.
  • discipline: Field of study for the degree awarded.
  • department: Name of the academic department.
  • degree: Degree awarded.
  • degree_level: Level of the degree (e.g., 'doctoral' or 'masters').
  • type: Type of ETD (e.g., 'thesis' or 'dissertation').

Methodology

This dataset was collected from Virginia Tech's institutional repository,
VTechWorks, using the Open Archives
Initiative Protocol for Metadata Harvesting (OAI-PMH) on September 22, 2023.

Citing This Dataset

If you use this dataset in your research, please cite the following paper:

bibtex @article{ming2024vtechagp, author = {Ming Cheng and Jiaying Gong and Chenhan Yuan and William A. Ingram and Edward A. Fox and Hoda Eldardiry}, title = {VTechAGP: An Academic-to-General-Audience Text Paraphrase Dataset and Benchmark Models}, journal = {CoRR}, volume = {abs/2411.04825}, year = {2024}, doi = {10.48550/arXiv.2411.04825}, eprinttype = {arXiv}, eprint = {2411.04825} }

Ethical & Usage Considerations

  • Publicly Available Data: All ETD metadata in this dataset was collected from publicly available sources in compliance with institutional repository policies.
  • Attribution: Users of this dataset should respect citation norms and acknowledge the original authors of the ETDs when appropriate.

Acknowledgment

This project was made possible in part by the Institute of Museum and Library Services
LG-256638-OLS-24.

License

This dataset is released under the Open Data Commons Attribution License (ODC-By).

Owner

  • Name: Bill Ingram
  • Login: waingram
  • Kind: user
  • Location: Blacksburg, VA
  • Company: @VirginiaTech @VTUL

Assistant Professor, Assistant Dean and Director of IT University Libraries @VTUL, @VirginiaTech

Citation (CITATION.cff)

cff-version: 1.2.0
title: "VTechAGP: Academic-to-General-Audience ETD Abstracts"
version: "1.0.0"
authors:
  - family-names: "Ingram"
    given-names: "William A."
    affiliation: "Virginia Tech"
    orcid: "0000-0002-8307-8844"
  - family-names: "Cheng"
    given-names: "Ming"
    affiliation: "Virginia Tech"
    orcid: "0009-0006-8475-2331"
  - family-names: "Gong"
    given-names: "Jiaying"
    affiliation: "Virginia Tech"
  - family-names: "Yuan"
    given-names: "Chenhan"
    affiliation: "Virginia Tech"
  - family-names: "Fox"
    given-names: "Edward A."
    affiliation: "Virginia Tech"
    orcid: "0000-0003-1447-6870"
  - family-names: "Eldardiry"
    given-names: "Hoda"
    affiliation: "Virginia Tech"
    orcid: "0000-0002-9712-6667"
doi: "10.5281/zenodo.14833933"
license: "ODC-By-1.0"
date-released: "2025-02-06"
repository-code: "https://github.com/waingram/VTechAGP-Dataset"
url: "https://doi.org/10.5281/zenodo.14833933"
keywords:
  - ETDs
  - text simplification
  - academic abstracts
  - paraphrase dataset
preferred-citation:
  type: article
  authors:
    - family-names: "Cheng"
      given-names: "Ming"
      orcid: "0009-0006-8475-2331"
    - family-names: "Gong"
      given-names: "Jiaying"
    - family-names: "Yuan"
      given-names: "Chenhan"
    - family-names: "Ingram"
      given-names: "William A."
      orcid: "0000-0002-8307-8844"
    - family-names: "Fox"
      given-names: "Edward A."
      orcid: "0000-0003-1447-6870"
    - family-names: "Eldardiry"
      given-names: "Hoda"
      orcid: "0000-0002-9712-6667"
  title: "VTechAGP: An Academic-to-General-Audience Text Paraphrase Dataset and Benchmark Models"
  journal: "CoRR"
  year: 2024
  doi: "10.48550/arXiv.2411.04825"
funding:
  - name: "Institute of Museum and Library Services"
    award_number: "LG-256638-OLS-24"
    award_uri: "https://www.imls.gov/grants/awarded/lg-256638-ols-24"

GitHub Events

Total
  • Release event: 2
  • Watch event: 2
  • Push event: 5
  • Create event: 4
Last Year
  • Release event: 2
  • Watch event: 2
  • Push event: 5
  • Create event: 4