redacted-contextual-question-answering
Data for a new question answering task, along with associated code
https://github.com/isi-vista/redacted-contextual-question-answering
Science Score: 52.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
✓Institutional organization owner
Organization isi-vista has institutional domain (www.isi.edu) -
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (16.1%) to scientific vocabulary
Repository
Data for a new question answering task, along with associated code
Basic Info
- Host: GitHub
- Owner: isi-vista
- License: mit
- Language: TeX
- Default Branch: main
- Size: 6.86 MB
Statistics
- Stars: 0
- Watchers: 3
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Redacted contextual question answering
Publications
For more information about this project, see the related paper:
TODO: Add citation once paper is published
Installation
Use the provided Makefile to install this project by running the following from the project root directory (the same directory as this README). Ensure the python in PATH is 3.11 before running this command:
shell
make install
DeepSpeed must be installed manually. See Installation Details - DeepSpeed for instructions on how to do so.
Note that the installation command will attempt to download all used models from the Hugging Face Hub. To do this, you will need to create a Hugging Face account and request access on the pages for the following models:
Once your request has been approved, authenticate on your local machine using a user access token, using the official User access tokens documentation as a guide.
If the installation process fails, is interrupted, or for any reason needs to be restarted, run git clean -xdf to reset the repository's state.
Dataset
We have collected a dataset of 10 openly licensed summaries for movies and television episodes. These abstracts were found on Wikipedia with List of American films of 2023 - Wikipedia and Category:2023 works - Wikipedia being used as the main way to find materials. We only used works published in July 2023 or later to avoid materials that might have been used to train SOTA LLMs.
Once we collected the summaries, we wrote 5 questions for each one. For each question, we then wrote 4 example answers, one for each for the 3 different constraints and one without any constraints. This resulted in 20 unique (question, constraint, answer) tuples for each summary.
The dataset is stored in the following directories/files:
rcqa_data/: Directory of data files used in experiments. Most of these files are also in the paper's supplemental materials.datasets/: Directory of JSON Lines files containing the output ofconvert_json_to_prompts.pyusing the files insummaries/as input.prompts/: Directory of JSON Lines files used as input data forrun_paper_experiments.sh.prompts.md: File containing the prompts in an easier-to-read Markdown format.RedactedContextualQuestionAnsweringAnnotation.xlsx: Model output with annotations of correctness, along with various relevant calculations and visualizations.
summaries/: Directory of individual JSON files for each summary.- Each file contains a single object with the following fields:
title: Title of the television episode or movie.source: Permalink to Wikipedia page version the summary was copied from.summary: Markdown-formatted summary of episode or movie, copied from Wikipedia.questions: Array of questions about each summary, with each question being an object with the following fields:question: Question about the episode or movie that can be answered using the provided summary.answers: Array of answers given specific constraints, with each answer being an object with the following fields:constraints: Array of constraints to follow when answering the question.answer: Example complete sentence that correctly answers the question and follows the constraints. If no answer is possible, then the value isnullinstead.
The summaries (and therefore the dataset) are licensed under the Creative Commons 4.0 BY-SA (Attribution-ShareAlike) license.
Redacted contextual question answering experiments
Data
All data for the experiments can be found in rcqa_data/. See the "Dataset" section above for a complete description.
Running
Run the following command to run training, inference, and evaluation for the paper:
shell
bash scripts/run_paper_experiments.sh
You will likely need to make changes to the codebase to run in your specific environment.
Contributing
This project uses various code quality tooling, all of which is automatically installed with the rest of the development requirements.
All checks can be run with make check, and some additional automatic changes can be run with make fix.
To test GitHub Actions workflows locally, install act and run it with act.
License
- The dataset is under the Creative Commons 4.0 BY-SA (Attribution-ShareAlike) license, which is used by the dataset (Wikipedia) that it is derived from.
- The code is under the MIT license.
- The paper (
paper) is under the Creative Commons 4.0 BY (Attribution) license, which is used for all publications in the ACL Anthology. src/run_clm.pyis originally under the Apache 2.0 license, with all changes from the original file being under the MIT license.
Owner
- Name: ISI Center for Vision, Image, Speech, and Text Analytics
- Login: isi-vista
- Kind: organization
- Location: Waltham, MA; Los Angeles, CA; Arlington, VA
- Website: https://www.isi.edu/centers/vista/home
- Repositories: 17
- Profile: https://github.com/isi-vista
Citation (CITATION.cff)
cff-version: 1.2.0
title: Redacted Contextual Question Answering
abstract: Data for a new question answering task, along with associated code
type: dataset
repository-code: https://github.com/isi-vista/redacted-contextual-question-answering
# Semantics are "AND" instead of "OR" but the `CITATION.cff` specification
# does not the support the full SPDX license grammar.
# See the "License" section of the README for more detailed information.
license:
- Apache-2.0
- CC-BY-4.0
- CC-BY-SA-4.0
- MIT
license-url: https://github.com/isi-vista/redacted-contextual-question-answering/blob/main/README.md#license
message: If you use this dataset or software, please cite the paper from `preferred-citation` (TBD).
authors:
- given-names: Jacob
family-names: Lichtefeld
affiliation: USC Information Sciences Institute
- given-names: Joe A.
family-names: Cecil
affiliation: USC Information Sciences Institute
- given-names: Alex
family-names: Hedges
affiliation: USC Information Sciences Institute
- given-names: Jeremy
family-names: Abramson
affiliation: USC Information Sciences Institute
- given-names: Marjorie
family-names: Freedman
affiliation: USC Information Sciences Institute
# TODO: Add citation once paper is published
GitHub Events
Total
Last Year
Dependencies
- actions/cache v4 composite
- actions/checkout v4 composite
- actions/setup-python v5 composite
- mypy * development
- pre-commit * development
- pylint * development
- Jinja2 ==3.1.4
- Markdown ==3.6
- MarkupSafe ==2.1.5
- PyYAML ==6.0.1
- Werkzeug ==3.0.3
- absl-py ==2.1.0
- accelerate ==0.30.1
- aiohttp ==3.9.5
- aiosignal ==1.3.1
- annotated-types ==0.6.0
- anyio ==4.2.0
- astroid ==3.2.3
- attrs ==23.2.0
- certifi ==2024.2.2
- cfgv ==3.4.0
- charset-normalizer ==3.3.2
- datasets ==2.19.1
- deepspeed ==0.14.0
- dill ==0.3.8
- distlib ==0.3.8
- distro ==1.9.0
- evaluate ==0.4.2
- filelock ==3.15.4
- frozenlist ==1.4.1
- fsspec ==2024.3.1
- grpcio ==1.63.0
- h11 ==0.14.0
- hjson ==3.1.0
- httpcore ==1.0.2
- httpx ==0.26.0
- huggingface-hub ==0.23.0
- identify ==2.6.0
- idna ==3.7
- isort ==5.13.2
- joblib ==1.4.2
- mccabe ==0.7.0
- mpmath ==1.3.0
- multidict ==6.0.5
- multiprocess ==0.70.16
- mypy ==1.10.1
- mypy-extensions ==1.0.0
- networkx ==3.3
- ninja ==1.11.1.1
- nodeenv ==1.9.1
- numpy ==1.26.4
- openai ==1.7.1
- packaging ==24.0
- pandas ==2.2.2
- platformdirs ==4.2.2
- pre-commit ==3.7.1
- protobuf ==5.26.1
- psutil ==5.9.8
- py-cpuinfo ==9.0.0
- pyarrow ==16.1.0
- pyarrow-hotfix ==0.6
- pydantic ==2.7.1
- pydantic-settings ==2.1.0
- pydantic_core ==2.18.2
- pylint ==3.2.5
- pynvml ==11.5.0
- python-dateutil ==2.9.0.post0
- python-dotenv ==1.0.0
- pytz ==2024.1
- regex ==2024.5.15
- requests ==2.31.0
- responses ==0.18.0
- safetensors ==0.4.3
- scikit-learn ==1.4.2
- scipy ==1.13.0
- sentencepiece ==0.2.0
- setuptools ==69.5.1
- six ==1.16.0
- sniffio ==1.3.0
- sympy ==1.12
- tensorboard ==2.16.2
- tensorboard-data-server ==0.7.2
- threadpoolctl ==3.5.0
- tiktoken ==0.5.2
- tokenizers ==0.19.1
- tomlkit ==0.13.0
- torch ==2.2.2
- tqdm ==4.66.4
- transformers ==4.41.0
- typing_extensions ==4.12.2
- tzdata ==2024.1
- urllib3 ==2.2.1
- virtualenv ==20.26.3
- wheel ==0.43.0
- xxhash ==3.4.1
- yarl ==1.9.4
- accelerate *
- datasets >=2.14.0
- deepspeed *
- evaluate *
- openai *
- protobuf *
- pydantic-settings *
- scikit-learn *
- sentencepiece *
- tensorboard *
- tiktoken *
- torch >=1.3
- tqdm *
- transformers >=4.41.0
- wheel *