redacted-contextual-question-answering

Data for a new question answering task, along with associated code

https://github.com/isi-vista/redacted-contextual-question-answering

Science Score: 52.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
✓
Institutional organization owner
Organization isi-vista has institutional domain (www.isi.edu)
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (16.1%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Data for a new question answering task, along with associated code

Basic Info

Host: GitHub
Owner: isi-vista
License: mit
Language: TeX
Default Branch: main
Size: 6.86 MB

Statistics

Stars: 0
Watchers: 3
Forks: 0
Open Issues: 0
Releases: 0

Created almost 2 years ago · Last pushed almost 2 years ago

Metadata Files

Readme License Citation

Redacted contextual question answering

Publications

For more information about this project, see the related paper:

TODO: Add citation once paper is published

Installation

Use the provided Makefile to install this project by running the following from the project root directory (the same directory as this README). Ensure the python in PATH is 3.11 before running this command:

shell make install

DeepSpeed must be installed manually. See Installation Details - DeepSpeed for instructions on how to do so.

Note that the installation command will attempt to download all used models from the Hugging Face Hub. To do this, you will need to create a Hugging Face account and request access on the pages for the following models:

Once your request has been approved, authenticate on your local machine using a user access token, using the official User access tokens documentation as a guide.

If the installation process fails, is interrupted, or for any reason needs to be restarted, run git clean -xdf to reset the repository's state.

Dataset

We have collected a dataset of 10 openly licensed summaries for movies and television episodes. These abstracts were found on Wikipedia with List of American films of 2023 - Wikipedia and Category:2023 works - Wikipedia being used as the main way to find materials. We only used works published in July 2023 or later to avoid materials that might have been used to train SOTA LLMs.

Once we collected the summaries, we wrote 5 questions for each one. For each question, we then wrote 4 example answers, one for each for the 3 different constraints and one without any constraints. This resulted in 20 unique (question, constraint, answer) tuples for each summary.

The dataset is stored in the following directories/files:

rcqa_data/: Directory of data files used in experiments. Most of these files are also in the paper's supplemental materials.
- datasets/: Directory of JSON Lines files containing the output of convert_json_to_prompts.py using the files in summaries/ as input.
- prompts/: Directory of JSON Lines files used as input data for run_paper_experiments.sh.
- prompts.md: File containing the prompts in an easier-to-read Markdown format.
- RedactedContextualQuestionAnsweringAnnotation.xlsx: Model output with annotations of correctness, along with various relevant calculations and visualizations.
summaries/: Directory of individual JSON files for each summary.
- Each file contains a single object with the following fields:
- title: Title of the television episode or movie.
- source: Permalink to Wikipedia page version the summary was copied from.
- summary: Markdown-formatted summary of episode or movie, copied from Wikipedia.
- questions: Array of questions about each summary, with each question being an object with the following fields:
  - question: Question about the episode or movie that can be answered using the provided summary.
  - answers: Array of answers given specific constraints, with each answer being an object with the following fields:
  - constraints: Array of constraints to follow when answering the question.
  - answer: Example complete sentence that correctly answers the question and follows the constraints. If no answer is possible, then the value is null instead.

The summaries (and therefore the dataset) are licensed under the Creative Commons 4.0 BY-SA (Attribution-ShareAlike) license.

Redacted contextual question answering experiments

Data

All data for the experiments can be found in rcqa_data/. See the "Dataset" section above for a complete description.

Running

Run the following command to run training, inference, and evaluation for the paper:

shell bash scripts/run_paper_experiments.sh

You will likely need to make changes to the codebase to run in your specific environment.

Contributing

This project uses various code quality tooling, all of which is automatically installed with the rest of the development requirements.

All checks can be run with make check, and some additional automatic changes can be run with make fix.

To test GitHub Actions workflows locally, install act and run it with act.

License

The dataset is under the Creative Commons 4.0 BY-SA (Attribution-ShareAlike) license, which is used by the dataset (Wikipedia) that it is derived from.
The code is under the MIT license.
The paper (paper) is under the Creative Commons 4.0 BY (Attribution) license, which is used for all publications in the ACL Anthology.
src/run_clm.py is originally under the Apache 2.0 license, with all changes from the original file being under the MIT license.

Owner

Name: ISI Center for Vision, Image, Speech, and Text Analytics
Login: isi-vista
Kind: organization
Location: Waltham, MA; Los Angeles, CA; Arlington, VA

Website: https://www.isi.edu/centers/vista/home
Repositories: 17
Profile: https://github.com/isi-vista

Citation (CITATION.cff)

cff-version: 1.2.0
title: Redacted Contextual Question Answering
abstract: Data for a new question answering task, along with associated code
type: dataset
repository-code: https://github.com/isi-vista/redacted-contextual-question-answering
# Semantics are "AND" instead of "OR" but the `CITATION.cff` specification
# does not the support the full SPDX license grammar.
# See the "License" section of the README for more detailed information.
license:
  - Apache-2.0
  - CC-BY-4.0
  - CC-BY-SA-4.0
  - MIT
license-url: https://github.com/isi-vista/redacted-contextual-question-answering/blob/main/README.md#license
message: If you use this dataset or software, please cite the paper from `preferred-citation` (TBD).
authors:
  - given-names: Jacob
    family-names: Lichtefeld
    affiliation: USC Information Sciences Institute
  - given-names: Joe A.
    family-names: Cecil
    affiliation: USC Information Sciences Institute
  - given-names: Alex
    family-names: Hedges
    affiliation: USC Information Sciences Institute
  - given-names: Jeremy
    family-names: Abramson
    affiliation: USC Information Sciences Institute
  - given-names: Marjorie
    family-names: Freedman
    affiliation: USC Information Sciences Institute
# TODO: Add citation once paper is published

GitHub Events

Total

Last Year

Dependencies

.github/workflows/checks.yaml actions

actions/cache v4 composite
actions/checkout v4 composite
actions/setup-python v5 composite

pyproject.toml pypi

requirements-dev.txt pypi

mypy * development
pre-commit * development
pylint * development

requirements-lock.txt pypi

Jinja2 ==3.1.4
Markdown ==3.6
MarkupSafe ==2.1.5
PyYAML ==6.0.1
Werkzeug ==3.0.3
absl-py ==2.1.0
accelerate ==0.30.1
aiohttp ==3.9.5
aiosignal ==1.3.1
annotated-types ==0.6.0
anyio ==4.2.0
astroid ==3.2.3
attrs ==23.2.0
certifi ==2024.2.2
cfgv ==3.4.0
charset-normalizer ==3.3.2
datasets ==2.19.1
deepspeed ==0.14.0
dill ==0.3.8
distlib ==0.3.8
distro ==1.9.0
evaluate ==0.4.2
filelock ==3.15.4
frozenlist ==1.4.1
fsspec ==2024.3.1
grpcio ==1.63.0
h11 ==0.14.0
hjson ==3.1.0
httpcore ==1.0.2
httpx ==0.26.0
huggingface-hub ==0.23.0
identify ==2.6.0
idna ==3.7
isort ==5.13.2
joblib ==1.4.2
mccabe ==0.7.0
mpmath ==1.3.0
multidict ==6.0.5
multiprocess ==0.70.16
mypy ==1.10.1
mypy-extensions ==1.0.0
networkx ==3.3
ninja ==1.11.1.1
nodeenv ==1.9.1
numpy ==1.26.4
openai ==1.7.1
packaging ==24.0
pandas ==2.2.2
platformdirs ==4.2.2
pre-commit ==3.7.1
protobuf ==5.26.1
psutil ==5.9.8
py-cpuinfo ==9.0.0
pyarrow ==16.1.0
pyarrow-hotfix ==0.6
pydantic ==2.7.1
pydantic-settings ==2.1.0
pydantic_core ==2.18.2
pylint ==3.2.5
pynvml ==11.5.0
python-dateutil ==2.9.0.post0
python-dotenv ==1.0.0
pytz ==2024.1
regex ==2024.5.15
requests ==2.31.0
responses ==0.18.0
safetensors ==0.4.3
scikit-learn ==1.4.2
scipy ==1.13.0
sentencepiece ==0.2.0
setuptools ==69.5.1
six ==1.16.0
sniffio ==1.3.0
sympy ==1.12
tensorboard ==2.16.2
tensorboard-data-server ==0.7.2
threadpoolctl ==3.5.0
tiktoken ==0.5.2
tokenizers ==0.19.1
tomlkit ==0.13.0
torch ==2.2.2
tqdm ==4.66.4
transformers ==4.41.0
typing_extensions ==4.12.2
tzdata ==2024.1
urllib3 ==2.2.1
virtualenv ==20.26.3
wheel ==0.43.0
xxhash ==3.4.1
yarl ==1.9.4

requirements.txt pypi

accelerate *
datasets >=2.14.0
deepspeed *
evaluate *
openai *
protobuf *
pydantic-settings *
scikit-learn *
sentencepiece *
tensorboard *
tiktoken *
torch >=1.3
tqdm *
transformers >=4.41.0
wheel *

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

redacted-contextual-question-answering

Science Score: 52.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Redacted contextual question answering

Publications

Installation

Dataset

Redacted contextual question answering experiments

Data

Running

Contributing

License

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Dependencies