https://github.com/google-deepmind/streamingqa

https://github.com/google-deepmind/streamingqa

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.7%) to scientific vocabulary
Last synced: 6 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: google-deepmind
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Size: 17.6 KB
Statistics
  • Stars: 48
  • Watchers: 3
  • Forks: 1
  • Open Issues: 3
  • Releases: 0
Created almost 4 years ago · Last pushed over 2 years ago
Metadata Files
Readme Contributing License

README.md

StreamingQA

This repository contains the question-answering StreamingQA datasets, a list of deduplicated WMT document IDs, and a script to process and filter the WMT documents to be used in conjunction with the paper: StreamingQA: A Benchmark for Adaptation to New Knowledge over Time in Question Answering Models (Liška, Kočiský, Gribovskaya, Terzi et al., 2021).

If you use this dataset in your research please cite

@article{streamingqa2022, title={StreamingQA: A Benchmark for Adaptation to New Knowledge over Time in Question Answering Models}, author={Adam Li{\v{s}}ka and Tom{\'a}{\v{s}} Ko{\v{c}}isk{\'y} and Elena Gribovskaya and Tayfun Terzi and Eren Sezener and Devang Agrawal and Cyprien de Masson d'Autume and Tim Scholtes and Manzil Zaheer and Susannah Young and Ellen Gilsenan-McMahon Sophia Austin and Phil Blunsom and Angeliki Lazaridou}, journal={arXiv preprint arXiv:2205.11388}, year={2022} }

Data

The paper specific data can be downloaded using the links provided below. These are files stored in Google Cloud Storage in gzipped form.

WMT

We downloaded document-split versions of the English WMT News Crawl dataset. As the dataset does not provide document IDs, we used SHA256 hashes of the Base64 encoded unsplit texts of articles as part of "sorting key IDs" (see below).

Deduplicated subset

For the paper, we use a deduplicated subset of the WMT data. To reproduce the subset, please find a list of WMT sorting key IDs, which in conjunction with the extraction script can be used to filter out duplicate documents. The list is stored as newline delimited sorting keys.

StreamingQA

The StreamingQA questions and answers (including metadata) are stored in JSONL files. We provide subsets for train, valid, and eval separately.

Each QA entry has attributes:

Field | Type | Description :--------------------------- | ----------: | :---------- qa_id | str | Question identifier: "eval-X", "valid-X", and "train-X", where X is an integer index starting from zero. question | str | The question text. answers | List[str] | A list of answers, where there len=1 for questions in the 'train' and 'valid' subset, and len=3 for questions in the 'eval' subset. answers_additional | List[str] | Additional answers only available for the 'eval' subset (empty string for subset 'train' and 'valid'). This is the 4th additional reference collected to compute the human benchmark. This is not used for evaluation but may serve useful for other purposes. question_ts | int | Timestamp (UTC seconds) of the date when the question was asked. evidence_ts | int | Timestamp (UTC seconds) of the date when the corresponding WMT news article was published. evidence_id | str | The WMT sorting key ID of the document text that was used as evidence for the question. recent_or_past | str | To which subset the question belongs ( "recent" vs "past"). written_or_generated | str | Whether the question is based on human annotations ("written") or was "generated". toxicity_identity_attack | float | Toxicity score of Perspective API classifier "IDENTITYATTACK". `toxicityinsult|float| Toxicity score of Perspective API classifier "INSULT". toxicityprofanity|float| Toxicity score of Perspective API classifier "PROFANITY". toxicityseveretoxicity|float` | Toxicity score of Perspective API classifier "SEVERETOXICITY". toxicity_sexually_explicit | float | Toxicity score of Perspective API classifier "SEXUALLYEXPLICIT". `toxicitythreat|float` | Toxicity score of Perspective API classifier "THREAT".

For detailed definitions of the toxicity classifiers please refer to Perspective API website.

Download

Name | File | Size (bytes) | Entries (lines) | MD5 | Download :------------------------------- | :--------------------------- | ------------: | --------------: | :--------------------------------- | :------- Deduplicated WMT sorting key IDs | wmt_sorting_key_ids.txt.gz | 439,101,648 | 11,393,471 | 3356d7e38e43b7bf4338e2003ab92f36 | Link StreamingQA train subset | streaminqa_train.jsonl.gz | 17,466,691 | 99,402 | 32b3bc32b39f81bc2f0e9ab6fb4201b3 | Link StreamingQA valid subset | streaminqa_valid.jsonl.gz | 1,749,221 | 9,939 | 3570fbba6e2630e0c2bff03b150f9230 | Link StreamingQA eval subset | streaminqa_eval.jsonl.gz | 7,455,358 | 36,378 | a54db9a7e6fb1adfea7d4022f5fc49bd | Link

Metadata

The following table is necessary for this dataset to be indexed by search engines such as Google Dataset Search.

property value
name StreamingQA
url
sameAs https://github.com/deepmind/streamingqa
description Data accompanying StreamingQA: A Benchmark for Adaptation to New Knowledge over Time in Question Answering Models (Liška, Kočiský, Gribovskaya, Terzi et al., 2021).
provider
property value
name DeepMind
sameAs https://en.wikipedia.org/wiki/DeepMind
citation https://identifiers.org/arxiv:2205.11388

Disclaimer

This dataset is based on news articles from various sources and contains a small number of questions or answers, both human written and automatically generated, that are toxic and may be triggering or worrisome for researchers, or cause models to generate such content. We aimed to create a balanced process that identifies most of the toxic content while decreasing the risk of removing false positives. We estimate that 0.5% items in the dataset are toxic after our toxicity filtering. Secondly, questions and answers reflect information from the news articles, and in particular, may not always be factually correct. Furthermore, this dataset is intended to evaluate adaptation of models to new information in news over time, and therefore, it may not be applicable to settings where the assumptions we made don't apply. We provide further toxicity discussion and details of our filtering in the paper.

Code

Installation

For installation please run ./run.sh to setup a Python environment and install the necessary Python packages (listed in requirements.txt). The script completes with the output of the test.

Example usage in Python

After installation, and in the activated Python virtual environment (here streamingqa_env) you may start an interactive Python session and use the following code (if you have downloaded files to the same directory as the code from the links provided above):

WMT Docs

We provide a Python script extraction.py that extracts the downloaded WMT data, pre-processes the text, assigns the sorting key IDs, and finally filters out the duplicate documents. The main entry point get_deduplicated_wmt_docs yields WMTDoc objects with attributes being the assigned sorting key ID (sorting_key), the document publication date as UTC timestamp in seconds (publication_ts), and the pre-processed document text (text).

```py import extraction

archivefile_names = [ 'news-docs.2007.en.filtered.gz', 'news-docs.2008.en.filtered.gz', 'news-docs.2009.en.filtered.gz', 'news-docs.2010.en.filtered.gz', 'news-docs.2011.en.filtered.gz', 'news-docs.2012.en.filtered.gz', 'news-docs.2013.en.filtered.gz', 'news-docs.2014.en.filtered.gz', 'news-docs.2015.en.filtered.gz', 'news-docs.2016.en.filtered.gz', 'news-docs.2017.en.filtered.gz', 'news-docs.2018.en.filtered.gz', 'news-docs.2019.en.filtered.gz', 'news-docs.2020.en.filtered.gz', 'news-docs.2021.en.filtered.gz', ]

wmtdocs = extraction.getdeduplicatedwmtdocs( wmtarchivefilepathsorobjects=archivefilenames, deduplicatedsortingkeysfilepathorobject='wmtsortingkey_ids.txt.gz', ) ```

WMT Passages

Furthermore, we also provide a function to reproduce our splits of articles into sentence chunks. These passages can be used as the search space for the retrieval architecture as is discussed in more detail in the paper mentioned above.

py wmt_passages = extraction.get_wmt_passages_from_docs( wmt_docs=wmt_docs, preprend_date=True, )

StreamingQA

```py import gzip import json

filenamebystreamingqasubset = { 'train': 'streaminqatrain.jsonl.gz', 'valid': 'streaminqavalid.jsonl.gz', 'eval': 'streaminqaeval.jsonl.gz', }

streamingqa = {} for subsetname, filename in filenamebystreamingqasubset.items(): with open('streamingqatrain.jsonl.gz'), 'rb') as inputfile: with gzip.open(inputfile) as ungzippedfile: streamingqa[subsetname] = [ json.loads(line.decode()) for line in ungzipped_file ] ```

License and disclaimer

Copyright 2022 DeepMind Technologies Limited

All software is licensed under the Apache License, Version 2.0 (Apache 2.0); you may not use this file except in compliance with the Apache 2.0 license. You may obtain a copy of the Apache 2.0 license at: https://www.apache.org/licenses/LICENSE-2.0

All other materials are licensed under the Creative Commons Attribution 4.0 International License (CC-BY). You may obtain a copy of the CC-BY license at: https://creativecommons.org/licenses/by/4.0/legalcode

Unless required by applicable law or agreed to in writing, all software and materials distributed here under the Apache 2.0 or CC-BY licenses are distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the licenses for the specific language governing permissions and limitations under those licenses.

This is not an official Google product.

Owner

  • Name: Google DeepMind
  • Login: google-deepmind
  • Kind: organization

GitHub Events

Total
  • Watch event: 2
Last Year
  • Watch event: 2

Committers

Last synced: 10 months ago

All Time
  • Total Commits: 4
  • Total Committers: 1
  • Avg Commits per committer: 4.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Tayfun Terzi t****i@g****m 4
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 10 months ago

All Time
  • Total issues: 4
  • Total pull requests: 0
  • Average time to close issues: 3 months
  • Average time to close pull requests: N/A
  • Total issue authors: 4
  • Total pull request authors: 0
  • Average comments per issue: 0.75
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • csbobby (1)
  • dcdsf321 (1)
  • jon-chuang (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Dependencies

requirements.in pypi
  • absl-py ==1.0.0
  • fs ==2.4.15
  • pytz ==2021.3
requirements.txt pypi
  • absl-py ==1.0.0
  • appdirs ==1.4.4
  • fs ==2.4.15
  • pytz ==2021.3
  • six ==1.16.0