https://github.com/google-deepmind/loft

LOFT: A 1 Million+ Token Long-Context Benchmark

https://github.com/google-deepmind/loft

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.1%) to scientific vocabulary
Last synced: 6 months ago · JSON representation

Repository

LOFT: A 1 Million+ Token Long-Context Benchmark

Basic Info
  • Host: GitHub
  • Owner: google-deepmind
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 336 KB
Statistics
  • Stars: 198
  • Watchers: 11
  • Forks: 15
  • Open Issues: 0
  • Releases: 0
Created almost 2 years ago · Last pushed 10 months ago
Metadata Files
Readme Contributing License

README.md

LOFT: A 1 Million+ Token Long-Context Benchmark

This repository houses the resources for LOFT, the Long Context Frontiers benchmark, introduced in the research paper Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?. LOFT consists of 6 long-context task categories spanning retrieval, multi-hop compositional reasoning, and more, totaling 35 datasets and 4 modalities.

Installation

bash $ git clone git@github.com:google-deepmind/loft.git $ cd loft/ $ pip install -r requirements.txt

Download Datasets and Prompts

The script below downloads all the LOFT datasets under BASE_DIR.

bash $ BASE_DIR=your-choice-of-directory $ sh download.sh $BASE_DIR

Each dataset is also available from the links in the Datasets table. For a small subset, download.sh will additionally run preprocess.py, which infills the missing fields in the queries and corpus files. Once the download is completed, you will see the file structure as below:

$BASE_DIR └── data ├── retrieval │   ├── arguana │   │   ├── 128k │   │   │   ├── corpus.jsonl │   │   │   ├── dev_queries.jsonl │   │   │   ├── few_shot_queries.jsonl │   │   │   └── test_queries.jsonl │   │   ├── 1m │   │   └── 32k │   ├── fever │   │   ├── ... │   ├── ... ├── rag ├── sql ├── icl └── mm

We also provide an example prompt in PROMPT_EXAMPLE.txt showing how Corpus-in-Context (CiC) prompting can be done for the text retrieval task.

Inference and Evaluation

We currently support using Gemini (e.g., gemini-1.5-flash-002) from VertexAI for inference. Please prepare your PROJECT_ID from Google Cloud. To run the inference with gemini-1.5-flash-002 and evaluate predictions:

```bash BASEDIR=$1 DATASET=$2 LENGTH="128k" TASKTYPE="retrieval" SPLIT="dev" PROMPTTYPE="fewshotwithcot" PROMPT="${TASKTYPE}${DATASET}${LENGTH}${SPLIT}:${PROMPT_TYPE}" echo "Prompt: ${PROMPT}"

mkdir -p ${BASEDIR}/outputs/${TASKTYPE}/${DATASET}/${LENGTH} answerfileextension="jsonl"

python runinference.py \ --promptname ${PROMPT} \ --tasktype ${TASKTYPE} \ --basedir ${BASEDIR} \ --datadir ${TASKTYPE}/${DATASET}/${LENGTH} \ --split ${SPLIT} \ --contextlength ${LENGTH} \ --outputpath ${BASEDIR}/outputs/${TASKTYPE}/${DATASET}/${LENGTH}/${SPLIT}predictions.jsonl \ --projectid ${PROJECT_ID} \ --overwrite

python runevaluation.py \ --answerfilepath ${BASEDIR}/data/${TASKTYPE}/${DATASET}/${LENGTH}/devqueries.${answerfileextension} \ --predfilepath ${BASEDIR}/outputs/${TASKTYPE}/${DATASET}/${LENGTH}/${SPLIT}predictions.jsonl \ --tasktype ${TASK_TYPE} ```

The same script can be found from infer_eval.sh. We provide example queries and predictions files in evaluation/example_predictions/. Each task_type outputs many different metric scores. To understand which task_type to use for each dataset and also to see the primary evaluation metric reported in the paper for each dataset, see the Datasets table.

Get Prompts for 3P Evaluation

You can use the following command to get prompts for specific datasets. For instance, the prompts for LOFT-hard below are obtained as follows:

bash TASK="retrieval" DATASET="qampari" LENGTH="128k" SPLIT="test" PROMPT_NAME="${TASK}_${DATASET}_${LENGTH}_${SPLIT}:few_shot_with_cot" python3 dump_prompts.py \ --prompt_name="${PROMPT_NAME}" \ --base_dir="${HOME}" \ --output_format=text \ --output_dir="${HOME}/prompts/${PROMPT_NAME}" \ --output_format=csv

Datasets

| Task | Dataset | Description | Task Type | Primary Metric | Infilling Needed? | Download | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | Text Retrieval | ArguAna | Argument Retrieval | retrieval | recall@1 | - | Link | | Text Retrieval | FEVER | Fact Checking | retrieval | recall@1 | - | Link | | Text Retrieval | FIQA | Question Answering | retrieval | recall@1 | ✅ | Link | | Text Retrieval | MS MARCO | Web Search | retrieval |recall@1 | ✅ | Link | | Text Retrieval | NQ | Question Answering | retrieval |recall@1 | - | Link | | Text Retrieval | Quora | Duplication Detection | retrieval |recall@1 | ✅ | Link | | Text Retrieval | SciFact | Citation Prediction | retrieval |recall@1 | - | Link | | Text Retrieval | Touché-2020 | Argument Retrieval | retrieval | recall@1 | ✅ | Link | | Text Retrieval | TopiOCQA | Multi-turn QA | retrieval |recall@1 | - | Link | | Text Retrieval | HotPotQA | Multi-hop QA | retrieval | mrecall@2 | - | Link | | Text Retrieval | MuSiQue | Multi-hop QA | retrieval | mrecall@5 | - | Link | | Text Retrieval | QAMPARI | Multi-target QA | retrieval | mrecall@5 | - | Link | | Text Retrieval | QUEST | Multi-target QA | retrieval | mrecall@3 | - | Link | | Visual Retrieval | Flickr30k | Image Retrieval | retrieval | recall@1 | - | Link | | Visual Retrieval | MS COCO | Image Retrieval | retrieval | recall@1 | - | Link | | Visual Retrieval | OVEN | Image-text Retrieval | retrieval | recall@1 | - | Link | | Visual Retrieval | MSR-VTT | Video Retrieval | retrieval | recall@1| - | Link | | Audio Retrieval | FLEURS-en | Audio Retrieval | retrieval | recall@1 | - | Link | | Audio Retrieval | FLEURS-es | Audio Retrieval | retrieval | recall@1 | - | Link | | Audio Retrieval | FLEURS-fr | Audio Retrieval | retrieval | recall@1| - | Link | | Audio Retrieval | FLEURS-hi | Audio Retrieval | retrieval | recall@1 | - | Link | | Audio Retrieval | FLEURS-zh | Audio Retrieval | retrieval | recall@1 | - | Link | | RAG | NQ | Question Answering | rag | subspan_em | - | Link | | RAG | TopiOCQA | Multi-turn QA | rag | subspan_em | - | Link | | RAG | HotPotQA | Multi-hop QA | rag | subspan_em | - | Link | | RAG | MuSiQue | Multi-hop QA | rag | subspan_em | - | Link | | RAG | QAMPARI | Multi-target QA | multi_value_rag | subspan_em | - | Link | | RAG | QUEST | Multi-target QA | multi_value_rag | subspan_em | - | Link | | SQL | Spider | Single-turn SQL | sql | exec_acc | - | Link | | SQL | SParC | Multi-turn SQL | sql | exec_acc | - | Link | | Many-Shot ICL | BBH-date | Multiple-choice QA | icl | em | - | Link | | Many-Shot ICL |BBH-salient | Multiple-choice QA | icl | em | - | Link | | Many-Shot ICL |BBH-tracking7 | Multiple-choice QA | icl | em | - | Link | | Many-Shot ICL |BBH-web | Multiple-choice QA | icl | em | - | Link | | Many-Shot ICL |LIB-dialogue | Classification | - | - | - | ❌ |

LOFT-Hard Subset

From the experiments in our paper, we learned that Gemini 1.5 was already performing well on many LOFT datasets, but also it showed some headroom on other datasets. Hence, we recommend iterating on the following three datasets:

Full datasets and inference are supported from the current OSS.

LOFT Multimodal Datasets

For three of the LOFT multimodal datasets, Flickr30k, MS COCO, MSR-VTT, we ask the user of this repository to download the datasets from their respective websites:

  • Flickr30k: https://www.kaggle.com/datasets/hsankesara/flickr-image-dataset
  • MS COCO: https://cocodataset.org/
  • MSR-VTT: https://cove.thecvf.com/datasets/839

Past & Upcoming Releases

  • [x] Remaining multi-modal data and inference.
  • [x] Prompt conversion code (data => prompt).
  • [x] Inference code and prompts for retrieval (10/25/24).
  • [x] Evaluation code for ICL and some ICL and visual retrieval datasets (8/30/24).
  • [x] Evaluation code for text tasks and code to regenerate some of the LOFT datasets (6/29/24).
  • [x] Initial release with links to download many of the LOFT text datasets (6/20/24).

Citing this work

@article{Lee2024LongContext, title={Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?}, author={Jinhyuk Lee and Anthony Chen and Zhuyun Dai and Dheeru Dua and Devendra Singh Sachan and Michael Boratko and Yi Luan and Sébastien M. R. Arnold and Vincent Perot and Siddharth Dalmia and Hexiang Hu and Xudong Lin and Panupong Pasupat and Aida Amini and Jeremy R. Cole and Sebastian Riedel and Iftekhar Naim and Ming-Wei Chang and Kelvin Guu}, journal={ArXiv}, year={2024}, volume={abs/2406.13121}, url={https://arxiv.org/abs/2406.13121} }

License and disclaimer

Copyright 2024 DeepMind Technologies Limited

All software is licensed under the Apache License, Version 2.0 (Apache 2.0); you may not use this file except in compliance with the Apache 2.0 license. You may obtain a copy of the Apache 2.0 license at: https://www.apache.org/licenses/LICENSE-2.0

All other materials are licensed under the Creative Commons Attribution 4.0 International License (CC-BY). You may obtain a copy of the CC-BY license at: https://creativecommons.org/licenses/by/4.0/legalcode

Individual tasks may be subject to copyright and licensing from their respective owners - please see individual download files for details.

Unless required by applicable law or agreed to in writing, all software and materials distributed here under the Apache 2.0 or CC-BY licenses are distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the licenses for the specific language governing permissions and limitations under those licenses.

This is not an official Google product.

Owner

  • Name: Google DeepMind
  • Login: google-deepmind
  • Kind: organization

GitHub Events

Total
  • Issues event: 4
  • Watch event: 64
  • Delete event: 1
  • Issue comment event: 9
  • Push event: 12
  • Pull request event: 5
  • Fork event: 7
Last Year
  • Issues event: 4
  • Watch event: 64
  • Delete event: 1
  • Issue comment event: 9
  • Push event: 12
  • Pull request event: 5
  • Fork event: 7

Committers

Last synced: 10 months ago

All Time
  • Total Commits: 38
  • Total Committers: 5
  • Avg Commits per committer: 7.6
  • Development Distribution Score (DDS): 0.5
Past Year
  • Commits: 36
  • Committers: 4
  • Avg Commits per committer: 9.0
  • Development Distribution Score (DDS): 0.472
Top Committers
Name Email Commits
Jinhyuk Lee j****e@g****m 19
Anthony Chen a****n@g****m 14
DeepMind n****y@g****m 2
Hexiang Hu h****g@g****m 2
DeepMind n****y@d****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 10 months ago

All Time
  • Total issues: 5
  • Total pull requests: 2
  • Average time to close issues: 3 months
  • Average time to close pull requests: 5 months
  • Total issue authors: 4
  • Total pull request authors: 2
  • Average comments per issue: 2.0
  • Average comments per pull request: 1.5
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 1
Past Year
  • Issues: 5
  • Pull requests: 2
  • Average time to close issues: 3 months
  • Average time to close pull requests: 5 months
  • Issue authors: 4
  • Pull request authors: 2
  • Average comments per issue: 2.0
  • Average comments per pull request: 1.5
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 1
Top Authors
Issue Authors
  • EmGarr (2)
  • vkaul11 (1)
  • Tau-J (1)
  • iofu728 (1)
Pull Request Authors
  • dependabot[bot] (2)
  • Wangmerlyn (2)
Top Labels
Issue Labels
Pull Request Labels
dependencies (2)