muld

The Multitask Long Document Benchmark

https://github.com/ghomashudson/muld

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (7.0%) to scientific vocabulary

Keywords

benchmark long-texts nlp

Last synced: 6 months ago · JSON representation ·

Repository

The Multitask Long Document Benchmark

Basic Info

Host: GitHub
Owner: ghomasHudson
Language: Python
Default Branch: master
Homepage:
Size: 94.7 KB

Statistics

Stars: 41
Watchers: 1
Forks: 1
Open Issues: 2
Releases: 0

Topics

benchmark long-texts nlp

Created about 4 years ago · Last pushed over 3 years ago

Metadata Files

Readme Citation

MuLD: The Multitask Long Document Benchmark

MuLD (Multitask Long Document Benchmark) is a set of 6 NLP tasks where the inputs consist of at least 10,000 words. The benchmark covers a wide variety of task types including translation, summarization, question answering, and classification. Additionally there is a range of output lengths from a single word classification label all the way up to an output longer than the input text.

muld_table

This repo contains official code for the paper MuLD: The Multitask Long Document Benchmark.

Quickstart

The easiest method is to use the Huggingface Datasets library: python import datasets ds = datasets.load_dataset("ghomasHudson/muld", "NarrativeQA") ds = datasets.load_dataset("ghomasHudson/muld", "HotpotQA") ds = datasets.load_dataset("ghomasHudson/muld", "Character Archetype Classification") ds = datasets.load_dataset("ghomasHudson/muld", "OpenSubtitles") ds = datasets.load_dataset("ghomasHudson/muld", "AO3 Style Change Detection") ds = datasets.load_dataset("ghomasHudson/muld", "VLSP") Or by cloning this repo: python import datasets ds = datasets.load_dataset("./muld.py", "NarrativeQA") ...

Manual Download

If you prefer to download the data files yourself: - NarrativeQA Train Val Test - Mirror: Train Val Test - HotpotQA Train, Val - Mirror: Train Val - Character Archetype Classification Train Val Test - Mirror: Train Val Test - OpenSubtitles Train Test - Mirror: Train Test - Style Change Train Val Test, - Mirror: Train Val Test - VLSP Test - Mirror: Test

Citation

If you use our benchmark please cite the paper: @InProceedings{hudson-almoubayed:2022:LREC, author = {Hudson, George and Al Moubayed, Noura}, title = {MuLD: The Multitask Long Document Benchmark}, booktitle = {Proceedings of the Language Resources and Evaluation Conference}, month = {June}, year = {2022}, address = {Marseille, France}, publisher = {European Language Resources Association}, pages = {3675--3685}, url = {https://aclanthology.org/2022.lrec-1.392} }

Additionally please cite the datasets we used (particularly NarrativeQA, HotpotQA, and Opensubtitles where we directly use their data with limited filtering).

Dataset Metadata

The following table is necessary for this dataset to be indexed by search engines such as Google Dataset Search.

property value

name MuLD

alternateName Multitask Long Document Benchmark

url https://github.com/ghomasHudson/muld

description MuLD (Multitask Long Document Benchmark) is a set of 6 NLP tasks where the inputs consist of at least 10,000 words. The benchmark covers a wide variety of task types including translation, summarization, question answering, and classification. Additionally there is a range of output lengths from a single word classification label all the way up to an output longer than the input text.

citation https://arxiv.org/abs/2202.07362

creator

property	value
name	`Thomas Hudson`
sameAs	`https://orcid.org/0000-0003-3562-3593`

Owner

Name: Thomas Hudson
Login: ghomasHudson
Kind: user
Company: Durham University

Website: ghomashudson.github.io
Repositories: 6
Profile: https://github.com/ghomasHudson

Research Associate at Durham University. Researching NLP for Veterinary medicine

Citation (CITATION.cff)

cff-version: 1.2.0
title: 'MuLD: The Multitask Long Document Benchmark'
message: >-
  If you use this dataset, please cite it using the
  metadata from this file.
type: dataset
authors:
  - given-names: G Thomas
    family-names: Hudson
    email: g.t.hudson@durham.ac.uk
    affiliation: Durham University
    orcid: 'https://orcid.org/0000-0003-3562-3593'
  - given-names: Noura
    name-particle: Al
    family-names: Moubayed
    orcid: 'https://orcid.org/0000-0001-8942-355X'
    affiliation: Durham University
identifiers:
  - type: url
    value: 'https://aclanthology.org/2022.lrec-1.392'
abstract: >-
  The impressive progress in NLP techniques has been
  driven by the development of multi-task benchmarks
  such as GLUE and SuperGLUE. While these benchmarks
  focus on tasks for one or two input sentences,
  there has been exciting work in designing efficient
  techniques for processing much longer inputs. In
  this paper, we present MuLD: a new long document
  benchmark consisting of only documents over 10,000
  tokens. By modifying existing NLP tasks, we create
  a diverse benchmark which requires models to
  successfully model long-term dependencies in the
  text. We evaluate how existing models perform, and
  find that our benchmark is much more challenging
  than their ‘short document’ equivalents.
  Furthermore, by evaluating both regular and
  efficient transformers, we show that models with
  increased context length are better able to solve
  the tasks presented, suggesting that future
  improvements in these models are vital for solving
  similar long document problems. We release the data
  and code for baselines to encourage further
  research on efficient NLP models.
keywords:
  - Long Documents
  - Benchmark
  - Multitask learning
  - NLP
license: CC-BY-NC-4.0

preferred-citation:
  authors:
    - given-names: G Thomas
      family-names: Hudson
      email: g.t.hudson@durham.ac.uk
      affiliation: Durham University
      orcid: 'https://orcid.org/0000-0003-3562-3593'
    - given-names: Noura
      name-particle: Al
      family-names: Moubayed
      orcid: 'https://orcid.org/0000-0001-8942-355X'
      affiliation: Durham University
  title: "MuLD: The Multitask Long Document Benchmark"
  type: conference-paper
  collection-title: Proceedings of the Language Resources and Evaluation Conference
  conference:
    name: Language Resources and Evaluation Conference
    date-start: 2022-06-21
    date-end: 2022-06-23
    address: Marseille, France
  location: 
    name: Marseille, France
  start: 3675
  end: 3685
  publisher:
    name: European Language Resources Association
  url: https://aclanthology.org/2022.lrec-1.392
  abstract: >-
    The impressive progress in NLP techniques has been
    driven by the development of multi-task benchmarks
    such as GLUE and SuperGLUE. While these benchmarks
    focus on tasks for one or two input sentences,
    there has been exciting work in designing efficient
    techniques for processing much longer inputs. In
    this paper, we present MuLD: a new long document
    benchmark consisting of only documents over 10,000
    tokens. By modifying existing NLP tasks, we create
    a diverse benchmark which requires models to
    successfully model long-term dependencies in the
    text. We evaluate how existing models perform, and
    find that our benchmark is much more challenging
    than their ‘short document’ equivalents.
    Furthermore, by evaluating both regular and
    efficient transformers, we show that models with
    increased context length are better able to solve
    the tasks presented, suggesting that future
    improvements in these models are vital for solving
    similar long document problems. We release the data
    and code for baselines to encourage further
    research on efficient NLP models.

GitHub Events

Total

Watch event: 2

Last Year

Watch event: 2

Committers

Last synced: 9 months ago

All Time

Total Commits: 38
Total Committers: 1
Avg Commits per committer: 38.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Thomas Hudson	0**n@g**m	38

Issues and Pull Requests

Last synced: 9 months ago

All Time

Total issues: 3
Total pull requests: 0
Average time to close issues: 22 days
Average time to close pull requests: N/A
Total issue authors: 2
Total pull request authors: 0
Average comments per issue: 1.33
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

muld

Science Score: 54.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

MuLD: The Multitask Long Document Benchmark

Quickstart

Manual Download

Citation

Dataset Metadata

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels