https://github.com/argilla-io/awesome-llm-datasets

πŸ‘©πŸ€πŸ€– A curated list of datasets for large language models (LLMs), RLHF and related resources (continually updated)

https://github.com/argilla-io/awesome-llm-datasets

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • β—‹
    CITATION.cff file
  • β—‹
    codemeta.json file
  • β—‹
    .zenodo.json file
  • β—‹
    DOI references
  • βœ“
    Academic publication links
    Links to: arxiv.org
  • β—‹
    Academic email domains
  • β—‹
    Institutional organization owner
  • β—‹
    JOSS paper metadata
  • β—‹
    Scientific vocabulary similarity
    Low similarity (11.2%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

πŸ‘©πŸ€πŸ€– A curated list of datasets for large language models (LLMs), RLHF and related resources (continually updated)

Basic Info
  • Host: GitHub
  • Owner: argilla-io
  • License: apache-2.0
  • Default Branch: main
  • Homepage:
  • Size: 18.6 KB
Statistics
  • Stars: 23
  • Watchers: 5
  • Forks: 1
  • Open Issues: 0
  • Releases: 0
Created about 3 years ago · Last pushed about 3 years ago
Metadata Files
Readme License

README.md

πŸ‘©πŸ€πŸ€– awesome-llm-datasets

This repository is a collection of useful links related to datasets for Language Model models (LLMs) and Reinforcement Learning with Human Feedback (RLHF).

It includes a variety of open datasets, as well as tools, pre-trained models, and research papers that can help researchers and developers work with LLMs and RLHF from a data perspective.

Follow and star for the latest and greatest links related to datasets for LLMs and RLHF.

Table of Contents

  1. πŸ“¦ Datasets
    1. πŸ“š For pre-training
      1. 2023
      2. Before 2023
    2. πŸ—£οΈ For instruction-tuning
    3. πŸ‘©πŸ€πŸ€– For RLHF
    4. βš–οΈ For evaluation
    5. πŸ‘½ For other purposes
  2. 🦾 Models and their datasets
  3. 🧰 Tools and methods
  4. πŸ“” Papers

Datasets

For pre-training

2023

RedPajama Data:

1.2 Trillion tokens Dataset in English:

| Dataset | Token Count | |---------------|-------------| | Commoncrawl | 878 Billion | | C4 | 175 Billion | | GitHub | 59 Billion | | Books | 26 Billion | | ArXiv | 28 Billion | | Wikipedia | 24 Billion | | StackExchange | 20 Billion | | Total | 1.2 Trillion |

Also includes code for data preparation, deduplication, tokenization, and visualization.

Created by Ontocord.ai, MILA QuΓ©bec AI Institute, ETH DS3Lab, UniversitΓ© de MontrΓ©al, Stanford Center for Research on Foundation Models (CRFM), Stanford Hazy Research research group and LAION.

Before 2023

For instruction-tuning

For RLHF & Alignment

For evaluation

For other purposes

Models and their datasets

LLaMA

Overview: A collection of open source foundation models ranging in size from 7B to 65B parameters released by Meta AI.

License: Non-commercial bespoke (model), GPL-3.0 (code)

πŸ“ Release blog post πŸ“„ arXiv publication πŸƒ Model card

Vicuna

Overview: A 13B parameter open source chatbot model fine-tuned on LLaMA and ~70k ChatGPT conversations that maintains 92% of ChatGPT’s performance and outperforms LLaMA and Alpaca.

License: Non-commercial bespoke license (model), Apache 2.0 (code).

πŸ“¦ Repo

πŸ“ Release blog post

πŸ”— ShareGPT dataset

πŸ€— Models

πŸ€– Gradio demo

Dolly 2.0

Overview: A fully open source 12B parameter instruction-following LLM, fine-tuned on a human-generated instruction dataset licensed for research and commercial use.

License: CC BY-SA 3.0 (model), CC BY-SA 3.0 (dataset), Apache 2.0 (code).

πŸ“¦ Repo

πŸ“ Release blog post

πŸ€— Models

LLaVA

Overview: A multi-modal LLM that combines a vision encoder and Vicuna for general-purpose visual and language understanding, with capabilities similar to GPT-4.

License: Non-commercial bespoke (model), CC BY NC 4.0 (dataset), Apache 2.0 (code).

πŸ“¦ Repo

πŸ“ Project homepage

πŸ“„ arXiv publication

πŸ€— Dataset & models

πŸ€– Gradio demo

StableLM

Overview: A suite of low-parameter (3B, 7B) LLMs trained on a new dataset built on The Pile, with 1.5 trillion tokens of content.

License: CC BY-SA-4.0 (models).

πŸ“¦ Repo

πŸ“ Release blog post

πŸ€— Models

πŸ€– Gradio demo

Alpaca

Overview: A partially open source instruction-following model fine-tuned on LLaMA which is smaller and cheaper and performs similarly to GPT-3.5.

License: Non-commercial bespoke (model), CC BY-NC 4.0 (dataset), Apache 2.0 (code).

πŸ“ Release blog post

πŸ€— Dataset

Tools and methods

Papers

Owner

  • Name: Argilla
  • Login: argilla-io
  • Kind: organization
  • Email: contact@argilla.io

Building the open-source tool for data-centric NLP

GitHub Events

Total
  • Watch event: 2
Last Year
  • Watch event: 2

Issues and Pull Requests

Last synced: over 1 year ago

All Time
  • Total issues: 0
  • Total pull requests: 1
  • Average time to close issues: N/A
  • Average time to close pull requests: 4 days
  • Total issue authors: 0
  • Total pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • m-newhauser (1)
Top Labels
Issue Labels
Pull Request Labels