https://github.com/argilla-io/awesome-llm-datasets

👩🤝🤖 A curated list of datasets for large language models (LLMs), RLHF and related resources (continually updated)

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.2%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

👩🤝🤖 A curated list of datasets for large language models (LLMs), RLHF and related resources (continually updated)

Basic Info

Host: GitHub
Owner: argilla-io
License: apache-2.0
Default Branch: main
Homepage:
Size: 18.6 KB

Statistics

Stars: 23
Watchers: 5
Forks: 1
Open Issues: 0
Releases: 0

Created about 3 years ago · Last pushed about 3 years ago

Metadata Files

Readme License

👩🤝🤖 awesome-llm-datasets

This repository is a collection of useful links related to datasets for Language Model models (LLMs) and Reinforcement Learning with Human Feedback (RLHF).

It includes a variety of open datasets, as well as tools, pre-trained models, and research papers that can help researchers and developers work with LLMs and RLHF from a data perspective.

Follow and star for the latest and greatest links related to datasets for LLMs and RLHF.

📦 Datasets
🦾 Models and their datasets
🧰 Tools and methods
📔 Papers

Datasets

For pre-training

2023

RedPajama Data:

1.2 Trillion tokens Dataset in English:

| Dataset | Token Count | |---------------|-------------| | Commoncrawl | 878 Billion | | C4 | 175 Billion | | GitHub | 59 Billion | | Books | 26 Billion | | ArXiv | 28 Billion | | Wikipedia | 24 Billion | | StackExchange | 20 Billion | | Total | 1.2 Trillion |

Also includes code for data preparation, deduplication, tokenization, and visualization.

Created by Ontocord.ai, MILA Québec AI Institute, ETH DS3Lab, Université de Montréal, Stanford Center for Research on Foundation Models (CRFM), Stanford Hazy Research research group and LAION.

Before 2023

For instruction-tuning

For RLHF & Alignment

For evaluation

For other purposes

Models and their datasets

LLaMA

Overview: A collection of open source foundation models ranging in size from 7B to 65B parameters released by Meta AI.

License: Non-commercial bespoke (model), GPL-3.0 (code)

📝 Release blog post 📄 arXiv publication 🃏 Model card

Vicuna

Overview: A 13B parameter open source chatbot model fine-tuned on LLaMA and ~70k ChatGPT conversations that maintains 92% of ChatGPT’s performance and outperforms LLaMA and Alpaca.

License: Non-commercial bespoke license (model), Apache 2.0 (code).

📦 Repo

📝 Release blog post

🔗 ShareGPT dataset

🤗 Models

🤖 Gradio demo

Dolly 2.0

Overview: A fully open source 12B parameter instruction-following LLM, fine-tuned on a human-generated instruction dataset licensed for research and commercial use.

License: CC BY-SA 3.0 (model), CC BY-SA 3.0 (dataset), Apache 2.0 (code).

📦 Repo

📝 Release blog post

🤗 Models

LLaVA

Overview: A multi-modal LLM that combines a vision encoder and Vicuna for general-purpose visual and language understanding, with capabilities similar to GPT-4.

License: Non-commercial bespoke (model), CC BY NC 4.0 (dataset), Apache 2.0 (code).

📦 Repo

📝 Project homepage

📄 arXiv publication

🤗 Dataset & models

🤖 Gradio demo

StableLM

Overview: A suite of low-parameter (3B, 7B) LLMs trained on a new dataset built on The Pile, with 1.5 trillion tokens of content.

License: CC BY-SA-4.0 (models).

📦 Repo

📝 Release blog post

🤗 Models

🤖 Gradio demo

Alpaca

Overview: A partially open source instruction-following model fine-tuned on LLaMA which is smaller and cheaper and performs similarly to GPT-3.5.

License: Non-commercial bespoke (model), CC BY-NC 4.0 (dataset), Apache 2.0 (code).

📝 Release blog post

🤗 Dataset

Tools and methods

Papers

Owner

Name: Argilla
Login: argilla-io
Kind: organization
Email: contact@argilla.io

Website: https://argilla.io
Twitter: argilla_io
Repositories: 12
Profile: https://github.com/argilla-io

Building the open-source tool for data-centric NLP

GitHub Events

Total

Watch event: 2

Last Year

Watch event: 2

Issues and Pull Requests

Last synced: over 1 year ago

All Time

Total issues: 0
Total pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: 4 days
Total issue authors: 0
Total pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

https://github.com/argilla-io/awesome-llm-datasets

Science Score: 10.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

👩🤝🤖 awesome-llm-datasets

Table of Contents

Datasets

For pre-training

2023

Before 2023

For instruction-tuning

For RLHF & Alignment

For evaluation

For other purposes

Models and their datasets

LLaMA

Vicuna

Dolly 2.0

LLaVA

StableLM

Alpaca

Tools and methods

Papers

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels