https://github.com/argilla-io/awesome-llm-datasets
π©π€π€ A curated list of datasets for large language models (LLMs), RLHF and related resources (continually updated)
Science Score: 10.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
βCITATION.cff file
-
βcodemeta.json file
-
β.zenodo.json file
-
βDOI references
-
βAcademic publication links
Links to: arxiv.org -
βAcademic email domains
-
βInstitutional organization owner
-
βJOSS paper metadata
-
βScientific vocabulary similarity
Low similarity (11.2%) to scientific vocabulary
Repository
π©π€π€ A curated list of datasets for large language models (LLMs), RLHF and related resources (continually updated)
Basic Info
Statistics
- Stars: 23
- Watchers: 5
- Forks: 1
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
π©π€π€ awesome-llm-datasets
This repository is a collection of useful links related to datasets for Language Model models (LLMs) and Reinforcement Learning with Human Feedback (RLHF).
It includes a variety of open datasets, as well as tools, pre-trained models, and research papers that can help researchers and developers work with LLMs and RLHF from a data perspective.
Follow and star for the latest and greatest links related to datasets for LLMs and RLHF.
Table of Contents
Datasets
For pre-training
2023
1.2 Trillion tokens Dataset in English:
| Dataset | Token Count | |---------------|-------------| | Commoncrawl | 878 Billion | | C4 | 175 Billion | | GitHub | 59 Billion | | Books | 26 Billion | | ArXiv | 28 Billion | | Wikipedia | 24 Billion | | StackExchange | 20 Billion | | Total | 1.2 Trillion |
Also includes code for data preparation, deduplication, tokenization, and visualization.
Created by Ontocord.ai, MILA QuΓ©bec AI Institute, ETH DS3Lab, UniversitΓ© de MontrΓ©al, Stanford Center for Research on Foundation Models (CRFM), Stanford Hazy Research research group and LAION.
Before 2023
For instruction-tuning
For RLHF & Alignment
For evaluation
For other purposes
Models and their datasets
LLaMA
Overview: A collection of open source foundation models ranging in size from 7B to 65B parameters released by Meta AI.
License: Non-commercial bespoke (model), GPL-3.0 (code)
π Release blog post π arXiv publication π Model card
Vicuna
Overview: A 13B parameter open source chatbot model fine-tuned on LLaMA and ~70k ChatGPT conversations that maintains 92% of ChatGPTβs performance and outperforms LLaMA and Alpaca.
License: Non-commercial bespoke license (model), Apache 2.0 (code).
π¦ Repo
π Release blog post
π ShareGPT dataset
π€ Models
π€ Gradio demo
Dolly 2.0
Overview: A fully open source 12B parameter instruction-following LLM, fine-tuned on a human-generated instruction dataset licensed for research and commercial use.
License: CC BY-SA 3.0 (model), CC BY-SA 3.0 (dataset), Apache 2.0 (code).
π¦ Repo
π Release blog post
π€ Models
LLaVA
Overview: A multi-modal LLM that combines a vision encoder and Vicuna for general-purpose visual and language understanding, with capabilities similar to GPT-4.
License: Non-commercial bespoke (model), CC BY NC 4.0 (dataset), Apache 2.0 (code).
π¦ Repo
π Project homepage
π arXiv publication
π€ Dataset & models
π€ Gradio demo
StableLM
Overview: A suite of low-parameter (3B, 7B) LLMs trained on a new dataset built on The Pile, with 1.5 trillion tokens of content.
License: CC BY-SA-4.0 (models).
π¦ Repo
π Release blog post
π€ Models
π€ Gradio demo
Alpaca
Overview: A partially open source instruction-following model fine-tuned on LLaMA which is smaller and cheaper and performs similarly to GPT-3.5.
License: Non-commercial bespoke (model), CC BY-NC 4.0 (dataset), Apache 2.0 (code).
π Release blog post
π€ Dataset
Tools and methods
Papers
Owner
- Name: Argilla
- Login: argilla-io
- Kind: organization
- Email: contact@argilla.io
- Website: https://argilla.io
- Twitter: argilla_io
- Repositories: 12
- Profile: https://github.com/argilla-io
Building the open-source tool for data-centric NLP
GitHub Events
Total
- Watch event: 2
Last Year
- Watch event: 2
Issues and Pull Requests
Last synced: over 1 year ago
All Time
- Total issues: 0
- Total pull requests: 1
- Average time to close issues: N/A
- Average time to close pull requests: 4 days
- Total issue authors: 0
- Total pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
- m-newhauser (1)