https://github.com/andstor/lm-output-dataset
Dataset of various language model outputs from different datasets
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.3%) to scientific vocabulary
Repository
Dataset of various language model outputs from different datasets
Basic Info
- Host: GitHub
- Owner: andstor
- License: mit
- Language: Python
- Default Branch: main
- Size: 96.7 KB
Statistics
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
lm-output-dataset
Dataset of various language model outputs from different datasets
Description
This repository contains a dataset of various language model outputs from different datasets. This includes both open-source and proprietary models. The dataset is available at 🤗 Hugging Face.
Tags
Table of available tags and their meaning
| Tag name | Description | temperature | top-p | top-k | |----------|-------------|:-----------:|:----:|:------:| | greedy | Greedy decoding | 0 | 0 | 0 | | random | Random sampling | 1 | 1 | 0 |
Requirements
Dependencies
Install the Python dependencies defined in the requirements.txt.
bash
pip install -r requirements.txt
Accelerate
Setup accelerate:
bash
accelerate config
OpenAI API access
To use the OpenAI API, you need to set the OPENAI_API_KEY environment variable to your API key.
Generation with Hugging Face models
The generate.py script will generate samples from a specified dataset with a Hugging Face model.
Usage
```bash usage: generate.py [-h] [--datasetname DATASETNAME] [--datasetconfigname DATASETCONFIGNAME] [--datasetsplit DATASETSPLIT] [--textcolumnname TEXTCOLUMNNAME] [--referencecolumnname REFERENCECOLUMNNAME] [--modelnameorpath MODELNAMEORPATH] [--configname CONFIGNAME] [--generationconfigfile GENERATIONCONFIGFILE] [--tokenizername TOKENIZERNAME] [--useslowtokenizer] [--perdevicebatchsize PERDEVICEBATCHSIZE] [--outputdir OUTPUTDIR] [--seed SEED] [--preprocessingnumworkers PREPROCESSINGNUMWORKERS] [--overwrite_cache] [--tag TAG] [--maxnewtokens MAXNEWTOKENS] [--maxwindowsize MAXWINDOWSIZE] [--subsamples SUBSAMPLES] [--idcolumnname IDCOLUMNNAME] [--keepcolumns KEEPCOLUMNS]
Do inference with a transformer model on a causal language modeling task
optional arguments: -h, --help show this help message and exit --datasetname DATASETNAME The name of the dataset to use (via the datasets library). --datasetconfigname DATASETCONFIGNAME The configuration name of the dataset to use (via the datasets library). --datasetsplit DATASETSPLIT The name of the split to use. --textcolumnname TEXTCOLUMNNAME The column name of the dataset to use. --referencecolumnname REFERENCECOLUMNNAME The column name of the dataset to use as reference. If not provided, the cutoff textcolumnname will be used. --modelnameorpath MODELNAMEORPATH Path to pretrained model or model identifier from huggingface.co/models. --configname CONFIGNAME Pretrained config name or path if not the same as modelname --generationconfigfile GENERATIONCONFIGFILE Generation config path if not the same as modelname --tokenizername TOKENIZERNAME Pretrained tokenizer name or path if not the same as modelname --useslowtokenizer If passed, will use a slow tokenizer (not backed by the 🤗 Tokenizers library). --perdevicebatchsize PERDEVICEBATCHSIZE Batch size (per device) for the dataloader. --outputdir OUTPUTDIR Where to store the final model. --seed SEED A seed for reproducible training. --preprocessingnumworkers PREPROCESSINGNUMWORKERS The number of processes to use for the preprocessing. --overwritecache Overwrite the cached dataset --tag TAG The tag to use for this generation run. --maxnewtokens MAXNEWTOKENS The maximum number of new tokens to generate. --maxwindowsize MAXWINDOWSIZE The maximum number of tokens in the input. --subsamples SUBSAMPLES The number of subsamples to use from each data example. Randomly selected. None means use all. --idcolumnname IDCOLUMNNAME The column name of the dataset to use as id. If not provided, the index will be used. --keepcolumns KEEPCOLUMNS The column names of the dataset to keep separate by commas. If not provided, all columns will be removed. ```
Complete generation
Complete generation is done by providing both an input data column and a reference data column. This will make the model use the whole prompt as input. By setting --max_new_tokens to auto, all the unused embedding space is used to generate as much as possible. The maximum number of tokens in the input can be truncated (from the left) by setting --max_window_size, thus allowing for a longer output (--max_new_tokens).
Example
The following example will generate samples from the test split of the Humaneval dataset using the greedy decoding strategy. The output will be saved to the output directory.
bash
accelerate launch generate.py \
--dataset_name THUDM/humaneval-x \
--dataset_config_name python \
--dataset_split test \
--text_column_name prompt \
--reference_column_name canonical_solution \
--model_name_or_path EleutherAI/gpt-j-6B \
--generation_config_file generation_config.json \
--per_device_batch_size 1 \
--output_dir output \
--seed 42 \
--preprocessing_num_workers 10 \
--tag greedy \
--max_new_tokens auto
Strided generation
Stridden generation is done by only providing an input data column. This will be split into parts where each part is generated in a "sliding window" approach. Each window serves as the reference for the preceding window. The window stride (read size) is determined by the --max_window_size argument. The number of tokens to be generated is controlled by --max_new_tokens. The number of subsamples to use from each data example can be controlled by --subsamples. If --subsamples is set to None, all subsamples will be used.
Filtering
The dataset is filtered by the following criteria: - The input is at least maxnewtokens + maxwindowsize long
Window splitting
Given an input, it is first truncated by maxnewtokens. The result is then split into window sizes of up to maxwindowsize. The minimum size of each window is l / math.ceil(l / maxsize), where l is len(imput)-maxnew_tokens. Each window is generated independently. The first window has an index of 0, the second has an index of 1, etc.
Example
The following example will generate samples from the test split of the The Pile dataset using the greedy decoding strategy. The input will be truncated to 512 tokens and the maximum number of new tokens will be 512. The output will be saved to the output directory.
bash
accelerate launch generate.py \
--dataset_name andstor/the_pile_github \
--dataset_config_name java \
--dataset_split test \
--text_column_name text \
--model_name_or_path EleutherAI/gpt-j-6B \
--generation_config_file generation_config.json \
--per_device_batch_size 1 \
--output_dir output \
--seed 42 \
--preprocessing_num_workers 20 \
--tag greedy \
--max_window_size 512 \
--max_new_tokens 512
The generation_config.json file contains the following configuration:
json
{
"do_sample": false,
"max_new_tokens": 256,
"bos_token_id": 50256,
"eos_token_id": 50256
}
Generation with third-party models (API)
Several third-party models such as ChatGPT have been used for generating data. Scripts for generating data with these models can be found in the notebooks directory. Note that access to most of these requires a paid subscription. Furthermore, most are closed-source models and might not be reproducible.
License
Copyright © André Storhaug
This repository is licensed under the MIT License.
Owner
- Name: André Storhaug
- Login: andstor
- Kind: user
- Location: Trondheim 🇳🇴
- Company: NTNU
- Website: https://andre.storhaug.no
- Repositories: 87
- Profile: https://github.com/andstor
🎓 CS PhD student @ Norwegian University of Science and Technology (NTNU)
GitHub Events
Total
Last Year
Issues and Pull Requests
Last synced: over 1 year ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- accelerate *
- datasets *
- numpy *
- torch *
- tqdm *
- transformers *