https://github.com/andstor/text-generation

Generation script(s) for generating text with LLMs

https://github.com/andstor/text-generation

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.8%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Generation script(s) for generating text with LLMs

Basic Info
  • Host: GitHub
  • Owner: andstor
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 121 KB
Statistics
  • Stars: 0
  • Watchers: 2
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created over 2 years ago · Last pushed about 2 years ago
Metadata Files
Readme

README.md

text-generation

Generation script(s) for generating text with LLMs

Description

This repository contains scripts for generating text with language models. The scripts are designed to be used with the Hugging Face transformers library and the datasets library. Accelerate is used to speed up the generation process.

Requirements

Dependencies

Install the Python dependencies defined in the requirements.txt. bash pip install -r requirements.txt

Accelerate

Setup accelerate: bash accelerate config

OpenAI API access

To use the OpenAI API, you need to set the OPENAI_API_KEY environment variable to your API key.

Generation with Hugging Face models

The run_gen.py script will generate samples from a specified dataset with a Hugging Face model. It supports a large set of options for controlling the generation process.

Usage

```bash usage: rungen.py [-h] [--modelnameorpath MODELNAMEORPATH] [--modeltype MODELTYPE] [--configname CONFIGNAME] [--tokenizername TOKENIZERNAME] [--usefasttokenizer [USEFASTTOKENIZER]] [--nousefasttokenizer] [--modelrevision MODELREVISION] [--token TOKEN] [--useauthtoken [USEAUTH_TOKEN]] [--datasetname DATASETNAME] [--datasetconfigname DATASETCONFIGNAME] [--datasetsplit DATASETSPLIT] [--textcolumnname TEXTCOLUMNNAME] [--referencecolumnname REFERENCECOLUMNNAME] [--datasetrevision DATASETREVISION] [--streaming [STREAMING]] [--overwritecache [OVERWRITECACHE]] [--validationsplitpercentage VALIDATIONSPLITPERCENTAGE] [--preprocessingnumworkers PREPROCESSINGNUMWORKERS] [--generationconfigfile GENERATIONCONFIGFILE] [--perdevicebatchsize PERDEVICEBATCHSIZE] [--outputdir OUTPUTDIR] [--overwriteoutputdir [OVERWRITEOUTPUTDIR]] [--idcolumnname IDCOLUMNNAME] [--keepcolumns KEEPCOLUMNS [KEEP_COLUMNS ...]] [--seed SEED] [--maxnewtokens MAXNEWTOKENS] [--maxwindowsize MAXWINDOWSIZE] [--subsamples SUBSAMPLES] [--blocksize BLOCKSIZE] [--usebracematching [USEBRACEMATCHING]] [--bracematchingstartlevel BRACEMATCHINGSTART_LEVEL]

optional arguments: -h, --help show this help message and exit --modelnameorpath MODELNAMEORPATH The model checkpoint for weights initialization. Do not set if you want to train a model from scratch. (default: None) --modeltype MODELTYPE If training from scratch, pass a model type from the list: bart, bert, bert-generation, bigbird, bigbirdpegasus, biogpt, blenderbot, blenderbot-small, bloom, camembert, llama, codegen, cpmant, ctrl, data2vec-text, electra, ernie, falcon, fuyu, git, gpt2, gpt2, gptbigcode, gptneo, gptneox, gptneoxjapanese, gptj, llama, marian, mbart, mega, megatron-bert, mistral, mixtral, mpt, musicgen, mvp, open-llama, openai-gpt, opt, pegasus, persimmon, phi, plbart, prophetnet, qdqbert, reformer, rembert, roberta, roberta-prelayernorm, rocbert, roformer, rwkv, speechtotext2, transfo-xl, trocr, whisper, xglm, xlm, xlm-prophetnet, xlm-roberta, xlm-roberta-xl, xlnet, xmod (default: None) --configname CONFIGNAME Pretrained config name or path if not the same as modelname (default: None) --tokenizername TOKENIZERNAME Pretrained tokenizer name or path if not the same as modelname (default: None) --usefasttokenizer [USEFASTTOKENIZER] Whether to use one of the fast tokenizer (backed by the tokenizers library) or not. (default: True) --nousefasttokenizer Whether to use one of the fast tokenizer (backed by the tokenizers library) or not. (default: False) --modelrevision MODELREVISION The specific model version to use (can be a branch name, tag name or commit id). (default: main) --token TOKEN The token to use as HTTP bearer authorization for remote files. If not specified, will use the token generated when running huggingface-cli login (stored in ~/.huggingface). (default: None) --useauthtoken [USEAUTHTOKEN] The use_auth_token argument is deprecated and will be removed in v4.34. Please use token instead. (default: None) --datasetname DATASETNAME The name of the dataset to use (via the datasets library). (default: None) --datasetconfigname DATASETCONFIGNAME The configuration name of the dataset to use (via the datasets library). (default: None) --datasetsplit DATASETSPLIT The dataset split to use. (default: None) --textcolumnname TEXTCOLUMNNAME The dataset column name to use. (default: None) --referencecolumnname REFERENCECOLUMNNAME The dataset column name to use as reference for the target sequence. (default: None) --datasetrevision DATASETREVISION The specific dataset version to use (can be a branch name, tag name or commit id). (default: main) --streaming [STREAMING] Enable streaming mode (default: False) --overwritecache [OVERWRITECACHE] Overwrite the cached training and evaluation sets (default: False) --validationsplitpercentage VALIDATIONSPLITPERCENTAGE The percentage of the train set used as validation set in case there is no validation split (default: 5) --preprocessingnumworkers PREPROCESSINGNUMWORKERS The number of processes to use for the preprocessing. (default: None) --generationconfigfile GENERATIONCONFIGFILE Generation config path if not the same as modelname. (default: None) --perdevicebatchsize PERDEVICEBATCHSIZE Batch size (per device) for generation. (default: 8) --outputdir OUTPUTDIR The output directory where the model predictions and checkpoints will be written. (default: None) --overwriteoutputdir [OVERWRITEOUTPUTDIR] Overwrite the content of the output directory. Use this to continue training if outputdir points to a checkpoint directory. (default: False) --idcolumnname IDCOLUMNNAME The column name of the dataset to use as id. If not provided, the index will be used. (default: None) --keepcolumns KEEPCOLUMNS [KEEPCOLUMNS ...] The column names of the dataset to keep separate by commas. If not provided, all columns will be removed. (default: None) --seed SEED Seed for random number generation. (default: None) --maxnewtokens MAXNEWTOKENS The maximum number of new tokens to generate. (default: None) --maxwindowsize MAXWINDOWSIZE The maximum number of tokens in the input. (default: None) --subsamples SUBSAMPLES The number of subsamples to use from each data example. Randomly selected. None means use all. (default: None) --blocksize BLOCKSIZE Optional limit the model's max position embeddings.(default: None) --usebracematching [USEBRACEMATCHING] Whether to use brace matching as a stopping criteria. (default: False) --bracematchingstartlevel BRACEMATCHINGSTART_LEVEL The level of brace matching to start from. (default: 0) ```

Complete generation

Complete generation is done by providing both an input data column and a reference data column. This will make the model use the whole prompt as input. By setting --max_new_tokens to auto, all the unused embedding space is used to generate as much as possible. The maximum number of position embeddings allowed can also be manually specified by setting the --block_size. The maximum number of tokens in the input can be truncated (from the left) by setting --max_window_size, thus allowing for a longer output (--max_new_tokens).

Example

The following example will generate samples from the test split of the methods2test_small dataset using the greedy decoding strategy. The output will be saved to the output directory.

bash accelerate launch run_gen.py \ --dataset_name andstor/methods2test_small \ --dataset_config_name fm+fc+c+m+f+t+tc \ --dataset_split test \ --text_column_name source \ --reference_column_name target \ --model_name_or_path facebook/opt-125m \ --per_device_batch_size 4 \ --output_dir output \ --overwrite_output_dir \ --seed 42 \ --preprocessing_num_workers 10 \ --max_new_tokens auto

Early stopping

Early stopping can be done by using brace matching. This will stop the generation when the number of open braces is equal to the number of closed braces. The level of brace matching to start from can be controlled by the --brace_matching_start_level argument. The following example will use brace matching as a stopping criteria and start from level 1.

bash --use_brace_matching \ --brace_matching_start_level 1

Strided generation

Stridden generation is done by only providing an input data column. This will be split into parts where each part is generated in a "sliding window" approach. Each window serves as the reference for the preceding window. The window stride (read size) is determined by the --max_window_size argument. The number of tokens to be generated is controlled by --max_new_tokens. The number of subsamples to use from each data example can be controlled by --subsamples. If --subsamples is set to None, all subsamples will be used. The maximum number of position embeddings allowed can also be manually specified by setting the --block_size.

Filtering

The data is filtered by the following criteria: - The input is at least maxnewtokens + maxwindowsize long

Window splitting

Given an input, it is first truncated by maxnewtokens. The result is then split into window sizes of up to maxwindowsize. The minimum size of each window is l / math.ceil(l / maxsize), where l is len(imput)-maxnew_tokens. Each window is generated independently. The first window has an index of 0, the second has an index of 1, etc.

Example

The following example will generate samples from the test split of the The Pile dataset using the greedy decoding strategy. The input will be truncated to 512 tokens and the maximum number of new tokens will be 512. The output will be saved to the output directory.

bash accelerate launch run_gen.py \ --dataset_name andstor/the_pile_github \ --dataset_config_name java \ --dataset_split test \ --text_column_name text \ --model_name_or_path EleutherAI/gpt-j-6B \ --generation_config_file generation_config.json \ --per_device_batch_size 1 \ --output_dir output \ --seed 42 \ --preprocessing_num_workers 10 \ --max_window_size 512 \ --max_new_tokens 512

The generation_config.json file contains the following configuration: json { "do_sample": false, "max_new_tokens": 256, "bos_token_id": 50256, "eos_token_id": 50256 }

Generation with third-party models (API)

Several third-party models such as ChatGPT have been used for generating data. Scripts for generating data with these models can be found in the notebooks directory. Note that access to most of these requires a paid subscription. Furthermore, most are closed-source models and might not be reproducible.

License

Copyright © André Storhaug

This repository is licensed under the MIT License.

Owner

  • Name: André Storhaug
  • Login: andstor
  • Kind: user
  • Location: Trondheim 🇳🇴
  • Company: NTNU

🎓 CS PhD student @ Norwegian University of Science and Technology (NTNU)

GitHub Events

Total
Last Year

Issues and Pull Requests

Last synced: over 1 year ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Dependencies

requirements.txt pypi
  • accelerate *
  • datasets *
  • deepspeed *
  • torch *
  • transformers *