textgen

TextGen: Implementation of Text Generation models, include LLaMA, BLOOM, GPT2, BART, T5, SongNet and so on. 文本生成模型，实现了包括LLaMA，ChatGLM，BLOOM，GPT2，Seq2Seq，BART，T5，UDA等模型的训练和预测，开箱即用。

https://github.com/shibing624/textgen

Keywords

bart bert chatglm chatgpt gpt2 llama seq2seq t5 text-generation textgen xlnet

Keywords from Contributors

transformer

Last synced: 6 months ago · JSON representation ·

Repository

TextGen: Implementation of Text Generation models, include LLaMA, BLOOM, GPT2, BART, T5, SongNet and so on. 文本生成模型，实现了包括LLaMA，ChatGLM，BLOOM，GPT2，Seq2Seq，BART，T5，UDA等模型的训练和预测，开箱即用。

Basic Info

Host: GitHub
Owner: shibing624
License: apache-2.0
Language: Python
Default Branch: main
Homepage:
Size: 10.2 MB

Statistics

Stars: 967
Watchers: 9
Forks: 108
Open Issues: 21
Releases: 12

Topics

bart bert chatglm chatgpt gpt2 llama seq2seq t5 text-generation textgen xlnet

Created almost 5 years ago · Last pushed over 1 year ago

Metadata Files

Readme Contributing License Citation

README.md

🇨🇳中文 | 🌐English | 📖文档/Docs | 🤖模型/Models

TextGen: Implementation of Text Generation models

📖 Introduction

TextGen实现了多种文本生成模型，包括：LLaMA、ChatGLM、UDA、GPT2、Seq2Seq、BART、T5、SongNet等模型，开箱即用。

🔥 News

[2023/11/02] v1.1.2版本: GPT模型支持了NEFTune给embedding加噪SFT训练方法，SFT中使用 --neft_alpha 参数启用 NEFTune，例如 --neft_alpha 5。详见Release-v1.1.2

[2023/09/05] v1.1.1版本: 支持多卡推理，推理速度加倍，调库textgen做batch推理，多卡推理更方便、快速。详见Release-v1.1.1

[2023/08/23] v1.1.0版本: 发布基于ShareGPT4数据集微调的中英文Vicuna-13B模型shibing624/vicuna-baichuan-13b-chat，和对应的LoRA模型shibing624/vicuna-baichuan-13b-chat-lora，支持多轮对话，评测效果有提升，详见Release-v1.1.0

[2023/08/02] v1.0.2版本: 新增支持ChatGLM2和LLaMA2模型的SFT微调训练，详见Release-v1.0.2

[2023/06/15] v1.0.0版本: 新增ChatGLM/LLaMA/Bloom模型的多轮对话微调训练，并发布医疗问诊LoRA模型shibing624/ziya-llama-13b-medical-lora。详见Release-v1.0.0

[2023/06/02] v0.2.7版本: 新增ChatGLM/LLaMA/Bloom模型的SFT微调训练，并发布适用于通用对话和中文纠错的LoRA模型。详见Release-v0.2.7

😊 Feature

GPT：本项目基于PyTorch实现了 ChatGLM-6B 1,2,3 / Baichuan 1,2 / LLaMA 1,2 / BLOOM / Mistral / QWen 等GPT模型LoRA微调训练和预测，可以用于对话生成任务和领域微调训练
UDA/EDA：本项目实现了UDA(非核心词替换)、EDA和Back Translation(回译)算法，基于TF-IDF将句子中部分不重要词替换为同义词，随机词插入、删除、替换等方法，产生新的文本，实现了文本扩增
Seq2Seq：本项目基于PyTorch实现了Seq2Seq、ConvSeq2Seq、BART模型的训练和预测，可以用于文本翻译、对话生成、摘要生成等文本生成任务
T5：本项目基于PyTorch实现了T5和CopyT5模型训练和预测，可以用于文本翻译、对话生成、对联生成、文案撰写等文本生成任务
GPT2：本项目基于PyTorch实现了GTP2模型训练和预测，可以用于文章生成、对联生成等文本生成任务
SongNet：本项目基于PyTorch实现了SongNet模型训练和预测，可以用于规范格式的诗词、歌词等文本生成任务
TGLS：本项目实现了TGLS无监督相似文本生成模型，是一种“先搜索后学习”的文本生成方法，通过反复迭代学习候选集，最终模型能生成类似候选集的高质量相似文本

Release Models

release基于textgen训练的中文模型，模型已经release到HuggingFace models，指定模型名称textgen会自动下载模型，可直接使用。

| Model | Arch | Introduction | Train Script | Predict Script | |:----------------------------------------------------------------------------------------------------------|:-------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------| | shibing624/t5-chinese-couplet | T5 | fine-tuned中文对联后的模型 | 对联生成模型调研 | predict script | | shibing624/songnet-base-chinese-songci | SongNet | fine-tuned宋词后的模型 | training script | predict script | | shibing624/songnet-base-chinese-couplet | SongNet | fine-tuned对联后的模型 | training script | predict script | | shibing624/chatglm-6b-csc-zh-lora | ChatGLM-6B | 在27万中文拼写纠错数据shibing624/CSC上微调了一版ChatGLM-6B，纠错效果有提升，发布微调后的LoRA权重 | training script | predict script | | shibing624/chatglm-6b-belle-zh-lora | ChatGLM-6B | 在100万条中文ChatGPT指令Belle数据集BelleGroup/train1MCN上微调了一版ChatGLM-6B，问答效果有提升，发布微调后的LoRA权重 | training script | predict script | | shibing624/llama-13b-belle-zh-lora | LLaMA-13B | 在100万条中文ChatGPT指令Belle数据集BelleGroup/train1MCN上微调了一版Llama-13B，问答效果有提升，发布微调后的LoRA权重 | training script | predict script | | shibing624/chinese-alpaca-plus-7b-hf | LLaMA-7B | 中文LLaMA-Plus, Alpaca-Plus 7B版本，在LLaMA-7B上扩充了中文词表并继续预训练120G文本（通用领域），在4M指令数据集上微调后得到的中文Alpaca-plus模型 | training script | predict script | | shibing624/chinese-alpaca-plus-13b-hf | LLaMA-13B | 中文LLaMA-Plus, Alpaca-Plus 13B版本，在LLaMA-13B上扩充了中文词表并继续预训练120G文本（通用领域），在4.3M指令数据集上微调后得到的中文Alpaca-plus模型 | training script | predict script | | shibing624/ziya-llama-13b-medical-lora | LLaMA-13B | 在240万条中英文医疗数据集shibing624/medical上微调了一版Ziya-LLaMA-13B模型，医疗问答效果有提升，发布微调后的LoRA权重 | training script | predict script | | shibing624/vicuna-baichuan-13b-chat | Baichuan-13B-Chat | 在10万条多语言ShareGPT GPT4多轮对话数据集shibing624/sharegpt_gpt4上SFT微调了一版baichuan-13b-chat多轮问答模型，日常问答和医疗问答效果有提升，发布微调后的完整模型权重 | training script | predict script |

Evaluation

| Model | Arch | Introduction | Score | |:--------------------------------------------------------------------------------------------------------------------------------------------|:-----------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------| | LLaMA-7B-Chinese-Alpaca | LLaMA-7B | 复用ymcui/Chinese-LLaMA-Alpaca的评估case和得分 | 4.92 | | LLaMA-13B-Chinese-Alpaca | LLaMA-13B | 复用ymcui/Chinese-LLaMA-Alpaca的评估case和得分 | 7.05 | | ChatGLM-6B | ChatGLM-6B | 基于原生THUDM/chatglm-6b评估测试集得分 | 7.16 | | ChatGLM-6B-v1.1 | ChatGLM-6B | 基于原生THUDM/chatglm-6bv1.1英文优化版模型评估测试集得分 | 7.18 | | shibing624/chatglm-6b-belle-zh-lora | ChatGLM-6B | 基于THUDM/chatglm-6b加载shibing624/chatglm-6b-belle-zh-loraLoRA模型后评估测试集得分 | 7.03 | | facat/alpaca-lora-cn-13b | LLaMA-13B | 基于decapoda-research/llama-13b-hf加载facat/alpaca-lora-cn-13bLoRA模型后评估测试集并标注得分 | 4.13 |
| Chinese-Vicuna/Chinese-Vicuna-lora-13b-belle-and-guanaco | LLaMA-13B | 基于decapoda-research/llama-13b-hf加载Chinese-Vicuna/Chinese-Vicuna-lora-13b-belle-and-guanacoLoRA模型后评估测试集并标注得分 | 3.98 | | shibing624/chinese-alpaca-plus-7b-hf | LLaMA-7B | 使用ymcui/Chinese-LLaMA-Alpaca 合并模型方法合并HF权重后，评估测试集并标注得分 | 6.93 | | shibing624/chinese-alpaca-plus-13b-hf | LLaMA-13B | 使用ymcui/Chinese-LLaMA-Alpaca 合并模型方法合并HF权重后，评估测试集并标注得分 | 7.07 | | TheBloke/vicuna-13B-1.1-HF | LLaMA-13B | 使用原生vicuna-13B-1.1合并后的模型，评估测试集并标注得分 | 5.13 | | IDEA-CCNL/Ziya-LLaMA-13B-v1 | LLaMA-13B | 使用姜子牙通用大模型V1，评估测试集并标注得分 | 6.63 |

说明： - 评估case，详见在线文档：中文LLM-benchmark多任务评估集(腾讯文档) https://docs.qq.com/sheet/DUUpsREtWbFBsUVJE?tab=r7io7g 感谢韩俊明、杨家铭等同学的标注 - 评估任务类型包括：知识问答，开放式问答，数值计算，诗词、音乐、体育，娱乐，写文章，文本翻译，代码编程，伦理、拒答类，多轮问答，Score 评分是前100条（10分制）的平均分数，人工打分，越高越好 - 评估数量少，任务类型不够全面，评分之间的大小关系有一些参考价值，分数的绝对值没太大参考价值 - 评估脚本：tests/test_benchmark.py ，使用fp16预测，无int量化处理，运行脚本可复现评估结果，但生成结果具有随机性，受解码超参、随机种子等因素影响。评测并非绝对严谨，测试结果仅供晾晒参考 - 结论：ChatGLM-6B、LLaMA-13B的中文衍生模型（包括alpaca-plus, vicuna, ziya）的表现属于第一梯队，原版LLaMA-7B的表现整体稍差些 - LLaMA-13B-Chinese-Alpaca是在原版LLaMA上扩充了中文词表，并融入了约20G的通用中文语料后的指令微调模型，表明了LLaMA的底座优秀，具有强大的语言迁移能力 - ChatGLM这种原生的中文预训练模型更理解中文语义，且在中文知识问答、开放式问答得分高 - LLaMA系列模型数值计算、中英翻译、代码编程类得分高 - 经过中文预训练和SFT微调后的Chinese-LLaMA模型在中文诗词、娱乐、伦理类得分相较原版LLaMA有提升

🚀 Demo

HuggingFace Demo: https://huggingface.co/spaces/shibing624/chinese-couplet-generate

run example: examples/T5/gradio_demo.py to see the demo:

shell python examples/T5/gradio_demo.py

model trained by examples/t5/T5FinetuneChinese_Couplet.ipynb

💾 Install

shell pip install -U textgen

or

install develop version: shell pip install torch # conda install pytorch git clone https://github.com/shibing624/textgen.git cd textgen python setup.py install

▶️ Usage

ChatGLM-6B 模型

使用 ChatGLM-6B 微调后的模型

example: examples/gpt/inference_demo.py

```python from textgen import GptModel

model = GptModel("chatglm", "THUDM/chatglm-6b", peft_name="shibing624/chatglm-6b-csc-zh-lora") r = model.predict(["介绍下北京"]) print(r) # ['北京是中国的首都...'] ```

训练 ChatGLM-6B 微调模型

支持自定义训练数据集和训练参数，数据集格式参考examples/data/sharegptzh100_format.jsonl
支持QLoRA、AdaLoRA、LoRA、PTuning、PrefixTuning等部分参数微调方法，也支持全参微调
支持多卡训练，支持混合精度训练
支持多卡推理

example: examples/gpt/trainingchatglmdemo.py

单卡训练： shell cd examples/gpt CUDA_VISIBLE_DEVICES=0 python training_chatglm_demo.py --do_train --do_predict --num_epochs 1 --output_dir outputs_chatglm_v1

多卡训练： shell cd examples/gpt CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node 2 training_chatglm_demo.py --do_train --do_predict --num_epochs 20 --output_dir outputs_chatglm_v1

多卡推理： shell cd examples/gpt CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node 2 inference_multigpu_demo.py --model_type chatglm --base_model THUDM/chatglm-6b

LLaMA 模型

使用 LLaMA 微调后的模型

example: examples/gpt/inference_demo.py

show code example and result

```python import sys sys.path.append('../..') from textgen import GptModel model = GptModel("llama", "decapoda-research/llama-7b-hf", peft_name="ziqingyang/chinese-alpaca-lora-7b") r = model.predict(["用一句话描述地球为什么是独一无二的。"]) print(r) # ['地球是唯一一颗拥有生命的行星。'] ```

训练 LLaMA 微调模型

支持自定义训练数据集和训练参数，数据集格式参考examples/data/sharegptzh100_format.jsonl
支持QLoRA、AdaLoRA、LoRA、PTuning、PrefixTuning等部分参数微调方法，也支持全参微调
支持多卡训练，支持混合精度训练，使用方法同上（ChatGLM多卡训练）
支持多卡推理

example: examples/gpt/trainingllamademo.py

基于微调(LoRA)模型继续训练

如果需要基于Lora模型继续训练，可以使用下面的脚本合并模型为新的base model，再微调训练即可。

执行以下命令： shell python -m textgen/gpt/merge_peft_adapter \ --model_type llama \ --base_model_name_or_path path/to/llama/model \ --tokenizer_path path/to/llama/tokenizer \ --peft_model_path path/to/lora/model \ --output_dir merged 参数说明： --model_type：模型类型，目前支持bloom,llama,baichuan和chatglm --base_model_name_or_path：存放HF格式的底座模型权重和配置文件的目录 --tokenizer_path：存放HF格式的底座模型tokenizer文件的目录 --peft_model_path：中文LLaMA/Alpaca LoRA解压后文件所在目录，也可使用HF上的Lora模型名称，如`ziqingyang/chinese-alpaca-lora-7b`会自动下载对应模型 --output_dir：指定保存全量模型权重的目录，默认为./merged

训练领域模型

Note: 为了全面的介绍训练医疗大模型的过程，把4阶段训练方法(Pretraining, Supervised Finetuning, Reward Modeling and Reinforcement Learning)单独新建了一个repo：shibing624/MedicalGPT，请移步该repo查看训练方法。

ConvSeq2Seq 模型

训练并预测ConvSeq2Seq模型：

example: examples/seq2sesq/trainingconvseq2seqmodel_demo.py

show code example and result

```python import argparse from loguru import logger import sys sys.path.append('../..') from textgen.seq2seq.conv_seq2seq_model import ConvSeq2SeqModel def main(): parser = argparse.ArgumentParser() parser.add_argument('--train_file', default='../data/zh_dialog.tsv', type=str, help='Training data file') parser.add_argument('--do_train', action='store_true', help='Whether to run training.') parser.add_argument('--do_predict', action='store_true', help='Whether to run predict.') parser.add_argument('--output_dir', default='./outputs/convseq2seq_zh/', type=str, help='Model output directory') parser.add_argument('--max_seq_length', default=50, type=int, help='Max sequence length') parser.add_argument('--num_epochs', default=200, type=int, help='Number of training epochs') parser.add_argument('--batch_size', default=32, type=int, help='Batch size') args = parser.parse_args() logger.info(args) if args.do_train: logger.info('Loading data...') model = ConvSeq2SeqModel(epochs=args.num_epochs, batch_size=args.batch_size, model_dir=args.output_dir, max_length=args.max_seq_length) model.train_model(args.train_file) print(model.eval_model(args.train_file)) if args.do_predict: model = ConvSeq2SeqModel(epochs=args.num_epochs, batch_size=args.batch_size, model_dir=args.output_dir, max_length=args.max_seq_length) sentences = ["什么是ai", "你是什么类型的计算机", "你知道热力学吗"] print("inputs:", sentences) print('outputs:', model.predict(sentences)) if __name__ == '__main__': main() ``` output: ```bash inputs: ["什么是ai", "你是什么类型的计算机", "你知道热力学吗"] outputs: ['人工智能是工程和科学的分支,致力于构建思维的机器。', '我的程序运行在python,所以我在任何运脑上工作！', '我不能错热是一个疯狂的人工智能"200年。'] ```

BART 模型

训练并预测BART模型：

example: examples/seq2sesq/trainingbartseq2seqzh_demo.py

output:

shell inputs: ['什么是ai', '你是什么类型的计算机', '你知道热力学吗'] outputs: ['人工智能是工程和科学的分支,致力于构', '我的程序运行在python,所以我在任何电脑上', '什么是热力学吗？']

T5 模型

example: examples/t5/trainingzht5modeldemo.py

show code example and result

```python import argparse from loguru import logger import pandas as pd import sys sys.path.append('../..') from textgen.t5 import T5Model def load_data(file_path): data = [] with open(file_path, 'r', encoding='utf-8') as f: for line in f: line = line.strip('\n') terms = line.split('\t') if len(terms) == 2: data.append(['QA', terms[0], terms[1]]) else: logger.warning(f'line error: {line}') return data def main(): parser = argparse.ArgumentParser() parser.add_argument('--train_file', default='../data/zh_dialog.tsv', type=str, help='Training data file') parser.add_argument('--model_type', default='t5', type=str, help='Transformers model type') parser.add_argument('--model_name', default='Langboat/mengzi-t5-base', type=str, help='Transformers model or path') parser.add_argument('--do_train', action='store_true', help='Whether to run training.') parser.add_argument('--do_predict', action='store_true', help='Whether to run predict.') parser.add_argument('--output_dir', default='./outputs/mengzi_t5_zh/', type=str, help='Model output directory') parser.add_argument('--max_seq_length', default=50, type=int, help='Max sequence length') parser.add_argument('--num_epochs', default=10, type=int, help='Number of training epochs') parser.add_argument('--batch_size', default=32, type=int, help='Batch size') args = parser.parse_args() logger.info(args) if args.do_train: logger.info('Loading data...') # train_data: Pandas DataFrame containing the 3 columns - `prefix`, `input_text`, `target_text`. # - `prefix`: A string indicating the task to perform. (E.g. `"question"`, `"stsb"`) # - `input_text`: The input text. `prefix` is prepended to form the full input. (: ) # - `target_text`: The target sequence train_data = load_data(args.train_file) logger.debug('train_data: {}'.format(train_data[:10])) train_df = pd.DataFrame(train_data, columns=["prefix", "input_text", "target_text"]) eval_data = load_data(args.train_file)[:10] eval_df = pd.DataFrame(eval_data, columns=["prefix", "input_text", "target_text"]) model_args = { "reprocess_input_data": True, "overwrite_output_dir": True, "max_seq_length": args.max_seq_length, "train_batch_size": args.batch_size, "num_train_epochs": args.num_epochs, "save_eval_checkpoints": False, "save_model_every_epoch": False, "evaluate_generated_text": True, "evaluate_during_training": True, "evaluate_during_training_verbose": True, "use_multiprocessing": True, "save_best_model": True, "output_dir": args.output_dir, "use_early_stopping": True, } # model_type: t5 model_name: Langboat/mengzi-t5-base model = T5Model(args.model_type, args.model_name, args=model_args) def count_matches(labels, preds): logger.debug(f"labels: {labels[:10]}") logger.debug(f"preds: {preds[:10]}") match = sum([1 if label == pred else 0 for label, pred in zip(labels, preds)]) logger.debug(f"match: {match}") return match model.train_model(train_df, eval_data=eval_df, matches=count_matches) print(model.eval_model(eval_df, matches=count_matches)) if args.do_predict: model = T5Model(args.model_type, args.output_dir) sentences = ["什么是ai", "你是什么类型的计算机", "你知道热力学吗"] print("inputs:", sentences) print("outputs:", model.predict(sentences)) if __name__ == '__main__': main() ``` output: ```shell inputs: ['什么是ai', '你是什么类型的计算机', '你知道热力学吗'] outputs: ['人工智能有两个广义的定义,任何拟人的机械,如在卡雷尔capeks', '我的程序运行在Python,所以我在任何电脑上工作!', '什么是热力学'] ```

GPT2 模型

中文GPT2 - 文章生成

使用中文数据集（段落格式，\n间隔），训练GPT2模型，可以用于诗歌生成、文章生成等任务。

example: examples/gpt2/trainingzhgpt2_demo.py

中文GPT2 - 对联生成

使用中文对联数据集（tsv格式，\t间隔），自定义数据集读取Dataset，训练GPT2模型，可以用于对联生成、对话生成等任务。

example: examples/gpt2/trainingcoupletgpt2_demo.py

GPT2 vs T5：

都是从Transformer改进来的，T5同时有编码器和解码器，GPT2只有解码器
T5的模型优势是处理给定输入，产出对应输出的任务，如翻译、对话、问答等
GPT2的模型优势是自由创作，如写一篇短文
T5的对联生成效果好于GPT2、GPT2的诗词生成效果好于T5

SongNet 模型

格式控制的文本生成模型，paper见SongNet: Rigid Formats Controlled Text Generation，适用于强韵律格式要求的诗歌、对联、歌词生成等任务。

example: examples/songnet/trainingzhsongnet_demo.py

Keyword Text Augmentation(EDA/UDA)

example: examples/textaugmentation/textaugmentation_demo.py

show code example and result

```python import sys sys.path.append('..') from textgen.augment import TextAugment if __name__ == '__main__': docs = ['主要研究机器学习、深度学习、计算机视觉、智能对话系统相关内容', '晚上肚子好难受', '你会武功吗，我不会', '组装标题质量受限于广告主自提物料的片段质量，且表达丰富度有限', ] m = TextAugment(sentence_list=docs) a = docs[0] print(a) b = m.augment(a, aug_ops='random-0.2') print('random-0.2:', b) b = m.augment(a, aug_ops='insert-0.2') print('insert-0.2:', b) b = m.augment(a, aug_ops='delete-0.2') print('delete-0.2:', b) b = m.augment(a, aug_ops='tfidf-0.2') print('tfidf-0.2:', b) b = m.augment(a, aug_ops='mix-0.2') print('mix-0.2:', b) ``` output: ```bash 主要研究机器学习、深度学习、计算机视觉、智能对话系统相关内容 random-0.2: ('主要陪陪机器学习、深度学习主要计算机视觉、智能对话系统受限于内容', [('研究', '陪陪', 2, 4), ('、', '主要', 13, 15), ('相关', '受限于', 27, 30)]) insert-0.2: ('主要研究机器机器学习学习、深度深度学习、计算机视觉、智能对话系统相关内容', [('机器', '机器机器', 4, 8), ('学习', '学习学习', 8, 12), ('深度', '深度深度', 13, 17)]) delete-0.2: ('主要研究机器学习、深度学习、计算机视觉、对话系统相关内容', [('智能', '', 20, 20)]) tfidf-0.2: ('一是研究机器学习、深度学习、计算机听觉、智能交谈系统密切相关内容', [('主要', '一是', 0, 2), ('视觉', '听觉', 17, 19), ('对话', '交谈', 22, 24), ('相关', '密切相关', 26, 30)]) mix-0.2: ('主要研究机器学习、深度学、计算机听觉、智能对话软件系统相关内容', [('学习', '学', 11, 12), ('视觉', '听觉', 16, 18), ('系统', '软件系统', 23, 27)]) ```

TGLS 模型（无监督相似文本生成模型）

无监督的中文电商评论生成：从电商评论中提取用户表达观点的短句并进行组合来生成仿真评论。

example: examples/unsupgeneration/unsupgeneration_demo.py

show code example and result

```python import os import sys sys.path.append('..') from textgen.unsup_generation import TglsModel, load_list pwd_path = os.path.abspath(os.path.dirname(__file__)) samples = load_list(os.path.join(pwd_path, './data/ecommerce_comments.txt')) docs_text = [ ["挺好的，速度很快，也很实惠，不知效果如何", "产品没得说，买了以后就降价，心情不美丽。", "刚收到，包装很完整，不错", "发货速度很快，物流也不错，同一时间买的两个东东，一个先到一个还在路上。这个水水很喜欢，不过盖子真的开了。盖不牢了现在。", "包装的很好，是正品", "被种草兰蔻粉水三百元一大瓶囤货，希望是正品好用，收到的时候用保鲜膜包裹得严严实实，只敢买考拉自营的护肤品", ], ['很温和，清洗的也很干净，不油腻，很不错，会考虑回购，第一次考拉买护肤品，满意', '这款卸妆油我会无限回购的。即使我是油痘皮，也不会闷痘，同时在脸部按摩时，还能解决白头的脂肪粒的问题。用清水洗完脸后，非常的清爽。', '自从用了fancl之后就不用其他卸妆了，卸的舒服又干净', '买贵了，大润发才卖79。9。', ], samples ] m = TglsModel(docs_text) r = m.generate(samples[:500]) print('size:', len(r)) for review in r: print('\t' + review) ``` output: [美迪惠尔 N.M.F针剂水库保湿面膜](https://goods.kaola.com/product/2227311.html)有如下的20句评论，其中有10句是真实用户评论，10句是生成的评论，能看出来么?😂 ``` 还不错还不错还不错还不错。东西到了，不知道好不好用。试用过后再来评价。到时看网评都还可以。哺乳期唯一使用的护肤品，每天都是素颜，脸面全靠面膜吊着😄补水💦不粘腻一如既往的支持，喜欢💕 搞活动时买的面膜，不知道这个面膜是真是假敷在脸上面膜纸都有小水泡鼓起来。很不错，非常补水，用过的都知道，性价比之王，好用又不贵，正品，用着放心，物流也很快。面膜非常好用哦。面膜薄薄的。好像是蚕丝面膜啊。精华很多呢。敷在脸上很舒服。感觉挺保湿的，味道也挺好闻的。就是里面只有单纯的面膜直接敷脸上有点不好弄，哈哈哈还可以保湿效果不错水润润的每天贴一片脸也不干了用完了在买点，不错还会继续回购的。快递很快，东西很赞！想要得点考拉豆不容易，还要三十个字。时间宝贵，废话不说！用过了就知道了挺好用的，朋友推荐来的挺好用的，淡淡的，虽然不是很浓精华的感觉，但是效果也蛮好的。划算不得不说美迪惠尔的面膜是我用过的最好的面膜之一😎补水效果非常好，没想到这么便宜的价格竟真的能买到真品。保湿效果挺好的，面膜很好用。期待好的产品。一打开包装里面的精华刚刚好，用了补水补水效果不错，物流非常快。皮肤很光滑😇比上去速度快三天就到了。前两天皮肤干燥连续敷了两个晚上感觉还不错😂补水效果明显！可想而知精华液又多充足😍敷上以后凉凉的很舒服。补水效果一般吧～但是我用的韩国背回来的面膜纸不算薄，希望好用会回购的，敷上脸感觉比较清爽～价格还不便宜。希望好用，面膜用过了很好用，皮肤水嫩光滑白皙，补水不错，价格也合适。就是精华液太少了，保湿效果不错。面膜的补水效果非常好，保湿效果确实很赞，这个面膜相对于胶原蛋白和美白的那两款的面膜纸要厚一些，看着价格合适。 ``` 前10句是真实用户评论，后10句是生成的。

📚 Dataset

SFT datasets

50万条中文ChatGPT指令Belle数据集：BelleGroup/train0.5MCN
100万条中文ChatGPT指令Belle数据集：BelleGroup/train1MCN
5万条英文ChatGPT指令Alpaca数据集：50k English Stanford Alpaca dataset
2万条中文ChatGPT指令Alpaca数据集：shibing624/alpaca-zh
69万条中文指令Guanaco数据集(Belle50万条+Guanaco19万条)：Chinese-Vicuna/guanacobellemerge_v1.0
240万条中文医疗数据集(包括预训练数据和指令微调数据集)：shibing624/medical
5万条英文ChatGPT多轮对话数据集：RyokoAI/ShareGPT52K
80万条中文ChatGPT多轮对话数据集：BelleGroup/multiturnchat0.8M
116万条中文ChatGPT多轮对话数据集：fnlp/moss-002-sft-data

Reward Model datasets

原版的oasst1数据集：OpenAssistant/oasst1
2万条多语言oasst1的reward数据集：tasksource/oasst1pairwiserlhf_reward
11万条英文hh-rlhf的reward数据集：Dahoas/full-hh-rlhf
9万条英文reward数据集(来自Anthropic's Helpful Harmless dataset)：Dahoas/static-hh
7万条英文reward数据集（来源同上）：Dahoas/rm-static
7万条繁体中文的reward数据集（翻译自rm-static）liswei/rm-static-m2m100-zh
7万条英文Reward数据集：yitingxie/rlhf-reward-datasets
3千条中文知乎问答偏好数据集：liyucheng/zhihurlhf3k

✅ Todo

[x] add multiple rounds of dialogue data fine-tuning method
[x] add reward model finetuning, go to shibing624/MeidcalGPT
[x] add rl finetuning, go to shibing624/MeidcalGPT
[x] add medical reward dataset
[x] add llama in4 training, go to shibing624/MeidcalGPT
[ ] add all training and predict demo in colab

☎️ Contact

Issue(建议) ：
邮件我：xuming: xuming624@qq.com
微信我：加我微信号：xuming624, 备注：姓名-公司名-NLP 进NLP交流群。

😇 Citation

如果你在研究中使用了textgen，请按如下格式引用：

latex @misc{textgen, title={textgen: Text Generation Tool}, author={Ming Xu}, year={2021}, howpublished={\url{https://github.com/shibing624/textgen}}, }

🤗 License

This repository is licensed under The Apache License 2.0.

Please follow the Model Card to use the LLaMA model.

Please follow the RAIL License to use the BLOOM & BLOOMZ model.

😍 Contribute

项目代码还很粗糙，如果大家对代码有所改进，欢迎提交回本项目，在提交之前，注意以下两点：

在tests添加相应的单元测试
使用python -m pytest来运行所有单元测试，确保所有单测都是通过的

之后即可提交PR。

💕 Acknowledgements

Thanks for their great work!

Owner

Name: xuming
Login: shibing624
Kind: user
Location: Beijing, China
Company: @tencent

Website: https://blog.csdn.net/mingzai624
Repositories: 32
Profile: https://github.com/shibing624

Senior Researcher, Machine Learning Developer, Advertising Risk Control.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Xu"
  given-names: "Ming"
title: "Textgen: Text generation toolkit"
url: "https://github.com/shibing624/textgen"
data-released: 2022-02-22
version: 0.0.4

GitHub Events

Total

Watch event: 38
Issue comment event: 2
Fork event: 5

Last Year

Watch event: 38
Issue comment event: 2
Fork event: 5

Committers

Last synced: 6 months ago

All Time

Total Commits: 563
Total Committers: 5
Avg Commits per committer: 112.6
Development Distribution Score (DDS): 0.023

Past Year

Commits: 2
Committers: 1
Avg Commits per committer: 2.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
shibing624	s**4@1**m	550
flemingxu	f**u@t**m	10
Willard Sheen	w**n@b**n	1
xinranwwang	x**g@t**m	1
hy.li	h**i@h**n	1

Committer Domains (Top 20 + Academic)

tencent.com: 2 hcr.com.cn: 1 buaa.edu.cn: 1 126.com: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 54
Total pull requests: 5
Average time to close issues: about 2 months
Average time to close pull requests: about 5 hours
Total issue authors: 40
Total pull request authors: 4
Average comments per issue: 4.35
Average comments per pull request: 0.0
Merged pull requests: 3
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 2
Pull requests: 0
Average time to close issues: 4 days
Average time to close pull requests: N/A
Issue authors: 2
Pull request authors: 0
Average comments per issue: 4.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

MonkeyTB (4)
hangzeli08 (4)
PolarisRisingWar (3)
alexhmyang (2)
gg22mm (2)
huyi1989 (2)
bash99 (2)
hongyix (2)
Ocyss (1)
ImXunan (1)
LMXKO (1)
YunweiDai (1)
feng-1985 (1)
xiaojunjun65 (1)
yeyuan0620 (1)

Pull Request Authors

manutd12 (2)
xingener (1)
wiserxin (1)
alitrack (1)

Top Labels

Issue Labels

question (29) wontfix (21) bug (10) enhancement (8)

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 72 last-month

Total dependent packages: 0
Total dependent repositories: 3
Total versions: 23
Total maintainers: 1

pypi.org: textgen

Text Generation Model

Homepage: https://github.com/shibing624/textgen
Documentation: https://textgen.readthedocs.io/
License: Apache 2.0
Latest release: 1.1.1
published over 2 years ago

Versions: 23
Dependent Packages: 0
Dependent Repositories: 3
Downloads: 72 Last month

Rankings

Stargazers count: 2.3%

Forks count: 4.7%

Average: 7.9%

Dependent repos count: 9.0%

Dependent packages count: 10.1%

Downloads: 13.6%

Maintainers (1)

shibing624

Last synced: 6 months ago

textgen

Science Score: 64.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

TextGen: Implementation of Text Generation models

📖 Introduction

🔥 News

😊 Feature

Release Models

Evaluation

🚀 Demo

💾 Install

▶️ Usage

ChatGLM-6B 模型

使用 ChatGLM-6B 微调后的模型

训练 ChatGLM-6B 微调模型

LLaMA 模型

使用 LLaMA 微调后的模型

训练 LLaMA 微调模型

基于微调(LoRA)模型继续训练

训练领域模型

ConvSeq2Seq 模型

BART 模型

T5 模型

GPT2 模型

中文GPT2 - 文章生成

中文GPT2 - 对联生成

SongNet 模型

Keyword Text Augmentation(EDA/UDA)

TGLS 模型（无监督相似文本生成模型）

📚 Dataset

SFT datasets

Reward Model datasets

✅ Todo

☎️ Contact

😇 Citation

🤗 License

😍 Contribute

💕 Acknowledgements

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: textgen

Rankings

Maintainers (1)

Dependencies