pp-inscaptagger

Instance Capability Tagger(InsCapTagger) is a multimodal data capability tagging model. 多模态数据能力标签模型，可用于图文数据分析和处理（e.g. 基于信息密度的数据过滤方案、基于模型能力的数据配比方案）。 🔥 🔥 🔥

https://github.com/lyuwenyu/pp-inscaptagger

Last synced: 7 months ago · JSON representation ·

Repository

Instance Capability Tagger(InsCapTagger) is a multimodal data capability tagging model. 多模态数据能力标签模型，可用于图文数据分析和处理（e.g. 基于信息密度的数据过滤方案、基于模型能力的数据配比方案）。 🔥 🔥 🔥

Basic Info

Host: GitHub
Owner: lyuwenyu
License: apache-2.0
Default Branch: main
Homepage:
Size: 23.4 KB

Statistics

Stars: 8
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created over 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme License Citation

PP-InsCapTagger

Update

[2024.10.10] Code and model are available at this.

TODO

[x] Release code and model
[ ] English version README
[ ] Upload arXiv paper

Citation

If you use PP-InsCapTagger in your work, please consider citing the following BibTeX entry:

bibtex

```bibtex @misc{ppinscaptagger2024, title={Instance Capability Tagger: Enhancing Multimodal Data Efficiency for Model Training.}, author={Lv Wenyu and Huang Kui and Zhao Yian}, howpublished = {\url{https://github.com/lyuwenyu/PP-InsCapTagger}}, year={2024} } ```

方案简介

PP-InsCapTagger(Instance Capability Tagger) 是 DataCopilot 基于 PaddleMIX 实现的数据能力标签模型，用于为多模态数据实例能力打标，通过实例能力分布对数据集进行优化，可以提高模型训练效率，为数据集分析和评价提供了一种高效的方案。结合模型推理打标结果对LLaVA SFT数据集进行优化，可以提高LLaVA模型SFT阶段50%的训练效率。

数据实例能力标签：在多模态任务中，每条数据都可以抽象出一种或多种能力，在训练时，模型会从这些数据中学习并增强自身对应的能力，如下图。为了评价和优化数据集，我们可以通过模型为每条多模态数据在模型训练中贡献的实例能力进行打标，并根据打标结果中数据实例能力分布进行数据集的优化，进而提升模型的训练效率。

PP-InsCapTagger 基于 PaddleMix 进行训练，使用 llava-v1.6-7b 模型作为 base 模型。数据集使用多模态数据 LLaVA-Instruct-150K 的部分图片和多轮对话内容，并通过 GPT-4o 为每一条数据的实例能力进行打标，并将打标结果作为该条数据的 tags 属性进行保存，然后使用 DataCopilot 实现数据集的高效预处理，结合原始多轮对话内容和 tags 结果重构数据集的 question 和 answer。

PP-InsCapTagger 部分训练和推理的细节可以参考AI Studio 项目：基于PaddleMIX的数据集行为标签分类器训推实例

模型使用示例

本项目提供 PP-InsCapTagger 使用脚本 inference.py, 通过single_data和json_data两种推理模式，可以分别实现以图像-文本对输入的单条样本推理和以json文件输入的批量数据推理。

单样本推理:

输入图片：

输入多轮对话：

Q: What animal is in the image? A: The image features a dog. Q: What color are the dog's eyes? A: The dog has blue eyes. Q: Where is the dog situated in the image? A: The dog is situated inside a vehicle, on a front passenger seat.

```bash

PaddleMIX根目录下执行

python paddlemix/datacopilot/example/ppinscaptagger/inference.py \ singledata \ -m paddlemix/PP-InsCapTagger \ -image https://paddlenlp.bj.bcebos.com/models/community/paddlemix/PP-InsCapTagger/demo.jpg \ -qa "What animal is in the image?" "The image features a dog." \ "What color are the dog's eyes?" "The dog has blue eyes." \ "Where is the dog situated in the image?" "The dog is situated inside a vehicle, on a front passenger seat." ```

其中，-m表示模型所用权重路径，当值为paddlemix/PP-InsCapTagger时，会自动下载PP-InsCapTagger模型到本地；-image表示输入的图像地址(本地地址\http链接)；-qa表示输入的多轮对话内容，以空格分隔。

批量数据推理:

```bash

PaddleMIX根目录下执行

python paddlemix/datacopilot/example/ppinscaptagger/inference.py \ jsondata \ -m paddlemix/PP-InsCapTagger \ -d path/to/your/data.json \ -k 0 \ -o path/to/your/output-dir ``其中，path/to/your/data.json` 为输入的批量数据文件路径，格式如下：

```json [ { "image": "http://ecx.images-amazon.com/images/I/51ntbts0gmL.jpg", "conversations": [ [ "\nWhat is the genre of this book?", "Literature & Fiction" ], [ "What is the color of this book?", "Red and black" ]

    ]
},
{
    "image": "http://ecx.images-amazon.com/images/I/51cc3XrLevL.jpg",
    "conversations": [
        [
            "<image>\nWhat is the title of this book?",
            "Beyond Bigger Leaner Stronger: The Advanced Guide to Building Muscle, Staying Lean, and Getting Strong (The Build Muscle, Get Lean, and Stay Healthy Series)"
        ]
    ]
}

] ``-k表示脚本批量处理的起始位置为第k个chunk的数据，默认为0，当处理中断时可以更改处理起始位置；path/to/your/output-dir表示处理结果json文件保存的位置，所有chunk的处理结果分别保存在对应的json文件中，命名为tagger_{i:05}.json`。

标签使用案例

LLaVA v1.5模型SFT阶段训练时，使用的指令微调数据集为LLaVA-Instruct-150K中llavav15_mix665k数据集，该数据集为多个数据集混合而成，相比于预训练数据集，该数据集规模更大，同时在实例能力分布上也存在较大的差异。为了优化该数据集的实例能力分布，进而提高模型训练效率，我们使用PP-InsCapTagger对数据集进行打标，并统计标签分布。

使用PP-InsCapTagger对llavav15_mix665k数据集进行打标，可以得到7913个标签，对数量最多的前100个标签分布进行可视化，可以看出标签分布存在较大的差异，如下图所示：

See

为了对llavav15mix665k数据集进行优化，我们使用PP-InsCapTagger打标的标签结果对数据集进行筛选，**首先确定出能够覆盖80%数据的单条数据的标签数量N，然后在数据集标签集合中选出标签数量占比前0.7%的标签作为一个筛选集合R，对于llavav15mix665k数据集中的每条数据，如果该条数据标签数量小于N，且该条数据的所有标签均在集合R中，则删除该条数据，否则保留该条数据**。通过该筛选策略，最终保留数据集规模为原始数据集的50%左右。

我们分别使用llavav15_mix665k数据集和筛选后的数据集进行llava-1.5-7b SFT阶段训练，对比结果如下表所示：

| Model | ScienceQA | TextVQA | VQAv2 | GQA | mmmu | mme | |:----------------------:|:-----------:|:---------:|:-------:|:-------:|:-------:|:----------------:| | llava-1.5-7b
(origin) | 66.8 | 58.2 | 78.5 | 62.0 | - | - | | llava-1.5-7b
(rerun) | 69.01 | 57.6 | 79.0 | 62.95 | 36.89 | 1521
323 | | llava-1.5-7b
(random 50%) | 67.31 | 55.6 | 76.89 | 61.01 | 34.67 | 1421
286 | | llava-1.5-7b
(our 50%) | 70.24(+2.93) | 57.12(+1.52) | 78.32(+1.43) | 62.14(+1.13) | 37.11(+2.44) | 1476(+55)
338(+52) |

通过PP-InsCapTagger的打标和优化，我们采样出的50%数据集优于随机采样，并且与全量原始数据集的训练效果基本持平，大大提高了模型训练效率。

Owner

Name: Wenyu
Login: lyuwenyu
Kind: user
Location: Beijing, China
Company: Harbin Institute of Technology

Website: https://www.linkedin.com/in/lyuwenyu/
Repositories: 4
Profile: https://github.com/lyuwenyu

L

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: 'Instance Capability Tagger: Enhancing Multimodal Data Efficiency for Model Training'
message: >-
  If you use this model, please cite it using the metadata from this file.
type: software
authors:
  - given-names: Wenyu
    family-names: Lv
    email: lyuwenyu@foxmail.com
  - given-names: Kui
    family-names: Huang
  - given-names: Yian
    family-names: Zhao  
repository-code: 'https://github.com/lyuwenyu/PP-InsCapTagger'
repository: 'https://github.com/lyuwenyu/PP-InsCapTagger'
keywords:
  - mllm
  - InsCapTagger
license: Apache-2.0
version: 1.0
date-released: 2024-10-10

GitHub Events

Total

Watch event: 4
Push event: 2

Last Year

Watch event: 4
Push event: 2

Committers

Last synced: 10 months ago

All Time

Total Commits: 6
Total Committers: 1
Avg Commits per committer: 6.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 6
Committers: 1
Avg Commits per committer: 6.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
lyuwenyu	w**u@g**m	6

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

pp-inscaptagger

Science Score: 44.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

PP-InsCapTagger

Update

TODO

Citation

方案简介

模型使用示例

单样本推理:

PaddleMIX根目录下执行

批量数据推理:

PaddleMIX根目录下执行

标签使用案例

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels