https://github.com/bytedance/valley

Valley is a cutting-edge multimodal large model designed to handle a variety of tasks involving text, images, and video data.

https://github.com/bytedance/valley

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.5%) to scientific vocabulary

Keywords

research
Last synced: 7 months ago · JSON representation

Repository

Valley is a cutting-edge multimodal large model designed to handle a variety of tasks involving text, images, and video data.

Basic Info
  • Host: GitHub
  • Owner: bytedance
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 4.58 MB
Statistics
  • Stars: 215
  • Watchers: 3
  • Forks: 13
  • Open Issues: 3
  • Releases: 0
Topics
research
Created over 1 year ago · Last pushed about 1 year ago
Metadata Files
Readme License

README.md

Valley2

🤗 Hugging Face   |   🤖 ModelScope    |    📑 Home Page    |    📙 Paper

News

Introduction

Valley is a cutting-edge multimodal large model designed to handle a variety of tasks involving text, images, and video data, which is developed by ByteDance. Our model

  • Achieved the best results in the inhouse e-commerce and short-video benchmarks, much better then other SOTA opensource models.
  • Demonstrated comparatively outstanding performance in the OpenCompass (average scores >= 67.40, TOP2 among <10B models) tests

when evaluated against models of the same scale.

opencompass


Valley-Eagle

The foundational version of Valley is a multimodal large model aligned with Siglip and Qwen2.5, incorporating LargeMLP and ConvAdapter to construct the projector.

  • In the final version, we also referenced Eagle, introducing an additional VisionEncoder that can flexibly adjust the number of tokens and is parallelized with the original visual tokens.
  • This enhancement supplements the model’s performance in extreme scenarios, and we chose the Qwen2vl VisionEncoder for this purpose.

and the model structure is shown as follows:

opencompass

Environment Setup

bash pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121 pip install -r requirements.txt

Inference Demo

  • Single image ``` python # Method-1 import torch import urllib from io import BytesIO from PIL import Image from transformers import AutoProcessor, AutoModel

device = torch.device("cuda" if torch.cuda.isavailable() else "cpu") model = AutoModel.frompretrained("bytedance-research/Valley-Eagle-7B", trustremotecode=True) processor = AutoProcessor.frompretrained("bytedance-research/Valley-Eagle-7B", trustremote_code=True)

url = "https://images.unsplash.com/photo-1734640113825-24dd7c056052" img = urllib.request.urlopen(url=url, timeout=5).read() img = Image.open(BytesIO(img)).convert("RGB") res = processor( { "conversations": [ {"role": "system", "content": "You are Valley, developed by ByteDance. Your are a helpfull Assistant."}, {"role": "user", "content": "Describe the given image."}, ], "images": [img] }, inference=True )

with torch.inferencemode(): model.to(dtype=torch.float16, device=device) outputids = model.generate( inputids=res["inputids"].to(device), images=[[item.to(dtype=torch.float16, device=device) for item in img] for img in res["images"]], imagesizes=res["imagesizes"], pixelvalues=res["pixelvalues"].to(dtype=torch.float16, device=device), imagegridthw=res["imagegridthw"].to(device), dosample=False, maxnewtokens=1024, repetitionpenalty=1.0, returndictingenerate=True, outputscores=True, ) inputtokenlen = res["inputids"].shape[1] generationtext = processor.batchdecode(outputids.sequences[:, inputtokenlen:])[0] generationtext = generationtext.replace("<|imend|>", "") print(generationtext)

Method-2

from valleyeaglechat import ValleyEagleChat import urllib from io import BytesIO from PIL import Image

model = ValleyEagleChat( modelpath="bytedance-research/Valley-Eagle-7B", paddingside="left", )

url = "https://images.unsplash.com/photo-1734640113825-24dd7c056052" img = urllib.request.urlopen(url=url, timeout=5).read() img = Image.open(BytesIO(img)).convert("RGB")

request = { "chat_history": [ {"role": "system", "content": "You are Valley, developed by ByteDance. Your are a helpfull Assistant."}, {"role": "user", "content": "Describe the given image."}, ], "images": [img], } result = model(request) print(f"\n>>> Assistant:\n") print(result) ```

  • Multi-images ``` python from valleyeaglechat import ValleyEagleChat import urllib from io import BytesIO from PIL import Image

model = ValleyEagleChat( modelpath="bytedance-research/Valley-Eagle-7B", paddingside="left", )

urls = [ "https://plus.unsplash.com/premiumphoto-1661632559307-902ac3f6174c", "https://plus.unsplash.com/premiumphoto-1661632559713-a478160cd72e", "https://plus.unsplash.com/premiumphoto-1661607772173-54f7b8263c27", "https://plus.unsplash.com/premiumphoto-1661607115685-36b2a7276389", "https://plus.unsplash.com/premiumphoto-1661607103369-e799ee7ef954", "https://plus.unsplash.com/premiumphoto-1661628841460-1c9d7e6669ec", "https://plus.unsplash.com/premiumphoto-1661602273588-f213a4155caf", "https://plus.unsplash.com/premiumphoto-1661602247160-d42d7aba6798" ]

url2img = lambda url: Image.open( BytesIO(urllib.request.urlopen(url=url, timeout=5).read()) ).convert("RGB")

imgs = [url2img(url) for url in urls]

request = { "chat_history": [ {"role": "system", "content": "You are Valley, developed by ByteDance. Your are a helpfull Assistant."}, {"role": "user", "content": "Describe the given images."}, ], "images": imgs, } result = model(request) print(f"\n>>> Assistant:\n") print(result)

```

  • Video ``` python from valleyeaglechat import ValleyEagleChat import decord import requests import numpy as np from torchvision import transforms

model = ValleyEagleChat( modelpath='bytedance-research/Valley-Eagle-7B', paddingside = 'left', )

url = 'https://videos.pexels.com/video-files/29641276/127531271920108025fps.mp4' videofile = './video.mp4' response = requests.get(url) if response.status_code == 200: with open("video.mp4", "wb") as f: f.write(response.content) else: print("download error!") exit(0)

videoreader = decord.VideoReader(videofile) decord.bridge.setbridge("torch") video = videoreader.getbatch( np.linspace(0, len(videoreader) - 1, 8).astype(np.int_) ).byte()

request = { "chat_history": [ {'role': 'system', 'content': 'You are Valley, developed by ByteDance. Your are a helpfull Assistant.'}, {'role': 'user', 'content': 'Describe the given video.'}, ], "images": [transforms.ToPILImage()(image.permute(2, 0, 1)).convert("RGB") for image in video], } result = model(request) print(f"\n>>> Assistant:\n") print(result) ```

Related Project

We list related Project - Valley: Video Assistant with Large Language model Enhanced abilitY - LLaVA: Large Language and Vision Assistant - Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders - LLaVA-CoT: Let Vision Language Models Reason Step-by-Step - Qwen2.5

License Agreement

All of our open-source models are licensed under the Apache-2.0 license.

We are Hiring 🔥🔥🔥

The Tiktop-Ecommerce Team focuses on the research and development of multi-modal large model algorithms and foundational algorithms, we welcome inquiries and look forward to working on challenging projects with talented individuals like you!

Location: Beijing / Shanghai / Hangzhou / Singapore

Contact & Resume Submission: wuheng.2024@bytedance.com

Tiktok-电商团队专注于多模态大模型算法和基础算法的研发,欢迎咨询(实习/全职),期待和优秀的你,一起做有挑战的事情!

岗位城市:北京/上海/杭州/新加坡

咨询&简历投递:wuheng.2024@bytedance.com

Citation

@article{wu2025valley2, title={Valley2: Exploring Multimodal Models with Scalable Vision-Language Design}, author={Wu, Ziheng and Chen, Zhenghao and Luo, Ruipu and Zhang, Can and Gao, Yuan and He, Zhentao and Wang, Xian and Lin, Haoran and Qiu, Minghui}, journal={arXiv preprint arXiv:2501.05901}, year={2025} }

Owner

  • Name: Bytedance Inc.
  • Login: bytedance
  • Kind: organization
  • Location: Singapore

GitHub Events

Total
  • Issues event: 20
  • Watch event: 224
  • Member event: 1
  • Issue comment event: 31
  • Push event: 24
  • Public event: 1
  • Pull request event: 8
  • Fork event: 16
  • Create event: 2
Last Year
  • Issues event: 20
  • Watch event: 224
  • Member event: 1
  • Issue comment event: 31
  • Push event: 24
  • Public event: 1
  • Pull request event: 8
  • Fork event: 16
  • Create event: 2

Issues and Pull Requests

Last synced: 7 months ago

All Time
  • Total issues: 8
  • Total pull requests: 2
  • Average time to close issues: 6 days
  • Average time to close pull requests: less than a minute
  • Total issue authors: 7
  • Total pull request authors: 1
  • Average comments per issue: 1.5
  • Average comments per pull request: 0.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 8
  • Pull requests: 2
  • Average time to close issues: 6 days
  • Average time to close pull requests: less than a minute
  • Issue authors: 7
  • Pull request authors: 1
  • Average comments per issue: 1.5
  • Average comments per pull request: 0.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • Zxr1314 (2)
  • Biao-K (1)
  • babyta (1)
  • RockyF (1)
  • GYxiaOH (1)
  • chuxiliyixiaosa (1)
  • effort-yq (1)
  • xtuyzb (1)
Pull Request Authors
  • Hyggge (2)
  • wuziheng (1)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

requirements.txt pypi
  • accelerate ==0.34.2
  • bert-score ==0.3.13
  • byted-matxscript ==1.8.2
  • byted-wandb ==0.13.72
  • datasets ==2.21.0
  • decord ==0.6.0
  • deepspeed ==0.9.5
  • einops ==0.8.0
  • evaluate ==0.4.3
  • fastapi ==0.115.0
  • flash_attn ==2.7.2.post1
  • ftfy ==6.2.3
  • markdown2 ==2.5.0
  • ninja ==1.11.1.1
  • nltk ==3.9.1
  • numpy ==1.26.4
  • omegaconf ==2.3.0
  • openai ==0.28
  • opencv-python-headless ==4.10.0.84
  • packaging ==24.1
  • pandas ==2.2.2
  • peft ==0.5.0
  • prettytable ==3.11.0
  • protobuf ==3.20.3
  • pyarrow ==15.0.0
  • pydantic ==1.10.14
  • qwen_vl_utils *
  • requests ==2.32.3
  • rouge-score ==0.1.2
  • scikit-image ==0.24.0
  • scikit-learn ==1.5.2
  • sentencepiece ==0.1.97
  • timm ==0.6.7
  • tokenizers >=0.13.3
  • torchmetrics *
  • transformers ==4.45.2
  • uvicorn ==0.30.6