https://github.com/bytedance/valley

Valley is a cutting-edge multimodal large model designed to handle a variety of tasks involving text, images, and video data.

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.5%) to scientific vocabulary

Keywords

research

Last synced: 10 months ago · JSON representation

Repository

Valley is a cutting-edge multimodal large model designed to handle a variety of tasks involving text, images, and video data.

Basic Info

Host: GitHub
Owner: bytedance
License: apache-2.0
Language: Python
Default Branch: main
Homepage:
Size: 4.58 MB

Statistics

Stars: 215
Watchers: 3
Forks: 13
Open Issues: 3
Releases: 0

Topics

research

Created over 1 year ago · Last pushed over 1 year ago

Metadata Files

Readme License

Valley2

🤗 Hugging Face | 🤖 ModelScope | 📑 Home Page | 📙 Paper

News

[2025/06/06] 🔥 We have submitted Valley2-DPO to the closed-source OpenCompass Multi-modal Leaderboard, achieving a score of 38.62, which ranks top-3 among multi-modal models with fewer than 10 billion (10B) parameters.
[2025/04/14] 🔥 We have released the weights of Valley-Eagle-7B-DPO (Valley2-DPO)!
[2025/02/09] 🔥 We have developed the Valley-Eagle-7B-DPO (Valley2-DPO), which scored 69.6 on the Opencompass leaderboard, and the weights will be released soon.
[2025/01/10] 🔥 Our paper has been released! Valley2: Exploring Multimodal Models with Scalable Vision-Language Design
[2024/12/23] 🔥 Announcing Valley-Eagle-7B (Valley2)!

Introduction

Valley is a cutting-edge multimodal large model designed to handle a variety of tasks involving text, images, and video data, which is developed by ByteDance. Our model

Achieved the best results in the inhouse e-commerce and short-video benchmarks, much better then other SOTA opensource models.
Demonstrated comparatively outstanding performance in the OpenCompass (average scores >= 67.40, TOP2 among <10B models) tests

when evaluated against models of the same scale.

Valley-Eagle

The foundational version of Valley is a multimodal large model aligned with Siglip and Qwen2.5, incorporating LargeMLP and ConvAdapter to construct the projector.

In the final version, we also referenced Eagle, introducing an additional VisionEncoder that can flexibly adjust the number of tokens and is parallelized with the original visual tokens.
This enhancement supplements the model’s performance in extreme scenarios, and we chose the Qwen2vl VisionEncoder for this purpose.

and the model structure is shown as follows:

Environment Setup

bash pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121 pip install -r requirements.txt

Inference Demo

Single image ``` python # Method-1 import torch import urllib from io import BytesIO from PIL import Image from transformers import AutoProcessor, AutoModel

device = torch.device("cuda" if torch.cuda.isavailable() else "cpu") model = AutoModel.frompretrained("bytedance-research/Valley-Eagle-7B", trustremotecode=True) processor = AutoProcessor.frompretrained("bytedance-research/Valley-Eagle-7B", trustremote_code=True)

url = "https://images.unsplash.com/photo-1734640113825-24dd7c056052" img = urllib.request.urlopen(url=url, timeout=5).read() img = Image.open(BytesIO(img)).convert("RGB") res = processor( { "conversations": [ {"role": "system", "content": "You are Valley, developed by ByteDance. Your are a helpfull Assistant."}, {"role": "user", "content": "Describe the given image."}, ], "images": [img] }, inference=True )

with torch.inferencemode(): model.to(dtype=torch.float16, device=device) outputids = model.generate( inputids=res["inputids"].to(device), images=[[item.to(dtype=torch.float16, device=device) for item in img] for img in res["images"]], imagesizes=res["imagesizes"], pixelvalues=res["pixelvalues"].to(dtype=torch.float16, device=device), imagegridthw=res["imagegridthw"].to(device), dosample=False, maxnewtokens=1024, repetitionpenalty=1.0, returndictingenerate=True, outputscores=True, ) inputtokenlen = res["inputids"].shape[1] generationtext = processor.batchdecode(outputids.sequences[:, inputtokenlen:])[0] generationtext = generationtext.replace("<|imend|>", "") print(generationtext)

Method-2

from valleyeaglechat import ValleyEagleChat import urllib from io import BytesIO from PIL import Image

model = ValleyEagleChat( modelpath="bytedance-research/Valley-Eagle-7B", paddingside="left", )

url = "https://images.unsplash.com/photo-1734640113825-24dd7c056052" img = urllib.request.urlopen(url=url, timeout=5).read() img = Image.open(BytesIO(img)).convert("RGB")

request = { "chat_history": [ {"role": "system", "content": "You are Valley, developed by ByteDance. Your are a helpfull Assistant."}, {"role": "user", "content": "Describe the given image."}, ], "images": [img], } result = model(request) print(f"\n>>> Assistant:\n") print(result) ```

Multi-images ``` python from valleyeaglechat import ValleyEagleChat import urllib from io import BytesIO from PIL import Image

model = ValleyEagleChat( modelpath="bytedance-research/Valley-Eagle-7B", paddingside="left", )

urls = [ "https://plus.unsplash.com/premiumphoto-1661632559307-902ac3f6174c", "https://plus.unsplash.com/premiumphoto-1661632559713-a478160cd72e", "https://plus.unsplash.com/premiumphoto-1661607772173-54f7b8263c27", "https://plus.unsplash.com/premiumphoto-1661607115685-36b2a7276389", "https://plus.unsplash.com/premiumphoto-1661607103369-e799ee7ef954", "https://plus.unsplash.com/premiumphoto-1661628841460-1c9d7e6669ec", "https://plus.unsplash.com/premiumphoto-1661602273588-f213a4155caf", "https://plus.unsplash.com/premiumphoto-1661602247160-d42d7aba6798" ]

url2img = lambda url: Image.open( BytesIO(urllib.request.urlopen(url=url, timeout=5).read()) ).convert("RGB")

imgs = [url2img(url) for url in urls]

request = { "chat_history": [ {"role": "system", "content": "You are Valley, developed by ByteDance. Your are a helpfull Assistant."}, {"role": "user", "content": "Describe the given images."}, ], "images": imgs, } result = model(request) print(f"\n>>> Assistant:\n") print(result)

```

Video ``` python from valleyeaglechat import ValleyEagleChat import decord import requests import numpy as np from torchvision import transforms

model = ValleyEagleChat( modelpath='bytedance-research/Valley-Eagle-7B', paddingside = 'left', )

url = 'https://videos.pexels.com/video-files/29641276/127531271920108025fps.mp4' videofile = './video.mp4' response = requests.get(url) if response.status_code == 200: with open("video.mp4", "wb") as f: f.write(response.content) else: print("download error!") exit(0)

videoreader = decord.VideoReader(videofile) decord.bridge.setbridge("torch") video = videoreader.getbatch( np.linspace(0, len(videoreader) - 1, 8).astype(np.int_) ).byte()

request = { "chat_history": [ {'role': 'system', 'content': 'You are Valley, developed by ByteDance. Your are a helpfull Assistant.'}, {'role': 'user', 'content': 'Describe the given video.'}, ], "images": [transforms.ToPILImage()(image.permute(2, 0, 1)).convert("RGB") for image in video], } result = model(request) print(f"\n>>> Assistant:\n") print(result) ```

License Agreement

All of our open-source models are licensed under the Apache-2.0 license.

We are Hiring 🔥🔥🔥

The Tiktop-Ecommerce Team focuses on the research and development of multi-modal large model algorithms and foundational algorithms, we welcome inquiries and look forward to working on challenging projects with talented individuals like you!

Location: Beijing / Shanghai / Hangzhou / Singapore

Contact & Resume Submission: wuheng.2024@bytedance.com

Tiktok-电商团队专注于多模态大模型算法和基础算法的研发，欢迎咨询(实习/全职)，期待和优秀的你，一起做有挑战的事情！

岗位城市：北京/上海/杭州/新加坡

咨询&简历投递：wuheng.2024@bytedance.com

Citation

@article{wu2025valley2, title={Valley2: Exploring Multimodal Models with Scalable Vision-Language Design}, author={Wu, Ziheng and Chen, Zhenghao and Luo, Ruipu and Zhang, Can and Gao, Yuan and He, Zhentao and Wang, Xian and Lin, Haoran and Qiu, Minghui}, journal={arXiv preprint arXiv:2501.05901}, year={2025} }

Owner

Name: Bytedance Inc.
Login: bytedance
Kind: organization
Location: Singapore

Website: https://opensource.bytedance.com
Twitter: ByteDanceOSS
Repositories: 255
Profile: https://github.com/bytedance

GitHub Events

Total

Issues event: 20
Watch event: 224
Member event: 1
Issue comment event: 31
Push event: 24
Public event: 1
Pull request event: 8
Fork event: 16
Create event: 2

Last Year

Issues event: 20
Watch event: 224
Member event: 1
Issue comment event: 31
Push event: 24
Public event: 1
Pull request event: 8
Fork event: 16
Create event: 2

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 8
Total pull requests: 2
Average time to close issues: 6 days
Average time to close pull requests: less than a minute
Total issue authors: 7
Total pull request authors: 1
Average comments per issue: 1.5
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 8
Pull requests: 2
Average time to close issues: 6 days
Average time to close pull requests: less than a minute
Issue authors: 7
Pull request authors: 1
Average comments per issue: 1.5
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Zxr1314 (2)
Biao-K (1)
babyta (1)
RockyF (1)
GYxiaOH (1)
chuxiliyixiaosa (1)
effort-yq (1)
xtuyzb (1)

Pull Request Authors

Hyggge (2)
wuziheng (1)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

requirements.txt pypi

accelerate ==0.34.2
bert-score ==0.3.13
byted-matxscript ==1.8.2
byted-wandb ==0.13.72
datasets ==2.21.0
decord ==0.6.0
deepspeed ==0.9.5
einops ==0.8.0
evaluate ==0.4.3
fastapi ==0.115.0
flash_attn ==2.7.2.post1
ftfy ==6.2.3
markdown2 ==2.5.0
ninja ==1.11.1.1
nltk ==3.9.1
numpy ==1.26.4
omegaconf ==2.3.0
openai ==0.28
opencv-python-headless ==4.10.0.84
packaging ==24.1
pandas ==2.2.2
peft ==0.5.0
prettytable ==3.11.0
protobuf ==3.20.3
pyarrow ==15.0.0
pydantic ==1.10.14
qwen_vl_utils *
requests ==2.32.3
rouge-score ==0.1.2
scikit-image ==0.24.0
scikit-learn ==1.5.2
sentencepiece ==0.1.97
timm ==0.6.7
tokenizers >=0.13.3
torchmetrics *
transformers ==4.45.2
uvicorn ==0.30.6

https://github.com/bytedance/valley

Science Score: 23.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Valley2

News

Introduction

Valley-Eagle

Environment Setup

Inference Demo

Method-2

Related Project

License Agreement

We are Hiring 🔥🔥🔥

Citation

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies