https://github.com/bytedance/valley
Valley is a cutting-edge multimodal large model designed to handle a variety of tasks involving text, images, and video data.
Science Score: 23.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.5%) to scientific vocabulary
Keywords
Repository
Valley is a cutting-edge multimodal large model designed to handle a variety of tasks involving text, images, and video data.
Basic Info
Statistics
- Stars: 215
- Watchers: 3
- Forks: 13
- Open Issues: 3
- Releases: 0
Topics
Metadata Files
README.md
Valley2
🤗 Hugging Face   |   🤖 ModelScope    |    📑 Home Page    |    📙 Paper
News
- [2025/06/06] 🔥 We have submitted Valley2-DPO to the closed-source OpenCompass Multi-modal Leaderboard, achieving a score of 38.62, which ranks top-3 among multi-modal models with fewer than 10 billion (10B) parameters.
- [2025/04/14] 🔥 We have released the weights of Valley-Eagle-7B-DPO (Valley2-DPO)!
- [2025/02/09] 🔥 We have developed the Valley-Eagle-7B-DPO (Valley2-DPO), which scored 69.6 on the Opencompass leaderboard, and the weights will be released soon.
- [2025/01/10] 🔥 Our paper has been released! Valley2: Exploring Multimodal Models with Scalable Vision-Language Design
- [2024/12/23] 🔥 Announcing Valley-Eagle-7B (Valley2)!
Introduction
Valley is a cutting-edge multimodal large model designed to handle a variety of tasks involving text, images, and video data, which is developed by ByteDance. Our model
- Achieved the best results in the inhouse e-commerce and short-video benchmarks, much better then other SOTA opensource models.
- Demonstrated comparatively outstanding performance in the OpenCompass (average scores >= 67.40, TOP2 among <10B models) tests
when evaluated against models of the same scale.
Valley-Eagle
The foundational version of Valley is a multimodal large model aligned with Siglip and Qwen2.5, incorporating LargeMLP and ConvAdapter to construct the projector.
- In the final version, we also referenced Eagle, introducing an additional VisionEncoder that can flexibly adjust the number of tokens and is parallelized with the original visual tokens.
- This enhancement supplements the model’s performance in extreme scenarios, and we chose the Qwen2vl VisionEncoder for this purpose.
and the model structure is shown as follows:
Environment Setup
bash
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt
Inference Demo
- Single image ``` python # Method-1 import torch import urllib from io import BytesIO from PIL import Image from transformers import AutoProcessor, AutoModel
device = torch.device("cuda" if torch.cuda.isavailable() else "cpu") model = AutoModel.frompretrained("bytedance-research/Valley-Eagle-7B", trustremotecode=True) processor = AutoProcessor.frompretrained("bytedance-research/Valley-Eagle-7B", trustremote_code=True)
url = "https://images.unsplash.com/photo-1734640113825-24dd7c056052" img = urllib.request.urlopen(url=url, timeout=5).read() img = Image.open(BytesIO(img)).convert("RGB") res = processor( { "conversations": [ {"role": "system", "content": "You are Valley, developed by ByteDance. Your are a helpfull Assistant."}, {"role": "user", "content": "Describe the given image."}, ], "images": [img] }, inference=True )
with torch.inferencemode(): model.to(dtype=torch.float16, device=device) outputids = model.generate( inputids=res["inputids"].to(device), images=[[item.to(dtype=torch.float16, device=device) for item in img] for img in res["images"]], imagesizes=res["imagesizes"], pixelvalues=res["pixelvalues"].to(dtype=torch.float16, device=device), imagegridthw=res["imagegridthw"].to(device), dosample=False, maxnewtokens=1024, repetitionpenalty=1.0, returndictingenerate=True, outputscores=True, ) inputtokenlen = res["inputids"].shape[1] generationtext = processor.batchdecode(outputids.sequences[:, inputtokenlen:])[0] generationtext = generationtext.replace("<|imend|>", "") print(generationtext)
Method-2
from valleyeaglechat import ValleyEagleChat import urllib from io import BytesIO from PIL import Image
model = ValleyEagleChat( modelpath="bytedance-research/Valley-Eagle-7B", paddingside="left", )
url = "https://images.unsplash.com/photo-1734640113825-24dd7c056052" img = urllib.request.urlopen(url=url, timeout=5).read() img = Image.open(BytesIO(img)).convert("RGB")
request = { "chat_history": [ {"role": "system", "content": "You are Valley, developed by ByteDance. Your are a helpfull Assistant."}, {"role": "user", "content": "Describe the given image."}, ], "images": [img], } result = model(request) print(f"\n>>> Assistant:\n") print(result) ```
- Multi-images ``` python from valleyeaglechat import ValleyEagleChat import urllib from io import BytesIO from PIL import Image
model = ValleyEagleChat( modelpath="bytedance-research/Valley-Eagle-7B", paddingside="left", )
urls = [ "https://plus.unsplash.com/premiumphoto-1661632559307-902ac3f6174c", "https://plus.unsplash.com/premiumphoto-1661632559713-a478160cd72e", "https://plus.unsplash.com/premiumphoto-1661607772173-54f7b8263c27", "https://plus.unsplash.com/premiumphoto-1661607115685-36b2a7276389", "https://plus.unsplash.com/premiumphoto-1661607103369-e799ee7ef954", "https://plus.unsplash.com/premiumphoto-1661628841460-1c9d7e6669ec", "https://plus.unsplash.com/premiumphoto-1661602273588-f213a4155caf", "https://plus.unsplash.com/premiumphoto-1661602247160-d42d7aba6798" ]
url2img = lambda url: Image.open( BytesIO(urllib.request.urlopen(url=url, timeout=5).read()) ).convert("RGB")
imgs = [url2img(url) for url in urls]
request = { "chat_history": [ {"role": "system", "content": "You are Valley, developed by ByteDance. Your are a helpfull Assistant."}, {"role": "user", "content": "Describe the given images."}, ], "images": imgs, } result = model(request) print(f"\n>>> Assistant:\n") print(result)
```
- Video ``` python from valleyeaglechat import ValleyEagleChat import decord import requests import numpy as np from torchvision import transforms
model = ValleyEagleChat( modelpath='bytedance-research/Valley-Eagle-7B', paddingside = 'left', )
url = 'https://videos.pexels.com/video-files/29641276/127531271920108025fps.mp4' videofile = './video.mp4' response = requests.get(url) if response.status_code == 200: with open("video.mp4", "wb") as f: f.write(response.content) else: print("download error!") exit(0)
videoreader = decord.VideoReader(videofile) decord.bridge.setbridge("torch") video = videoreader.getbatch( np.linspace(0, len(videoreader) - 1, 8).astype(np.int_) ).byte()
request = { "chat_history": [ {'role': 'system', 'content': 'You are Valley, developed by ByteDance. Your are a helpfull Assistant.'}, {'role': 'user', 'content': 'Describe the given video.'}, ], "images": [transforms.ToPILImage()(image.permute(2, 0, 1)).convert("RGB") for image in video], } result = model(request) print(f"\n>>> Assistant:\n") print(result) ```
Related Project
We list related Project - Valley: Video Assistant with Large Language model Enhanced abilitY - LLaVA: Large Language and Vision Assistant - Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders - LLaVA-CoT: Let Vision Language Models Reason Step-by-Step - Qwen2.5
License Agreement
All of our open-source models are licensed under the Apache-2.0 license.
We are Hiring 🔥🔥🔥
The Tiktop-Ecommerce Team focuses on the research and development of multi-modal large model algorithms and foundational algorithms, we welcome inquiries and look forward to working on challenging projects with talented individuals like you!
Location: Beijing / Shanghai / Hangzhou / Singapore
Contact & Resume Submission: wuheng.2024@bytedance.com
Tiktok-电商团队专注于多模态大模型算法和基础算法的研发,欢迎咨询(实习/全职),期待和优秀的你,一起做有挑战的事情!
岗位城市:北京/上海/杭州/新加坡
咨询&简历投递:wuheng.2024@bytedance.com
Citation
@article{wu2025valley2,
title={Valley2: Exploring Multimodal Models with Scalable Vision-Language Design},
author={Wu, Ziheng and Chen, Zhenghao and Luo, Ruipu and Zhang, Can and Gao, Yuan and He, Zhentao and Wang, Xian and Lin, Haoran and Qiu, Minghui},
journal={arXiv preprint arXiv:2501.05901},
year={2025}
}
Owner
- Name: Bytedance Inc.
- Login: bytedance
- Kind: organization
- Location: Singapore
- Website: https://opensource.bytedance.com
- Twitter: ByteDanceOSS
- Repositories: 255
- Profile: https://github.com/bytedance
GitHub Events
Total
- Issues event: 20
- Watch event: 224
- Member event: 1
- Issue comment event: 31
- Push event: 24
- Public event: 1
- Pull request event: 8
- Fork event: 16
- Create event: 2
Last Year
- Issues event: 20
- Watch event: 224
- Member event: 1
- Issue comment event: 31
- Push event: 24
- Public event: 1
- Pull request event: 8
- Fork event: 16
- Create event: 2
Issues and Pull Requests
Last synced: 7 months ago
All Time
- Total issues: 8
- Total pull requests: 2
- Average time to close issues: 6 days
- Average time to close pull requests: less than a minute
- Total issue authors: 7
- Total pull request authors: 1
- Average comments per issue: 1.5
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 8
- Pull requests: 2
- Average time to close issues: 6 days
- Average time to close pull requests: less than a minute
- Issue authors: 7
- Pull request authors: 1
- Average comments per issue: 1.5
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- Zxr1314 (2)
- Biao-K (1)
- babyta (1)
- RockyF (1)
- GYxiaOH (1)
- chuxiliyixiaosa (1)
- effort-yq (1)
- xtuyzb (1)
Pull Request Authors
- Hyggge (2)
- wuziheng (1)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- accelerate ==0.34.2
- bert-score ==0.3.13
- byted-matxscript ==1.8.2
- byted-wandb ==0.13.72
- datasets ==2.21.0
- decord ==0.6.0
- deepspeed ==0.9.5
- einops ==0.8.0
- evaluate ==0.4.3
- fastapi ==0.115.0
- flash_attn ==2.7.2.post1
- ftfy ==6.2.3
- markdown2 ==2.5.0
- ninja ==1.11.1.1
- nltk ==3.9.1
- numpy ==1.26.4
- omegaconf ==2.3.0
- openai ==0.28
- opencv-python-headless ==4.10.0.84
- packaging ==24.1
- pandas ==2.2.2
- peft ==0.5.0
- prettytable ==3.11.0
- protobuf ==3.20.3
- pyarrow ==15.0.0
- pydantic ==1.10.14
- qwen_vl_utils *
- requests ==2.32.3
- rouge-score ==0.1.2
- scikit-image ==0.24.0
- scikit-learn ==1.5.2
- sentencepiece ==0.1.97
- timm ==0.6.7
- tokenizers >=0.13.3
- torchmetrics *
- transformers ==4.45.2
- uvicorn ==0.30.6