https://github.com/christian-byrne/img2txt-comfyui-nodes

Implements popular img2txt captioning models into ComfyUI nodes

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (16.1%) to scientific vocabulary

Last synced: 9 months ago · JSON representation

Repository

Implements popular img2txt captioning models into ComfyUI nodes

Basic Info

Host: GitHub
Owner: christian-byrne
Language: Python
Default Branch: master
Homepage:
Size: 3.97 MB

Statistics

Stars: 94
Watchers: 2
Forks: 12
Open Issues: 7
Releases: 0

Created about 2 years ago · Last pushed about 1 year ago

Metadata Files

Readme

README.md

Auto-generate caption (BLIP):

alt text

Using to automate img2img process (BLIP and Llava)

alt text

Requirements/Dependencies

For Llava

bitsandbytes>=0.43.0 accelerate>=0.3.0

For MiniCPM

transformers<=4.41.2 timm>=1.0.7 sentencepiece

Installation

cd into ComfyUI/custom_nodes directory
git clone this repo
cd img2txt-comfyui-nodes
pip install -r requirements.txt
Models will be automatically downloaded per-use. If you never toggle a model on in the UI, it will never be downloaded.
To ask a list of specific questions about the image, use the Llava or MiniPCM models. The questions are separated by line in the multiline text input box.

Support for Chinese

The MiniCPM model works with Chinese text input without any additional configuration. The output will also be in Chinese.
- "MiniCPM-V 2.0 supports strong bilingual multimodal capabilities in both English and Chinese. This is enabled by generalizing multimodal capabilities across languages, a technique from VisCPM"
Please support the creators of MiniCPM here

Tips

The multi-line input can be used to ask any type of questions. You can even ask very specific or complex questions about images.
To get best results for a prompt that will be fed back into a txt2img or img2img prompt, usually it's best to only ask one or two questions, asking for a general description of the image and the most salient features and styles.

Model Locations/Paths

Models are downloaded automatically using the Huggingface cache system and the transformers from_pretrained method so no manual installation of models is necessary.
If you really want to manually download the models, please refer to Huggingface's documentation concerning the cache system. Here is the relevant except:
- > Pretrained models are downloaded and locally cached at ~/.cache/huggingface/hub. This is the default directory given by the shell environment variable TRANSFORMERSCACHE. On Windows, the default directory is given by C:\Users\username\.cache\huggingface\hub. You can change the shell environment variables shown below - in order of priority - to specify a different cache directory: > - Shell environment variable (default): HUGGINGFACEHUBCACHE or TRANSFORMERSCACHE. > - Shell environment variable: HFHOME. > - Shell environment variable: XDGCACHE_HOME + /huggingface.

Models

MiniCPM (Chinese & English)
- Title: MiniCPM-V-2 - Strong multimodal large language model for efficient end-side deployment
- Datasets: HuggingFaceM4VQAv2, RLHF-V-Dataset, LLaVA-Instruct-150K
- Size: ~ 6.8GB
Salesforce - blip-image-captioning-base
- Title: BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
- Size: ~ 2GB
- Dataset: COCO (The MS COCO dataset is a large-scale object detection, image segmentation, and captioning dataset published by Microsoft)
llava - llava-1.5-7b-hf
- Title: LLava: Large Language Models for Vision and Language Tasks
- Size: ~ 15GB
- Dataset: 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP, 158K GPT-generated multimodal instruction-following data, 450K academic-task-oriented VQA data mixture, 40K ShareGPT data.

Prompts

This is the guide for the format of an "ideal" txt2img prompt (using BLIP). Use as the basis for the questions to ask the img2txt models.

Subject - you can specify region, write the most about the subject
Medium - material used to make artwork. Some examples are illustration, oil painting, 3D rendering, and photography. Medium has a strong effect because one keyword alone can dramatically change the style.
Style - artistic style of the image. Examples include impressionist, surrealist, pop art, etc.
Artists - Artist names are strong modifiers. They allow you to dial in the exact style using a particular artist as a reference. It is also common to use multiple artist names to blend their styles. Now let’s add Stanley Artgerm Lau, a superhero comic artist, and Alphonse Mucha, a portrait painter in the 19th century.
Website - Niche graphic websites such as Artstation and Deviant Art aggregate many images of distinct genres. Using them in a prompt is a sure way to steer the image toward these styles.
Resolution - Resolution represents how sharp and detailed the image is. Let’s add keywords highly detailed and sharp focus
Enviornment
Additional Details and objects - Additional details are sweeteners added to modify an image. We will add sci-fi, stunningly beautiful and dystopian to add some vibe to the image.
Composition - camera type, detail, cinematography, blur, depth-of-field
Color/Warmth - You can control the overall color of the image by adding color keywords. The colors you specified may appear as a tone or in objects.
Lighting - Any photographer would tell you lighting is a key factor in creating successful images. Lighting keywords can have a huge effect on how the image looks. Let’s add cinematic lighting and dark to the prompt.

Owner

Name: Christian Byrne
Login: christian-byrne
Kind: user
Location: San Francisco
Company: Comfy-Org

Twitter: c__byrne
Repositories: 100
Profile: https://github.com/christian-byrne

GitHub Events

Total

Issues event: 4
Watch event: 21
Push event: 1
Pull request event: 2
Fork event: 3

Last Year

Issues event: 4
Watch event: 21
Push event: 1
Pull request event: 2
Fork event: 3

Issues and Pull Requests

Last synced: about 1 year ago

All Time

Total issues: 11
Total pull requests: 3
Average time to close issues: 4 days
Average time to close pull requests: 2 days
Total issue authors: 10
Total pull request authors: 2
Average comments per issue: 1.45
Average comments per pull request: 2.33
Merged pull requests: 3
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 9
Pull requests: 3
Average time to close issues: 6 days
Average time to close pull requests: 2 days
Issue authors: 8
Pull request authors: 2
Average comments per issue: 0.89
Average comments per pull request: 2.33
Merged pull requests: 3
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

dxdpxl (2)
KennyChan3389 (1)
radlinsky (1)
michel-io (1)
thefirstLeonliao (1)
r-e-grant (1)
plhys (1)
Fox-pix (1)
flybirdxx (1)
sxserjio (1)
yurayko (1)

Pull Request Authors

haohaocreates (3)
robinjhuang (2)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

requirements.txt pypi

transformers >=4.35.3

.github/workflows/publish.yml actions

Comfy-Org/publish-node-action main composite
actions/checkout v4 composite

pyproject.toml pypi

accelerate >=0.3.0
bitsandbytes >=0.43.0
sentencepiece ==0.1.99
timm ==0.9.10
transformers >=4.36.0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/christian-byrne/img2txt-comfyui-nodes

Science Score: 13.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Requirements/Dependencies

Installation

Support for Chinese

Tips

Model Locations/Paths

Models

Prompts

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies