https://github.com/christian-byrne/img2txt-comfyui-nodes
Implements popular img2txt captioning models into ComfyUI nodes
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (16.1%) to scientific vocabulary
Last synced: 9 months ago
·
JSON representation
Repository
Implements popular img2txt captioning models into ComfyUI nodes
Basic Info
Statistics
- Stars: 94
- Watchers: 2
- Forks: 12
- Open Issues: 7
- Releases: 0
Created about 2 years ago
· Last pushed about 1 year ago
Metadata Files
Readme
README.md
Auto-generate caption (BLIP):

Using to automate img2img process (BLIP and Llava)

Requirements/Dependencies
For Llava
bitsandbytes>=0.43.0
accelerate>=0.3.0
For MiniCPM
transformers<=4.41.2
timm>=1.0.7
sentencepiece
Installation
cdintoComfyUI/custom_nodesdirectorygit clonethis repocd img2txt-comfyui-nodespip install -r requirements.txt- Models will be automatically downloaded per-use. If you never toggle a model on in the UI, it will never be downloaded.
- To ask a list of specific questions about the image, use the Llava or MiniPCM models. The questions are separated by line in the multiline text input box.
Support for Chinese
- The
MiniCPMmodel works with Chinese text input without any additional configuration. The output will also be in Chinese.- "MiniCPM-V 2.0 supports strong bilingual multimodal capabilities in both English and Chinese. This is enabled by generalizing multimodal capabilities across languages, a technique from VisCPM"
- Please support the creators of MiniCPM here
Tips
- The multi-line input can be used to ask any type of questions. You can even ask very specific or complex questions about images.
- To get best results for a prompt that will be fed back into a txt2img or img2img prompt, usually it's best to only ask one or two questions, asking for a general description of the image and the most salient features and styles.
Model Locations/Paths
- Models are downloaded automatically using the Huggingface cache system and the transformers
from_pretrainedmethod so no manual installation of models is necessary. - If you really want to manually download the models, please refer to Huggingface's documentation concerning the cache system. Here is the relevant except:
- > Pretrained models are downloaded and locally cached at
~/.cache/huggingface/hub. This is the default directory given by the shell environment variable TRANSFORMERSCACHE. On Windows, the default directory is given byC:\Users\username\.cache\huggingface\hub. You can change the shell environment variables shown below - in order of priority - to specify a different cache directory: > - Shell environment variable (default): HUGGINGFACEHUBCACHE or TRANSFORMERSCACHE. > - Shell environment variable: HFHOME. > - Shell environment variable: XDGCACHE_HOME + /huggingface.
- > Pretrained models are downloaded and locally cached at
Models
- MiniCPM (Chinese & English)
- Title: MiniCPM-V-2 - Strong multimodal large language model for efficient end-side deployment
- Datasets: HuggingFaceM4VQAv2, RLHF-V-Dataset, LLaVA-Instruct-150K
- Size: ~ 6.8GB
- Salesforce - blip-image-captioning-base
- Title: BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
- Size: ~ 2GB
- Dataset: COCO (The MS COCO dataset is a large-scale object detection, image segmentation, and captioning dataset published by Microsoft)
- llava - llava-1.5-7b-hf
- Title: LLava: Large Language Models for Vision and Language Tasks
- Size: ~ 15GB
- Dataset: 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP, 158K GPT-generated multimodal instruction-following data, 450K academic-task-oriented VQA data mixture, 40K ShareGPT data. <!-- -(https://huggingface.co/models?pipeline_tag=image-to-text&sort=trending) -->
Prompts
This is the guide for the format of an "ideal" txt2img prompt (using BLIP). Use as the basis for the questions to ask the img2txt models.
- Subject - you can specify region, write the most about the subject
- Medium - material used to make artwork. Some examples are illustration, oil painting, 3D rendering, and photography. Medium has a strong effect because one keyword alone can dramatically change the style.
- Style - artistic style of the image. Examples include impressionist, surrealist, pop art, etc.
- Artists - Artist names are strong modifiers. They allow you to dial in the exact style using a particular artist as a reference. It is also common to use multiple artist names to blend their styles. Now let’s add Stanley Artgerm Lau, a superhero comic artist, and Alphonse Mucha, a portrait painter in the 19th century.
- Website - Niche graphic websites such as Artstation and Deviant Art aggregate many images of distinct genres. Using them in a prompt is a sure way to steer the image toward these styles.
- Resolution - Resolution represents how sharp and detailed the image is. Let’s add keywords highly detailed and sharp focus
- Enviornment
- Additional Details and objects - Additional details are sweeteners added to modify an image. We will add sci-fi, stunningly beautiful and dystopian to add some vibe to the image.
- Composition - camera type, detail, cinematography, blur, depth-of-field
- Color/Warmth - You can control the overall color of the image by adding color keywords. The colors you specified may appear as a tone or in objects.
- Lighting - Any photographer would tell you lighting is a key factor in creating successful images. Lighting keywords can have a huge effect on how the image looks. Let’s add cinematic lighting and dark to the prompt.
Owner
- Name: Christian Byrne
- Login: christian-byrne
- Kind: user
- Location: San Francisco
- Company: Comfy-Org
- Twitter: c__byrne
- Repositories: 100
- Profile: https://github.com/christian-byrne
GitHub Events
Total
- Issues event: 4
- Watch event: 21
- Push event: 1
- Pull request event: 2
- Fork event: 3
Last Year
- Issues event: 4
- Watch event: 21
- Push event: 1
- Pull request event: 2
- Fork event: 3
Issues and Pull Requests
Last synced: about 1 year ago
All Time
- Total issues: 11
- Total pull requests: 3
- Average time to close issues: 4 days
- Average time to close pull requests: 2 days
- Total issue authors: 10
- Total pull request authors: 2
- Average comments per issue: 1.45
- Average comments per pull request: 2.33
- Merged pull requests: 3
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 9
- Pull requests: 3
- Average time to close issues: 6 days
- Average time to close pull requests: 2 days
- Issue authors: 8
- Pull request authors: 2
- Average comments per issue: 0.89
- Average comments per pull request: 2.33
- Merged pull requests: 3
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- dxdpxl (2)
- KennyChan3389 (1)
- radlinsky (1)
- michel-io (1)
- thefirstLeonliao (1)
- r-e-grant (1)
- plhys (1)
- Fox-pix (1)
- flybirdxx (1)
- sxserjio (1)
- yurayko (1)
Pull Request Authors
- haohaocreates (3)
- robinjhuang (2)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
requirements.txt
pypi
- transformers >=4.35.3
.github/workflows/publish.yml
actions
- Comfy-Org/publish-node-action main composite
- actions/checkout v4 composite
pyproject.toml
pypi
- accelerate >=0.3.0
- bitsandbytes >=0.43.0
- sentencepiece ==0.1.99
- timm ==0.9.10
- transformers >=4.36.0