https://github.com/bjodah/llm-multi-backend-container

Docker/podman container for llama.cpp/vllm/exllamav2 orchestrated using llama-swap

https://github.com/bjodah/llm-multi-backend-container

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (8.3%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Docker/podman container for llama.cpp/vllm/exllamav2 orchestrated using llama-swap

Basic Info
  • Host: GitHub
  • Owner: bjodah
  • License: bsd-2-clause
  • Language: Python
  • Default Branch: main
  • Size: 268 KB
Statistics
  • Stars: 16
  • Watchers: 2
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created over 1 year ago · Last pushed 10 months ago
Metadata Files
Readme License

README.md

llm-multi-backend-container

Use llama-swap inside a container with vllm, llama.cpp, and exllamav2+tabbyAPI.

Adapting for other machines

This repo is my working config, it's mainly used on a 16 core Ryzen machine with 64 GiB RAM and a single RTX 3090. For customization, you might want to grep for a few keywords: console $ git grep -E '\b868[6-8]\b' # port numbers $ git grep sk-empty # API token/key $ git grep -iE '(logging|loglevel|verbos)' $ git grep -E "\b(8\.6|86)\b" # CUDA compute arch, 8.6 == ampere (RTX 3090)

Usage

console $ head ./bin/host-llm-multi-backend-container.sh $ ./bin/host-llm-multi-backend-container.sh --build --force-recreate See what model/backend combinations are available: console $ curl -s -X GET -H "Authorization: Bearer sk-empty" http://localhost:8686/v1/models | jq -r '.data[].id' | grep -i 'qwen2.5-coder-7b' vllm-Qwen2.5-Coder-7B llamacpp-Qwen2.5-Coder-7B exllamav2-Qwen2.5-Coder-7B

Testing

console $ bash -x scripts/test-chat-completions.sh + curl -s -X POST http://localhost:8688/v1/chat/completions -H 'Content-Type: application/json' -H 'Authorization: Bearer sk-empty' -d '{"model": "llamacpp-glm-4.5-air", "messages": [{"role": "user", "content": "Answer only with the missing word: The capital of Sweden is"}]}' + jq '.choices[0].message.content' "\n<think>We are to answer with only the missing word. The question is: \"The capital of Sweden is\"\n The capital of Sweden is Stockholm. Therefore, the missing word is \"Stockholm\".</think>Stockholm" + retcode=0 + '[' 0 -ne 0 ']' + return 0 + exit

Monitoring

console $ ./scripts/enter-container-llama-swap.sh watch "ps aux | grep -E '(vllm|llama-|tabbyAPI)' | grep -v emacs | grep -v 'grep -E'" $ while true; do clear; date; echo -n "currently loaded model: "; curl -s localhost:8686/running | jq -r '.running[0].model'; echo '...sleeping for 60 seconds'; sleep 60; done $ curl -s localhost:8686/logs/stream/upstream $ curl -s localhost:8686/logs/stream/proxy $ ./scripts/enter-container-llama-swap.sh tail -F /tmp/llama-server-stdout-stderr.log

Working directly with underlying end-point:

console $ curl -H "Authorization: Bearer sk-empty" http://localhost:8686/upstream/llamacpp-Qwen3-30B-A3B/health $ curl -H "Authorization: Bearer sk-empty" http://localhost:8686/upstream/llamacpp-Qwen3-30B-A3B/slots | jq

Downloading models

Downloading Unsloth's Maverick: console $ HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download unsloth/Llama-4-Maverick-17B-128E-Instruct-GGUF --exclude "*.gguf" $ HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download unsloth/Llama-4-Maverick-17B-128E-Instruct-GGUF --include "UD-Q2_K_XL/*.gguf" Downloading Unsloth's Q2KXL quants (248 GB) of DeepSeek V3 0324: console $ HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download unsloth/DeepSeek-V3-0324-GGUF --exclude "*.gguf" \ && HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download unsloth/DeepSeek-V3-0324-GGUF --include "UD-Q2_K_XL/*.gguf"

Unused configurations

deepseek-v3 (I only have 64GB of RAM, which is not enough) ``` # notes: # 1. maybe use: # - https://huggingface.co/ubergarm/DeepSeek-V3-0324-GGUF # - https://github.com/ikawrakow/ik_llama.cpp/discussions/258 llamacpp-deepseek-v3-0324: cmd: | /opt/llama.cpp/build/bin/llama-server --port ${PORT} --ctx-size 16384 --seed "-1" --prio 2 --temp 0.3 --min-p 0.01 --model /root/.cache/huggingface/hub/models--unsloth--DeepSeek-V3-0324-GGUF/snapshots/b3e19c41e42074be413d73f1d0e1b7f2be9e60c3/UD-IQ2_XXS/DeepSeek-V3-0324-UD-IQ2_XXS-00001-of-00005.gguf # ~219GB for 1..5 --n-gpu-layers 1 --ubatch-size 1 --jinja #--model /root/.cache/huggingface/hub/models--unsloth--DeepSeek-V3-0324-GGUF/snapshots/b3e19c41e42074be413d73f1d0e1b7f2be9e60c3/UD-Q2_K_XL/DeepSeek-V3-0324-UD-Q2_K_XL-00001-of-00006.gguf # zombie process after reading 231G (of 248G) proxy: http://127.0.0.1:${PORT} ttl: 3600 ```
24GB of VRAM is not enough for Qwen2.5-VL-32B it seems ``` llamacpp-Qwen2.5-VL-32B: cmd: | /opt/llama.cpp/build/bin/llama-server --port ${PORT} --ctx-size 4096 --cache-type-k q8_0 --cache-type-v q4_0 --flash-attn --n-gpu-layers 64 --hf-repo mradermacher/Qwen2.5-VL-32B-Instruct-i1-GGUF:i1-IQ3_S --temp 0.15 proxy: http://127.0.0.1:${PORT} ttl: 3600 ```
Instead of using vLLM, we could probably use phildougherty python app (see submodule), currently not yet working though... ``` phildougherty-Qwen2.5-VL-7B: cmd: | python3 /phildougherty-qwen-vl-api/app.py --model Qwen2.5-VL-7B-Instruct --port ${PORT} --quant int8 # --quant int4 proxy: http://127.0.0.1:${PORT} ttl: 3600 ```
draft model for QwQ-32B (I need an additional GPU for it to make sense) ``` #--hf-repo-draft mradermacher/Qwen2.5-Coder-0.5B-QwQ-draft-i1-GGUF:Q4_K_M # <-- token 151665 content differs - target '', draft '' --hf-repo-draft bartowski/InfiniAILab_QwQ-0.5B-GGUF:Q8_0 --n-gpu-layers-draft 99 --override-kv tokenizer.ggml.bos_token_id=int:151643 # --draft-max 16 # --draft-min 5 # --draft-p-min 0.5 ```

Tidbits

testing qwen2.5-coder-7b on port 11902 ```console $ ./scripts/host-qwen2.5-coder-7b_localhost_port11902.sh $ env OPENAI_API_BASE=localhost:11902/v1 OPENAI_API_KEY=sk-empty \ ./scripts/test-chat-completions.sh modelnameplaceholder "In python, how do I defer deletion of a specific path to end of program?" \ | jq -r | batcat -pp -l md ```

Owner

  • Name: Bjorn
  • Login: bjodah
  • Kind: user

GitHub Events

Total
  • Watch event: 15
  • Push event: 64
Last Year
  • Watch event: 15
  • Push event: 64

Issues and Pull Requests

Last synced: about 1 year ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels