LlamaCppOutlines

Llama Cpp and Outlines wrapper in one tight package.

https://github.com/krishnaveti/llamacppoutlines.jl

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (7.9%) to scientific vocabulary
Last synced: 10 months ago · JSON representation ·

Repository

Llama Cpp and Outlines wrapper in one tight package.

Basic Info
  • Host: GitHub
  • Owner: krishnaveti
  • License: mit
  • Language: Julia
  • Default Branch: master
  • Homepage:
  • Size: 43.3 MB
Statistics
  • Stars: 1
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 2
Created 12 months ago · Last pushed 11 months ago
Metadata Files
Readme License Citation

README.md

LlamaCppOutlines.jl

A Julia package for LLaMA inference with structured output generation using Outlines constraints.

Features

  • LLaMA Model Inference: Basic text generation with sampling
  • Multimodal Support: Text + image generation with multimodal models
  • Constrained Generation: Structured output using JSON schema constraints
  • Enhanced Sampling: Multiple sampling strategies (greedy, top-k, top-p, temperature)
  • LoRA Support: Dynamic adapter loading and switching
  • GPU Support: Automatic CUDA detection

Installation

julia ] add LlamaCppOutlines or Pkg.add(url="https://github.com/krishnaveti/LlamaCppOutlines.jl") Warning: Current support is only for Windows.

Quick Start

```julia using LlamaCppOutlines

Initialize the APIs

init_apis!()

Load a model

model, modelcontext, vocab = loadand_initialize("path/to/model.gguf")

Basic text generation

result = generatewithsampling( "What is the capital of France?", model=model, modelcontext=modelcontext, vocab=vocab, maxnewtokens=50 )

Constrained generation with JSON schema

schema = Dict( "type" => "object", "properties" => Dict( "city" => Dict("type" => "string"), "country" => Dict("type" => "string") ), "required" => ["city", "country"] )

result = greedyconstrainedgeneration( "What is the capital of France?", schema, "google/gemma-2-2b-it", modelcontext=modelcontext, vocab=vocab ) ```

Multimodal Usage

```julia

Load multimodal model

model, modelcontext, mtmdcontext, vocab = loadandinitialize_mtmd( "path/to/model.gguf", "path/to/projection.gguf" )

Generate text from image

result = generatemtmdwithsampling( "Describe this image: <media>", ["path/to/image.jpg"], model=model, modelcontext=modelcontext, vocab=vocab, mtmdcontext=mtmd_context ) ```

Enhanced Sampling

```julia

Use different sampling strategies

creativeresult = enhancedconstrainedgeneration( prompt, schema, tokenizer, modelcontext=modelcontext, vocab=vocab, samplingparams=creative_params() )

Or create custom sampling parameters

customparams = SamplingParams( temperature=0.9f0, topk=50, topp=0.95f0, repeatpenalty=1.1f0 ) ```

LoRA Training and Adaptation

```julia

Train a LoRA adapter from scratch

ggufpath = trainloratogguf( "google/gemma-2-2b-it", "yourhftokenhere", outputdir="myloramodels" )

Load and use LoRA adapter

model, modelcontext, vocab = loadandinitialize("path/to/model.gguf") adapter = loadloraadapter(model, ggufpath)

Apply adapter to context

result = setadapterlora(model_context, adapter, 1.0f0) if result == 0 println("LoRA adapter applied successfully") end

Generate with LoRA adaptation

response = generatewithsampling( "Solve this economics problem: ...", model=model, modelcontext=modelcontext, vocab=vocab, maxnewtokens=100 )

Remove adapter when done

rmadapterlora(modelcontext, adapter) freelora_adapter(adapter) ```

LoRA Manager for Multiple Adapters

```julia

Create LoRA manager for handling multiple adapters

manager = LoRAManager(model)

Load multiple adapters

loadadapter!(manager, "economics", "path/to/econadapter.gguf") loadadapter!(manager, "creative", "path/to/creativeadapter.gguf")

List available adapters

adapters = list_adapters(manager) println("Available adapters: ", adapters)

Switch between adapters

switchadapter!(manager, modelcontext, "economics", scale=1.0f0)

... generate economics content

switchadapter!(manager, modelcontext, "creative", scale=0.8f0)

... generate creative content

Clear active adapter

clearactiveadapter!(manager, model_context)

Clean up all adapters

freealladapters!(manager) ```

API Reference

Core Functions

  • init_apis!(): Initialize all API libraries (Windows)
  • init_apis_not_windows!(): Initialize all API libraries (Linux/macOS)
  • load_and_initialize(model_path; ...): Load LLaMA model for text generation
  • load_and_initialize_mtmd(model_path, proj_path; ...): Load multimodal model
  • generate_with_sampling(prompt; ...): Basic text generation
  • generate_mtmd_with_sampling(prompt, img_paths; ...): Multimodal generation

Constrained Generation

  • greedy_constrained_generation(prompt, schema, tokenizer; ...): Greedy constrained output
  • enhanced_constrained_generation(prompt, schema, tokenizer; ...): Enhanced sampling with constraints
  • greedy_mtmd_constrained_generation(prompt, img_paths, schema, tokenizer; ...): Multimodal constrained output
  • enhanced_mtmd_constrained_generation(prompt, img_paths, schema, tokenizer; ...): Enhanced multimodal sampling

Sampling Parameters

  • SamplingParams(; ...): Create custom sampling configuration
  • creative_params(): High creativity settings
  • balanced_params(): Balanced settings (default)
  • focused_params(): Low creativity, focused output
  • greedy_params(): Deterministic output

LoRA Functions

  • train_lora_to_gguf(model_name, hf_token; ...): Train LoRA adapter and convert to GGUF
  • load_lora_adapter(model, adapter_path): Load LoRA adapter from GGUF file
  • free_lora_adapter(adapter): Free LoRA adapter from memory
  • set_adapter_lora(ctx, adapter, scale): Set LoRA adapter on context
  • rm_adapter_lora(ctx, adapter): Remove LoRA adapter from context
  • clear_adapter_lora(ctx): Clear all LoRA adapters from context

LoRA Manager

  • LoRAManager(model): Create LoRA manager for multiple adapters
  • load_adapter!(manager, name, path): Load adapter into manager
  • switch_adapter!(manager, ctx, name; scale=1.0f0): Switch to specific adapter
  • clear_active_adapter!(manager, ctx): Clear active adapter
  • free_all_adapters!(manager): Free all managed adapters
  • list_adapters(manager): List all loaded adapters
  • get_active_adapter(manager): Get name of active adapter
  • list_lora_gguf_files(directory="lora_gguf"): List GGUF files in directory

Requirements

System Dependencies

  • LLaMA.cpp: Binary builds through artifact
  • Outlines-core: Binary builds through artifact

⚠️ Linux/macOS Users: This package currently includes Windows binaries only. Will be updating soon.

Authentication Requirements

  • HuggingFace Token: Required for full API functionality
    • Needed for constrained generation with HuggingFace tokenizers
    • Required for LoRA training with gated models (e.g., Gemma, Llama)
    • Required for vocabulary creation from HuggingFace models
    • Get your token at: https://huggingface.co/settings/tokens
    • Use with hf_token parameter in relevant functions

Optional Dependencies

For image processing (multimodal functionality): julia ] add Images FileIO ImageMagick

For audio processing: julia ] add LibSndFile

For LoRA training (automatically installed by train_lora_to_gguf): bash pip install torch transformers peft datasets huggingface_hub python-dotenv accelerate Once initialized, all functionality works identically to Windows.

Directory Structure

LlamaCppOutlines/ ├── src/ │ ├── LlamaCppOutlines.jl # Main module │ └── constrained_generation.jl # Constrained generation functions ├── lib/ │ ├── llama_api.jl # LLaMA API bindings │ ├── mtmd_api.jl # Multimodal API bindings │ ├── outlines_api.jl # Outlines API bindings │ └── lora_to_gguf.py # LoRA training script ├── test/ │ ├── runtests.jl # Test runner │ ├── test_basic_api.jl # Basic API tests │ ├── test_multimodal_api.jl # Multimodal tests │ ├── test_outlines_api.jl # Outlines tests │ └── test_integration.jl # Integration tests └── Project.toml └── Artifacts.toml

Testing

julia ] test LlamaCppOutlines

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests for new functionality
  5. Run the test suite
  6. Submit a pull request

License

This project is licensed under the MIT License.

Owner

  • Login: krishnaveti
  • Kind: user

Citation (Citation.cff)

cff-version: 1.2.0
message: >
  If you use LlamaCppOutlines.jl in your work and find it helpful, please cite it using the following metadata.
title: LlamaCppOutlines.jl
version: 0.1.0
doi: 10.5281/zenodo.15857210
date-released: 2025-07-10
authors:
  - family-names: K S
    given-names: Vikram
    orcid: https://orcid.org/0009-0008-3118-411X
    affiliation: University of Cincinnati
repository-code: https://github.com/krishnaveti/llamacpp-outlines
license: MIT
keywords:
  - Julia
  - LLaMA
  - GGUF
  - multimodal
  - Outlines
  - LoRA
  - inference
  - structured generation
  - constrained decoding
  - large language models

GitHub Events

Total
  • Create event: 3
  • Commit comment event: 15
  • Release event: 3
  • Watch event: 1
  • Delete event: 5
  • Push event: 16
Last Year
  • Create event: 3
  • Commit comment event: 15
  • Release event: 3
  • Watch event: 1
  • Delete event: 5
  • Push event: 16

Packages

  • Total packages: 1
  • Total downloads: unknown
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 3
juliahub.com: LlamaCppOutlines

Llama Cpp and Outlines wrapper in one tight package.

  • Versions: 3
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 0 Total
Rankings
Dependent repos count: 8.3%
Average: 21.9%
Dependent packages count: 35.5%
Last synced: 10 months ago