bizcon

LLM benchmark for business conversations

https://github.com/olib-ai/bizcon

Last synced: 11 months ago · JSON representation ·

Repository

LLM benchmark for business conversations

Basic Info

Host: GitHub
Owner: Olib-AI
License: mit
Language: Python
Default Branch: main
Size: 209 KB

Statistics

Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Releases: 0

Created about 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme Contributing License Citation

bizCon: Business Conversation Evaluation Framework for LLMs

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/) [![Tests](https://img.shields.io/badge/tests-passing-brightgreen.svg)](https://github.com/Olib-AI/bizcon/actions) [![GitHub Issues](https://img.shields.io/github/issues/Olib-AI/bizcon)](https://github.com/Olib-AI/bizcon/issues) [![GitHub Stars](https://img.shields.io/github/stars/Olib-AI/bizcon)](https://github.com/Olib-AI/bizcon/stargazers) **A comprehensive open-source framework for benchmarking Large Language Models on business conversation capabilities** [🚀 Quick Start](#-quick-start) • [📖 Documentation](#-documentation) • [📊 Sample Results](#-sample-results) • [🤝 Contributing](#-contributing) • [💬 Community](#-community)

📋 Table of Contents

📖 Click to view full navigation

- [🎯 Overview](#-overview) - [✨ Key Features](#-key-features) - [🚀 Quick Start](#-quick-start) - [📖 Documentation](#-documentation) - [📊 Sample Results](#-sample-results) - [🏗️ Advanced Usage](#️-advanced-usage) - [🤝 Contributing](#-contributing) - [🧪 Testing & Validation](#-testing--validation) - [💬 Community](#-community) - [📈 Roadmap](#-roadmap)

🎯 Overview

bizCon is a specialized evaluation framework designed to benchmark Large Language Models (LLMs) on realistic business conversation scenarios. Unlike generic benchmarks, bizCon focuses on practical business use cases involving professional communication, tool integration, and domain-specific knowledge.

Why bizCon?

Business-Focused: Evaluates models on real-world business scenarios
Multi-Dimensional: Assesses 5 key aspects of business communication
Tool Integration: Tests models' ability to use business tools effectively
Comparative Analysis: Benchmark multiple models side-by-side
Enterprise-Ready: Professional reporting and analysis capabilities

✨ Key Features

🎭 Diverse Business Scenarios

Product Inquiries: Enterprise software consultations
Technical Support: Complex troubleshooting and API integration
Contract Negotiation: SaaS agreements and enterprise deals
Appointment Scheduling: Multi-stakeholder coordination
Compliance Inquiries: Regulatory and data privacy questions
Implementation Planning: Software deployment strategies
Service Complaints: Customer service and dispute resolution
Multi-Department: Cross-functional project coordination

📊 Comprehensive Evaluation Metrics

Response Quality (25%) - Factual accuracy and completeness
Business Value (25%) - Strategic insight and actionable recommendations
Communication Style (20%) - Professionalism and tone appropriateness
Tool Usage (20%) - Effective integration with business tools
Performance (10%) - Response time and efficiency

🛠️ Business Tool Ecosystem

Knowledge Base Search
Product Catalog Lookup
Pricing Calculator
Appointment Scheduler
Customer History Access
Document Retrieval
Order Management
Support Ticket System

🤖 Multi-Model Support

🤖 OpenAI	🧠 Anthropic	🌟 Mistral AI
• GPT-4 • GPT-3.5-turbo • GPT-4-turbo	• Claude-3-opus • Claude-3-sonnet • Claude-3-haiku	• Mistral-large • Mistral-medium • Mistral-small

🚀 Quick Start

Installation

```bash

Clone the repository

git clone https://github.com/Olib-AI/bizcon.git cd bizcon

Basic installation

pip install -e .

Install with advanced visualization features (use quotes for zsh)

pip install -e ".[advanced]"

Install all optional features

pip install -e ".[all]" ```

Basic Usage

Set up your API keys: bash export OPENAI_API_KEY="your-openai-key" export ANTHROPIC_API_KEY="your-anthropic-key" export MISTRAL_API_KEY="your-mistral-key"
Run a quick test: ```bash

🚀 Test without API keys (uses mock models)

python test_framework.py

🧪 Run unit and integration tests

python -m pytest tests/

🤖 Test with real models (requires API keys)

python testwithreal_models.py ```

Run a benchmark: ```bash # 📊 Compare models on specific scenarios python run.py --scenarios productinquiry001 support_001 --verbose

🏃 Run full benchmark with custom config

python run.py --config config/models.yaml --output results/

💻 Using CLI interface directly

bizcon run --config config/models.yaml --output results/ ```

Explore available options: ```bash # 📋 List all available scenarios python run.py --list-scenarios # or: bizcon list-scenarios

🤖 List supported models

python run.py --list-models

or: bizcon list-models

```

Configuration

Customize your evaluation in config/models.yaml:

yaml models: - provider: openai name: gpt-4 temperature: 0.7 max_tokens: 2048 - provider: anthropic name: claude-3-sonnet temperature: 0.7 max_tokens: 2048

Adjust evaluation settings in config/evaluation.yaml:

yaml evaluation: parallel: true num_runs: 3 evaluator_weights: response_quality: 0.25 business_value: 0.25 communication_style: 0.20 tool_usage: 0.20 performance: 0.10

📖 Documentation

Project Structure

bizcon/ ├── config/ # Configuration files │ ├── models.yaml # Model configurations │ └── evaluation.yaml # Evaluation settings ├── core/ # Core evaluation pipeline │ ├── pipeline.py # Main evaluation orchestrator │ └── runner.py # Scenario execution engine ├── models/ # LLM provider integrations │ ├── openai.py # OpenAI client │ ├── anthropic.py # Anthropic client │ └── mistral.py # Mistral AI client ├── scenarios/ # Business conversation scenarios │ ├── product_inquiry.py │ ├── technical_support.py │ └── contract_negotiation.py ├── evaluators/ # Evaluation metrics │ ├── response_quality.py │ ├── business_value.py │ └── communication_style.py ├── tools/ # Business tool implementations │ ├── knowledge_base.py │ ├── scheduler.py │ └── product_catalog.py ├── visualization/ # Advanced visualization and reporting │ ├── charts.py # Static matplotlib charts │ ├── interactive_charts.py # Interactive Plotly charts │ ├── dashboard.py # Basic Flask dashboard │ ├── advanced_dashboard.py # Advanced dashboard with filtering │ ├── analysis_utils.py # Statistical analysis tools │ └── report.py # Report generation └── data/ # Sample business data ├── knowledge_base/ ├── products/ └── pricing/

Creating Custom Scenarios

```python from scenarios.base import BusinessScenario

class CustomBusinessScenario(BusinessScenario): def init(self, scenarioid=None): super().init( scenarioid=scenarioid or "custom001", name="Custom Business Scenario", description="Your custom scenario description", industry="technology", complexity="medium", toolsrequired=["knowledgebase", "scheduler"] )

def _initialize_conversation(self):
    return [{
        "user_message": "Your initial customer message",
        "expected_tool_calls": [
            {"tool_id": "knowledge_base", "parameters": {"query": "example"}}
        ]
    }]

def _initialize_ground_truth(self):
    return {
        "expected_facts": ["Key fact 1", "Key fact 2"],
        "business_objective": "Help customer achieve X",
        "expected_tone": "professional"
    }

```

Adding Custom Evaluators

```python from evaluators.base import BaseEvaluator

class CustomEvaluator(BaseEvaluator): def init(self, weight=1.0): super().init(name="Custom Evaluator", weight=weight)

def evaluate(self, response, scenario, turn_index, conversation_history, tool_calls):
    # Your evaluation logic here
    score = self.calculate_score(response)
    return {
        "score": score,
        "explanation": "Detailed explanation of the score",
        "max_possible": 10.0
    }

```

📊 Sample Results

📈 Click to view sample benchmark results

### Overall Model Performance ``` ┌─────────────────┬─────────┬─────────────┬─────────────┬─────────────┬─────────────┬─────────────┐ │ Model │ Overall │ Response │ Business │ Communication│ Tool Usage │ Performance │ │ │ Score │ Quality │ Value │ Style │ │ │ ├─────────────────┼─────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┤ │ gpt-4 │ 8.2/10 │ 8.5/10 │ 8.1/10 │ 9.0/10 │ 7.8/10 │ 8.0/10 │ │ claude-3-sonnet │ 7.9/10 │ 8.2/10 │ 7.8/10 │ 8.8/10 │ 7.5/10 │ 7.2/10 │ │ claude-3-haiku │ 7.1/10 │ 7.3/10 │ 6.9/10 │ 8.0/10 │ 6.8/10 │ 8.5/10 │ │ gpt-3.5-turbo │ 6.8/10 │ 6.5/10 │ 6.2/10 │ 7.5/10 │ 6.0/10 │ 7.8/10 │ └─────────────────┴─────────┴─────────────┴─────────────┴─────────────┴─────────────┴─────────────┘ ``` ### Success Rates by Category - **GPT-4**: Response Quality (89%), Tool Usage (78%), Communication Style (90%) - **Claude-3-Sonnet**: Response Quality (86%), Tool Usage (75%), Communication Style (88%) - **Claude-3-Haiku**: Response Quality (73%), Tool Usage (68%), Communication Style (80%) ### Report Outputs - **📊 Interactive HTML Report**: Charts, breakdowns, and detailed analysis - **📈 CSV Data Export**: Raw scores for custom analysis and visualization - **📝 Markdown Summary**: Professional reports for sharing and documentation - **🎯 Success Rate Analysis**: Model performance across business scenarios

🏗️ Advanced Usage

Parallel Evaluation

```bash

Run multiple scenarios in parallel

python run.py --scenarios productinquiry001 support001 contract001 --parallel

Or using CLI directly

bizcon run --scenarios productinquiry001 support_001 --parallel ```

Custom Model Parameters

yaml models: - provider: openai name: gpt-4 temperature: 0.3 max_tokens: 1024 parameters: seed: 42 top_p: 0.9

Advanced Visualization Dashboard

```bash

Install advanced features first (use quotes for zsh)

pip install -e ".[advanced]"

Launch interactive dashboard with advanced features

python examples/advanceddashboarddemo.py --results-dir output/

Launch on custom host/port with auto-refresh

python examples/advanceddashboarddemo.py --host 0.0.0.0 --port 8080

Disable auto-refresh for static analysis

python examples/advanceddashboarddemo.py --no-auto-refresh ```

Note: Advanced visualization features require additional dependencies (Plotly, Flask, SciPy). Install with pip install "bizcon[advanced]" (quotes required for zsh) to enable these features.

Scenario Categories

```bash

Run all product inquiry scenarios

python run.py --scenarios productinquiry*

Run scenarios by complexity

python run.py --scenarios complex_* ```

🤝 Contributing

We welcome contributions from the community! Here's how you can help:

Ways to Contribute

🐛 Report Bugs: Open an issue with detailed reproduction steps
✨ Suggest Features: Propose new scenarios, evaluators, or tools
📝 Improve Documentation: Help make our docs clearer
🔧 Submit Code: Fix bugs or add new features
🧪 Add Test Cases: Improve our test coverage

Development Setup

```bash git clone https://github.com/Olib-AI/bizcon.git cd bizcon pip install -e .

Run framework validation (no API keys needed)

python test_framework.py

Run full test suite

python -m pytest tests/ ```

Contribution Guidelines

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes
Add tests for new functionality
Run the test suite (pytest)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

🧪 Testing & Validation

🎯 Framework Validation Status

| Component | Status | Coverage | |-----------|--------|----------| | **Unit Tests** | ✅ PASSED (12/12) | Evaluators, Scenarios, Tools | | **Integration Tests** | ✅ PASSED | End-to-end Pipeline | | **Framework Tests** | ✅ PASSED | Mock Model Validation | | **Report Generation** | ✅ WORKING | HTML, Markdown, CSV | | **CLI Functionality** | ✅ OPERATIONAL | All Commands Available | | **Data Integrity** | ✅ VERIFIED | JSON Files Valid |

Running Tests

🧪 Click to view test commands

```bash # 🚀 Quick framework validation (no API keys required) python test_framework.py # 📊 Full test suite with detailed output python -m pytest tests/ -v # 🔍 Test specific components python -m pytest tests/unit/test_evaluators.py::TestResponseQualityEvaluator python -m pytest tests/integration/test_pipeline.py # 🎯 Test with coverage report python -m pytest tests/ --cov=./ --cov-report=html ``` **No API keys needed** for framework validation - uses MockModelClient for comprehensive testing.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

💬 Community

Website: www.olib.ai
GitHub: github.com/Olib-AI
Issues: Report bugs or request features
Discussions: Join the conversation

👥 Authors

Akram Hasan Sharkar - Author & Lead Developer
Maya Msahal - Co-Author & Research Contributor

Developed at Olib AI

📖 Research Paper

A detailed research paper describing the methodology, evaluation framework, and empirical results of bizCon will be published on arXiv.org. The paper link will be available here upon publication.

Citation format will be provided once the paper is published.

🙏 Acknowledgments

Built with ❤️ by Akram Hasan Sharkar and Maya Msahal at Olib AI
Inspired by the need for better business-focused LLM evaluation
Thanks to all contributors who help make this project better

📈 Roadmap

🚀 View upcoming features and release history

### ✅ Recent Additions (May 2025) | Feature | Priority | Status | Completed | |---------|----------|--------|-----------| | 📊 **Advanced Visualization Dashboards** | High | ✅ Complete | May 2025 | | 🎯 **Interactive Plotly Charts** | High | ✅ Complete | May 2025 | | 🔄 **Real-time Dashboard Filtering** | Medium | ✅ Complete | May 2025 | | 📈 **Statistical Analysis Tools** | Medium | ✅ Complete | May 2025 | | 🔍 **Model Comparison Engine** | Medium | ✅ Complete | May 2025 | ### 🔮 Upcoming Features | Feature | Priority | Status | ETA | |---------|----------|--------|-----| | 🌐 **More LLM Providers** (Cohere, Together AI) | High | Planning | Q3 2025 | | 🏭 **Industry-Specific Scenario Packs** | Medium | Planning | Q4 2025 | | ⚡ **Real-time Evaluation APIs** | Medium | Researching | Q4 2025 | | 🔗 **Custom Webhook Integrations** | Low | Backlog | Q1 2026 | | 🌍 **Multi-language Support** | Low | Backlog | Q1 2026 | | 🤖 **AI-Powered Insights** | Medium | Planning | Q3 2025 | ### 📋 Version History - **v0.4.0** *(Current)*: Advanced visualization dashboards, interactive Plotly charts, real-time filtering, statistical analysis - **v0.3.0**: Multi-provider support, tool integration, success rate differentiation - **v0.2.0**: Added visualization and reporting capabilities - **v0.1.0**: Initial release with core evaluation framework

**Made with ❤️ by [Akram Hasan Sharkar](https://github.com/ibnbd) & [Maya Msahal](https://github.com/Mayamsah) at [Olib AI](https://www.olib.ai)** [⭐ Star us on GitHub](https://github.com/Olib-AI/bizcon) • [📖 Read the Docs](https://github.com/Olib-AI/bizcon/wiki) • [🐛 Report Issues](https://github.com/Olib-AI/bizcon/issues)

Owner

Name: Olib AI
Login: Olib-AI
Kind: organization
Email: github@olib.ai
Location: United States of America

Website: www.olib.ai
Repositories: 1
Profile: https://github.com/Olib-AI

Empowering businesses with cutting-edge AI solutions.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
type: software
title: "bizCon: Business Conversation Evaluation Framework for LLMs"
authors:
  - family-names: "Sharkar"
    given-names: "Akram Hasan"
    orcid: "https://orcid.org/0000-0000-0000-0000"  # Replace with actual ORCID if available
    affiliation: "Olib AI"
  - family-names: "Msahal"
    given-names: "Maya"
    orcid: "https://orcid.org/0000-0000-0000-0000"  # Replace with actual ORCID if available
    affiliation: "Olib AI"
repository-code: "https://github.com/Olib-AI/bizcon"
url: "https://www.olib.ai"
abstract: >-
  bizCon is a comprehensive evaluation framework for benchmarking 
  Large Language Models on business conversation capabilities. It 
  evaluates models across multiple dimensions including response 
  quality, business value, communication style, tool usage, and 
  performance using realistic business scenarios.
keywords:
  - "large language models"
  - "LLM evaluation"
  - "business conversations"
  - "benchmark"
  - "natural language processing"
  - "artificial intelligence"
license: MIT
version: "0.1.0"
date-released: "2024-12-19"
preferred-citation:
  type: article
  title: "bizCon: A Comprehensive Evaluation Framework for Business Conversation Capabilities of Large Language Models"
  authors:
    - family-names: "Sharkar"
      given-names: "Akram Hasan"
      affiliation: "Olib AI"
    - family-names: "Msahal"
      given-names: "Maya"
      affiliation: "Olib AI"
  journal: "arXiv preprint"
  year: 2024
  notes: "Paper to be published on arXiv.org"

GitHub Events

Total

Issues event: 3
Watch event: 1
Issue comment event: 1
Push event: 5
Fork event: 1
Create event: 2

Last Year

Issues event: 3
Watch event: 1
Issue comment event: 1
Push event: 5
Fork event: 1
Create event: 2

bizcon

Science Score: 54.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

bizCon: Business Conversation Evaluation Framework for LLMs

📋 Table of Contents

🎯 Overview

Why bizCon?

✨ Key Features

🎭 Diverse Business Scenarios

📊 Comprehensive Evaluation Metrics

🛠️ Business Tool Ecosystem

🤖 Multi-Model Support

🚀 Quick Start

Installation

Clone the repository

Basic installation

Install with advanced visualization features (use quotes for zsh)

Install all optional features

Basic Usage

🚀 Test without API keys (uses mock models)

🧪 Run unit and integration tests

🤖 Test with real models (requires API keys)

🏃 Run full benchmark with custom config

💻 Using CLI interface directly

🤖 List supported models

or: bizcon list-models

Configuration

📖 Documentation

Project Structure

Creating Custom Scenarios

Adding Custom Evaluators

📊 Sample Results

🏗️ Advanced Usage

Parallel Evaluation

Run multiple scenarios in parallel

Or using CLI directly

Custom Model Parameters

Advanced Visualization Dashboard

Install advanced features first (use quotes for zsh)

Launch interactive dashboard with advanced features

Launch on custom host/port with auto-refresh

Disable auto-refresh for static analysis

Scenario Categories

Run all product inquiry scenarios

Run scenarios by complexity

🤝 Contributing

Ways to Contribute

Development Setup

Run framework validation (no API keys needed)

Run full test suite

Contribution Guidelines

🧪 Testing & Validation

🎯 Framework Validation Status

Running Tests

📄 License

💬 Community

👥 Authors

📖 Research Paper

🙏 Acknowledgments

📈 Roadmap

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year