bizcon

LLM benchmark for business conversations

https://github.com/olib-ai/bizcon

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.2%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

LLM benchmark for business conversations

Basic Info
  • Host: GitHub
  • Owner: Olib-AI
  • License: mit
  • Language: Python
  • Default Branch: main
  • Size: 209 KB
Statistics
  • Stars: 0
  • Watchers: 2
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created 9 months ago · Last pushed 9 months ago
Metadata Files
Readme Contributing License Citation

README.md

bizCon: Business Conversation Evaluation Framework for LLMs

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/) [![Tests](https://img.shields.io/badge/tests-passing-brightgreen.svg)](https://github.com/Olib-AI/bizcon/actions) [![GitHub Issues](https://img.shields.io/github/issues/Olib-AI/bizcon)](https://github.com/Olib-AI/bizcon/issues) [![GitHub Stars](https://img.shields.io/github/stars/Olib-AI/bizcon)](https://github.com/Olib-AI/bizcon/stargazers) **A comprehensive open-source framework for benchmarking Large Language Models on business conversation capabilities** [🚀 Quick Start](#-quick-start) • [📖 Documentation](#-documentation) • [📊 Sample Results](#-sample-results) • [🤝 Contributing](#-contributing) • [💬 Community](#-community)

📋 Table of Contents

📖 Click to view full navigation - [🎯 Overview](#-overview) - [✨ Key Features](#-key-features) - [🚀 Quick Start](#-quick-start) - [📖 Documentation](#-documentation) - [📊 Sample Results](#-sample-results) - [🏗️ Advanced Usage](#️-advanced-usage) - [🤝 Contributing](#-contributing) - [🧪 Testing & Validation](#-testing--validation) - [💬 Community](#-community) - [📈 Roadmap](#-roadmap)

🎯 Overview

bizCon is a specialized evaluation framework designed to benchmark Large Language Models (LLMs) on realistic business conversation scenarios. Unlike generic benchmarks, bizCon focuses on practical business use cases involving professional communication, tool integration, and domain-specific knowledge.

Why bizCon?

  • Business-Focused: Evaluates models on real-world business scenarios
  • Multi-Dimensional: Assesses 5 key aspects of business communication
  • Tool Integration: Tests models' ability to use business tools effectively
  • Comparative Analysis: Benchmark multiple models side-by-side
  • Enterprise-Ready: Professional reporting and analysis capabilities

✨ Key Features

🎭 Diverse Business Scenarios

  • Product Inquiries: Enterprise software consultations
  • Technical Support: Complex troubleshooting and API integration
  • Contract Negotiation: SaaS agreements and enterprise deals
  • Appointment Scheduling: Multi-stakeholder coordination
  • Compliance Inquiries: Regulatory and data privacy questions
  • Implementation Planning: Software deployment strategies
  • Service Complaints: Customer service and dispute resolution
  • Multi-Department: Cross-functional project coordination

📊 Comprehensive Evaluation Metrics

  1. Response Quality (25%) - Factual accuracy and completeness
  2. Business Value (25%) - Strategic insight and actionable recommendations
  3. Communication Style (20%) - Professionalism and tone appropriateness
  4. Tool Usage (20%) - Effective integration with business tools
  5. Performance (10%) - Response time and efficiency

🛠️ Business Tool Ecosystem

  • Knowledge Base Search
  • Product Catalog Lookup
  • Pricing Calculator
  • Appointment Scheduler
  • Customer History Access
  • Document Retrieval
  • Order Management
  • Support Ticket System

🤖 Multi-Model Support

🤖 OpenAI 🧠 Anthropic 🌟 Mistral AI
• GPT-4
• GPT-3.5-turbo
• GPT-4-turbo
• Claude-3-opus
• Claude-3-sonnet
• Claude-3-haiku
• Mistral-large
• Mistral-medium
• Mistral-small

🚀 Quick Start

Installation

```bash

Clone the repository

git clone https://github.com/Olib-AI/bizcon.git cd bizcon

Basic installation

pip install -e .

Install with advanced visualization features (use quotes for zsh)

pip install -e ".[advanced]"

Install all optional features

pip install -e ".[all]" ```

Basic Usage

  1. Set up your API keys: bash export OPENAI_API_KEY="your-openai-key" export ANTHROPIC_API_KEY="your-anthropic-key" export MISTRAL_API_KEY="your-mistral-key"

  2. Run a quick test: ```bash

    🚀 Test without API keys (uses mock models)

    python test_framework.py

🧪 Run unit and integration tests

python -m pytest tests/

🤖 Test with real models (requires API keys)

python testwithreal_models.py ```

  1. Run a benchmark: ```bash # 📊 Compare models on specific scenarios python run.py --scenarios productinquiry001 support_001 --verbose

🏃 Run full benchmark with custom config

python run.py --config config/models.yaml --output results/

💻 Using CLI interface directly

bizcon run --config config/models.yaml --output results/ ```

  1. Explore available options: ```bash # 📋 List all available scenarios python run.py --list-scenarios # or: bizcon list-scenarios

🤖 List supported models

python run.py --list-models

or: bizcon list-models

```

Configuration

Customize your evaluation in config/models.yaml:

yaml models: - provider: openai name: gpt-4 temperature: 0.7 max_tokens: 2048 - provider: anthropic name: claude-3-sonnet temperature: 0.7 max_tokens: 2048

Adjust evaluation settings in config/evaluation.yaml:

yaml evaluation: parallel: true num_runs: 3 evaluator_weights: response_quality: 0.25 business_value: 0.25 communication_style: 0.20 tool_usage: 0.20 performance: 0.10

📖 Documentation

Project Structure

bizcon/ ├── config/ # Configuration files │ ├── models.yaml # Model configurations │ └── evaluation.yaml # Evaluation settings ├── core/ # Core evaluation pipeline │ ├── pipeline.py # Main evaluation orchestrator │ └── runner.py # Scenario execution engine ├── models/ # LLM provider integrations │ ├── openai.py # OpenAI client │ ├── anthropic.py # Anthropic client │ └── mistral.py # Mistral AI client ├── scenarios/ # Business conversation scenarios │ ├── product_inquiry.py │ ├── technical_support.py │ └── contract_negotiation.py ├── evaluators/ # Evaluation metrics │ ├── response_quality.py │ ├── business_value.py │ └── communication_style.py ├── tools/ # Business tool implementations │ ├── knowledge_base.py │ ├── scheduler.py │ └── product_catalog.py ├── visualization/ # Advanced visualization and reporting │ ├── charts.py # Static matplotlib charts │ ├── interactive_charts.py # Interactive Plotly charts │ ├── dashboard.py # Basic Flask dashboard │ ├── advanced_dashboard.py # Advanced dashboard with filtering │ ├── analysis_utils.py # Statistical analysis tools │ └── report.py # Report generation └── data/ # Sample business data ├── knowledge_base/ ├── products/ └── pricing/

Creating Custom Scenarios

```python from scenarios.base import BusinessScenario

class CustomBusinessScenario(BusinessScenario): def init(self, scenarioid=None): super().init( scenarioid=scenarioid or "custom001", name="Custom Business Scenario", description="Your custom scenario description", industry="technology", complexity="medium", toolsrequired=["knowledgebase", "scheduler"] )

def _initialize_conversation(self):
    return [{
        "user_message": "Your initial customer message",
        "expected_tool_calls": [
            {"tool_id": "knowledge_base", "parameters": {"query": "example"}}
        ]
    }]

def _initialize_ground_truth(self):
    return {
        "expected_facts": ["Key fact 1", "Key fact 2"],
        "business_objective": "Help customer achieve X",
        "expected_tone": "professional"
    }

```

Adding Custom Evaluators

```python from evaluators.base import BaseEvaluator

class CustomEvaluator(BaseEvaluator): def init(self, weight=1.0): super().init(name="Custom Evaluator", weight=weight)

def evaluate(self, response, scenario, turn_index, conversation_history, tool_calls):
    # Your evaluation logic here
    score = self.calculate_score(response)
    return {
        "score": score,
        "explanation": "Detailed explanation of the score",
        "max_possible": 10.0
    }

```

📊 Sample Results

📈 Click to view sample benchmark results ### Overall Model Performance ``` ┌─────────────────┬─────────┬─────────────┬─────────────┬─────────────┬─────────────┬─────────────┐ │ Model │ Overall │ Response │ Business │ Communication│ Tool Usage │ Performance │ │ │ Score │ Quality │ Value │ Style │ │ │ ├─────────────────┼─────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┤ │ gpt-4 │ 8.2/10 │ 8.5/10 │ 8.1/10 │ 9.0/10 │ 7.8/10 │ 8.0/10 │ │ claude-3-sonnet │ 7.9/10 │ 8.2/10 │ 7.8/10 │ 8.8/10 │ 7.5/10 │ 7.2/10 │ │ claude-3-haiku │ 7.1/10 │ 7.3/10 │ 6.9/10 │ 8.0/10 │ 6.8/10 │ 8.5/10 │ │ gpt-3.5-turbo │ 6.8/10 │ 6.5/10 │ 6.2/10 │ 7.5/10 │ 6.0/10 │ 7.8/10 │ └─────────────────┴─────────┴─────────────┴─────────────┴─────────────┴─────────────┴─────────────┘ ``` ### Success Rates by Category - **GPT-4**: Response Quality (89%), Tool Usage (78%), Communication Style (90%) - **Claude-3-Sonnet**: Response Quality (86%), Tool Usage (75%), Communication Style (88%) - **Claude-3-Haiku**: Response Quality (73%), Tool Usage (68%), Communication Style (80%) ### Report Outputs - **📊 Interactive HTML Report**: Charts, breakdowns, and detailed analysis - **📈 CSV Data Export**: Raw scores for custom analysis and visualization - **📝 Markdown Summary**: Professional reports for sharing and documentation - **🎯 Success Rate Analysis**: Model performance across business scenarios

🏗️ Advanced Usage

Parallel Evaluation

```bash

Run multiple scenarios in parallel

python run.py --scenarios productinquiry001 support001 contract001 --parallel

Or using CLI directly

bizcon run --scenarios productinquiry001 support_001 --parallel ```

Custom Model Parameters

yaml models: - provider: openai name: gpt-4 temperature: 0.3 max_tokens: 1024 parameters: seed: 42 top_p: 0.9

Advanced Visualization Dashboard

```bash

Install advanced features first (use quotes for zsh)

pip install -e ".[advanced]"

Launch interactive dashboard with advanced features

python examples/advanceddashboarddemo.py --results-dir output/

Launch on custom host/port with auto-refresh

python examples/advanceddashboarddemo.py --host 0.0.0.0 --port 8080

Disable auto-refresh for static analysis

python examples/advanceddashboarddemo.py --no-auto-refresh ```

Note: Advanced visualization features require additional dependencies (Plotly, Flask, SciPy). Install with pip install "bizcon[advanced]" (quotes required for zsh) to enable these features.

Scenario Categories

```bash

Run all product inquiry scenarios

python run.py --scenarios productinquiry*

Run scenarios by complexity

python run.py --scenarios complex_* ```

🤝 Contributing

We welcome contributions from the community! Here's how you can help:

Ways to Contribute

  • 🐛 Report Bugs: Open an issue with detailed reproduction steps
  • Suggest Features: Propose new scenarios, evaluators, or tools
  • 📝 Improve Documentation: Help make our docs clearer
  • 🔧 Submit Code: Fix bugs or add new features
  • 🧪 Add Test Cases: Improve our test coverage

Development Setup

```bash git clone https://github.com/Olib-AI/bizcon.git cd bizcon pip install -e .

Run framework validation (no API keys needed)

python test_framework.py

Run full test suite

python -m pytest tests/ ```

Contribution Guidelines

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes
  4. Add tests for new functionality
  5. Run the test suite (pytest)
  6. Commit your changes (git commit -m 'Add amazing feature')
  7. Push to the branch (git push origin feature/amazing-feature)
  8. Open a Pull Request

🧪 Testing & Validation

🎯 Framework Validation Status

| Component | Status | Coverage | |-----------|--------|----------| | **Unit Tests** | ✅ PASSED (12/12) | Evaluators, Scenarios, Tools | | **Integration Tests** | ✅ PASSED | End-to-end Pipeline | | **Framework Tests** | ✅ PASSED | Mock Model Validation | | **Report Generation** | ✅ WORKING | HTML, Markdown, CSV | | **CLI Functionality** | ✅ OPERATIONAL | All Commands Available | | **Data Integrity** | ✅ VERIFIED | JSON Files Valid |

Running Tests

🧪 Click to view test commands ```bash # 🚀 Quick framework validation (no API keys required) python test_framework.py # 📊 Full test suite with detailed output python -m pytest tests/ -v # 🔍 Test specific components python -m pytest tests/unit/test_evaluators.py::TestResponseQualityEvaluator python -m pytest tests/integration/test_pipeline.py # 🎯 Test with coverage report python -m pytest tests/ --cov=./ --cov-report=html ``` **No API keys needed** for framework validation - uses MockModelClient for comprehensive testing.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

💬 Community

👥 Authors

Akram Hasan Sharkar - Author & Lead Developer
Maya Msahal - Co-Author & Research Contributor

Developed at Olib AI

📖 Research Paper

A detailed research paper describing the methodology, evaluation framework, and empirical results of bizCon will be published on arXiv.org. The paper link will be available here upon publication.

Citation format will be provided once the paper is published.

🙏 Acknowledgments

  • Built with ❤️ by Akram Hasan Sharkar and Maya Msahal at Olib AI
  • Inspired by the need for better business-focused LLM evaluation
  • Thanks to all contributors who help make this project better

📈 Roadmap

🚀 View upcoming features and release history ### ✅ Recent Additions (May 2025) | Feature | Priority | Status | Completed | |---------|----------|--------|-----------| | 📊 **Advanced Visualization Dashboards** | High | ✅ Complete | May 2025 | | 🎯 **Interactive Plotly Charts** | High | ✅ Complete | May 2025 | | 🔄 **Real-time Dashboard Filtering** | Medium | ✅ Complete | May 2025 | | 📈 **Statistical Analysis Tools** | Medium | ✅ Complete | May 2025 | | 🔍 **Model Comparison Engine** | Medium | ✅ Complete | May 2025 | ### 🔮 Upcoming Features | Feature | Priority | Status | ETA | |---------|----------|--------|-----| | 🌐 **More LLM Providers** (Cohere, Together AI) | High | Planning | Q3 2025 | | 🏭 **Industry-Specific Scenario Packs** | Medium | Planning | Q4 2025 | | ⚡ **Real-time Evaluation APIs** | Medium | Researching | Q4 2025 | | 🔗 **Custom Webhook Integrations** | Low | Backlog | Q1 2026 | | 🌍 **Multi-language Support** | Low | Backlog | Q1 2026 | | 🤖 **AI-Powered Insights** | Medium | Planning | Q3 2025 | ### 📋 Version History - **v0.4.0** *(Current)*: Advanced visualization dashboards, interactive Plotly charts, real-time filtering, statistical analysis - **v0.3.0**: Multi-provider support, tool integration, success rate differentiation - **v0.2.0**: Added visualization and reporting capabilities - **v0.1.0**: Initial release with core evaluation framework

**Made with ❤️ by [Akram Hasan Sharkar](https://github.com/ibnbd) & [Maya Msahal](https://github.com/Mayamsah) at [Olib AI](https://www.olib.ai)** [⭐ Star us on GitHub](https://github.com/Olib-AI/bizcon) • [📖 Read the Docs](https://github.com/Olib-AI/bizcon/wiki) • [🐛 Report Issues](https://github.com/Olib-AI/bizcon/issues)

Owner

  • Name: Olib AI
  • Login: Olib-AI
  • Kind: organization
  • Email: github@olib.ai
  • Location: United States of America

Empowering businesses with cutting-edge AI solutions.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
type: software
title: "bizCon: Business Conversation Evaluation Framework for LLMs"
authors:
  - family-names: "Sharkar"
    given-names: "Akram Hasan"
    orcid: "https://orcid.org/0000-0000-0000-0000"  # Replace with actual ORCID if available
    affiliation: "Olib AI"
  - family-names: "Msahal"
    given-names: "Maya"
    orcid: "https://orcid.org/0000-0000-0000-0000"  # Replace with actual ORCID if available
    affiliation: "Olib AI"
repository-code: "https://github.com/Olib-AI/bizcon"
url: "https://www.olib.ai"
abstract: >-
  bizCon is a comprehensive evaluation framework for benchmarking 
  Large Language Models on business conversation capabilities. It 
  evaluates models across multiple dimensions including response 
  quality, business value, communication style, tool usage, and 
  performance using realistic business scenarios.
keywords:
  - "large language models"
  - "LLM evaluation"
  - "business conversations"
  - "benchmark"
  - "natural language processing"
  - "artificial intelligence"
license: MIT
version: "0.1.0"
date-released: "2024-12-19"
preferred-citation:
  type: article
  title: "bizCon: A Comprehensive Evaluation Framework for Business Conversation Capabilities of Large Language Models"
  authors:
    - family-names: "Sharkar"
      given-names: "Akram Hasan"
      affiliation: "Olib AI"
    - family-names: "Msahal"
      given-names: "Maya"
      affiliation: "Olib AI"
  journal: "arXiv preprint"
  year: 2024
  notes: "Paper to be published on arXiv.org"

GitHub Events

Total
  • Issues event: 3
  • Watch event: 1
  • Issue comment event: 1
  • Push event: 5
  • Fork event: 1
  • Create event: 2
Last Year
  • Issues event: 3
  • Watch event: 1
  • Issue comment event: 1
  • Push event: 5
  • Fork event: 1
  • Create event: 2