Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.2%) to scientific vocabulary
Repository
LLM benchmark for business conversations
Basic Info
- Host: GitHub
- Owner: Olib-AI
- License: mit
- Language: Python
- Default Branch: main
- Size: 209 KB
Statistics
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
bizCon: Business Conversation Evaluation Framework for LLMs
📋 Table of Contents
📖 Click to view full navigation
- [🎯 Overview](#-overview) - [✨ Key Features](#-key-features) - [🚀 Quick Start](#-quick-start) - [📖 Documentation](#-documentation) - [📊 Sample Results](#-sample-results) - [🏗️ Advanced Usage](#️-advanced-usage) - [🤝 Contributing](#-contributing) - [🧪 Testing & Validation](#-testing--validation) - [💬 Community](#-community) - [📈 Roadmap](#-roadmap)🎯 Overview
bizCon is a specialized evaluation framework designed to benchmark Large Language Models (LLMs) on realistic business conversation scenarios. Unlike generic benchmarks, bizCon focuses on practical business use cases involving professional communication, tool integration, and domain-specific knowledge.
Why bizCon?
- Business-Focused: Evaluates models on real-world business scenarios
- Multi-Dimensional: Assesses 5 key aspects of business communication
- Tool Integration: Tests models' ability to use business tools effectively
- Comparative Analysis: Benchmark multiple models side-by-side
- Enterprise-Ready: Professional reporting and analysis capabilities
✨ Key Features
🎭 Diverse Business Scenarios
- Product Inquiries: Enterprise software consultations
- Technical Support: Complex troubleshooting and API integration
- Contract Negotiation: SaaS agreements and enterprise deals
- Appointment Scheduling: Multi-stakeholder coordination
- Compliance Inquiries: Regulatory and data privacy questions
- Implementation Planning: Software deployment strategies
- Service Complaints: Customer service and dispute resolution
- Multi-Department: Cross-functional project coordination
📊 Comprehensive Evaluation Metrics
- Response Quality (25%) - Factual accuracy and completeness
- Business Value (25%) - Strategic insight and actionable recommendations
- Communication Style (20%) - Professionalism and tone appropriateness
- Tool Usage (20%) - Effective integration with business tools
- Performance (10%) - Response time and efficiency
🛠️ Business Tool Ecosystem
- Knowledge Base Search
- Product Catalog Lookup
- Pricing Calculator
- Appointment Scheduler
- Customer History Access
- Document Retrieval
- Order Management
- Support Ticket System
🤖 Multi-Model Support
| 🤖 OpenAI | 🧠 Anthropic | 🌟 Mistral AI |
| • GPT-4 • GPT-3.5-turbo • GPT-4-turbo |
• Claude-3-opus • Claude-3-sonnet • Claude-3-haiku |
• Mistral-large • Mistral-medium • Mistral-small |
🚀 Quick Start
Installation
```bash
Clone the repository
git clone https://github.com/Olib-AI/bizcon.git cd bizcon
Basic installation
pip install -e .
Install with advanced visualization features (use quotes for zsh)
pip install -e ".[advanced]"
Install all optional features
pip install -e ".[all]" ```
Basic Usage
Set up your API keys:
bash export OPENAI_API_KEY="your-openai-key" export ANTHROPIC_API_KEY="your-anthropic-key" export MISTRAL_API_KEY="your-mistral-key"Run a quick test: ```bash
🚀 Test without API keys (uses mock models)
python test_framework.py
🧪 Run unit and integration tests
python -m pytest tests/
🤖 Test with real models (requires API keys)
python testwithreal_models.py ```
- Run a benchmark: ```bash # 📊 Compare models on specific scenarios python run.py --scenarios productinquiry001 support_001 --verbose
🏃 Run full benchmark with custom config
python run.py --config config/models.yaml --output results/
💻 Using CLI interface directly
bizcon run --config config/models.yaml --output results/ ```
- Explore available options: ```bash # 📋 List all available scenarios python run.py --list-scenarios # or: bizcon list-scenarios
🤖 List supported models
python run.py --list-models
or: bizcon list-models
```
Configuration
Customize your evaluation in config/models.yaml:
yaml
models:
- provider: openai
name: gpt-4
temperature: 0.7
max_tokens: 2048
- provider: anthropic
name: claude-3-sonnet
temperature: 0.7
max_tokens: 2048
Adjust evaluation settings in config/evaluation.yaml:
yaml
evaluation:
parallel: true
num_runs: 3
evaluator_weights:
response_quality: 0.25
business_value: 0.25
communication_style: 0.20
tool_usage: 0.20
performance: 0.10
📖 Documentation
Project Structure
bizcon/
├── config/ # Configuration files
│ ├── models.yaml # Model configurations
│ └── evaluation.yaml # Evaluation settings
├── core/ # Core evaluation pipeline
│ ├── pipeline.py # Main evaluation orchestrator
│ └── runner.py # Scenario execution engine
├── models/ # LLM provider integrations
│ ├── openai.py # OpenAI client
│ ├── anthropic.py # Anthropic client
│ └── mistral.py # Mistral AI client
├── scenarios/ # Business conversation scenarios
│ ├── product_inquiry.py
│ ├── technical_support.py
│ └── contract_negotiation.py
├── evaluators/ # Evaluation metrics
│ ├── response_quality.py
│ ├── business_value.py
│ └── communication_style.py
├── tools/ # Business tool implementations
│ ├── knowledge_base.py
│ ├── scheduler.py
│ └── product_catalog.py
├── visualization/ # Advanced visualization and reporting
│ ├── charts.py # Static matplotlib charts
│ ├── interactive_charts.py # Interactive Plotly charts
│ ├── dashboard.py # Basic Flask dashboard
│ ├── advanced_dashboard.py # Advanced dashboard with filtering
│ ├── analysis_utils.py # Statistical analysis tools
│ └── report.py # Report generation
└── data/ # Sample business data
├── knowledge_base/
├── products/
└── pricing/
Creating Custom Scenarios
```python from scenarios.base import BusinessScenario
class CustomBusinessScenario(BusinessScenario): def init(self, scenarioid=None): super().init( scenarioid=scenarioid or "custom001", name="Custom Business Scenario", description="Your custom scenario description", industry="technology", complexity="medium", toolsrequired=["knowledgebase", "scheduler"] )
def _initialize_conversation(self):
return [{
"user_message": "Your initial customer message",
"expected_tool_calls": [
{"tool_id": "knowledge_base", "parameters": {"query": "example"}}
]
}]
def _initialize_ground_truth(self):
return {
"expected_facts": ["Key fact 1", "Key fact 2"],
"business_objective": "Help customer achieve X",
"expected_tone": "professional"
}
```
Adding Custom Evaluators
```python from evaluators.base import BaseEvaluator
class CustomEvaluator(BaseEvaluator): def init(self, weight=1.0): super().init(name="Custom Evaluator", weight=weight)
def evaluate(self, response, scenario, turn_index, conversation_history, tool_calls):
# Your evaluation logic here
score = self.calculate_score(response)
return {
"score": score,
"explanation": "Detailed explanation of the score",
"max_possible": 10.0
}
```
📊 Sample Results
📈 Click to view sample benchmark results
### Overall Model Performance ``` ┌─────────────────┬─────────┬─────────────┬─────────────┬─────────────┬─────────────┬─────────────┐ │ Model │ Overall │ Response │ Business │ Communication│ Tool Usage │ Performance │ │ │ Score │ Quality │ Value │ Style │ │ │ ├─────────────────┼─────────┼─────────────┼─────────────┼─────────────┼─────────────┼─────────────┤ │ gpt-4 │ 8.2/10 │ 8.5/10 │ 8.1/10 │ 9.0/10 │ 7.8/10 │ 8.0/10 │ │ claude-3-sonnet │ 7.9/10 │ 8.2/10 │ 7.8/10 │ 8.8/10 │ 7.5/10 │ 7.2/10 │ │ claude-3-haiku │ 7.1/10 │ 7.3/10 │ 6.9/10 │ 8.0/10 │ 6.8/10 │ 8.5/10 │ │ gpt-3.5-turbo │ 6.8/10 │ 6.5/10 │ 6.2/10 │ 7.5/10 │ 6.0/10 │ 7.8/10 │ └─────────────────┴─────────┴─────────────┴─────────────┴─────────────┴─────────────┴─────────────┘ ``` ### Success Rates by Category - **GPT-4**: Response Quality (89%), Tool Usage (78%), Communication Style (90%) - **Claude-3-Sonnet**: Response Quality (86%), Tool Usage (75%), Communication Style (88%) - **Claude-3-Haiku**: Response Quality (73%), Tool Usage (68%), Communication Style (80%) ### Report Outputs - **📊 Interactive HTML Report**: Charts, breakdowns, and detailed analysis - **📈 CSV Data Export**: Raw scores for custom analysis and visualization - **📝 Markdown Summary**: Professional reports for sharing and documentation - **🎯 Success Rate Analysis**: Model performance across business scenarios🏗️ Advanced Usage
Parallel Evaluation
```bash
Run multiple scenarios in parallel
python run.py --scenarios productinquiry001 support001 contract001 --parallel
Or using CLI directly
bizcon run --scenarios productinquiry001 support_001 --parallel ```
Custom Model Parameters
yaml
models:
- provider: openai
name: gpt-4
temperature: 0.3
max_tokens: 1024
parameters:
seed: 42
top_p: 0.9
Advanced Visualization Dashboard
```bash
Install advanced features first (use quotes for zsh)
pip install -e ".[advanced]"
Launch interactive dashboard with advanced features
python examples/advanceddashboarddemo.py --results-dir output/
Launch on custom host/port with auto-refresh
python examples/advanceddashboarddemo.py --host 0.0.0.0 --port 8080
Disable auto-refresh for static analysis
python examples/advanceddashboarddemo.py --no-auto-refresh ```
Note: Advanced visualization features require additional dependencies (Plotly, Flask, SciPy). Install with pip install "bizcon[advanced]" (quotes required for zsh) to enable these features.
Scenario Categories
```bash
Run all product inquiry scenarios
python run.py --scenarios productinquiry*
Run scenarios by complexity
python run.py --scenarios complex_* ```
🤝 Contributing
We welcome contributions from the community! Here's how you can help:
Ways to Contribute
- 🐛 Report Bugs: Open an issue with detailed reproduction steps
- ✨ Suggest Features: Propose new scenarios, evaluators, or tools
- 📝 Improve Documentation: Help make our docs clearer
- 🔧 Submit Code: Fix bugs or add new features
- 🧪 Add Test Cases: Improve our test coverage
Development Setup
```bash git clone https://github.com/Olib-AI/bizcon.git cd bizcon pip install -e .
Run framework validation (no API keys needed)
python test_framework.py
Run full test suite
python -m pytest tests/ ```
Contribution Guidelines
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Add tests for new functionality
- Run the test suite (
pytest) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
🧪 Testing & Validation
🎯 Framework Validation Status
Running Tests
🧪 Click to view test commands
```bash # 🚀 Quick framework validation (no API keys required) python test_framework.py # 📊 Full test suite with detailed output python -m pytest tests/ -v # 🔍 Test specific components python -m pytest tests/unit/test_evaluators.py::TestResponseQualityEvaluator python -m pytest tests/integration/test_pipeline.py # 🎯 Test with coverage report python -m pytest tests/ --cov=./ --cov-report=html ``` **No API keys needed** for framework validation - uses MockModelClient for comprehensive testing.📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
💬 Community
- Website: www.olib.ai
- GitHub: github.com/Olib-AI
- Issues: Report bugs or request features
- Discussions: Join the conversation
👥 Authors
Akram Hasan Sharkar - Author & Lead Developer
Maya Msahal - Co-Author & Research Contributor
Developed at Olib AI
📖 Research Paper
A detailed research paper describing the methodology, evaluation framework, and empirical results of bizCon will be published on arXiv.org. The paper link will be available here upon publication.
Citation format will be provided once the paper is published.
🙏 Acknowledgments
- Built with ❤️ by Akram Hasan Sharkar and Maya Msahal at Olib AI
- Inspired by the need for better business-focused LLM evaluation
- Thanks to all contributors who help make this project better
📈 Roadmap
🚀 View upcoming features and release history
### ✅ Recent Additions (May 2025) | Feature | Priority | Status | Completed | |---------|----------|--------|-----------| | 📊 **Advanced Visualization Dashboards** | High | ✅ Complete | May 2025 | | 🎯 **Interactive Plotly Charts** | High | ✅ Complete | May 2025 | | 🔄 **Real-time Dashboard Filtering** | Medium | ✅ Complete | May 2025 | | 📈 **Statistical Analysis Tools** | Medium | ✅ Complete | May 2025 | | 🔍 **Model Comparison Engine** | Medium | ✅ Complete | May 2025 | ### 🔮 Upcoming Features | Feature | Priority | Status | ETA | |---------|----------|--------|-----| | 🌐 **More LLM Providers** (Cohere, Together AI) | High | Planning | Q3 2025 | | 🏭 **Industry-Specific Scenario Packs** | Medium | Planning | Q4 2025 | | ⚡ **Real-time Evaluation APIs** | Medium | Researching | Q4 2025 | | 🔗 **Custom Webhook Integrations** | Low | Backlog | Q1 2026 | | 🌍 **Multi-language Support** | Low | Backlog | Q1 2026 | | 🤖 **AI-Powered Insights** | Medium | Planning | Q3 2025 | ### 📋 Version History - **v0.4.0** *(Current)*: Advanced visualization dashboards, interactive Plotly charts, real-time filtering, statistical analysis - **v0.3.0**: Multi-provider support, tool integration, success rate differentiation - **v0.2.0**: Added visualization and reporting capabilities - **v0.1.0**: Initial release with core evaluation frameworkOwner
- Name: Olib AI
- Login: Olib-AI
- Kind: organization
- Email: github@olib.ai
- Location: United States of America
- Website: www.olib.ai
- Repositories: 1
- Profile: https://github.com/Olib-AI
Empowering businesses with cutting-edge AI solutions.
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
type: software
title: "bizCon: Business Conversation Evaluation Framework for LLMs"
authors:
- family-names: "Sharkar"
given-names: "Akram Hasan"
orcid: "https://orcid.org/0000-0000-0000-0000" # Replace with actual ORCID if available
affiliation: "Olib AI"
- family-names: "Msahal"
given-names: "Maya"
orcid: "https://orcid.org/0000-0000-0000-0000" # Replace with actual ORCID if available
affiliation: "Olib AI"
repository-code: "https://github.com/Olib-AI/bizcon"
url: "https://www.olib.ai"
abstract: >-
bizCon is a comprehensive evaluation framework for benchmarking
Large Language Models on business conversation capabilities. It
evaluates models across multiple dimensions including response
quality, business value, communication style, tool usage, and
performance using realistic business scenarios.
keywords:
- "large language models"
- "LLM evaluation"
- "business conversations"
- "benchmark"
- "natural language processing"
- "artificial intelligence"
license: MIT
version: "0.1.0"
date-released: "2024-12-19"
preferred-citation:
type: article
title: "bizCon: A Comprehensive Evaluation Framework for Business Conversation Capabilities of Large Language Models"
authors:
- family-names: "Sharkar"
given-names: "Akram Hasan"
affiliation: "Olib AI"
- family-names: "Msahal"
given-names: "Maya"
affiliation: "Olib AI"
journal: "arXiv preprint"
year: 2024
notes: "Paper to be published on arXiv.org"
GitHub Events
Total
- Issues event: 3
- Watch event: 1
- Issue comment event: 1
- Push event: 5
- Fork event: 1
- Create event: 2
Last Year
- Issues event: 3
- Watch event: 1
- Issue comment event: 1
- Push event: 5
- Fork event: 1
- Create event: 2