https://github.com/all-hands-ai/critic-rubrics

Type-safe LLM-as-judge evaluation framework for structured prediction and analysis.

https://github.com/all-hands-ai/critic-rubrics

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (8.8%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Type-safe LLM-as-judge evaluation framework for structured prediction and analysis.

Basic Info
  • Host: GitHub
  • Owner: All-Hands-AI
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 379 KB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 1
Created 10 months ago · Last pushed 10 months ago
Metadata Files
Readme

README.md

Critic Rubrics

Type-safe function-calling-based LLM-as-judge evaluation framework for structured prediction and analysis.

[!WARNING] This repository is an active research project. APIs and implementations are subject to major changes. Use with caution.

To Install

bash pip install git+https://github.com/All-Hands-AI/critic-rubrics

Core Data Structures

Prediction Types

All predictions inherit from BasePrediction and define how data is flattened into OpenAI tool schemas:

```python from critic_rubrics import BinaryPrediction, TextPrediction, ClassificationPrediction from typing import Literal

Boolean detection with evidence

class BinaryPrediction(BasePrediction): detected: bool # Flattened as: detected rationale: str # Flattened as: rationale

Free text output

class TextPrediction(BasePrediction): text: str # Flattened as: _text

Single-label classification with evidence

class ClassificationPredictionL: label: L # Flattened as: (with enum constraint) rationale: str # Flattened as: _rationale ```

Feature Definition

```python from critic_rubrics import Feature

feature = Feature( name="taskcomplexity", description="Assess the complexity level of the given task", predictiontype=ClassificationPrediction[Literal["simple", "moderate", "complex"]] ) ```

Feature Data

```python from critic_rubrics import FeatureData

Combines feature definition with actual prediction data

feature_data = FeatureData( feature=feature, prediction=ClassificationPrediction(label="complex", rationale="Multiple dependencies") ) ```

Core APIs

Rubric Definition

```python from critic_rubrics import BaseRubrics, Feature, BinaryPrediction, ClassificationPrediction from typing import Literal, Any from litellm import ChatCompletionRequest

class TaskAnalysisRubric(BaseRubrics): def init(self): super().init( toolname="analyzetask", tooldescription="Analyze task characteristics and complexity", features=[ Feature( name="requiresclarification", description="Task requires additional clarification from user", predictiontype=BinaryPrediction ), Feature( name="complexitylevel", description="Overall complexity assessment", predictiontype=ClassificationPrediction[Literal["simple", "moderate", "complex"]] ) ], systemmessage="You are an expert task analyzer.", user_message="Analyze the following task:" )

def create_annotation_request(self, inputs: dict[str, Any], model: str = "openai/o3-2025-04-16") -> ChatCompletionRequest:
    return {
        "model": model,
        "messages": [
            {"role": "system", "content": self.system_message},
            {"role": "user", "content": f"{self.user_message}\n\n{inputs['task_description']}"}
        ],
        "tools": self.tools,
        "tool_choice": self.tool_choice
    }

```

Tool Schema Generation

```python rubric = TaskAnalysisRubric()

OpenAI-compatible tool schema

print(rubric.tools)

[{"type": "function", "function": {"name": "analyze_task", "parameters": {...}}}]

print(rubric.tool_choice)

{"type": "function", "function": {"name": "analyze_task"}}

```

LLM Integration

```python from critic_rubrics import Annotator from litellm import completion

Single request

rubric = TaskAnalysisRubric() request = rubric.createannotationrequest({"task_description": "Build a web scraper"}) response = Annotator.annotate(request, model="openai/gpt-4")

Extract structured data

toolcall = response.choices[0].message.toolcalls[0] featuredatalist = rubric.toolcalltofeaturedata(tool_call)

for featuredata in featuredatalist: print(f"{featuredata.feature.name}: {featuredata.prediction.todict()}") ```

Batch Processing

```python

Send batch requests

requests = [rubric.createannotationrequest({"taskdescription": task}) for task in tasks] batchids = Annotator.batchannotate( requests, outputdir="./batchresults", customllm_provider="openai", model="openai/gpt-4" )

Retrieve results

for batchid in batchids: status, results = Annotator.getbatchresults(batchid, customllm_provider="openai") if status["status"] == "completed": for result in results: # Process batch result pass ```

Data Flow

1. Define Rubric → 2. Generate Tool Schema → 3. Send to LLM → 4. Parse Response → 5. Typed FeatureData ↓ ↓ ↓ ↓ ↓ BaseRubrics.tools ChatCompletionRequest ModelResponse tool_call_to_feature_data() List[FeatureData]

Key Methods

Prediction Methods

  • to_tool_properties(field_name, field_description, rationale_description)dict[str, Any]
  • from_tool_args(feature_name, tool_args)BasePrediction
  • to_dict()dict[str, Any]

Rubric Methods

  • toolslist[ChatCompletionToolParam]
  • tool_choiceChatCompletionToolChoiceObjectParam
  • create_annotation_request(inputs, model)ChatCompletionRequest
  • tool_call_to_feature_data(tool_call)list[FeatureData]

Annotator Methods

  • annotate(request, **kwargs)ModelResponse
  • batch_annotate(requests, output_dir, custom_llm_provider, **kwargs)list[str]
  • get_batch_results(batch_id, custom_llm_provider, **kwargs)tuple[dict, list[dict]]

Installation

```bash

Runtime dependencies

pip install -e .

Development setup

uv sync --group dev ```

Requirements

  • Python 3.12+
  • pydantic >= 2.11.7
  • litellm >= 1.76.0

Example: Complete Workflow

```python from critic_rubrics import BaseRubrics, Feature, BinaryPrediction, Annotator from typing import Any from litellm import ChatCompletionRequest

1. Define rubric

class CodeReviewRubric(BaseRubrics): def init(self): super().init( toolname="reviewcode", tooldescription="Review code for potential issues", features=[ Feature("hasbugs", "Code contains potential bugs", BinaryPrediction), Feature("needsrefactor", "Code needs refactoring", BinaryPrediction) ], systemmessage="You are a senior code reviewer." )

def create_annotation_request(self, inputs: dict[str, Any], model: str = "openai/gpt-4") -> ChatCompletionRequest:
    return {
        "model": model,
        "messages": [
            {"role": "system", "content": self.system_message},
            {"role": "user", "content": f"Review this code:\n\n{inputs['code']}"}
        ],
        "tools": self.tools,
        "tool_choice": self.tool_choice
    }

2. Use rubric

rubric = CodeReviewRubric() request = rubric.createannotationrequest({"code": "def add(a, b): return a + b"}) response = Annotator.annotate(request)

3. Extract results

toolcall = response.choices[0].message.toolcalls[0] features = rubric.toolcalltofeaturedata(tool_call)

for featuredata in features: pred = featuredata.prediction print(f"{feature_data.feature.name}: {pred.detected} - {pred.rationale}") ```

Owner

  • Name: All Hands AI
  • Login: All-Hands-AI
  • Kind: organization
  • Email: contact@all-hands.dev

We build AI software development agents for everyone, in the open.

GitHub Events

Total
  • Release event: 1
  • Issue comment event: 4
  • Public event: 1
  • Push event: 9
  • Pull request review event: 1
  • Pull request review comment event: 1
  • Pull request event: 10
  • Create event: 5
Last Year
  • Release event: 1
  • Issue comment event: 4
  • Public event: 1
  • Push event: 9
  • Pull request review event: 1
  • Pull request review comment event: 1
  • Pull request event: 10
  • Create event: 5

Issues and Pull Requests

Last synced: 10 months ago

All Time
  • Total issues: 0
  • Total pull requests: 19
  • Average time to close issues: N/A
  • Average time to close pull requests: 1 day
  • Total issue authors: 0
  • Total pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 1.32
  • Merged pull requests: 19
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 19
  • Average time to close issues: N/A
  • Average time to close pull requests: 1 day
  • Issue authors: 0
  • Pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 1.32
  • Merged pull requests: 19
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • xingyaoww (19)
Top Labels
Issue Labels
Pull Request Labels
documentation (1)

Dependencies

.github/workflows/precommit.yml actions
  • actions/checkout v4 composite
  • actions/setup-python v5 composite
  • astral-sh/setup-uv v3 composite
  • pre-commit/action v3.0.1 composite
.github/workflows/tests.yml actions
  • actions/checkout v4 composite
  • actions/setup-python v5 composite
  • astral-sh/setup-uv v3 composite
pyproject.toml pypi
  • litellm >=1.76.0
  • pydantic >=2.11.7
  • rich >=14.1.0
  • tenacity >=8.5.0
uv.lock pypi
  • aiohappyeyeballs 2.6.1
  • aiohttp 3.12.15
  • aiosignal 1.4.0
  • annotated-types 0.7.0
  • anyio 4.10.0
  • attrs 25.3.0
  • certifi 2025.8.3
  • cfgv 3.4.0
  • charset-normalizer 3.4.3
  • click 8.2.1
  • colorama 0.4.6
  • critic-rubrics 0.1.1
  • distlib 0.4.0
  • distro 1.9.0
  • filelock 3.19.1
  • frozenlist 1.7.0
  • fsspec 2025.7.0
  • h11 0.16.0
  • hf-xet 1.1.8
  • httpcore 1.0.9
  • httpx 0.28.1
  • huggingface-hub 0.34.4
  • identify 2.6.13
  • idna 3.10
  • importlib-metadata 8.7.0
  • iniconfig 2.1.0
  • jinja2 3.1.6
  • jiter 0.10.0
  • jsonschema 4.25.1
  • jsonschema-specifications 2025.4.1
  • litellm 1.76.0
  • markdown-it-py 4.0.0
  • markupsafe 3.0.2
  • mdurl 0.1.2
  • multidict 6.6.4
  • nodeenv 1.9.1
  • openai 1.101.0
  • packaging 25.0
  • platformdirs 4.3.8
  • pluggy 1.6.0
  • pre-commit 4.3.0
  • propcache 0.3.2
  • psutil 7.0.0
  • pydantic 2.11.7
  • pydantic-core 2.33.2
  • pygments 2.19.2
  • pyright 1.1.404
  • pytest 8.4.1
  • python-dotenv 1.1.1
  • pyyaml 6.0.2
  • referencing 0.36.2
  • regex 2025.7.34
  • requests 2.32.5
  • rich 14.1.0
  • rpds-py 0.27.0
  • ruff 0.12.10
  • sniffio 1.3.1
  • tenacity 9.1.2
  • tiktoken 0.11.0
  • tokenizers 0.21.4
  • tqdm 4.67.1
  • typing-extensions 4.14.1
  • typing-inspection 0.4.1
  • urllib3 2.5.0
  • virtualenv 20.34.0
  • yarl 1.20.1
  • zipp 3.23.0