https://github.com/all-hands-ai/critic-rubrics

Type-safe LLM-as-judge evaluation framework for structured prediction and analysis.

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (8.8%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Type-safe LLM-as-judge evaluation framework for structured prediction and analysis.

Basic Info

Host: GitHub
Owner: All-Hands-AI
Language: Python
Default Branch: main
Homepage:
Size: 379 KB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 1

Created 10 months ago · Last pushed 10 months ago

Metadata Files

Readme

Critic Rubrics

Type-safe function-calling-based LLM-as-judge evaluation framework for structured prediction and analysis.

[!WARNING] This repository is an active research project. APIs and implementations are subject to major changes. Use with caution.

To Install

bash pip install git+https://github.com/All-Hands-AI/critic-rubrics

Core Data Structures

Prediction Types

All predictions inherit from BasePrediction and define how data is flattened into OpenAI tool schemas:

```python from critic_rubrics import BinaryPrediction, TextPrediction, ClassificationPrediction from typing import Literal

Boolean detection with evidence

class BinaryPrediction(BasePrediction): detected: bool # Flattened as: detected rationale: str # Flattened as: rationale

Free text output

class TextPrediction(BasePrediction): text: str # Flattened as: _text

Single-label classification with evidence

class ClassificationPredictionL: label: L # Flattened as: (with enum constraint) rationale: str # Flattened as: _rationale ```

Feature Definition

```python from critic_rubrics import Feature

feature = Feature( name="taskcomplexity", description="Assess the complexity level of the given task", predictiontype=ClassificationPrediction[Literal["simple", "moderate", "complex"]] ) ```

Feature Data

```python from critic_rubrics import FeatureData

Combines feature definition with actual prediction data

feature_data = FeatureData( feature=feature, prediction=ClassificationPrediction(label="complex", rationale="Multiple dependencies") ) ```

Core APIs

Rubric Definition

```python from critic_rubrics import BaseRubrics, Feature, BinaryPrediction, ClassificationPrediction from typing import Literal, Any from litellm import ChatCompletionRequest

class TaskAnalysisRubric(BaseRubrics): def init(self): super().init( toolname="analyzetask", tooldescription="Analyze task characteristics and complexity", features=[ Feature( name="requiresclarification", description="Task requires additional clarification from user", predictiontype=BinaryPrediction ), Feature( name="complexitylevel", description="Overall complexity assessment", predictiontype=ClassificationPrediction[Literal["simple", "moderate", "complex"]] ) ], systemmessage="You are an expert task analyzer.", user_message="Analyze the following task:" )

def create_annotation_request(self, inputs: dict[str, Any], model: str = "openai/o3-2025-04-16") -> ChatCompletionRequest:
    return {
        "model": model,
        "messages": [
            {"role": "system", "content": self.system_message},
            {"role": "user", "content": f"{self.user_message}\n\n{inputs['task_description']}"}
        ],
        "tools": self.tools,
        "tool_choice": self.tool_choice
    }

```

Tool Schema Generation

```python rubric = TaskAnalysisRubric()

OpenAI-compatible tool schema

print(rubric.tools)

[{"type": "function", "function": {"name": "analyze_task", "parameters": {...}}}]

print(rubric.tool_choice)

{"type": "function", "function": {"name": "analyze_task"}}

```

LLM Integration

```python from critic_rubrics import Annotator from litellm import completion

Single request

rubric = TaskAnalysisRubric() request = rubric.createannotationrequest({"task_description": "Build a web scraper"}) response = Annotator.annotate(request, model="openai/gpt-4")

Extract structured data

toolcall = response.choices[0].message.toolcalls[0] featuredatalist = rubric.toolcalltofeaturedata(tool_call)

for featuredata in featuredatalist: print(f"{featuredata.feature.name}: {featuredata.prediction.todict()}") ```

Batch Processing

```python

Send batch requests

requests = [rubric.createannotationrequest({"taskdescription": task}) for task in tasks] batchids = Annotator.batchannotate( requests, outputdir="./batchresults", customllm_provider="openai", model="openai/gpt-4" )

Retrieve results

for batchid in batchids: status, results = Annotator.getbatchresults(batchid, customllm_provider="openai") if status["status"] == "completed": for result in results: # Process batch result pass ```

Data Flow

1. Define Rubric → 2. Generate Tool Schema → 3. Send to LLM → 4. Parse Response → 5. Typed FeatureData ↓ ↓ ↓ ↓ ↓ BaseRubrics.tools ChatCompletionRequest ModelResponse tool_call_to_feature_data() List[FeatureData]

Key Methods

Prediction Methods

to_tool_properties(field_name, field_description, rationale_description) → dict[str, Any]
from_tool_args(feature_name, tool_args) → BasePrediction
to_dict() → dict[str, Any]

Rubric Methods

tools → list[ChatCompletionToolParam]
tool_choice → ChatCompletionToolChoiceObjectParam
create_annotation_request(inputs, model) → ChatCompletionRequest
tool_call_to_feature_data(tool_call) → list[FeatureData]

Annotator Methods

annotate(request, **kwargs) → ModelResponse
batch_annotate(requests, output_dir, custom_llm_provider, **kwargs) → list[str]
get_batch_results(batch_id, custom_llm_provider, **kwargs) → tuple[dict, list[dict]]

Installation

```bash

Runtime dependencies

pip install -e .

Development setup

uv sync --group dev ```

Requirements

Python 3.12+
pydantic >= 2.11.7
litellm >= 1.76.0

Example: Complete Workflow

```python from critic_rubrics import BaseRubrics, Feature, BinaryPrediction, Annotator from typing import Any from litellm import ChatCompletionRequest

1. Define rubric

class CodeReviewRubric(BaseRubrics): def init(self): super().init( toolname="reviewcode", tooldescription="Review code for potential issues", features=[ Feature("hasbugs", "Code contains potential bugs", BinaryPrediction), Feature("needsrefactor", "Code needs refactoring", BinaryPrediction) ], systemmessage="You are a senior code reviewer." )

def create_annotation_request(self, inputs: dict[str, Any], model: str = "openai/gpt-4") -> ChatCompletionRequest:
    return {
        "model": model,
        "messages": [
            {"role": "system", "content": self.system_message},
            {"role": "user", "content": f"Review this code:\n\n{inputs['code']}"}
        ],
        "tools": self.tools,
        "tool_choice": self.tool_choice
    }

2. Use rubric

rubric = CodeReviewRubric() request = rubric.createannotationrequest({"code": "def add(a, b): return a + b"}) response = Annotator.annotate(request)

3. Extract results

toolcall = response.choices[0].message.toolcalls[0] features = rubric.toolcalltofeaturedata(tool_call)

for featuredata in features: pred = featuredata.prediction print(f"{feature_data.feature.name}: {pred.detected} - {pred.rationale}") ```

Owner

Name: All Hands AI
Login: All-Hands-AI
Kind: organization
Email: contact@all-hands.dev

Website: https://all-hands.dev/
Twitter: allhands_ai
Repositories: 1
Profile: https://github.com/All-Hands-AI

We build AI software development agents for everyone, in the open.

GitHub Events

Total

Release event: 1
Issue comment event: 4
Public event: 1
Push event: 9
Pull request review event: 1
Pull request review comment event: 1
Pull request event: 10
Create event: 5

Last Year

Release event: 1
Issue comment event: 4
Public event: 1
Push event: 9
Pull request review event: 1
Pull request review comment event: 1
Pull request event: 10
Create event: 5

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 0
Total pull requests: 19
Average time to close issues: N/A
Average time to close pull requests: 1 day
Total issue authors: 0
Total pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 1.32
Merged pull requests: 19
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 19
Average time to close issues: N/A
Average time to close pull requests: 1 day
Issue authors: 0
Pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 1.32
Merged pull requests: 19
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

xingyaoww (19)

Top Labels

Issue Labels

Pull Request Labels

documentation (1)

Dependencies

.github/workflows/precommit.yml actions

actions/checkout v4 composite
actions/setup-python v5 composite
astral-sh/setup-uv v3 composite
pre-commit/action v3.0.1 composite

.github/workflows/tests.yml actions

actions/checkout v4 composite
actions/setup-python v5 composite
astral-sh/setup-uv v3 composite

pyproject.toml pypi

litellm >=1.76.0
pydantic >=2.11.7
rich >=14.1.0
tenacity >=8.5.0

uv.lock pypi

aiohappyeyeballs 2.6.1
aiohttp 3.12.15
aiosignal 1.4.0
annotated-types 0.7.0
anyio 4.10.0
attrs 25.3.0
certifi 2025.8.3
cfgv 3.4.0
charset-normalizer 3.4.3
click 8.2.1
colorama 0.4.6
critic-rubrics 0.1.1
distlib 0.4.0
distro 1.9.0
filelock 3.19.1
frozenlist 1.7.0
fsspec 2025.7.0
h11 0.16.0
hf-xet 1.1.8
httpcore 1.0.9
httpx 0.28.1
huggingface-hub 0.34.4
identify 2.6.13
idna 3.10
importlib-metadata 8.7.0
iniconfig 2.1.0
jinja2 3.1.6
jiter 0.10.0
jsonschema 4.25.1
jsonschema-specifications 2025.4.1
litellm 1.76.0
markdown-it-py 4.0.0
markupsafe 3.0.2
mdurl 0.1.2
multidict 6.6.4
nodeenv 1.9.1
openai 1.101.0
packaging 25.0
platformdirs 4.3.8
pluggy 1.6.0
pre-commit 4.3.0
propcache 0.3.2
psutil 7.0.0
pydantic 2.11.7
pydantic-core 2.33.2
pygments 2.19.2
pyright 1.1.404
pytest 8.4.1
python-dotenv 1.1.1
pyyaml 6.0.2
referencing 0.36.2
regex 2025.7.34
requests 2.32.5
rich 14.1.0
rpds-py 0.27.0
ruff 0.12.10
sniffio 1.3.1
tenacity 9.1.2
tiktoken 0.11.0
tokenizers 0.21.4
tqdm 4.67.1
typing-extensions 4.14.1
typing-inspection 0.4.1
urllib3 2.5.0
virtualenv 20.34.0
yarl 1.20.1
zipp 3.23.0

https://github.com/all-hands-ai/critic-rubrics

Science Score: 26.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Critic Rubrics

To Install

Core Data Structures

Prediction Types

Boolean detection with evidence

Free text output

Single-label classification with evidence

Feature Definition

Feature Data

Combines feature definition with actual prediction data

Core APIs

Rubric Definition

Tool Schema Generation

OpenAI-compatible tool schema

[{"type": "function", "function": {"name": "analyze_task", "parameters": {...}}}]

{"type": "function", "function": {"name": "analyze_task"}}

LLM Integration

Single request

Extract structured data

Batch Processing

Send batch requests

Retrieve results

Data Flow

Key Methods

Prediction Methods

Rubric Methods

Annotator Methods

Installation

Runtime dependencies

Development setup

Requirements

Example: Complete Workflow

1. Define rubric

2. Use rubric

3. Extract results

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies