https://github.com/all-hands-ai/critic-rubrics
Type-safe LLM-as-judge evaluation framework for structured prediction and analysis.
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.8%) to scientific vocabulary
Repository
Type-safe LLM-as-judge evaluation framework for structured prediction and analysis.
Basic Info
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 1
Metadata Files
README.md
Critic Rubrics
Type-safe function-calling-based LLM-as-judge evaluation framework for structured prediction and analysis.
[!WARNING] This repository is an active research project. APIs and implementations are subject to major changes. Use with caution.
To Install
bash
pip install git+https://github.com/All-Hands-AI/critic-rubrics
Core Data Structures
Prediction Types
All predictions inherit from BasePrediction and define how data is flattened into OpenAI tool schemas:
```python from critic_rubrics import BinaryPrediction, TextPrediction, ClassificationPrediction from typing import Literal
Boolean detection with evidence
class BinaryPrediction(BasePrediction):
detected: bool # Flattened as:
Free text output
class TextPrediction(BasePrediction):
text: str # Flattened as:
Single-label classification with evidence
class ClassificationPredictionL:
label: L # Flattened as:
Feature Definition
```python from critic_rubrics import Feature
feature = Feature( name="taskcomplexity", description="Assess the complexity level of the given task", predictiontype=ClassificationPrediction[Literal["simple", "moderate", "complex"]] ) ```
Feature Data
```python from critic_rubrics import FeatureData
Combines feature definition with actual prediction data
feature_data = FeatureData( feature=feature, prediction=ClassificationPrediction(label="complex", rationale="Multiple dependencies") ) ```
Core APIs
Rubric Definition
```python from critic_rubrics import BaseRubrics, Feature, BinaryPrediction, ClassificationPrediction from typing import Literal, Any from litellm import ChatCompletionRequest
class TaskAnalysisRubric(BaseRubrics): def init(self): super().init( toolname="analyzetask", tooldescription="Analyze task characteristics and complexity", features=[ Feature( name="requiresclarification", description="Task requires additional clarification from user", predictiontype=BinaryPrediction ), Feature( name="complexitylevel", description="Overall complexity assessment", predictiontype=ClassificationPrediction[Literal["simple", "moderate", "complex"]] ) ], systemmessage="You are an expert task analyzer.", user_message="Analyze the following task:" )
def create_annotation_request(self, inputs: dict[str, Any], model: str = "openai/o3-2025-04-16") -> ChatCompletionRequest:
return {
"model": model,
"messages": [
{"role": "system", "content": self.system_message},
{"role": "user", "content": f"{self.user_message}\n\n{inputs['task_description']}"}
],
"tools": self.tools,
"tool_choice": self.tool_choice
}
```
Tool Schema Generation
```python rubric = TaskAnalysisRubric()
OpenAI-compatible tool schema
print(rubric.tools)
[{"type": "function", "function": {"name": "analyze_task", "parameters": {...}}}]
print(rubric.tool_choice)
{"type": "function", "function": {"name": "analyze_task"}}
```
LLM Integration
```python from critic_rubrics import Annotator from litellm import completion
Single request
rubric = TaskAnalysisRubric() request = rubric.createannotationrequest({"task_description": "Build a web scraper"}) response = Annotator.annotate(request, model="openai/gpt-4")
Extract structured data
toolcall = response.choices[0].message.toolcalls[0] featuredatalist = rubric.toolcalltofeaturedata(tool_call)
for featuredata in featuredatalist: print(f"{featuredata.feature.name}: {featuredata.prediction.todict()}") ```
Batch Processing
```python
Send batch requests
requests = [rubric.createannotationrequest({"taskdescription": task}) for task in tasks] batchids = Annotator.batchannotate( requests, outputdir="./batchresults", customllm_provider="openai", model="openai/gpt-4" )
Retrieve results
for batchid in batchids: status, results = Annotator.getbatchresults(batchid, customllm_provider="openai") if status["status"] == "completed": for result in results: # Process batch result pass ```
Data Flow
1. Define Rubric → 2. Generate Tool Schema → 3. Send to LLM → 4. Parse Response → 5. Typed FeatureData
↓ ↓ ↓ ↓ ↓
BaseRubrics.tools ChatCompletionRequest ModelResponse tool_call_to_feature_data() List[FeatureData]
Key Methods
Prediction Methods
to_tool_properties(field_name, field_description, rationale_description)→dict[str, Any]from_tool_args(feature_name, tool_args)→BasePredictionto_dict()→dict[str, Any]
Rubric Methods
tools→list[ChatCompletionToolParam]tool_choice→ChatCompletionToolChoiceObjectParamcreate_annotation_request(inputs, model)→ChatCompletionRequesttool_call_to_feature_data(tool_call)→list[FeatureData]
Annotator Methods
annotate(request, **kwargs)→ModelResponsebatch_annotate(requests, output_dir, custom_llm_provider, **kwargs)→list[str]get_batch_results(batch_id, custom_llm_provider, **kwargs)→tuple[dict, list[dict]]
Installation
```bash
Runtime dependencies
pip install -e .
Development setup
uv sync --group dev ```
Requirements
- Python 3.12+
- pydantic >= 2.11.7
- litellm >= 1.76.0
Example: Complete Workflow
```python from critic_rubrics import BaseRubrics, Feature, BinaryPrediction, Annotator from typing import Any from litellm import ChatCompletionRequest
1. Define rubric
class CodeReviewRubric(BaseRubrics): def init(self): super().init( toolname="reviewcode", tooldescription="Review code for potential issues", features=[ Feature("hasbugs", "Code contains potential bugs", BinaryPrediction), Feature("needsrefactor", "Code needs refactoring", BinaryPrediction) ], systemmessage="You are a senior code reviewer." )
def create_annotation_request(self, inputs: dict[str, Any], model: str = "openai/gpt-4") -> ChatCompletionRequest:
return {
"model": model,
"messages": [
{"role": "system", "content": self.system_message},
{"role": "user", "content": f"Review this code:\n\n{inputs['code']}"}
],
"tools": self.tools,
"tool_choice": self.tool_choice
}
2. Use rubric
rubric = CodeReviewRubric() request = rubric.createannotationrequest({"code": "def add(a, b): return a + b"}) response = Annotator.annotate(request)
3. Extract results
toolcall = response.choices[0].message.toolcalls[0] features = rubric.toolcalltofeaturedata(tool_call)
for featuredata in features: pred = featuredata.prediction print(f"{feature_data.feature.name}: {pred.detected} - {pred.rationale}") ```
Owner
- Name: All Hands AI
- Login: All-Hands-AI
- Kind: organization
- Email: contact@all-hands.dev
- Website: https://all-hands.dev/
- Twitter: allhands_ai
- Repositories: 1
- Profile: https://github.com/All-Hands-AI
We build AI software development agents for everyone, in the open.
GitHub Events
Total
- Release event: 1
- Issue comment event: 4
- Public event: 1
- Push event: 9
- Pull request review event: 1
- Pull request review comment event: 1
- Pull request event: 10
- Create event: 5
Last Year
- Release event: 1
- Issue comment event: 4
- Public event: 1
- Push event: 9
- Pull request review event: 1
- Pull request review comment event: 1
- Pull request event: 10
- Create event: 5
Issues and Pull Requests
Last synced: 10 months ago
All Time
- Total issues: 0
- Total pull requests: 19
- Average time to close issues: N/A
- Average time to close pull requests: 1 day
- Total issue authors: 0
- Total pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 1.32
- Merged pull requests: 19
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 19
- Average time to close issues: N/A
- Average time to close pull requests: 1 day
- Issue authors: 0
- Pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 1.32
- Merged pull requests: 19
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
- xingyaoww (19)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- actions/checkout v4 composite
- actions/setup-python v5 composite
- astral-sh/setup-uv v3 composite
- pre-commit/action v3.0.1 composite
- actions/checkout v4 composite
- actions/setup-python v5 composite
- astral-sh/setup-uv v3 composite
- litellm >=1.76.0
- pydantic >=2.11.7
- rich >=14.1.0
- tenacity >=8.5.0
- aiohappyeyeballs 2.6.1
- aiohttp 3.12.15
- aiosignal 1.4.0
- annotated-types 0.7.0
- anyio 4.10.0
- attrs 25.3.0
- certifi 2025.8.3
- cfgv 3.4.0
- charset-normalizer 3.4.3
- click 8.2.1
- colorama 0.4.6
- critic-rubrics 0.1.1
- distlib 0.4.0
- distro 1.9.0
- filelock 3.19.1
- frozenlist 1.7.0
- fsspec 2025.7.0
- h11 0.16.0
- hf-xet 1.1.8
- httpcore 1.0.9
- httpx 0.28.1
- huggingface-hub 0.34.4
- identify 2.6.13
- idna 3.10
- importlib-metadata 8.7.0
- iniconfig 2.1.0
- jinja2 3.1.6
- jiter 0.10.0
- jsonschema 4.25.1
- jsonschema-specifications 2025.4.1
- litellm 1.76.0
- markdown-it-py 4.0.0
- markupsafe 3.0.2
- mdurl 0.1.2
- multidict 6.6.4
- nodeenv 1.9.1
- openai 1.101.0
- packaging 25.0
- platformdirs 4.3.8
- pluggy 1.6.0
- pre-commit 4.3.0
- propcache 0.3.2
- psutil 7.0.0
- pydantic 2.11.7
- pydantic-core 2.33.2
- pygments 2.19.2
- pyright 1.1.404
- pytest 8.4.1
- python-dotenv 1.1.1
- pyyaml 6.0.2
- referencing 0.36.2
- regex 2025.7.34
- requests 2.32.5
- rich 14.1.0
- rpds-py 0.27.0
- ruff 0.12.10
- sniffio 1.3.1
- tenacity 9.1.2
- tiktoken 0.11.0
- tokenizers 0.21.4
- tqdm 4.67.1
- typing-extensions 4.14.1
- typing-inspection 0.4.1
- urllib3 2.5.0
- virtualenv 20.34.0
- yarl 1.20.1
- zipp 3.23.0