https://github.com/ai-forever/pollux

https://github.com/ai-forever/pollux

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.9%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Basic Info
Statistics
  • Stars: 11
  • Watchers: 4
  • Forks: 0
  • Open Issues: 2
  • Releases: 0
Created about 1 year ago · Last pushed 12 months ago
Metadata Files
Readme License

README.md

POLLUX

HuggingFace License Release Paper

Evaluating the Generative Capabilities of LLMs in Russian.
Benchmark and a family of LM-as-a-Judge models.

Demo code | Models & Dataset | Benchmark demo | Publications

Welcome to POLLUX – an open-source project dedicated to evaluating the generative capabilities of modern large language models (LLMs) in Russian.

Our comprehensive evaluation framework is built on three foundational pillars. First, we provide carefully developed 📊 taxonomies that systematically categorize both generative tasks and evaluation criteria. Second, our meticulously crafted 🌟 benchmark comprises 2,100 unique, manually created instructions paired with 471,515 detailed point criteria assessments. Finally, POLLUX features a specialized ⚖️ family of LLM-based judges that automate the evaluation process, enabling scalable and systematic assessment of model outputs across all task categories.

↗ 🧭 Explore the benchmark on the project page.

↗ 🤗 See Hugging Face collection for the dataset and the models.

POLLUX features

  • 📚 152 diverse tasks: Covering open-ended generation, text-to-text transformation, information-seeking, and code-related prompts. The task taxonomy is grounded in analysis of real-world user queries.

  • 🌡️ 66 unique evaluation criteria: A rich set of non-overlapping fine-grained metrics — ranging from surface-level quality (e.g. absence of artifacts) to higher-level abilities like reasoning and creativity. Each criterion comes with a clearly defined evaluation scale.

  • 📊 Three difficulty levels: Tasks are organized into easy, medium, and hard tiers to support targeted model diagnostics.

  • 👩🏼‍🎓 Expert-curated tasks: All tasks and criteria are designed from scratch by domain experts to ensure quality and relevance. All instructions and criteria annotations are similarly developed and reviewed by experts panels to maintain consistent standards throughout the evaluation process.

  • 🤖 LLM-based evaluators: A suite of judge models (7B and 32B) trained to assess responses against specific criteria and generate score justifications. Supports custom criteria and evaluation scales via flexible input formatting (beta).

🚀 Quickstart

Score model outputs with POLLUX judges: demo.ipynb

To reproduce the evaluation results, please refer to the src/inference.py file:

commandline git clone https://github.com/ai-forever/POLLUX.git cd POLLUX pip install -r requirements.txt

python ./src/inference.py --test_path ai-forever/POLLUX --template_path src/data_utils/test_prompt_template_ru.yaml --num_proc 1 inference_offline_vllm --model_path ai-forever/pollux-judge-7b --tokenizer_path ai-forever/pollux-judge-7b --tensor_parallel_size 1 --answer_path pollux_judge_7b.json

python ./src/inference.py --test_path ai-forever/POLLUX --template_path src/data_utils/test_prompt_template_ru.yaml --num_proc 1 compute_metrics --answer_path logs/pollux_judge_7b.json

📂 Repository Structure

pollux/ ├── images/ # project logo ├── metainfo/ # benchmark metadata ├── clustering_demo.ipynb # user logs analysis ├── src/ # inference tools ├── src/inference.py # reproduce evaluation ├── LICENSE # license └── demo.ipynb # inference demo

🌟 Benchmark

The POLLUX benchmark is built upon comprehensive taxonomies of generative tasks and evaluation criteria. Our taxonomy of generative tasks encompasses 35 general task groups organized across two hierarchical levels (functional styles/substyles and genres), covering a total of 152 distinct tasks. 📊

Our taxonomy of evaluation criteria features five comprehensive categories that assess: - 🔍 General & Critical: Core syntactic, lexical, and semantic text properties - 🎯 Domain-specific: Properties tied to specialized functional styles
- ✅ Task-specific: Task-oriented markers and requirements - 💭 Subjective: Human preferences and subjective opinions

▎📈 Benchmark Scale & Coverage

The benchmark contains 2,100 unique instructions evenly distributed across all 35 task groups, with three complexity levels per group. Each instruction includes responses from 7 top-tier LLMs: - 🤖 OpenAI o1 & GPT-4o - 🧠 Claude 3.5 Sonnet
- 🦙 Llama 405B - ⚡️ T-pro-it-1.0 - 🔍 YandexGPT 4 Pro - 💎 GigaChat Max

This results in 11,500 total responses across the benchmark! 🚀

▎🔬 Expert Evaluation Process

Every response is scrupulously evaluated using a tailored criteria set combining: - Critical, Subjective, and General criteria - Relevant Domain- and Task-specific criteria

With at least two expert evaluators per criterion, we've collected: - 471,000+ individual criteria estimates with textual rationales ✍️ - 161,076 aggregate (over overlap) numerical scores 📊

▎🌐 Access & Exploration

Ready to dive in? Access the benchmark on its home page and explore the data through our interactive demo! 🎮

⚖️ Judges

POLLUX includes a family of LLM-based judges, trained to evaluate model outputs against scale-based criteria. The judges are designed to be flexible and can be adapted to different evaluation scales and criteria.

We provide two versions of the judges: - 7B (T-lite-based): A smaller model that is faster and more efficient, suitable for quick evaluations and lower resource environments. - 32B (T-pro-based): A larger model that provides more accurate evaluations, suitable for high-performance environments.

There are two architecture types in both sizes: - seq2seq: A sequence-to-sequence model that generates a score and its justification in a decoder-only manner as a joint text output. - regression (-r in HF model identifiers): A regression model that outputs a numeric score from an added regression head and generates the score justification in a decoder-only manner.

🔒 License

This project is licensed under the MIT License. See LICENSE for details.

Citation

If you use POLLUX in your research, please cite the following paper:

bibtex @misc{ martynov2025eyejudgementdissectingevaluation, title={Eye of Judgement: Dissecting the Evaluation of Russian-speaking LLMs with POLLUX}, author={Nikita Martynov and Anastasia Mordasheva and Dmitriy Gorbetskiy and Danil Astafurov and Ulyana Isaeva and Elina Basyrova and Sergey Skachkov and Victoria Berestova and Nikolay Ivanov and Valeriia Zanina and Alena Fenogenova}, year={2025}, eprint={2505.24616}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.24616} }


Made with ❤️ by the POLLUX team

Owner

  • Name: AI Forever
  • Login: ai-forever
  • Kind: organization
  • Location: Armenia

Creating ML for the future. AI projects you already know. We are non-profit organization with members from all over the world.

GitHub Events

Total
  • Issues event: 1
  • Watch event: 10
  • Push event: 3
  • Pull request event: 1
Last Year
  • Issues event: 1
  • Watch event: 10
  • Push event: 3
  • Pull request event: 1

Issues and Pull Requests

Last synced: 10 months ago


Dependencies

requirements.txt pypi
  • datasets ==3.6.0
  • fire ==0.7.0
  • torch ==2.6.0
  • transformers ==4.53.0
  • vllm ==0.8.3