tiny_qa_benchmark_pp

Tiny QA Benchmark++ a micro-benchmark suite (52-item gold + on-demand multilingual synthetic packs), generator CLI, and CI-ready eval harness for ultra-fast LLM smoke-testing & regression-catching.

https://github.com/vincentkoc/tiny_qa_benchmark_pp

Science Score: 77.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
✓
Academic publication links
Links to: arxiv.org
✓
Committers with academic emails
1 of 1 committers (100.0%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.7%) to scientific vocabulary

Keywords

benchmark dataset evaluation huggingface-datasets litellm llm llm-testing llmops qa-dataset smoke-test synthetic-data tinybenchmarks

Last synced: 6 months ago · JSON representation ·

Repository

Tiny QA Benchmark++ a micro-benchmark suite (52-item gold + on-demand multilingual synthetic packs), generator CLI, and CI-ready eval harness for ultra-fast LLM smoke-testing & regression-catching.

Basic Info

Host: GitHub
Owner: vincentkoc
License: apache-2.0
Language: Python
Default Branch: main
Homepage:
Size: 310 KB

Statistics

Stars: 8
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 2

Topics

benchmark dataset evaluation huggingface-datasets litellm llm llm-testing llmops qa-dataset smoke-test synthetic-data tinybenchmarks

Created 9 months ago · Last pushed 6 months ago

Metadata Files

Readme Funding License Citation

README.md

English | 简体中文 | 日本語 | Español | Français

Tiny QA Benchmark++ (TQB++)

An ultra-lightweight evaluation dataset and synthetic generator
to expose critical LLM failures in seconds, ideal for CI/CD and LLMOps.

GitHub • Hugging Face Dataset • Paper • PyPI

<!--

-->

Tiny QA Benchmark++ (TQB++) is an ultra-lightweight evaluation suite and python package designed to expose critical failures in Large Language Model (LLM) systems within seconds. It serves as the LLM software unit tests, ideal for rapid CI/CD checks, prompt engineering, and continuous quality assurance in modern LLMOps to be used alongside existing LLM evaluation tooling such as Opik.

This repository contains the official implementation and synthetic datasets for the paper: Tiny QA Benchmark++: Micro Gold Dataset with Synthetic Multilingual Generation for Rapid LLMOps Smoke Tests.

Paper: arXiv:2505.12058
Hugging Face Hub: datasets/vincentkoc/tinyqabenchmark_pp
GitHub Repository: vincentkoc/tinyqabenchmark_pp

Key Features

Immutable Gold Standard Core: A 52-item hand-crafted English Question-Answering (QA) dataset (core_en) for deterministic regression testing from early datasets/vincentkoc/tinyqabenchmark.
Synthetic Customisation Toolkit: A Python script (tools/generator) using LiteLLM to generate bespoke micro-benchmarks on demand for any language, topic, or difficulty.
Standardised Metadata: Artefacts packaged in Croissant JSON-LD format (metadata/) for discoverability and auto-loading by tools and search engines.
Open Science: All code (generator, evaluation scripts) and the core English dataset are released under the Apache-2.0 license. Synthetically generated data packs have a custom evaluation-only license.
LLMOps Alignment: Designed for easy integration into CI/CD pipelines, prompt-engineering workflows, cross-lingual drift detection, and observability dashboards.
Multilingual Packs: Pre-built packs for languages including English, French, Spanish, Portuguese, German, Chinese, Japanese, Turkish, Arabic, and Russian.

Using the `tinyqabenchmarkpp` Python Package

The core synthetic generation capabilities of TQB++ are available as a Python package, tinyqabenchmarkpp, which can be installed from PyPI.

Installation

bash pip install tinyqabenchmarkpp

(Note: Ensure you have Python 3.8+ and pip installed. If this command doesn't work, please check the official PyPI project page for the correct package name.)

Generating Synthetic Datasets via CLI

Once installed, you can use the tinyqabenchmarkpp command (or python -m tinyqabenchmarkpp.generate) to create custom QA datasets.

Example: bash tinyqabenchmarkpp --num 10 --languages "en,es" --categories "science" --output-file "./science_pack.jsonl"

This will generate a small pack of 10 English and Spanish science questions.

For detailed instructions on all available parameters (like --model, --context, --difficulty, etc.), advanced usage, and examples for different LLM providers (OpenAI, OpenRouter, Ollama), please refer to the Generator Toolkit README or run tinyqabenchmarkpp --help.

While the tinyqabenchmarkpp package focuses on dataset generation, the TQB++ project also provides pre-generated datasets and evaluation tools, as described below.

Loading Datasets with Hugging Face `datasets`

The TQB++ datasets are available on the Hugging Face Hub and can be easily loaded using the datasets library. This is the recommended way to access the data.

```python from datasets import loaddataset, getdatasetconfignames

Discover available dataset configurations (e.g., coreen, packfr_40, etc.)

configs = getdatasetconfignames("vincentkoc/tinyqabenchmarkpp") print(f"Available configurations: {configs}")

Load the core English dataset (assuming 'core_en' is a configuration)

if "coreen" in configs: coredataset = loaddataset("vincentkoc/tinyqabenchmarkpp", name="coreen", split="train") print(f"\nLoaded {len(coredataset)} examples from coreen:") # print(coredataset[0]) # Print the first example else: print("\n'core_en' configuration not found.")

Load a specific synthetic pack (e.g., a French pack)

Replace 'packfr40' with an actual configuration name from the `configs` list

examplepackname = "packfr40" # or another valid config name if examplepackname in configs: syntheticpack = loaddataset("vincentkoc/tinyqabenchmarkpp", name=examplepackname, split="train") print(f"\nLoaded {len(syntheticpack)} examples from {examplepackname}:") # print(syntheticpack[0]) # Print the first example else: print(f"\n'{examplepack_name}' configuration not found. Please choose from available configurations.")

```

For more detailed information on the datasets, including their structure and specific licenses, please see the README files within the data/ directory (i.e., data/README.md, data/core_en/README.md, and data/packs/README.md).

Repository Structure

data/: Contains the QA datasets.
- core_en/: The original 52-item human-curated English core dataset.
- packs/: Synthetically generated multilingual and topical dataset packs.
tools/: Contains scripts for dataset generation and evaluation.
- generator/: The synthetic QA dataset generator.
- eval/: Scripts and utilities for evaluating models against TQB++ datasets.
paper/: The LaTeX source and associated files for the research paper.
metadata/: Croissant JSON-LD metadata files for the datasets.
LICENSE: Main license for the codebase (Apache-2.0).
LICENCE.data_packs.md: Custom license for synthetically generated data packs.
LICENCE.paper.md: License for the paper content.

Usage Scenarios

TQB++ is designed for various LLMOps and evaluation workflows:

CI/CD Pipeline Testing: Use as a unit test with LLM testing tooling for LLM services to catch regressions.
Prompt Engineering & Agent Development: Get rapid feedback when iterating on prompts or agent designs.
Evaluation Harness Integration: Designed for seamless use with evaluation harnesses. Encode as an OpenAI Evals YAML (see intergrations/openai-evals/README.md) or an Opik dataset for dashboard tracking and robust LLM assessment. The intergrations/ folder provides further details on available out-of-the-box support.
Cross-Lingual Drift Detection: Monitor localization regressions using multilingual TQB++ packs.
Adaptive Testing: Synthesize new micro-benchmarks on-the-fly tailored to specific features or data drifts.
Monitoring Fine-tuning Dynamics: Track knowledge erosion or unintended capability changes during fine-tuning.

Citation

If you use TQB++ in your research or work, please cite the original TQB and the TQB++ paper:

```bibtex % This synthetic dataset and generator @misc{koctinyqabenchmarkpp, author = {Vincent Koc}, title = {Tiny QA Benchmark++ (TQB++) Datasets and Toolkit}, year = {2025}, publisher = {Hugging Face & GitHub}, doi = {10.57967/hf/5531}, howpublished = {\url{https://huggingface.co/datasets/vincentkoc/tinyqabenchmarkpp}}, note = {See also: \url{https://github.com/vincentkoc/tinyqabenchmarkpp}} }

% This is the paper @misc{koc2025tinyqabenchmarkultralightweight, title={Tiny QA Benchmark++: Ultra-Lightweight, Synthetic Multilingual Dataset Generation & Smoke-Tests for Continuous LLM Evaluation}, author={Vincent Koc}, year={2025}, eprint={2505.12058}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2505.12058} } ```

License

The code in this repository (including the generator and evaluation scripts) and the data/core_en dataset and anything else not mentioned with a licence are licensed under the Apache License 2.0. See the LICENSE file for details.

The synthetically generated dataset packs in data/packs/ are distributed under a custom "Eval-Only, Non-Commercial, No-Derivatives" license. See LICENCE.data_packs.md for details.

The Croissant JSON-LD metadata files in metadata/ are available under CC0-1.0.

The paper content in paper/ is subject to its own license terms, detailed in LICENCE.paper.md.

Owner

Name: Vincent Koc
Login: vincentkoc
Kind: user
Location: San Francisco, CA
Company: @comet-ml

Website: https://linktr.ee/vincentkoc
Twitter: vincent_koc
Repositories: 101
Profile: https://github.com/vincentkoc

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software or dataset, please cite it as below."
authors:
  - family-names: "Koc"
    given-names: "Vincent"
    orcid: "https://orcid.org/0000-0002-7830-0017"
    affiliation: "Comet ML, Inc." # Optional: Add if desired

title: "Tiny QA Benchmark++ (TQB++) Datasets and Toolkit"
version: 0.1.0 # Corresponds to the planned PyPI package version, update with releases
doi: 10.57967/hf/5531 # DOI for the TQB++ dataset collection on Hugging Face
date-released: 2025 # Year of planned release or first major public version

# Optional: Information about the software repository itself
repository-code: "https://github.com/vincentkoc/tiny_qa_benchmark_pp"
url: "https://huggingface.co/datasets/vincentkoc/tiny_qa_benchmark_pp"
license: Apache-2.0 # License for the toolkit/code aspects

keywords:
  - "LLM Evaluation"
  - "QA Benchmarks"
  - "Synthetic Data"
  - "LLMOps"
  - "Continuous Integration"
  - "Smoke Testing"
  - "Question Answering"
  - "benchmark"

# Preferred citation is now the arXiv paper
preferred-citation:
  type: article
  authors:
  - family-names: "Koc"
    given-names: "Vincent"
    orcid: "https://orcid.org/0000-0002-7830-0017"
  title: "Tiny QA Benchmark++: Ultra-Lightweight, Synthetic Multilingual Dataset Generation & Smoke-Tests for Continuous LLM Evaluation"
  year: 2025
  status: "preprint"
  url: "https://arxiv.org/abs/2505.12058"
  identifiers:
    - type: arXiv
      value: "2505.12058"
    - type: doi
      value: "10.48550/arXiv.2505.12058"
      description: "Pending registration"
  # Keywords from the paper can be included if desired, e.g.:
  # keywords:
  #   - "Artificial Intelligence"
  #   - "Computation and Language"
  #   - "cs.AI"
  #   - "cs.CL"

# References section includes the dataset collection and other relevant works
references:
  # Citation for the TQB++ dataset collection and toolkit (no longer preferred)
  - type: generic
    authors:
    - family-names: "Koc"
      given-names: "Vincent"
      orcid: "https://orcid.org/0000-0002-7830-0017"
    title: "Tiny QA Benchmark++ (TQB++) Datasets and Toolkit"
    year: 2025
    institution:
      name: "Hugging Face & GitHub"
    url: "https://huggingface.co/datasets/vincentkoc/tiny_qa_benchmark_pp"
    doi: 10.57967/hf/5531
    notes: "See also: https://github.com/vincentkoc/tiny_qa_benchmark_pp"

  # Reference to the original Tiny QA Benchmark
  - type: dataset
    title: "tiny_qa_benchmark"
    authors:
      - family-names: "Koc"
        given-names: "Vincent"
        orcid: "https://orcid.org/0000-0002-7830-0017"
    year: 2025
    institution:
      name: "Hugging Face"
    doi: "10.57967/hf/5417"
    url: "https://huggingface.co/datasets/vincentkoc/tiny_qa_benchmark"

  # Reference to the preprint of the TQB++ paper (now also the preferred citation)
  - type: article
    title: "Tiny QA Benchmark++: Ultra-Lightweight, Synthetic Multilingual Dataset Generation & Smoke-Tests for Continuous LLM Evaluation"
    authors:
      - family-names: "Koc"
        given-names: "Vincent"
        orcid: "https://orcid.org/0000-0002-7830-0017"
    year: 2025
    status: "preprint"
    url: "https://arxiv.org/abs/2505.12058"
    identifiers:
      - type: arXiv
        value: "2505.12058"
      - type: doi
        value: "10.48550/arXiv.2505.12058"
        description: "Pending registration"
    keywords:
      - "Artificial Intelligence"
      - "Computation and Language"
      - "cs.AI"
      - "cs.CL"

GitHub Events

Total

Release event: 11
Watch event: 43
Delete event: 5
Push event: 31
Create event: 10

Last Year

Release event: 11
Watch event: 43
Delete event: 5
Push event: 31
Create event: 10

Committers

Last synced: 9 months ago

All Time

Total Commits: 45
Total Committers: 1
Avg Commits per committer: 45.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 45
Committers: 1
Avg Commits per committer: 45.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Vincent Koc	v**c@i**g	45

Committer Domains (Top 20 + Academic)

ieee.org: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

.github/workflows/publish-hf-dataset.yml actions

actions/checkout v4 composite
actions/setup-python v5 composite

.github/workflows/publish-pypi.yml actions

actions/checkout v4 composite
actions/setup-python v5 composite
pypa/gh-action-pypi-publish release/v1 composite

tools/eval/requirements.txt pypi

litellm *
matplotlib *
numpy *
opik *
pandas *
scipy *
seaborn *
sklearn *
statsmodels *
tinyqabenchmarkpp *
tqdm *

tools/generator/pyproject.toml pypi

litellm >=1.38.3
opik *

tiny_qa_benchmark_pp

Science Score: 77.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Tiny QA Benchmark++ (TQB++)

Key Features

Using the tinyqabenchmarkpp Python Package

Installation

Generating Synthetic Datasets via CLI

Loading Datasets with Hugging Face datasets

Discover available dataset configurations (e.g., coreen, packfr_40, etc.)

Load the core English dataset (assuming 'core_en' is a configuration)

Load a specific synthetic pack (e.g., a French pack)

Replace 'packfr40' with an actual configuration name from the configs list

Repository Structure

Usage Scenarios

Citation

License

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

Using the `tinyqabenchmarkpp` Python Package

Loading Datasets with Hugging Face `datasets`

Replace 'packfr40' with an actual configuration name from the `configs` list