datadecider

`transformers` based implementation of "Data Decide"

https://github.com/gtfintechlab/datadecider

Science Score: 52.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
    Organization gtfintechlab has institutional domain (fintech.gatech.edu)
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (15.7%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

`transformers` based implementation of "Data Decide"

Basic Info
  • Host: GitHub
  • Owner: gtfintechlab
  • License: apache-2.0
  • Language: HTML
  • Default Branch: main
  • Size: 260 MB
Statistics
  • Stars: 1
  • Watchers: 0
  • Forks: 1
  • Open Issues: 1
  • Releases: 0
Created 9 months ago · Last pushed 8 months ago
Metadata Files
Readme License Citation

README.md

DataDecider

A framework for training and evaluating Open Language Models (OLMo) using the DataDecide methodology for efficient data curation and model development.

Overview

DataDecider implements the DataDecide approach for training language models, which uses small-scale proxy experiments to predict which data mixtures will perform best at scale. This package provides:

  • Complete OLMo model implementation with 14 size variants (4M to 1B parameters)
  • DataDecide data curation pipeline with proxy metrics
  • Training infrastructure with distributed support
  • Evaluation suite for model assessment
  • Integration with Weights & Biases for experiment tracking

Installation

From GitHub (for use in other projects)

```bash

Using pip

pip install git+https://github.com/yourusername/DataDecider.git

Using uv

uv pip install git+https://github.com/yourusername/DataDecider.git

For local development from another project

pip install -e /path/to/DataDecider ```

For Development

```bash

Clone the repository

git clone https://github.com/yourusername/DataDecider.git cd DataDecider

Install in development mode with uv (recommended)

uv pip install -e ".[dev]"

Or using pip

pip install -e ".[dev]" ```

Quick Start

1. Prepare Your Dataset

The framework expects tokenized datasets in HuggingFace format. You can use the provided scripts to prepare your data:

```bash

Build a dataset from raw files

python -m datadecide.scripts.preparetrainingdata \ --input-dir ./rawdata \ --output-dir ./processed_data \ --tokenizer EleutherAI/gpt-neox-20b \ --max-length 2048 ```

2. Configure Your Model

Configuration files are in YAML format. Example for 4M model:

```yaml

configs/modelconfigs/olmo4m.yaml

modelsize: "4M" modelparams: numlayers: 8 hiddensize: 64 numattentionheads: 8 vocab_size: 50254 ```

3. Train Your Model

```bash

Using the main training script

data-decide-train \ --config configs/trainingconfigs/olmo4m_training.yaml

Or use the enhanced version with rich UI

python -m datadecide.scripts.trainenhanced \ --config configs/trainingconfigs/olmo4m_training.yaml ```

4. Monitor Training

```bash

Real-time monitoring with rich terminal UI

data-decide-monitor --run-name mytrainingrun

Or analyze completed runs

data-decide-analyze --wandb-run-path username/project/run_id ```

Key Features

DataDecide Methodology

The DataDecide approach involves:

  1. Proxy Dataset Creation: Generate multiple small datasets with different data mixtures
  2. Proxy Metrics: Compute perplexity, diversity, and quality scores without full training
  3. Mixture Selection: Choose the best data mixture based on proxy results
  4. Full Training: Train the model on the selected data mixture

Model Sizes

Supported OLMo configurations:

| Model Size | Parameters | Hidden Size | Layers | Heads | |------------|------------|-------------|---------|--------| | 4M | 3.7M | 64 | 8 | 8 | | 20M | 18.6M | 128 | 16 | 16 | | 38M | 36.9M | 192 | 16 | 16 | | 70M | 66.8M | 256 | 18 | 16 | | 160M | 152.2M | 384 | 20 | 16 | | 410M | 390.2M | 640 | 24 | 16 | | 1B | 982.3M | 1024 | 28 | 16 |

Training Features

  • Distributed Training: Full support for multi-GPU training via Accelerate
  • Mixed Precision: FP16/BF16 training for efficiency
  • Gradient Checkpointing: Memory-efficient training for larger models
  • Learning Rate Scheduling: Cosine decay with warmup
  • Comprehensive Monitoring: WANDB integration with system metrics and rich terminal UI
  • Pre-tokenized Data Pipeline: Efficient training with separated tokenization

Monitoring & Visualization

DataDecider includes a comprehensive monitoring system that provides both local and cloud-based tracking:

Rich Terminal UI

  • Real-time progress bars for epochs, steps, and evaluation
  • Live metrics display (loss, learning rate, GPU usage)
  • Beautiful colored output with system information
  • Time estimates and performance metrics

WANDB Integration

  • Automatic experiment tracking to Weights & Biases
  • System monitoring (GPU utilization, memory, temperature)
  • Model metrics (gradients, learning rates, predictions)
  • Checkpoint artifact management
  • Hyperparameter tracking and visualization

Quick Setup

```bash

1. Add to .env file

WANDBAPIKEY=yourapikey WANDBPROJECT=finpiledatadecide WANDBENTITY=yourusername

2. Run training (monitoring enabled by default)

uv run python examples/trainolmopretokenized.py --dataset tiny_100k ```

See docs/monitoring.md for complete documentation and docs/wandb-quickstart.md for a quick start guide.

Project Structure

DataDecider/ ├── configs/ # Configuration files │ ├── model_configs/ # Model architecture configs │ ├── training_configs/ # Training hyperparameters │ └── data_configs/ # Data processing configs ├── data_decide/ # Main package │ ├── olmo/ # OLMo implementation │ │ ├── models/ # Model architecture │ │ ├── data/ # Data processing │ │ ├── training/ # Training logic │ │ ├── evaluation/ # Evaluation metrics │ │ └── utils/ # Utilities │ └── scripts/ # Executable scripts ├── tests/ # Unit tests └── data/ # Data directory (gitignored)

Data Management

This repository does not include the large training datasets. To obtain the data:

  1. Sample Data: A small sample dataset is included in tests/test_data/ for testing
  2. Full Datasets: See data/README.md for instructions on downloading the full arXiv datasets
  3. Custom Data: Use the data preparation scripts to process your own datasets

Configuration

Environment Variables

Create a .env file in the project root:

```bash

Weights & Biases

WANDBAPIKEY=yourapikeyhere WANDBPROJECT=olmo-datadecide WANDBENTITY=yourentity

Training

CUDAVISIBLEDEVICES=0,1,2,3 TOKENIZERS_PARALLELISM=false ```

Training Configuration

Example training configuration:

```yaml

Training parameters

modelsize: "4M" datapath: "./data/processed/olmo4m400Mtokens" outputdir: "./checkpoints/olmo4mdatadecide" numtrainepochs: 1 perdevicetrainbatchsize: 8 gradientaccumulationsteps: 4 learningrate: 1.4e-2 warmupsteps: 572 savesteps: 1000 evalsteps: 500 logging_steps: 10

W&B configuration

reportto: ["wandb"] wandbproject: "olmo-4m-datadecide" wandb_name: "olmo-4m-arxiv-400M" ```

Development

Running Tests

```bash

Run all tests

pytest

Run with coverage

pytest --cov=data_decide

Run specific test

pytest tests/testdatacuration.py ```

Code Quality

```bash

Format code

ruff format .

Check style

ruff check . ```

Using DataDecider in Your Project

To use DataDecider in another project (like FinPileCode):

```python from datadecide.olmo.models import OLMoForCausalLM, OLMoConfig from datadecide.olmo.data import DataDecideCurator

Initialize model

config = OLMoConfig.from_pretrained("olmo-4m") model = OLMoForCausalLM(config)

Use DataDecide for data curation

curator = DataDecideCurator() proxydatasets = curator.createproxydatasets(yourdata) bestmixture = curator.selectbestmixture(proxydatasets) ```

Citation

If you use this framework in your research, please cite:

bibtex @software{datadecider, title = {DataDecider: OLMo Training with DataDecide Methodology}, author = {FinPile Team}, year = {2024}, url = {https://github.com/yourusername/DataDecider} }

License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

Acknowledgments

  • OLMo architecture based on the paper "OLMo: Accelerating the Science of Language Models"
  • DataDecide methodology for efficient data curation
  • Built with HuggingFace Transformers and Accelerate

Owner

  • Name: Financial Services Innovation Lab, Georgia Tech
  • Login: gtfintechlab
  • Kind: organization
  • Location: United States of America

A hub for Finance education, research and industry in the Southeast.

Citation (citations/DataDecide.htm)

<!DOCTYPE html>
<html lang="en">
<head>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<title>How to Predict Best Pretraining Data with Small Experiments</title>
<!--Generated on Tue Apr 15 16:57:41 2025 by LaTeXML (version 0.8.8) http://dlmf.nist.gov/LaTeXML/.-->
<meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/>
<link href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/css/bootstrap.min.css" rel="stylesheet" type="text/css"/>
<link href="/static/browse/0.3.4/css/ar5iv.0.7.9.min.css" rel="stylesheet" type="text/css"/>
<link href="/static/browse/0.3.4/css/ar5iv-fonts.0.7.9.min.css" rel="stylesheet" type="text/css"/>
<link href="/static/browse/0.3.4/css/latexml_styles.css" rel="stylesheet" type="text/css"/>
<script src="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/js/bootstrap.bundle.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/html2canvas/1.3.3/html2canvas.min.js"></script>
<script src="/static/browse/0.3.4/js/addons_new.js"></script>
<script src="/static/browse/0.3.4/js/feedbackOverlay.js"></script>
<base href="/html/2504.11393v1/"/></head>
<body>
<nav class="ltx_page_navbar">
<nav class="ltx_TOC">
<ol class="ltx_toclist">
<li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#S1" title="In How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">1 </span>Introduction</span></a></li>
<li class="ltx_tocentry ltx_tocentry_section">
<a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#S2" title="In How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">2 </span>Methods</span></a>
<ol class="ltx_toclist ltx_toclist_section">
<li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#S2.SS1" title="In 2 Methods ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">2.1 </span>The <span class="ltx_text ltx_font_smallcaps">DataDecide</span> Suite</span></a></li>
<li class="ltx_tocentry ltx_tocentry_subsection">
<a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#S2.SS2" title="In 2 Methods ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">2.2 </span>Prediction Methods</span></a>
<ol class="ltx_toclist ltx_toclist_subsection">
<li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#S2.SS2.SSS0.Px1" title="In 2.2 Prediction Methods ‣ 2 Methods ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_title">Ranking Single Scale Experiments (Single Scale)</span></a></li>
<li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#S2.SS2.SSS0.Px2" title="In 2.2 Prediction Methods ‣ 2 Methods ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_title">Extrapolating Scaling Laws (Multi Scale)</span></a></li>
</ol>
</li>
<li class="ltx_tocentry ltx_tocentry_subsection">
<a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#S2.SS3" title="In 2 Methods ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">2.3 </span>Prediction Metrics</span></a>
<ol class="ltx_toclist ltx_toclist_subsection">
<li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#S2.SS3.SSS0.Px1" title="In 2.3 Prediction Metrics ‣ 2 Methods ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_title">Prediction Error</span></a></li>
<li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#S2.SS3.SSS0.Px2" title="In 2.3 Prediction Metrics ‣ 2 Methods ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_title">Decision Accuracy</span></a></li>
<li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#S2.SS3.SSS0.Px3" title="In 2.3 Prediction Metrics ‣ 2 Methods ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_title">Percent of Target Compute Budget (<math alttext="\%C" class="ltx_math_unparsed" display="inline"><semantics><mrow><mo>%</mo><mi>C</mi></mrow><annotation encoding="application/x-tex">\%C</annotation><annotation encoding="application/x-llamapun">% italic_C</annotation></semantics></math>)</span></a></li>
</ol>
</li>
<li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#S2.SS4" title="In 2 Methods ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">2.4 </span>Performance Evaluation with OLMES</span></a></li>
<li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#S2.SS5" title="In 2 Methods ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">2.5 </span>Proxy Metrics for Performance Evaluation</span></a></li>
</ol>
</li>
<li class="ltx_tocentry ltx_tocentry_section">
<a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#S3" title="In How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3 </span>Results</span></a>
<ol class="ltx_toclist ltx_toclist_section">
<li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#S3.SS1" title="In 3 Results ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3.1 </span>What is the best way to spend compute for data decisions?</span></a></li>
<li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#S3.SS2" title="In 3 Results ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3.2 </span>How does extrapolating scaling laws compare to ranking single scale experiments?</span></a></li>
<li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#S3.SS3" title="In 3 Results ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3.3 </span>What proxy metrics give better signal for predictions at small scale?</span></a></li>
<li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#S3.SS4" title="In 3 Results ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3.4 </span>How can we make evaluation benchmarks more predictable?</span></a></li>
</ol>
</li>
<li class="ltx_tocentry ltx_tocentry_section">
<a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#S4" title="In How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4 </span>Related Work</span></a>
<ol class="ltx_toclist ltx_toclist_section">
<li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#S4.SS0.SSS0.Px1" title="In 4 Related Work ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_title">Prediction</span></a></li>
<li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#S4.SS0.SSS0.Px2" title="In 4 Related Work ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_title">Suites over Data Differences</span></a></li>
</ol>
</li>
<li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#S5" title="In How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">5 </span>Limitations</span></a></li>
<li class="ltx_tocentry ltx_tocentry_appendix"><a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#A1" title="In How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">A </span>Hyperparameters</span></a></li>
<li class="ltx_tocentry ltx_tocentry_appendix"><a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#A2" title="In How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">B </span>Proxy Metric Definitions</span></a></li>
<li class="ltx_tocentry ltx_tocentry_appendix"><a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#A3" title="In How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">C </span>Scaling Law Variants</span></a></li>
</ol></nav>
</nav>
<div class="ltx_page_main">
<div class="ltx_page_content">
<article class="ltx_document ltx_authors_1line">
<h1 class="ltx_title ltx_title_document">
<img alt="[Uncaptioned image]" class="ltx_graphics ltx_img_landscape" height="48" id="id1.g1" src="x1.png" width="287"/>
<br class="ltx_break"/>How to Predict Best Pretraining Data with Small Experiments</h1>
<div class="ltx_authors">
<span class="ltx_creator ltx_role_author">
<span class="ltx_personname">
<span class="ltx_tabular ltx_align_middle" id="id18.17.17">
<span class="ltx_tr" id="id6.5.5.5">
<span class="ltx_td ltx_align_left" id="id6.5.5.5.5">Ian Magnusson<sup class="ltx_sup" id="id6.5.5.5.5.1"><span class="ltx_text ltx_font_italic" id="id6.5.5.5.5.1.1">∗§‡</span></sup>,
Nguyen Tai<sup class="ltx_sup" id="id6.5.5.5.5.2"><span class="ltx_text ltx_font_italic" id="id6.5.5.5.5.2.1">∗∥</span></sup>,
Ben Bogin<sup class="ltx_sup" id="id6.5.5.5.5.3"><span class="ltx_text ltx_font_italic" id="id6.5.5.5.5.3.1">∗§</span></sup>,
David Heineman<sup class="ltx_sup" id="id6.5.5.5.5.4"><span class="ltx_text ltx_font_italic" id="id6.5.5.5.5.4.1">§</span></sup>,
Jena Hwang<sup class="ltx_sup" id="id6.5.5.5.5.5"><span class="ltx_text ltx_font_italic" id="id6.5.5.5.5.5.1">§</span></sup>,</span></span>
<span class="ltx_tr" id="id10.9.9.9">
<span class="ltx_td ltx_align_left" id="id10.9.9.9.4">Luca Soldaini<sup class="ltx_sup" id="id10.9.9.9.4.1"><span class="ltx_text ltx_font_italic" id="id10.9.9.9.4.1.1">§</span></sup>,
Akshita Bhagia<sup class="ltx_sup" id="id10.9.9.9.4.2"><span class="ltx_text ltx_font_italic" id="id10.9.9.9.4.2.1">§</span></sup>,
Jiacheng Liu<sup class="ltx_sup" id="id10.9.9.9.4.3"><span class="ltx_text ltx_font_italic" id="id10.9.9.9.4.3.1">§‡</span></sup>,
Dirk Groeneveld<sup class="ltx_sup" id="id10.9.9.9.4.4"><span class="ltx_text ltx_font_italic" id="id10.9.9.9.4.4.1">§</span></sup>,</span></span>
<span class="ltx_tr" id="id14.13.13.13">
<span class="ltx_td ltx_align_left" id="id14.13.13.13.4">Oyvind Tafjord<sup class="ltx_sup" id="id14.13.13.13.4.1"><span class="ltx_text ltx_font_italic" id="id14.13.13.13.4.1.1">§</span></sup>,
Noah A. Smith<sup class="ltx_sup" id="id14.13.13.13.4.2"><span class="ltx_text ltx_font_italic" id="id14.13.13.13.4.2.1">§‡</span></sup>,
Pang Wei Koh<sup class="ltx_sup" id="id14.13.13.13.4.3"><span class="ltx_text ltx_font_italic" id="id14.13.13.13.4.3.1">§‡</span></sup>,
Jesse Dodge<sup class="ltx_sup" id="id14.13.13.13.4.4"><span class="ltx_text ltx_font_italic" id="id14.13.13.13.4.4.1">§</span></sup></span></span>
<span class="ltx_tr" id="id18.17.17.17">
<span class="ltx_td ltx_align_left" id="id18.17.17.17.4"><sup class="ltx_sup" id="id18.17.17.17.4.1"><span class="ltx_text ltx_font_italic" id="id18.17.17.17.4.1.1">§</span></sup> Allen Institute for AI  <sup class="ltx_sup" id="id18.17.17.17.4.2"><span class="ltx_text ltx_font_italic" id="id18.17.17.17.4.2.1">‡</span></sup> University of Washington  <sup class="ltx_sup" id="id18.17.17.17.4.3"><span class="ltx_text ltx_font_italic" id="id18.17.17.17.4.3.1">∥</span></sup> University of Pennsylvania  <sup class="ltx_sup" id="id18.17.17.17.4.4"><span class="ltx_text ltx_font_italic" id="id18.17.17.17.4.4.1">∗</span></sup> equal contribution</span></span>
</span>
</span></span>
</div>
<div class="ltx_abstract">
<h6 class="ltx_title ltx_title_abstract">Abstract</h6>
<p class="ltx_p" id="id20.2">Because large language models are expensive to pretrain on different datasets, using smaller-scale experiments to decide on data is crucial for reducing costs. Which benchmarks and methods of making decisions from observed performance at small scale most accurately predict the datasets that yield the best large models? To empower open exploration of this question, we release models, data, and evaluations in <span class="ltx_text ltx_font_smallcaps" id="id20.2.1">DataDecide</span>—the most extensive open suite of models over differences in data and scale.
We conduct controlled pretraining experiments across 25 corpora with differing sources, deduplication, and filtering up to 100B tokens, model sizes up to 1B parameters, and 3 random seeds.
We find that the ranking of models at a single, small size (e.g., 150M parameters) is a strong baseline for predicting best models at our larger target scale (1B) (<math alttext="\sim 80" class="ltx_Math" display="inline" id="id19.1.m1.1"><semantics id="id19.1.m1.1a"><mrow id="id19.1.m1.1.1" xref="id19.1.m1.1.1.cmml"><mi id="id19.1.m1.1.1.2" xref="id19.1.m1.1.1.2.cmml"></mi><mo id="id19.1.m1.1.1.1" xref="id19.1.m1.1.1.1.cmml">∼</mo><mn id="id19.1.m1.1.1.3" xref="id19.1.m1.1.1.3.cmml">80</mn></mrow><annotation-xml encoding="MathML-Content" id="id19.1.m1.1b"><apply id="id19.1.m1.1.1.cmml" xref="id19.1.m1.1.1"><csymbol cd="latexml" id="id19.1.m1.1.1.1.cmml" xref="id19.1.m1.1.1.1">similar-to</csymbol><csymbol cd="latexml" id="id19.1.m1.1.1.2.cmml" xref="id19.1.m1.1.1.2">absent</csymbol><cn id="id19.1.m1.1.1.3.cmml" type="integer" xref="id19.1.m1.1.1.3">80</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="id19.1.m1.1c">\sim 80</annotation><annotation encoding="application/x-llamapun" id="id19.1.m1.1d">∼ 80</annotation></semantics></math>% of comparisons correct).
No scaling law methods among 8 baselines exceed the compute-decision frontier of single-scale predictions, but <span class="ltx_text ltx_font_smallcaps" id="id20.2.2">DataDecide</span> can measure improvement in future scaling laws. We also identify that using continuous likelihood metrics as proxies in small experiments makes benchmarks including MMLU, ARC, HellaSwag, MBPP, and HumanEval <math alttext="&gt;80" class="ltx_Math" display="inline" id="id20.2.m2.1"><semantics id="id20.2.m2.1a"><mrow id="id20.2.m2.1.1" xref="id20.2.m2.1.1.cmml"><mi id="id20.2.m2.1.1.2" xref="id20.2.m2.1.1.2.cmml"></mi><mo id="id20.2.m2.1.1.1" xref="id20.2.m2.1.1.1.cmml">&gt;</mo><mn id="id20.2.m2.1.1.3" xref="id20.2.m2.1.1.3.cmml">80</mn></mrow><annotation-xml encoding="MathML-Content" id="id20.2.m2.1b"><apply id="id20.2.m2.1.1.cmml" xref="id20.2.m2.1.1"><gt id="id20.2.m2.1.1.1.cmml" xref="id20.2.m2.1.1.1"></gt><csymbol cd="latexml" id="id20.2.m2.1.1.2.cmml" xref="id20.2.m2.1.1.2">absent</csymbol><cn id="id20.2.m2.1.1.3.cmml" type="integer" xref="id20.2.m2.1.1.3">80</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="id20.2.m2.1c">&gt;80</annotation><annotation encoding="application/x-llamapun" id="id20.2.m2.1d">&gt; 80</annotation></semantics></math>% predictable at the target 1B scale with just 0.01% of the compute.</p>
</div>
<figure class="ltx_figure" id="S0.F1"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="416" id="S0.F1.g1" src="x2.png" width="830"/>
<figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 1: </span>Which pretraining data to use? Ideally, compare performance of large models with fixed configurations averaged over random seeds (left). In practice, cheaper, smaller-scale experiments are used (center).
Here <span class="ltx_text ltx_font_smallcaps" id="S0.F1.2.1">DataDecide</span> measures accuracy of pairwise decisions between 25 pretraining corpora to find efficient prediction methods (right).
</figcaption>
</figure>
<section class="ltx_section" id="S1">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">1 </span>Introduction</h2>
<div class="ltx_para ltx_noindent" id="S1.p1">
<p class="ltx_p" id="S1.p1.1">The cost of training large language models (LMs) necessitates methods of trying out options at small scale, but it also makes it expensive to validate the accuracy of development decisions made with such methods. We focus on the question of choosing between pretraining datasets to use—one of the most impactful development decisions.
Common practice (e.g., <cite class="ltx_cite ltx_citemacro_citep">Li et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib24" title="">2024</a></cite>) uses a single, small scale of experiments to cheaply test pretraining data intended for larger-scale models, where scale is determined by number of model parameters and training tokens. The other predominant approach is to fit scaling laws <cite class="ltx_cite ltx_citemacro_citep">(Kaplan et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib21" title="">2020</a>; Hoffmann et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib19" title="">2022</a>; Choshen et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib8" title="">2024</a>)</cite> to the trend in performance observed over multiple small scales, with recent work extending this to the prediction of downstream performance instead of language modeling loss <cite class="ltx_cite ltx_citemacro_citep">(Gadre et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib14" title="">2024</a>; Dubey et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib13" title="">2024</a>; Bhagia et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib3" title="">2024</a>)</cite>.</p>
</div>
<div class="ltx_para ltx_noindent" id="S1.p2">
<p class="ltx_p" id="S1.p2.1">So far decision-making approaches have only been validated without observing the counterfactual outcome, either by producing a single large model on the chosen decision with impressive performance or by low error in predicting the magnitude of observed performance of a small number of large models. Knowing what amount of error in predicting performance over scale is a low enough to actually make a correct decision among datasets, requires a suite of comparable models trained on many datasets. Although a wide variety of open-source pretraining corpora are available, the scaling behavior of data is difficult to assess from off-the-shelf models that vary simultaneously in data, optimizer, and modeling decisions.</p>
</div>
<div class="ltx_para ltx_noindent" id="S1.p3">
<p class="ltx_p" id="S1.p3.1">To make it possible to empirically study what methods make the best decisions over data, we build <span class="ltx_text ltx_font_smallcaps" id="S1.p3.1.1">DataDecide<span class="ltx_note ltx_role_footnote" id="footnote1"><sup class="ltx_note_mark">1</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">1</sup><span class="ltx_tag ltx_tag_note"><span class="ltx_text ltx_font_upright" id="footnote1.1.1.1">1</span></span><a class="ltx_ref ltx_href ltx_font_upright" href="https://huggingface.co/collections/allenai/datadecide-67edb1d2bacba40b5d3ed633" title="">DataDecide collection on HuggingFace</a></span></span></span></span>—a suite of models we pretrain on 25 corpora up to 100B tokens, over 14 different model sizes ranging from 4M parameters up to 1B parameters (more than 30K model checkpoints in total). We evaluate all models across a suite of 10 downstream tasks and calculate how accurately small models predict which pretraining corpora lead to better performance at our largest scale. Our conclusions provide practical recommendations for the best benchmarks, prediction methods, and metrics to use to make decisions.</p>
</div>
<div class="ltx_para ltx_noindent" id="S1.p4">
<p class="ltx_p" id="S1.p4.1">We call the 25 corpora we train on <span class="ltx_text ltx_font_italic" id="S1.p4.1.1">data recipes</span> as they range across popular corpora including Dolma <cite class="ltx_cite ltx_citemacro_citep">(Soldaini et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib38" title="">2024</a>)</cite>, DCLM <cite class="ltx_cite ltx_citemacro_citep">(Li et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib24" title="">2024</a>)</cite>, RefinedWeb <cite class="ltx_cite ltx_citemacro_citep">(Penedo et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib29" title="">2023</a>)</cite>, C4 <cite class="ltx_cite ltx_citemacro_citep">(Raffel et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib32" title="">2019</a>)</cite>, and FineWeb <cite class="ltx_cite ltx_citemacro_citep">(Penedo et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib30" title="">2024</a>)</cite> as well as combinations of interventions on these datasets such as source mixing, deduplication, and filtering. Previous work has considered only 2 <cite class="ltx_cite ltx_citemacro_citep">(Biderman et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib4" title="">2023</a>)</cite> or 6 recipes <cite class="ltx_cite ltx_citemacro_citep">(Magnusson et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib25" title="">2024</a>; Brandfonbrener et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib6" title="">2024</a>)</cite>.
We also offer a novel affordance by including 3 random seed reruns for even our largest runs, to help quantify whether variation occurs due to random initialization and data order or differences in the distribution of data.</p>
</div>
<div class="ltx_para ltx_noindent" id="S1.p5">
<p class="ltx_p" id="S1.p5.1">Concretely, <span class="ltx_text ltx_font_smallcaps" id="S1.p5.1.1">DataDecide</span> allows analyses such as Figure <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#S0.F1" title="Figure 1 ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_tag">1</span></a> (right), which shows the relationship between compute used to predict a ranking of datasets and how accurately that ranking reflects mean performance over 3 seed runs (quantified here by OLMES; <cite class="ltx_cite ltx_citemacro_citep">Gu et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib17" title="">2024</a></cite>) for models fully trained on those datasets at the target (1B) scale. We measure the accuracy of decisions as the percent of compared pairs of datasets where the prediction identifies the correct winner. Each point represents the average decision accuracy of a given method over 3 prediction attempts using small models with different random seeds, and shading shows standard deviation.</p>
</div>
<div class="ltx_para ltx_noindent" id="S1.p6">
<p class="ltx_p" id="S1.p6.1">Measuring the tradeoff of compute cost to better decisions lets us make the following recommendations about small experiments for making data decisions:</p>
<ul class="ltx_itemize" id="S1.I1">
<li class="ltx_item" id="S1.I1.i1" style="list-style-type:none;">
<span class="ltx_tag ltx_tag_item">•</span>
<div class="ltx_para" id="S1.I1.i1.p1">
<p class="ltx_p" id="S1.I1.i1.p1.1">§<a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#S3.SS1" title="3.1 What is the best way to spend compute for data decisions? ‣ 3 Results ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_tag">3.1</span></a> – The amount of compute you need to allocate for a given decision accuracy depends heavily on task. MMLU and ARC are much cheaper to predict than HellaSwag and some tasks such as SocialIQA are difficult to predict at all scales.</p>
</div>
</li>
<li class="ltx_item" id="S1.I1.i2" style="list-style-type:none;">
<span class="ltx_tag ltx_tag_item">•</span>
<div class="ltx_para" id="S1.I1.i2.p1">
<p class="ltx_p" id="S1.I1.i2.p1.1">§<a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#S3.SS2" title="3.2 How does extrapolating scaling laws compare to ranking single scale experiments? ‣ 3 Results ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_tag">3.2</span></a> – 8 baseline scaling law methods do not exceed the compute to decision accuracy frontier set by ranking single scale experiments.</p>
</div>
</li>
<li class="ltx_item" id="S1.I1.i3" style="list-style-type:none;">
<span class="ltx_tag ltx_tag_item">•</span>
<div class="ltx_para" id="S1.I1.i3.p1">
<p class="ltx_p" id="S1.I1.i3.p1.1">§<a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#S3.SS3" title="3.3 What proxy metrics give better signal for predictions at small scale? ‣ 3 Results ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_tag">3.3</span></a> – At small scales, continuous metrics using answer likelihood are better or equivalent predictors of decisions than using the same discrete accuracy target metric.</p>
</div>
</li>
<li class="ltx_item" id="S1.I1.i4" style="list-style-type:none;">
<span class="ltx_tag ltx_tag_item">•</span>
<div class="ltx_para ltx_noindent" id="S1.I1.i4.p1">
<p class="ltx_p" id="S1.I1.i4.p1.1">§<a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#S3.SS4" title="3.4 How can we make evaluation benchmarks more predictable? ‣ 3 Results ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_tag">3.4</span></a> – Better decisions can be explained in part by low run-to-run variance and a wide spread of benchmark performance values for different data, traits which can be improved by proxy metrics.</p>
</div>
</li>
</ul>
</div>
<div class="ltx_para ltx_noindent" id="S1.p7">
<p class="ltx_p" id="S1.p7.1">Future research can extend <span class="ltx_text ltx_font_smallcaps" id="S1.p7.1.1">DataDecide</span> with little extra compute by running new evaluations on our checkpoints, pretraining additional small models to compare against the large target models we provide, or trying new prediction methods with lightweight manipulations such as smoothing and curve fitting on top of our released evaluation results.</p>
</div>
</section>
<section class="ltx_section" id="S2">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">2 </span>Methods</h2>
<div class="ltx_para ltx_noindent" id="S2.p1">
<p class="ltx_p" id="S2.p1.1">Our aim is to empirically test the predictability of downstream performance at a larger, target scale using small experiments. We describe <span class="ltx_text ltx_font_smallcaps" id="S2.p1.1.1">DataDecide</span> §<a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#S2.SS1" title="2.1 The DataDecide Suite ‣ 2 Methods ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_tag">2.1</span></a>, the prediction methods we examine §<a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#S2.SS2" title="2.2 Prediction Methods ‣ 2 Methods ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_tag">2.2</span></a>, the metrics we use to assess predictions §<a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#S2.SS3" title="2.3 Prediction Metrics ‣ 2 Methods ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_tag">2.3</span></a>, how we measure downstream performance §<a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#S2.SS4" title="2.4 Performance Evaluation with OLMES ‣ 2 Methods ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_tag">2.4</span></a>, and proxy metrics for our performance evaluations §<a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#S2.SS5" title="2.5 Proxy Metrics for Performance Evaluation ‣ 2 Methods ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_tag">2.5</span></a>.
We will release all models, checkpoints, pretraining corpora, and evaluations.</p>
</div>
<figure class="ltx_table" id="S2.T1">
<table class="ltx_tabular ltx_centering ltx_align_middle" id="S2.T1.3">
<tr class="ltx_tr" id="S2.T1.3.4">
<td class="ltx_td ltx_align_justify ltx_align_top ltx_border_tt" id="S2.T1.3.4.1" style="padding-top:2.5pt;padding-bottom:2.5pt;">
<span class="ltx_inline-block ltx_align_top" id="S2.T1.3.4.1.1">
<span class="ltx_p" id="S2.T1.3.4.1.1.1" style="width:91.4pt;"><span class="ltx_text ltx_font_bold" id="S2.T1.3.4.1.1.1.1">Source / Recipe</span></span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top ltx_border_tt" id="S2.T1.3.4.2" style="padding-top:2.5pt;padding-bottom:2.5pt;">
<span class="ltx_inline-block ltx_align_top" id="S2.T1.3.4.2.1">
<span class="ltx_p" id="S2.T1.3.4.2.1.1" style="width:286.2pt;"><span class="ltx_text ltx_font_bold" id="S2.T1.3.4.2.1.1.1">Description</span></span>
</span>
</td>
</tr>
<tr class="ltx_tr" id="S2.T1.3.5">
<td class="ltx_td ltx_align_justify ltx_align_top ltx_border_t" id="S2.T1.3.5.1" style="padding-top:2.5pt;padding-bottom:2.5pt;">
<span class="ltx_inline-block ltx_align_top" id="S2.T1.3.5.1.1">
<span class="ltx_p" id="S2.T1.3.5.1.1.1" style="width:91.4pt;"><span class="ltx_text ltx_font_bold" id="S2.T1.3.5.1.1.1.1">Dolma1.7</span> <em class="ltx_emph ltx_font_italic" id="S2.T1.3.5.1.1.1.2">Original, No code, No math/code, No Reddit, No Flan</em></span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top ltx_border_t" id="S2.T1.3.5.2" style="padding-top:2.5pt;padding-bottom:2.5pt;">
<span class="ltx_inline-block ltx_align_top" id="S2.T1.3.5.2.1">
<span class="ltx_p" id="S2.T1.3.5.2.1.1" style="width:286.2pt;">A 2.3T-token corpus (Dolma 1.7 <cite class="ltx_cite ltx_citemacro_citep">Soldaini et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib38" title="">2024</a></cite>) sampling common LM sources for open research. We ablate code, math/code, Reddit, or Flan subsets.</span>
</span>
</td>
</tr>
<tr class="ltx_tr" id="S2.T1.3.6">
<td class="ltx_td ltx_align_justify ltx_align_top" id="S2.T1.3.6.1" style="padding-top:2.5pt;padding-bottom:2.5pt;">
<span class="ltx_inline-block ltx_align_top" id="S2.T1.3.6.1.1">
<span class="ltx_p" id="S2.T1.3.6.1.1.1" style="width:91.4pt;"><span class="ltx_text ltx_font_bold" id="S2.T1.3.6.1.1.1.1">Dolma1.6++</span> <em class="ltx_emph ltx_font_italic" id="S2.T1.3.6.1.1.1.2">Original</em></span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="S2.T1.3.6.2" style="padding-top:2.5pt;padding-bottom:2.5pt;">
<span class="ltx_inline-block ltx_align_top" id="S2.T1.3.6.2.1">
<span class="ltx_p" id="S2.T1.3.6.2.1.1" style="width:286.2pt;">Dolma 1.6 plus additional sources from Dolma 1.7: RedPajama’s arxiv subset, openwebmath, algebraic stack, flan, starcoder, falcon.</span>
</span>
</td>
</tr>
<tr class="ltx_tr" id="S2.T1.3.7">
<td class="ltx_td ltx_align_justify ltx_align_top" id="S2.T1.3.7.1" style="padding-top:2.5pt;padding-bottom:2.5pt;">
<span class="ltx_inline-block ltx_align_top" id="S2.T1.3.7.1.1">
<span class="ltx_p" id="S2.T1.3.7.1.1.1" style="width:91.4pt;"><span class="ltx_text ltx_font_bold" id="S2.T1.3.7.1.1.1.1">C4</span> <em class="ltx_emph ltx_font_italic" id="S2.T1.3.7.1.1.1.2">Original</em></span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="S2.T1.3.7.2" style="padding-top:2.5pt;padding-bottom:2.5pt;">
<span class="ltx_inline-block ltx_align_top" id="S2.T1.3.7.2.1">
<span class="ltx_p" id="S2.T1.3.7.2.1.1" style="width:286.2pt;">The C4 dataset <cite class="ltx_cite ltx_citemacro_citep">(Raffel et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib32" title="">2019</a>)</cite> as prepared in Dolma 1.7, heuristically filtered from the April 2019 Common Crawl.</span>
</span>
</td>
</tr>
<tr class="ltx_tr" id="S2.T1.3.8">
<td class="ltx_td ltx_align_justify ltx_align_top" id="S2.T1.3.8.1" style="padding-top:2.5pt;padding-bottom:2.5pt;">
<span class="ltx_inline-block ltx_align_top" id="S2.T1.3.8.1.1">
<span class="ltx_p" id="S2.T1.3.8.1.1.1" style="width:91.4pt;"><span class="ltx_text ltx_font_bold" id="S2.T1.3.8.1.1.1.1">FineWeb-Pro</span> <em class="ltx_emph ltx_font_italic" id="S2.T1.3.8.1.1.1.2">Original</em></span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="S2.T1.3.8.2" style="padding-top:2.5pt;padding-bottom:2.5pt;">
<span class="ltx_inline-block ltx_align_top" id="S2.T1.3.8.2.1">
<span class="ltx_p" id="S2.T1.3.8.2.1.1" style="width:286.2pt;">The FineWeb Pro corpus <cite class="ltx_cite ltx_citemacro_citep">(Zhou et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib42" title="">2024</a>)</cite>, featuring model-driven data cleaning on FineWeb.</span>
</span>
</td>
</tr>
<tr class="ltx_tr" id="S2.T1.3.9">
<td class="ltx_td ltx_align_justify ltx_align_top" id="S2.T1.3.9.1" style="padding-top:2.5pt;padding-bottom:2.5pt;">
<span class="ltx_inline-block ltx_align_top" id="S2.T1.3.9.1.1">
<span class="ltx_p" id="S2.T1.3.9.1.1.1" style="width:91.4pt;"><span class="ltx_text ltx_font_bold" id="S2.T1.3.9.1.1.1.1">FineWeb-Edu</span> <em class="ltx_emph ltx_font_italic" id="S2.T1.3.9.1.1.1.2">Original</em></span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="S2.T1.3.9.2" style="padding-top:2.5pt;padding-bottom:2.5pt;">
<span class="ltx_inline-block ltx_align_top" id="S2.T1.3.9.2.1">
<span class="ltx_p" id="S2.T1.3.9.2.1.1" style="width:286.2pt;">The deduplicated FineWeb-Edu subset of SmolLM-Corpus <cite class="ltx_cite ltx_citemacro_citep">(Ben Allal et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib2" title="">2024</a>)</cite>, focused on educational web pages.</span>
</span>
</td>
</tr>
<tr class="ltx_tr" id="S2.T1.3.10">
<td class="ltx_td ltx_align_justify ltx_align_top" id="S2.T1.3.10.1" style="padding-top:2.5pt;padding-bottom:2.5pt;">
<span class="ltx_inline-block ltx_align_top" id="S2.T1.3.10.1.1">
<span class="ltx_p" id="S2.T1.3.10.1.1.1" style="width:91.4pt;"><span class="ltx_text ltx_font_bold" id="S2.T1.3.10.1.1.1.1">Falcon</span> <em class="ltx_emph ltx_font_italic" id="S2.T1.3.10.1.1.1.2">Original</em></span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="S2.T1.3.10.2" style="padding-top:2.5pt;padding-bottom:2.5pt;">
<span class="ltx_inline-block ltx_align_top" id="S2.T1.3.10.2.1">
<span class="ltx_p" id="S2.T1.3.10.2.1.1" style="width:286.2pt;">The Falcon RefinedWeb corpus <cite class="ltx_cite ltx_citemacro_citep">(Penedo et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib29" title="">2023</a>)</cite> in Dolma 1.7, derived from Common Crawl through June 2023 and more aggressively filtered/deduplicated than C4.</span>
</span>
</td>
</tr>
<tr class="ltx_tr" id="S2.T1.3.11">
<td class="ltx_td ltx_align_justify ltx_align_top" id="S2.T1.3.11.1" style="padding-top:2.5pt;padding-bottom:2.5pt;">
<span class="ltx_inline-block ltx_align_top" id="S2.T1.3.11.1.1">
<span class="ltx_p" id="S2.T1.3.11.1.1.1" style="width:91.4pt;"><span class="ltx_text ltx_font_bold" id="S2.T1.3.11.1.1.1.1">Falcon+CC</span> <em class="ltx_emph ltx_font_italic" id="S2.T1.3.11.1.1.1.2">Original, QC 10%, QC 20%, QC Orig 10%, QC Tulu 10%</em></span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="S2.T1.3.11.2" style="padding-top:2.5pt;padding-bottom:2.5pt;">
<span class="ltx_inline-block ltx_align_top" id="S2.T1.3.11.2.1">
<span class="ltx_p" id="S2.T1.3.11.2.1.1" style="width:286.2pt;">Falcon and Dolma 1.7’s Common Crawl. We quality filter to top 10% or 20% documents with reproduced or original <cite class="ltx_cite ltx_citemacro_cite">Li et al. (<a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib24" title="">2024</a>)</cite> filter or retrain filter on pre-release version of Tulu-v3 <cite class="ltx_cite ltx_citemacro_citep">(Lambert et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib22" title="">2024</a>)</cite>.</span>
</span>
</td>
</tr>
<tr class="ltx_tr" id="S2.T1.3.12">
<td class="ltx_td ltx_align_justify ltx_align_top" id="S2.T1.3.12.1" style="padding-top:2.5pt;padding-bottom:2.5pt;">
<span class="ltx_inline-block ltx_align_top" id="S2.T1.3.12.1.1">
<span class="ltx_p" id="S2.T1.3.12.1.1.1" style="width:91.4pt;"><span class="ltx_text ltx_font_bold" id="S2.T1.3.12.1.1.1.1">DCLM-Baseline</span> <em class="ltx_emph ltx_font_italic" id="S2.T1.3.12.1.1.1.2">Original, QC 7% FW2, QC 7% FW3, QC FW 3%, QC FW 10%, QC 10%, QC 20%</em></span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="S2.T1.3.12.2" style="padding-top:2.5pt;padding-bottom:2.5pt;">
<span class="ltx_inline-block ltx_align_top" id="S2.T1.3.12.2.1">
<span class="ltx_p" id="S2.T1.3.12.2.1.1" style="width:286.2pt;">A SOTA Common Crawl corpus using best ablated deduplication, cleaning heuristics, and quality filter. We quality filter to top 7% of DCLM classified documents and further take 2+ or 3+ scores with FineWeb-edu classifier; or filter to top 3% or 10% with FineWeb-edu classifier; or take top
10% or 20% with reproduced DCLM classifier.</span>
</span>
</td>
</tr>
<tr class="ltx_tr" id="S2.T1.3.3">
<td class="ltx_td ltx_align_justify ltx_align_top ltx_border_bb" id="S2.T1.2.2.2" style="padding-top:2.5pt;padding-bottom:2.5pt;">
<span class="ltx_inline-block ltx_align_top" id="S2.T1.2.2.2.2">
<span class="ltx_p" id="S2.T1.2.2.2.2.2" style="width:91.4pt;"><em class="ltx_emph ltx_font_italic" id="S2.T1.1.1.1.1.1.1"><math alttext="\lambda" class="ltx_Math" display="inline" id="S2.T1.1.1.1.1.1.1.m1.1"><semantics id="S2.T1.1.1.1.1.1.1.m1.1a"><mi id="S2.T1.1.1.1.1.1.1.m1.1.1" xref="S2.T1.1.1.1.1.1.1.m1.1.1.cmml">λ</mi><annotation-xml encoding="MathML-Content" id="S2.T1.1.1.1.1.1.1.m1.1b"><ci id="S2.T1.1.1.1.1.1.1.m1.1.1.cmml" xref="S2.T1.1.1.1.1.1.1.m1.1.1">𝜆</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.T1.1.1.1.1.1.1.m1.1c">\lambda</annotation><annotation encoding="application/x-llamapun" id="S2.T1.1.1.1.1.1.1.m1.1d">italic_λ</annotation></semantics></math>%</em> <span class="ltx_text ltx_font_bold" id="S2.T1.2.2.2.2.2.3">DCLM-Baseline</span> + <em class="ltx_emph ltx_font_italic" id="S2.T1.2.2.2.2.2.2"><math alttext="1-\lambda" class="ltx_Math" display="inline" id="S2.T1.2.2.2.2.2.2.m1.1"><semantics id="S2.T1.2.2.2.2.2.2.m1.1a"><mrow id="S2.T1.2.2.2.2.2.2.m1.1.1" xref="S2.T1.2.2.2.2.2.2.m1.1.1.cmml"><mn id="S2.T1.2.2.2.2.2.2.m1.1.1.2" xref="S2.T1.2.2.2.2.2.2.m1.1.1.2.cmml">1</mn><mo id="S2.T1.2.2.2.2.2.2.m1.1.1.1" xref="S2.T1.2.2.2.2.2.2.m1.1.1.1.cmml">−</mo><mi id="S2.T1.2.2.2.2.2.2.m1.1.1.3" xref="S2.T1.2.2.2.2.2.2.m1.1.1.3.cmml">λ</mi></mrow><annotation-xml encoding="MathML-Content" id="S2.T1.2.2.2.2.2.2.m1.1b"><apply id="S2.T1.2.2.2.2.2.2.m1.1.1.cmml" xref="S2.T1.2.2.2.2.2.2.m1.1.1"><minus id="S2.T1.2.2.2.2.2.2.m1.1.1.1.cmml" xref="S2.T1.2.2.2.2.2.2.m1.1.1.1"></minus><cn id="S2.T1.2.2.2.2.2.2.m1.1.1.2.cmml" type="integer" xref="S2.T1.2.2.2.2.2.2.m1.1.1.2">1</cn><ci id="S2.T1.2.2.2.2.2.2.m1.1.1.3.cmml" xref="S2.T1.2.2.2.2.2.2.m1.1.1.3">𝜆</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.T1.2.2.2.2.2.2.m1.1c">1-\lambda</annotation><annotation encoding="application/x-llamapun" id="S2.T1.2.2.2.2.2.2.m1.1d">1 - italic_λ</annotation></semantics></math>%</em> <span class="ltx_text ltx_font_bold" id="S2.T1.2.2.2.2.2.4">Dolma1.7</span></span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top ltx_border_bb" id="S2.T1.3.3.3" style="padding-top:2.5pt;padding-bottom:2.5pt;">
<span class="ltx_inline-block ltx_align_top" id="S2.T1.3.3.3.1">
<span class="ltx_p" id="S2.T1.3.3.3.1.1" style="width:286.2pt;">Fractional combinations of Dolma1.7 and DCLM-Baseline
mixing different proportions of the two datasets for <math alttext="\lambda\in\{25\%,50\%,75\%\}" class="ltx_Math" display="inline" id="S2.T1.3.3.3.1.1.m1.3"><semantics id="S2.T1.3.3.3.1.1.m1.3a"><mrow id="S2.T1.3.3.3.1.1.m1.3.3" xref="S2.T1.3.3.3.1.1.m1.3.3.cmml"><mi id="S2.T1.3.3.3.1.1.m1.3.3.5" xref="S2.T1.3.3.3.1.1.m1.3.3.5.cmml">λ</mi><mo id="S2.T1.3.3.3.1.1.m1.3.3.4" xref="S2.T1.3.3.3.1.1.m1.3.3.4.cmml">∈</mo><mrow id="S2.T1.3.3.3.1.1.m1.3.3.3.3" xref="S2.T1.3.3.3.1.1.m1.3.3.3.4.cmml"><mo id="S2.T1.3.3.3.1.1.m1.3.3.3.3.4" stretchy="false" xref="S2.T1.3.3.3.1.1.m1.3.3.3.4.cmml">{</mo><mrow id="S2.T1.3.3.3.1.1.m1.1.1.1.1.1" xref="S2.T1.3.3.3.1.1.m1.1.1.1.1.1.cmml"><mn id="S2.T1.3.3.3.1.1.m1.1.1.1.1.1.2" xref="S2.T1.3.3.3.1.1.m1.1.1.1.1.1.2.cmml">25</mn><mo id="S2.T1.3.3.3.1.1.m1.1.1.1.1.1.1" xref="S2.T1.3.3.3.1.1.m1.1.1.1.1.1.1.cmml">%</mo></mrow><mo id="S2.T1.3.3.3.1.1.m1.3.3.3.3.5" xref="S2.T1.3.3.3.1.1.m1.3.3.3.4.cmml">,</mo><mrow id="S2.T1.3.3.3.1.1.m1.2.2.2.2.2" xref="S2.T1.3.3.3.1.1.m1.2.2.2.2.2.cmml"><mn id="S2.T1.3.3.3.1.1.m1.2.2.2.2.2.2" xref="S2.T1.3.3.3.1.1.m1.2.2.2.2.2.2.cmml">50</mn><mo id="S2.T1.3.3.3.1.1.m1.2.2.2.2.2.1" xref="S2.T1.3.3.3.1.1.m1.2.2.2.2.2.1.cmml">%</mo></mrow><mo id="S2.T1.3.3.3.1.1.m1.3.3.3.3.6" xref="S2.T1.3.3.3.1.1.m1.3.3.3.4.cmml">,</mo><mrow id="S2.T1.3.3.3.1.1.m1.3.3.3.3.3" xref="S2.T1.3.3.3.1.1.m1.3.3.3.3.3.cmml"><mn id="S2.T1.3.3.3.1.1.m1.3.3.3.3.3.2" xref="S2.T1.3.3.3.1.1.m1.3.3.3.3.3.2.cmml">75</mn><mo id="S2.T1.3.3.3.1.1.m1.3.3.3.3.3.1" xref="S2.T1.3.3.3.1.1.m1.3.3.3.3.3.1.cmml">%</mo></mrow><mo id="S2.T1.3.3.3.1.1.m1.3.3.3.3.7" stretchy="false" xref="S2.T1.3.3.3.1.1.m1.3.3.3.4.cmml">}</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.T1.3.3.3.1.1.m1.3b"><apply id="S2.T1.3.3.3.1.1.m1.3.3.cmml" xref="S2.T1.3.3.3.1.1.m1.3.3"><in id="S2.T1.3.3.3.1.1.m1.3.3.4.cmml" xref="S2.T1.3.3.3.1.1.m1.3.3.4"></in><ci id="S2.T1.3.3.3.1.1.m1.3.3.5.cmml" xref="S2.T1.3.3.3.1.1.m1.3.3.5">𝜆</ci><set id="S2.T1.3.3.3.1.1.m1.3.3.3.4.cmml" xref="S2.T1.3.3.3.1.1.m1.3.3.3.3"><apply id="S2.T1.3.3.3.1.1.m1.1.1.1.1.1.cmml" xref="S2.T1.3.3.3.1.1.m1.1.1.1.1.1"><csymbol cd="latexml" id="S2.T1.3.3.3.1.1.m1.1.1.1.1.1.1.cmml" xref="S2.T1.3.3.3.1.1.m1.1.1.1.1.1.1">percent</csymbol><cn id="S2.T1.3.3.3.1.1.m1.1.1.1.1.1.2.cmml" type="integer" xref="S2.T1.3.3.3.1.1.m1.1.1.1.1.1.2">25</cn></apply><apply id="S2.T1.3.3.3.1.1.m1.2.2.2.2.2.cmml" xref="S2.T1.3.3.3.1.1.m1.2.2.2.2.2"><csymbol cd="latexml" id="S2.T1.3.3.3.1.1.m1.2.2.2.2.2.1.cmml" xref="S2.T1.3.3.3.1.1.m1.2.2.2.2.2.1">percent</csymbol><cn id="S2.T1.3.3.3.1.1.m1.2.2.2.2.2.2.cmml" type="integer" xref="S2.T1.3.3.3.1.1.m1.2.2.2.2.2.2">50</cn></apply><apply id="S2.T1.3.3.3.1.1.m1.3.3.3.3.3.cmml" xref="S2.T1.3.3.3.1.1.m1.3.3.3.3.3"><csymbol cd="latexml" id="S2.T1.3.3.3.1.1.m1.3.3.3.3.3.1.cmml" xref="S2.T1.3.3.3.1.1.m1.3.3.3.3.3.1">percent</csymbol><cn id="S2.T1.3.3.3.1.1.m1.3.3.3.3.3.2.cmml" type="integer" xref="S2.T1.3.3.3.1.1.m1.3.3.3.3.3.2">75</cn></apply></set></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.T1.3.3.3.1.1.m1.3c">\lambda\in\{25\%,50\%,75\%\}</annotation><annotation encoding="application/x-llamapun" id="S2.T1.3.3.3.1.1.m1.3d">italic_λ ∈ { 25 % , 50 % , 75 % }</annotation></semantics></math>.</span>
</span>
</td>
</tr>
</table>
<figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_table">Table 1: </span><span class="ltx_text ltx_font_smallcaps" id="S2.T1.5.1">DataDecide</span> enables the study of data differences over scales through controlled pretraining experiments on 25 data recipes. These take different source datasets and apply interventions from ablating domains, deduplication, mixing, to quality filtering with different classifiers and thresholds. We release all pretraining corpora, as well as models trained on each recipe and each of the 14 model configurations in Table <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#A1.T2" title="Table 2 ‣ Appendix A Hyperparameters ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_tag">2</span></a> with 3 random seeds.</figcaption>
</figure>
<section class="ltx_subsection" id="S2.SS1">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">2.1 </span>The <span class="ltx_text ltx_font_smallcaps" id="S2.SS1.1.1">DataDecide</span> Suite</h3>
<div class="ltx_para ltx_noindent" id="S2.SS1.p1">
<p class="ltx_p" id="S2.SS1.p1.2">We pretrain a suite of 1,050 models using 25 data recipes <math alttext="\times" class="ltx_Math" display="inline" id="S2.SS1.p1.1.m1.1"><semantics id="S2.SS1.p1.1.m1.1a"><mo id="S2.SS1.p1.1.m1.1.1" xref="S2.SS1.p1.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S2.SS1.p1.1.m1.1b"><times id="S2.SS1.p1.1.m1.1.1.cmml" xref="S2.SS1.p1.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p1.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p1.1.m1.1d">×</annotation></semantics></math> 14 model scales <math alttext="\times" class="ltx_Math" display="inline" id="S2.SS1.p1.2.m2.1"><semantics id="S2.SS1.p1.2.m2.1a"><mo id="S2.SS1.p1.2.m2.1.1" xref="S2.SS1.p1.2.m2.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S2.SS1.p1.2.m2.1b"><times id="S2.SS1.p1.2.m2.1.1.cmml" xref="S2.SS1.p1.2.m2.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p1.2.m2.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p1.2.m2.1d">×</annotation></semantics></math> 3 random seeds for initialization and data order. Table <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#S2.T1" title="Table 1 ‣ 2 Methods ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_tag">1</span></a> describes the 25 data recipes included in <span class="ltx_text ltx_font_smallcaps" id="S2.SS1.p1.2.1">DataDecide</span> that aim to provide coverage of common data preparation choices such as deduplication, ablating domains, mixes of existing datasets, as well as quality filters with different implementations, training data, and thresholds for quality classifiers.</p>
</div>
<div class="ltx_para ltx_noindent" id="S2.SS1.p2">
<p class="ltx_p" id="S2.SS1.p2.2">We select a token to parameter ratio of 100, which at 5<math alttext="\times" class="ltx_Math" display="inline" id="S2.SS1.p2.1.m1.1"><semantics id="S2.SS1.p2.1.m1.1a"><mo id="S2.SS1.p2.1.m1.1.1" xref="S2.SS1.p2.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S2.SS1.p2.1.m1.1b"><times id="S2.SS1.p2.1.m1.1.1.cmml" xref="S2.SS1.p2.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p2.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p2.1.m1.1d">×</annotation></semantics></math> “Chinchilla” (5 <math alttext="\times~{}C" class="ltx_Math" display="inline" id="S2.SS1.p2.2.m2.1"><semantics id="S2.SS1.p2.2.m2.1a"><mrow id="S2.SS1.p2.2.m2.1.1" xref="S2.SS1.p2.2.m2.1.1.cmml"><mi id="S2.SS1.p2.2.m2.1.1.2" xref="S2.SS1.p2.2.m2.1.1.2.cmml"></mi><mo id="S2.SS1.p2.2.m2.1.1.1" lspace="0.222em" rspace="0.552em" xref="S2.SS1.p2.2.m2.1.1.1.cmml">×</mo><mi id="S2.SS1.p2.2.m2.1.1.3" xref="S2.SS1.p2.2.m2.1.1.3.cmml">C</mi></mrow><annotation-xml encoding="MathML-Content" id="S2.SS1.p2.2.m2.1b"><apply id="S2.SS1.p2.2.m2.1.1.cmml" xref="S2.SS1.p2.2.m2.1.1"><times id="S2.SS1.p2.2.m2.1.1.1.cmml" xref="S2.SS1.p2.2.m2.1.1.1"></times><csymbol cd="latexml" id="S2.SS1.p2.2.m2.1.1.2.cmml" xref="S2.SS1.p2.2.m2.1.1.2">absent</csymbol><ci id="S2.SS1.p2.2.m2.1.1.3.cmml" xref="S2.SS1.p2.2.m2.1.1.3">𝐶</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p2.2.m2.1c">\times~{}C</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p2.2.m2.1d">× italic_C</annotation></semantics></math>) optimal ratio <cite class="ltx_cite ltx_citemacro_citep">(Hoffmann et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib19" title="">2022</a>)</cite> captures the typical overtraining favored for inference savings.</p>
</div>
<div class="ltx_para ltx_noindent" id="S2.SS1.p3">
<p class="ltx_p" id="S2.SS1.p3.3">All 1B (target size) models have 3 full reruns with different seeds, while other model sizes have second and third seed runs that are terminated early after <math alttext="25\%" class="ltx_Math" display="inline" id="S2.SS1.p3.1.m1.1"><semantics id="S2.SS1.p3.1.m1.1a"><mrow id="S2.SS1.p3.1.m1.1.1" xref="S2.SS1.p3.1.m1.1.1.cmml"><mn id="S2.SS1.p3.1.m1.1.1.2" xref="S2.SS1.p3.1.m1.1.1.2.cmml">25</mn><mo id="S2.SS1.p3.1.m1.1.1.1" xref="S2.SS1.p3.1.m1.1.1.1.cmml">%</mo></mrow><annotation-xml encoding="MathML-Content" id="S2.SS1.p3.1.m1.1b"><apply id="S2.SS1.p3.1.m1.1.1.cmml" xref="S2.SS1.p3.1.m1.1.1"><csymbol cd="latexml" id="S2.SS1.p3.1.m1.1.1.1.cmml" xref="S2.SS1.p3.1.m1.1.1.1">percent</csymbol><cn id="S2.SS1.p3.1.m1.1.1.2.cmml" type="integer" xref="S2.SS1.p3.1.m1.1.1.2">25</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p3.1.m1.1c">25\%</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p3.1.m1.1d">25 %</annotation></semantics></math> of the target compute budget. We train the 1B reruns all the way to completion to allow our target “gold” predictions to account for run-to-run variance in evaluations due to weight initialization and data order. For instance, we find that the standard deviation between runs at the 1B 5<math alttext="\times C" class="ltx_Math" display="inline" id="S2.SS1.p3.2.m2.1"><semantics id="S2.SS1.p3.2.m2.1a"><mrow id="S2.SS1.p3.2.m2.1.1" xref="S2.SS1.p3.2.m2.1.1.cmml"><mi id="S2.SS1.p3.2.m2.1.1.2" xref="S2.SS1.p3.2.m2.1.1.2.cmml"></mi><mo id="S2.SS1.p3.2.m2.1.1.1" lspace="0.222em" rspace="0.222em" xref="S2.SS1.p3.2.m2.1.1.1.cmml">×</mo><mi id="S2.SS1.p3.2.m2.1.1.3" xref="S2.SS1.p3.2.m2.1.1.3.cmml">C</mi></mrow><annotation-xml encoding="MathML-Content" id="S2.SS1.p3.2.m2.1b"><apply id="S2.SS1.p3.2.m2.1.1.cmml" xref="S2.SS1.p3.2.m2.1.1"><times id="S2.SS1.p3.2.m2.1.1.1.cmml" xref="S2.SS1.p3.2.m2.1.1.1"></times><csymbol cd="latexml" id="S2.SS1.p3.2.m2.1.1.2.cmml" xref="S2.SS1.p3.2.m2.1.1.2">absent</csymbol><ci id="S2.SS1.p3.2.m2.1.1.3.cmml" xref="S2.SS1.p3.2.m2.1.1.3">𝐶</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p3.2.m2.1c">\times C</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p3.2.m2.1d">× italic_C</annotation></semantics></math> scale can be as high as <math alttext="2\%" class="ltx_Math" display="inline" id="S2.SS1.p3.3.m3.1"><semantics id="S2.SS1.p3.3.m3.1a"><mrow id="S2.SS1.p3.3.m3.1.1" xref="S2.SS1.p3.3.m3.1.1.cmml"><mn id="S2.SS1.p3.3.m3.1.1.2" xref="S2.SS1.p3.3.m3.1.1.2.cmml">2</mn><mo id="S2.SS1.p3.3.m3.1.1.1" xref="S2.SS1.p3.3.m3.1.1.1.cmml">%</mo></mrow><annotation-xml encoding="MathML-Content" id="S2.SS1.p3.3.m3.1b"><apply id="S2.SS1.p3.3.m3.1.1.cmml" xref="S2.SS1.p3.3.m3.1.1"><csymbol cd="latexml" id="S2.SS1.p3.3.m3.1.1.1.cmml" xref="S2.SS1.p3.3.m3.1.1.1">percent</csymbol><cn id="S2.SS1.p3.3.m3.1.1.2.cmml" type="integer" xref="S2.SS1.p3.3.m3.1.1.2">2</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS1.p3.3.m3.1c">2\%</annotation><annotation encoding="application/x-llamapun" id="S2.SS1.p3.3.m3.1d">2 %</annotation></semantics></math> points of accuracy for some recipes on most tasks. Meanwhile, at the non-target scales we wish to make predictions with a small fraction of the target compute, so we avoid reruns that would use an impractically large prediction budget.</p>
</div>
<div class="ltx_para ltx_noindent" id="S2.SS1.p4">
<p class="ltx_p" id="S2.SS1.p4.1">Whether for extrapolating scaling laws or ranking single scale experiments, it is important to select reasonable hyperparameters for each scale to avoid confounding in performance differences that are simply due to suboptimal hyperparameters. We use OLMo’s <span class="ltx_text ltx_font_italic" id="S2.SS1.p4.1.1">model ladder</span> <cite class="ltx_cite ltx_citemacro_citep">(Groeneveld et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib16" title="">2024</a>; OLMo et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib28" title="">2025</a>; Bhagia et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib3" title="">2024</a>)</cite> to programmatically create LM pretraining configurations for a specified parameter size and token-parameter ratio to enable running a grid of model scaling experiments.
The model ladder uses heuristics from the literature <cite class="ltx_cite ltx_citemacro_citep">(Porian et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib31" title="">2024</a>)</cite> to set global batch size and learning rate based on scaling factors. The hyperparameters that determine parameter count (layers, hidden dimension, number of heads, MLP dimension) were handpicked by OLMo developers for each scale to achieve the desired number of parameters. Appendix Table <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#A1.T2" title="Table 2 ‣ Appendix A Hyperparameters ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_tag">2</span></a> details the configurations of all our models.</p>
</div>
</section>
<section class="ltx_subsection" id="S2.SS2">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">2.2 </span>Prediction Methods</h3>
<div class="ltx_para ltx_noindent" id="S2.SS2.p1">
<p class="ltx_p" id="S2.SS2.p1.1">Broadly, there are two approaches in the literature to predicting large-scale performance based on small-scale experiments. We use straightforward implementations of each to assess where they succeed and fail at making decisions about which data recipes to use.</p>
</div>
<section class="ltx_paragraph" id="S2.SS2.SSS0.Px1">
<h4 class="ltx_title ltx_title_paragraph">Ranking Single Scale Experiments (Single Scale)</h4>
<div class="ltx_para ltx_noindent" id="S2.SS2.SSS0.Px1.p1">
<p class="ltx_p" id="S2.SS2.SSS0.Px1.p1.1">This simple approach is employed by work such as <cite class="ltx_cite ltx_citemacro_citet">Li et al. (<a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib24" title="">2024</a>)</cite> and consists of running a set of ablations or experiments over data recipe options while holding constant all other modeling variables including scale. The winning data recipe by downstream accuracy (or proxies) at the small experimental scale is assumed to extrapolate to the target scale.</p>
</div>
</section>
<section class="ltx_paragraph" id="S2.SS2.SSS0.Px2">
<h4 class="ltx_title ltx_title_paragraph">Extrapolating Scaling Laws (Multi Scale)</h4>
<div class="ltx_para ltx_noindent" id="S2.SS2.SSS0.Px2.p1">
<p class="ltx_p" id="S2.SS2.SSS0.Px2.p1.9">Another approach to making decisions with predictions across scales used in works such as <cite class="ltx_cite ltx_citemacro_citet">Dubey et al. (<a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib13" title="">2024</a>)</cite> is to fit scaling laws to <span class="ltx_text ltx_font_italic" id="S2.SS2.SSS0.Px2.p1.9.1">multiple</span> small experiments across a range of scales for each of the data recipes. The winning recipe is decided as the one whose scaling law shows the highest <span class="ltx_text ltx_font_italic" id="S2.SS2.SSS0.Px2.p1.9.2">extrapolated</span> performance at the target scale. Although scaling laws were first observed for language modeling loss <cite class="ltx_cite ltx_citemacro_citep">(Kaplan et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib21" title="">2020</a>; Hoffmann et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib19" title="">2022</a>)</cite>, they have been extended to predict downstream performance through a two-step approach that also fits a function from loss to downstream performance <cite class="ltx_cite ltx_citemacro_citep">(Gadre et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib14" title="">2024</a>; Bhagia et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib3" title="">2024</a>)</cite>. We follow a method from <cite class="ltx_cite ltx_citemacro_citet">Bhagia et al. (<a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib3" title="">2024</a>)</cite>. Their proposed approach incorporates separate parameters for number of model parameters and number of tokens trained to account for over or undertrained models. But as our suite only includes one token-parameter ratio, we use the simplified 3 parameter baseline, <math alttext="L(C)" class="ltx_Math" display="inline" id="S2.SS2.SSS0.Px2.p1.1.m1.1"><semantics id="S2.SS2.SSS0.Px2.p1.1.m1.1a"><mrow id="S2.SS2.SSS0.Px2.p1.1.m1.1.2" xref="S2.SS2.SSS0.Px2.p1.1.m1.1.2.cmml"><mi id="S2.SS2.SSS0.Px2.p1.1.m1.1.2.2" xref="S2.SS2.SSS0.Px2.p1.1.m1.1.2.2.cmml">L</mi><mo id="S2.SS2.SSS0.Px2.p1.1.m1.1.2.1" xref="S2.SS2.SSS0.Px2.p1.1.m1.1.2.1.cmml">⁢</mo><mrow id="S2.SS2.SSS0.Px2.p1.1.m1.1.2.3.2" xref="S2.SS2.SSS0.Px2.p1.1.m1.1.2.cmml"><mo id="S2.SS2.SSS0.Px2.p1.1.m1.1.2.3.2.1" stretchy="false" xref="S2.SS2.SSS0.Px2.p1.1.m1.1.2.cmml">(</mo><mi id="S2.SS2.SSS0.Px2.p1.1.m1.1.1" xref="S2.SS2.SSS0.Px2.p1.1.m1.1.1.cmml">C</mi><mo id="S2.SS2.SSS0.Px2.p1.1.m1.1.2.3.2.2" stretchy="false" xref="S2.SS2.SSS0.Px2.p1.1.m1.1.2.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.SS2.SSS0.Px2.p1.1.m1.1b"><apply id="S2.SS2.SSS0.Px2.p1.1.m1.1.2.cmml" xref="S2.SS2.SSS0.Px2.p1.1.m1.1.2"><times id="S2.SS2.SSS0.Px2.p1.1.m1.1.2.1.cmml" xref="S2.SS2.SSS0.Px2.p1.1.m1.1.2.1"></times><ci id="S2.SS2.SSS0.Px2.p1.1.m1.1.2.2.cmml" xref="S2.SS2.SSS0.Px2.p1.1.m1.1.2.2">𝐿</ci><ci id="S2.SS2.SSS0.Px2.p1.1.m1.1.1.cmml" xref="S2.SS2.SSS0.Px2.p1.1.m1.1.1">𝐶</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.SSS0.Px2.p1.1.m1.1c">L(C)</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.SSS0.Px2.p1.1.m1.1d">italic_L ( italic_C )</annotation></semantics></math>, as a first step which we chain with second step, <math alttext="Acc(L)" class="ltx_Math" display="inline" id="S2.SS2.SSS0.Px2.p1.2.m2.1"><semantics id="S2.SS2.SSS0.Px2.p1.2.m2.1a"><mrow id="S2.SS2.SSS0.Px2.p1.2.m2.1.2" xref="S2.SS2.SSS0.Px2.p1.2.m2.1.2.cmml"><mi id="S2.SS2.SSS0.Px2.p1.2.m2.1.2.2" xref="S2.SS2.SSS0.Px2.p1.2.m2.1.2.2.cmml">A</mi><mo id="S2.SS2.SSS0.Px2.p1.2.m2.1.2.1" xref="S2.SS2.SSS0.Px2.p1.2.m2.1.2.1.cmml">⁢</mo><mi id="S2.SS2.SSS0.Px2.p1.2.m2.1.2.3" xref="S2.SS2.SSS0.Px2.p1.2.m2.1.2.3.cmml">c</mi><mo id="S2.SS2.SSS0.Px2.p1.2.m2.1.2.1a" xref="S2.SS2.SSS0.Px2.p1.2.m2.1.2.1.cmml">⁢</mo><mi id="S2.SS2.SSS0.Px2.p1.2.m2.1.2.4" xref="S2.SS2.SSS0.Px2.p1.2.m2.1.2.4.cmml">c</mi><mo id="S2.SS2.SSS0.Px2.p1.2.m2.1.2.1b" xref="S2.SS2.SSS0.Px2.p1.2.m2.1.2.1.cmml">⁢</mo><mrow id="S2.SS2.SSS0.Px2.p1.2.m2.1.2.5.2" xref="S2.SS2.SSS0.Px2.p1.2.m2.1.2.cmml"><mo id="S2.SS2.SSS0.Px2.p1.2.m2.1.2.5.2.1" stretchy="false" xref="S2.SS2.SSS0.Px2.p1.2.m2.1.2.cmml">(</mo><mi id="S2.SS2.SSS0.Px2.p1.2.m2.1.1" xref="S2.SS2.SSS0.Px2.p1.2.m2.1.1.cmml">L</mi><mo id="S2.SS2.SSS0.Px2.p1.2.m2.1.2.5.2.2" stretchy="false" xref="S2.SS2.SSS0.Px2.p1.2.m2.1.2.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.SS2.SSS0.Px2.p1.2.m2.1b"><apply id="S2.SS2.SSS0.Px2.p1.2.m2.1.2.cmml" xref="S2.SS2.SSS0.Px2.p1.2.m2.1.2"><times id="S2.SS2.SSS0.Px2.p1.2.m2.1.2.1.cmml" xref="S2.SS2.SSS0.Px2.p1.2.m2.1.2.1"></times><ci id="S2.SS2.SSS0.Px2.p1.2.m2.1.2.2.cmml" xref="S2.SS2.SSS0.Px2.p1.2.m2.1.2.2">𝐴</ci><ci id="S2.SS2.SSS0.Px2.p1.2.m2.1.2.3.cmml" xref="S2.SS2.SSS0.Px2.p1.2.m2.1.2.3">𝑐</ci><ci id="S2.SS2.SSS0.Px2.p1.2.m2.1.2.4.cmml" xref="S2.SS2.SSS0.Px2.p1.2.m2.1.2.4">𝑐</ci><ci id="S2.SS2.SSS0.Px2.p1.2.m2.1.1.cmml" xref="S2.SS2.SSS0.Px2.p1.2.m2.1.1">𝐿</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.SSS0.Px2.p1.2.m2.1c">Acc(L)</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.SSS0.Px2.p1.2.m2.1d">italic_A italic_c italic_c ( italic_L )</annotation></semantics></math>, defined as follows where <math alttext="A" class="ltx_Math" display="inline" id="S2.SS2.SSS0.Px2.p1.3.m3.1"><semantics id="S2.SS2.SSS0.Px2.p1.3.m3.1a"><mi id="S2.SS2.SSS0.Px2.p1.3.m3.1.1" xref="S2.SS2.SSS0.Px2.p1.3.m3.1.1.cmml">A</mi><annotation-xml encoding="MathML-Content" id="S2.SS2.SSS0.Px2.p1.3.m3.1b"><ci id="S2.SS2.SSS0.Px2.p1.3.m3.1.1.cmml" xref="S2.SS2.SSS0.Px2.p1.3.m3.1.1">𝐴</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.SSS0.Px2.p1.3.m3.1c">A</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.SSS0.Px2.p1.3.m3.1d">italic_A</annotation></semantics></math>, <math alttext="\alpha" class="ltx_Math" display="inline" id="S2.SS2.SSS0.Px2.p1.4.m4.1"><semantics id="S2.SS2.SSS0.Px2.p1.4.m4.1a"><mi id="S2.SS2.SSS0.Px2.p1.4.m4.1.1" xref="S2.SS2.SSS0.Px2.p1.4.m4.1.1.cmml">α</mi><annotation-xml encoding="MathML-Content" id="S2.SS2.SSS0.Px2.p1.4.m4.1b"><ci id="S2.SS2.SSS0.Px2.p1.4.m4.1.1.cmml" xref="S2.SS2.SSS0.Px2.p1.4.m4.1.1">𝛼</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.SSS0.Px2.p1.4.m4.1c">\alpha</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.SSS0.Px2.p1.4.m4.1d">italic_α</annotation></semantics></math>, <math alttext="E" class="ltx_Math" display="inline" id="S2.SS2.SSS0.Px2.p1.5.m5.1"><semantics id="S2.SS2.SSS0.Px2.p1.5.m5.1a"><mi id="S2.SS2.SSS0.Px2.p1.5.m5.1.1" xref="S2.SS2.SSS0.Px2.p1.5.m5.1.1.cmml">E</mi><annotation-xml encoding="MathML-Content" id="S2.SS2.SSS0.Px2.p1.5.m5.1b"><ci id="S2.SS2.SSS0.Px2.p1.5.m5.1.1.cmml" xref="S2.SS2.SSS0.Px2.p1.5.m5.1.1">𝐸</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.SSS0.Px2.p1.5.m5.1c">E</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.SSS0.Px2.p1.5.m5.1d">italic_E</annotation></semantics></math>, <math alttext="a" class="ltx_Math" display="inline" id="S2.SS2.SSS0.Px2.p1.6.m6.1"><semantics id="S2.SS2.SSS0.Px2.p1.6.m6.1a"><mi id="S2.SS2.SSS0.Px2.p1.6.m6.1.1" xref="S2.SS2.SSS0.Px2.p1.6.m6.1.1.cmml">a</mi><annotation-xml encoding="MathML-Content" id="S2.SS2.SSS0.Px2.p1.6.m6.1b"><ci id="S2.SS2.SSS0.Px2.p1.6.m6.1.1.cmml" xref="S2.SS2.SSS0.Px2.p1.6.m6.1.1">𝑎</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.SSS0.Px2.p1.6.m6.1c">a</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.SSS0.Px2.p1.6.m6.1d">italic_a</annotation></semantics></math>, <math alttext="b" class="ltx_Math" display="inline" id="S2.SS2.SSS0.Px2.p1.7.m7.1"><semantics id="S2.SS2.SSS0.Px2.p1.7.m7.1a"><mi id="S2.SS2.SSS0.Px2.p1.7.m7.1.1" xref="S2.SS2.SSS0.Px2.p1.7.m7.1.1.cmml">b</mi><annotation-xml encoding="MathML-Content" id="S2.SS2.SSS0.Px2.p1.7.m7.1b"><ci id="S2.SS2.SSS0.Px2.p1.7.m7.1.1.cmml" xref="S2.SS2.SSS0.Px2.p1.7.m7.1.1">𝑏</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.SSS0.Px2.p1.7.m7.1c">b</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.SSS0.Px2.p1.7.m7.1d">italic_b</annotation></semantics></math>, <math alttext="k" class="ltx_Math" display="inline" id="S2.SS2.SSS0.Px2.p1.8.m8.1"><semantics id="S2.SS2.SSS0.Px2.p1.8.m8.1a"><mi id="S2.SS2.SSS0.Px2.p1.8.m8.1.1" xref="S2.SS2.SSS0.Px2.p1.8.m8.1.1.cmml">k</mi><annotation-xml encoding="MathML-Content" id="S2.SS2.SSS0.Px2.p1.8.m8.1b"><ci id="S2.SS2.SSS0.Px2.p1.8.m8.1.1.cmml" xref="S2.SS2.SSS0.Px2.p1.8.m8.1.1">𝑘</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.SSS0.Px2.p1.8.m8.1c">k</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.SSS0.Px2.p1.8.m8.1d">italic_k</annotation></semantics></math>, <math alttext="L_{0}" class="ltx_Math" display="inline" id="S2.SS2.SSS0.Px2.p1.9.m9.1"><semantics id="S2.SS2.SSS0.Px2.p1.9.m9.1a"><msub id="S2.SS2.SSS0.Px2.p1.9.m9.1.1" xref="S2.SS2.SSS0.Px2.p1.9.m9.1.1.cmml"><mi id="S2.SS2.SSS0.Px2.p1.9.m9.1.1.2" xref="S2.SS2.SSS0.Px2.p1.9.m9.1.1.2.cmml">L</mi><mn id="S2.SS2.SSS0.Px2.p1.9.m9.1.1.3" xref="S2.SS2.SSS0.Px2.p1.9.m9.1.1.3.cmml">0</mn></msub><annotation-xml encoding="MathML-Content" id="S2.SS2.SSS0.Px2.p1.9.m9.1b"><apply id="S2.SS2.SSS0.Px2.p1.9.m9.1.1.cmml" xref="S2.SS2.SSS0.Px2.p1.9.m9.1.1"><csymbol cd="ambiguous" id="S2.SS2.SSS0.Px2.p1.9.m9.1.1.1.cmml" xref="S2.SS2.SSS0.Px2.p1.9.m9.1.1">subscript</csymbol><ci id="S2.SS2.SSS0.Px2.p1.9.m9.1.1.2.cmml" xref="S2.SS2.SSS0.Px2.p1.9.m9.1.1.2">𝐿</ci><cn id="S2.SS2.SSS0.Px2.p1.9.m9.1.1.3.cmml" type="integer" xref="S2.SS2.SSS0.Px2.p1.9.m9.1.1.3">0</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.SSS0.Px2.p1.9.m9.1c">L_{0}</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.SSS0.Px2.p1.9.m9.1d">italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT</annotation></semantics></math> are optimized parameters:</p>
<table class="ltx_equationgroup ltx_eqn_align ltx_eqn_table" id="A3.EGx1">
<tbody id="S2.E1"><tr class="ltx_equation ltx_eqn_row ltx_align_baseline">
<td class="ltx_eqn_cell ltx_eqn_center_padleft"></td>
<td class="ltx_td ltx_align_right ltx_eqn_cell"><math alttext="\displaystyle L(C)" class="ltx_Math" display="inline" id="S2.E1.m1.1"><semantics id="S2.E1.m1.1a"><mrow id="S2.E1.m1.1.2" xref="S2.E1.m1.1.2.cmml"><mi id="S2.E1.m1.1.2.2" xref="S2.E1.m1.1.2.2.cmml">L</mi><mo id="S2.E1.m1.1.2.1" xref="S2.E1.m1.1.2.1.cmml">⁢</mo><mrow id="S2.E1.m1.1.2.3.2" xref="S2.E1.m1.1.2.cmml"><mo id="S2.E1.m1.1.2.3.2.1" stretchy="false" xref="S2.E1.m1.1.2.cmml">(</mo><mi id="S2.E1.m1.1.1" xref="S2.E1.m1.1.1.cmml">C</mi><mo id="S2.E1.m1.1.2.3.2.2" stretchy="false" xref="S2.E1.m1.1.2.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.E1.m1.1b"><apply id="S2.E1.m1.1.2.cmml" xref="S2.E1.m1.1.2"><times id="S2.E1.m1.1.2.1.cmml" xref="S2.E1.m1.1.2.1"></times><ci id="S2.E1.m1.1.2.2.cmml" xref="S2.E1.m1.1.2.2">𝐿</ci><ci id="S2.E1.m1.1.1.cmml" xref="S2.E1.m1.1.1">𝐶</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.E1.m1.1c">\displaystyle L(C)</annotation><annotation encoding="application/x-llamapun" id="S2.E1.m1.1d">italic_L ( italic_C )</annotation></semantics></math></td>
<td class="ltx_td ltx_align_left ltx_eqn_cell"><math alttext="\displaystyle=\frac{A}{C^{\alpha}}+E" class="ltx_Math" display="inline" id="S2.E1.m2.1"><semantics id="S2.E1.m2.1a"><mrow id="S2.E1.m2.1.1" xref="S2.E1.m2.1.1.cmml"><mi id="S2.E1.m2.1.1.2" xref="S2.E1.m2.1.1.2.cmml"></mi><mo id="S2.E1.m2.1.1.1" xref="S2.E1.m2.1.1.1.cmml">=</mo><mrow id="S2.E1.m2.1.1.3" xref="S2.E1.m2.1.1.3.cmml"><mstyle displaystyle="true" id="S2.E1.m2.1.1.3.2" xref="S2.E1.m2.1.1.3.2.cmml"><mfrac id="S2.E1.m2.1.1.3.2a" xref="S2.E1.m2.1.1.3.2.cmml"><mi id="S2.E1.m2.1.1.3.2.2" xref="S2.E1.m2.1.1.3.2.2.cmml">A</mi><msup id="S2.E1.m2.1.1.3.2.3" xref="S2.E1.m2.1.1.3.2.3.cmml"><mi id="S2.E1.m2.1.1.3.2.3.2" xref="S2.E1.m2.1.1.3.2.3.2.cmml">C</mi><mi id="S2.E1.m2.1.1.3.2.3.3" xref="S2.E1.m2.1.1.3.2.3.3.cmml">α</mi></msup></mfrac></mstyle><mo id="S2.E1.m2.1.1.3.1" xref="S2.E1.m2.1.1.3.1.cmml">+</mo><mi id="S2.E1.m2.1.1.3.3" xref="S2.E1.m2.1.1.3.3.cmml">E</mi></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.E1.m2.1b"><apply id="S2.E1.m2.1.1.cmml" xref="S2.E1.m2.1.1"><eq id="S2.E1.m2.1.1.1.cmml" xref="S2.E1.m2.1.1.1"></eq><csymbol cd="latexml" id="S2.E1.m2.1.1.2.cmml" xref="S2.E1.m2.1.1.2">absent</csymbol><apply id="S2.E1.m2.1.1.3.cmml" xref="S2.E1.m2.1.1.3"><plus id="S2.E1.m2.1.1.3.1.cmml" xref="S2.E1.m2.1.1.3.1"></plus><apply id="S2.E1.m2.1.1.3.2.cmml" xref="S2.E1.m2.1.1.3.2"><divide id="S2.E1.m2.1.1.3.2.1.cmml" xref="S2.E1.m2.1.1.3.2"></divide><ci id="S2.E1.m2.1.1.3.2.2.cmml" xref="S2.E1.m2.1.1.3.2.2">𝐴</ci><apply id="S2.E1.m2.1.1.3.2.3.cmml" xref="S2.E1.m2.1.1.3.2.3"><csymbol cd="ambiguous" id="S2.E1.m2.1.1.3.2.3.1.cmml" xref="S2.E1.m2.1.1.3.2.3">superscript</csymbol><ci id="S2.E1.m2.1.1.3.2.3.2.cmml" xref="S2.E1.m2.1.1.3.2.3.2">𝐶</ci><ci id="S2.E1.m2.1.1.3.2.3.3.cmml" xref="S2.E1.m2.1.1.3.2.3.3">𝛼</ci></apply></apply><ci id="S2.E1.m2.1.1.3.3.cmml" xref="S2.E1.m2.1.1.3.3">𝐸</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.E1.m2.1c">\displaystyle=\frac{A}{C^{\alpha}}+E</annotation><annotation encoding="application/x-llamapun" id="S2.E1.m2.1d">= divide start_ARG italic_A end_ARG start_ARG italic_C start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_ARG + italic_E</annotation></semantics></math></td>
<td class="ltx_eqn_cell ltx_eqn_center_padright"></td>
<td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(1)</span></td>
</tr></tbody>
<tbody id="S2.E2"><tr class="ltx_equation ltx_eqn_row ltx_align_baseline">
<td class="ltx_eqn_cell ltx_eqn_center_padleft"></td>
<td class="ltx_td ltx_align_right ltx_eqn_cell"><math alttext="\displaystyle Acc(L)" class="ltx_Math" display="inline" id="S2.E2.m1.1"><semantics id="S2.E2.m1.1a"><mrow id="S2.E2.m1.1.2" xref="S2.E2.m1.1.2.cmml"><mi id="S2.E2.m1.1.2.2" xref="S2.E2.m1.1.2.2.cmml">A</mi><mo id="S2.E2.m1.1.2.1" xref="S2.E2.m1.1.2.1.cmml">⁢</mo><mi id="S2.E2.m1.1.2.3" xref="S2.E2.m1.1.2.3.cmml">c</mi><mo id="S2.E2.m1.1.2.1a" xref="S2.E2.m1.1.2.1.cmml">⁢</mo><mi id="S2.E2.m1.1.2.4" xref="S2.E2.m1.1.2.4.cmml">c</mi><mo id="S2.E2.m1.1.2.1b" xref="S2.E2.m1.1.2.1.cmml">⁢</mo><mrow id="S2.E2.m1.1.2.5.2" xref="S2.E2.m1.1.2.cmml"><mo id="S2.E2.m1.1.2.5.2.1" stretchy="false" xref="S2.E2.m1.1.2.cmml">(</mo><mi id="S2.E2.m1.1.1" xref="S2.E2.m1.1.1.cmml">L</mi><mo id="S2.E2.m1.1.2.5.2.2" stretchy="false" xref="S2.E2.m1.1.2.cmml">)</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.E2.m1.1b"><apply id="S2.E2.m1.1.2.cmml" xref="S2.E2.m1.1.2"><times id="S2.E2.m1.1.2.1.cmml" xref="S2.E2.m1.1.2.1"></times><ci id="S2.E2.m1.1.2.2.cmml" xref="S2.E2.m1.1.2.2">𝐴</ci><ci id="S2.E2.m1.1.2.3.cmml" xref="S2.E2.m1.1.2.3">𝑐</ci><ci id="S2.E2.m1.1.2.4.cmml" xref="S2.E2.m1.1.2.4">𝑐</ci><ci id="S2.E2.m1.1.1.cmml" xref="S2.E2.m1.1.1">𝐿</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.E2.m1.1c">\displaystyle Acc(L)</annotation><annotation encoding="application/x-llamapun" id="S2.E2.m1.1d">italic_A italic_c italic_c ( italic_L )</annotation></semantics></math></td>
<td class="ltx_td ltx_align_left ltx_eqn_cell"><math alttext="\displaystyle=\frac{a}{1+e^{-k(L-L_{0})}}+b" class="ltx_Math" display="inline" id="S2.E2.m2.1"><semantics id="S2.E2.m2.1a"><mrow id="S2.E2.m2.1.2" xref="S2.E2.m2.1.2.cmml"><mi id="S2.E2.m2.1.2.2" xref="S2.E2.m2.1.2.2.cmml"></mi><mo id="S2.E2.m2.1.2.1" xref="S2.E2.m2.1.2.1.cmml">=</mo><mrow id="S2.E2.m2.1.2.3" xref="S2.E2.m2.1.2.3.cmml"><mstyle displaystyle="true" id="S2.E2.m2.1.1" xref="S2.E2.m2.1.1.cmml"><mfrac id="S2.E2.m2.1.1a" xref="S2.E2.m2.1.1.cmml"><mi id="S2.E2.m2.1.1.3" xref="S2.E2.m2.1.1.3.cmml">a</mi><mrow id="S2.E2.m2.1.1.1" xref="S2.E2.m2.1.1.1.cmml"><mn id="S2.E2.m2.1.1.1.3" xref="S2.E2.m2.1.1.1.3.cmml">1</mn><mo id="S2.E2.m2.1.1.1.2" xref="S2.E2.m2.1.1.1.2.cmml">+</mo><msup id="S2.E2.m2.1.1.1.4" xref="S2.E2.m2.1.1.1.4.cmml"><mi id="S2.E2.m2.1.1.1.4.2" xref="S2.E2.m2.1.1.1.4.2.cmml">e</mi><mrow id="S2.E2.m2.1.1.1.1.1" xref="S2.E2.m2.1.1.1.1.1.cmml"><mo id="S2.E2.m2.1.1.1.1.1a" xref="S2.E2.m2.1.1.1.1.1.cmml">−</mo><mrow id="S2.E2.m2.1.1.1.1.1.1" xref="S2.E2.m2.1.1.1.1.1.1.cmml"><mi id="S2.E2.m2.1.1.1.1.1.1.3" xref="S2.E2.m2.1.1.1.1.1.1.3.cmml">k</mi><mo id="S2.E2.m2.1.1.1.1.1.1.2" xref="S2.E2.m2.1.1.1.1.1.1.2.cmml">⁢</mo><mrow id="S2.E2.m2.1.1.1.1.1.1.1.1" xref="S2.E2.m2.1.1.1.1.1.1.1.1.1.cmml"><mo id="S2.E2.m2.1.1.1.1.1.1.1.1.2" stretchy="false" xref="S2.E2.m2.1.1.1.1.1.1.1.1.1.cmml">(</mo><mrow id="S2.E2.m2.1.1.1.1.1.1.1.1.1" xref="S2.E2.m2.1.1.1.1.1.1.1.1.1.cmml"><mi id="S2.E2.m2.1.1.1.1.1.1.1.1.1.2" xref="S2.E2.m2.1.1.1.1.1.1.1.1.1.2.cmml">L</mi><mo id="S2.E2.m2.1.1.1.1.1.1.1.1.1.1" xref="S2.E2.m2.1.1.1.1.1.1.1.1.1.1.cmml">−</mo><msub id="S2.E2.m2.1.1.1.1.1.1.1.1.1.3" xref="S2.E2.m2.1.1.1.1.1.1.1.1.1.3.cmml"><mi id="S2.E2.m2.1.1.1.1.1.1.1.1.1.3.2" xref="S2.E2.m2.1.1.1.1.1.1.1.1.1.3.2.cmml">L</mi><mn id="S2.E2.m2.1.1.1.1.1.1.1.1.1.3.3" xref="S2.E2.m2.1.1.1.1.1.1.1.1.1.3.3.cmml">0</mn></msub></mrow><mo id="S2.E2.m2.1.1.1.1.1.1.1.1.3" stretchy="false" xref="S2.E2.m2.1.1.1.1.1.1.1.1.1.cmml">)</mo></mrow></mrow></mrow></msup></mrow></mfrac></mstyle><mo id="S2.E2.m2.1.2.3.1" xref="S2.E2.m2.1.2.3.1.cmml">+</mo><mi id="S2.E2.m2.1.2.3.2" xref="S2.E2.m2.1.2.3.2.cmml">b</mi></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.E2.m2.1b"><apply id="S2.E2.m2.1.2.cmml" xref="S2.E2.m2.1.2"><eq id="S2.E2.m2.1.2.1.cmml" xref="S2.E2.m2.1.2.1"></eq><csymbol cd="latexml" id="S2.E2.m2.1.2.2.cmml" xref="S2.E2.m2.1.2.2">absent</csymbol><apply id="S2.E2.m2.1.2.3.cmml" xref="S2.E2.m2.1.2.3"><plus id="S2.E2.m2.1.2.3.1.cmml" xref="S2.E2.m2.1.2.3.1"></plus><apply id="S2.E2.m2.1.1.cmml" xref="S2.E2.m2.1.1"><divide id="S2.E2.m2.1.1.2.cmml" xref="S2.E2.m2.1.1"></divide><ci id="S2.E2.m2.1.1.3.cmml" xref="S2.E2.m2.1.1.3">𝑎</ci><apply id="S2.E2.m2.1.1.1.cmml" xref="S2.E2.m2.1.1.1"><plus id="S2.E2.m2.1.1.1.2.cmml" xref="S2.E2.m2.1.1.1.2"></plus><cn id="S2.E2.m2.1.1.1.3.cmml" type="integer" xref="S2.E2.m2.1.1.1.3">1</cn><apply id="S2.E2.m2.1.1.1.4.cmml" xref="S2.E2.m2.1.1.1.4"><csymbol cd="ambiguous" id="S2.E2.m2.1.1.1.4.1.cmml" xref="S2.E2.m2.1.1.1.4">superscript</csymbol><ci id="S2.E2.m2.1.1.1.4.2.cmml" xref="S2.E2.m2.1.1.1.4.2">𝑒</ci><apply id="S2.E2.m2.1.1.1.1.1.cmml" xref="S2.E2.m2.1.1.1.1.1"><minus id="S2.E2.m2.1.1.1.1.1.2.cmml" xref="S2.E2.m2.1.1.1.1.1"></minus><apply id="S2.E2.m2.1.1.1.1.1.1.cmml" xref="S2.E2.m2.1.1.1.1.1.1"><times id="S2.E2.m2.1.1.1.1.1.1.2.cmml" xref="S2.E2.m2.1.1.1.1.1.1.2"></times><ci id="S2.E2.m2.1.1.1.1.1.1.3.cmml" xref="S2.E2.m2.1.1.1.1.1.1.3">𝑘</ci><apply id="S2.E2.m2.1.1.1.1.1.1.1.1.1.cmml" xref="S2.E2.m2.1.1.1.1.1.1.1.1"><minus id="S2.E2.m2.1.1.1.1.1.1.1.1.1.1.cmml" xref="S2.E2.m2.1.1.1.1.1.1.1.1.1.1"></minus><ci id="S2.E2.m2.1.1.1.1.1.1.1.1.1.2.cmml" xref="S2.E2.m2.1.1.1.1.1.1.1.1.1.2">𝐿</ci><apply id="S2.E2.m2.1.1.1.1.1.1.1.1.1.3.cmml" xref="S2.E2.m2.1.1.1.1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S2.E2.m2.1.1.1.1.1.1.1.1.1.3.1.cmml" xref="S2.E2.m2.1.1.1.1.1.1.1.1.1.3">subscript</csymbol><ci id="S2.E2.m2.1.1.1.1.1.1.1.1.1.3.2.cmml" xref="S2.E2.m2.1.1.1.1.1.1.1.1.1.3.2">𝐿</ci><cn id="S2.E2.m2.1.1.1.1.1.1.1.1.1.3.3.cmml" type="integer" xref="S2.E2.m2.1.1.1.1.1.1.1.1.1.3.3">0</cn></apply></apply></apply></apply></apply></apply></apply><ci id="S2.E2.m2.1.2.3.2.cmml" xref="S2.E2.m2.1.2.3.2">𝑏</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.E2.m2.1c">\displaystyle=\frac{a}{1+e^{-k(L-L_{0})}}+b</annotation><annotation encoding="application/x-llamapun" id="S2.E2.m2.1d">= divide start_ARG italic_a end_ARG start_ARG 1 + italic_e start_POSTSUPERSCRIPT - italic_k ( italic_L - italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT end_ARG + italic_b</annotation></semantics></math></td>
<td class="ltx_eqn_cell ltx_eqn_center_padright"></td>
<td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(2)</span></td>
</tr></tbody>
</table>
<p class="ltx_p" id="S2.SS2.SSS0.Px2.p1.10">Following <cite class="ltx_cite ltx_citemacro_citet">Bhagia et al. (<a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib3" title="">2024</a>)</cite> we fit Equation <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#S2.E1" title="In Extrapolating Scaling Laws (Multi Scale) ‣ 2.2 Prediction Methods ‣ 2 Methods ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_tag">1</span></a> only on observations of final, fully trained checkpoints as accounting for the learning rate schedule’s impact on intermediate checkpoints would require further parameters in the equation increasing the required number of observations and cost. To account for step-to-step noise in evaluation we average the last <math alttext="10\%" class="ltx_Math" display="inline" id="S2.SS2.SSS0.Px2.p1.10.m1.1"><semantics id="S2.SS2.SSS0.Px2.p1.10.m1.1a"><mrow id="S2.SS2.SSS0.Px2.p1.10.m1.1.1" xref="S2.SS2.SSS0.Px2.p1.10.m1.1.1.cmml"><mn id="S2.SS2.SSS0.Px2.p1.10.m1.1.1.2" xref="S2.SS2.SSS0.Px2.p1.10.m1.1.1.2.cmml">10</mn><mo id="S2.SS2.SSS0.Px2.p1.10.m1.1.1.1" xref="S2.SS2.SSS0.Px2.p1.10.m1.1.1.1.cmml">%</mo></mrow><annotation-xml encoding="MathML-Content" id="S2.SS2.SSS0.Px2.p1.10.m1.1b"><apply id="S2.SS2.SSS0.Px2.p1.10.m1.1.1.cmml" xref="S2.SS2.SSS0.Px2.p1.10.m1.1.1"><csymbol cd="latexml" id="S2.SS2.SSS0.Px2.p1.10.m1.1.1.1.cmml" xref="S2.SS2.SSS0.Px2.p1.10.m1.1.1.1">percent</csymbol><cn id="S2.SS2.SSS0.Px2.p1.10.m1.1.1.2.cmml" type="integer" xref="S2.SS2.SSS0.Px2.p1.10.m1.1.1.2">10</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS2.SSS0.Px2.p1.10.m1.1c">10\%</annotation><annotation encoding="application/x-llamapun" id="S2.SS2.SSS0.Px2.p1.10.m1.1d">10 %</annotation></semantics></math> of checkpoints as the final observed loss. Equation <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#S2.E2" title="In Extrapolating Scaling Laws (Multi Scale) ‣ 2.2 Prediction Methods ‣ 2 Methods ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_tag">2</span></a>, however, is fit on all observations including intermediate checkpoints.
We explore variations for a total of 8 multi scale approaches defined in Appendix <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#A3" title="Appendix C Scaling Law Variants ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_tag">C</span></a>; none of these make for substantially better decisions than the method defined in this section.</p>
</div>
</section>
</section>
<section class="ltx_subsection" id="S2.SS3">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">2.3 </span>Prediction Metrics</h3>
<div class="ltx_para ltx_noindent" id="S2.SS3.p1">
<p class="ltx_p" id="S2.SS3.p1.1">Our predictive task is to forecast which of a pair of data recipes will perform better at some target scale based on small-scale experiments. We use the following metrics to measure the quality of these predictions.</p>
</div>
<section class="ltx_paragraph" id="S2.SS3.SSS0.Px1">
<h4 class="ltx_title ltx_title_paragraph">Prediction Error</h4>
<div class="ltx_para ltx_noindent" id="S2.SS3.SSS0.Px1.p1">
<p class="ltx_p" id="S2.SS3.SSS0.Px1.p1.2">Scaling laws literature <cite class="ltx_cite ltx_citemacro_citep">(Bhagia et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib3" title="">2024</a>; Gadre et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib14" title="">2024</a>)</cite> typically evaluates success from predicted and actual downstream performance, using relative error (<math alttext="\frac{\lvert\text{predicted}-\text{actual}\rvert}{\text{actual}}\times 100\%" class="ltx_Math" display="inline" id="S2.SS3.SSS0.Px1.p1.1.m1.1"><semantics id="S2.SS3.SSS0.Px1.p1.1.m1.1a"><mrow id="S2.SS3.SSS0.Px1.p1.1.m1.1.2" xref="S2.SS3.SSS0.Px1.p1.1.m1.1.2.cmml"><mfrac id="S2.SS3.SSS0.Px1.p1.1.m1.1.1" xref="S2.SS3.SSS0.Px1.p1.1.m1.1.1.cmml"><mrow id="S2.SS3.SSS0.Px1.p1.1.m1.1.1.1.1" xref="S2.SS3.SSS0.Px1.p1.1.m1.1.1.1.2.cmml"><mo id="S2.SS3.SSS0.Px1.p1.1.m1.1.1.1.1.2" stretchy="false" xref="S2.SS3.SSS0.Px1.p1.1.m1.1.1.1.2.1.cmml">|</mo><mrow id="S2.SS3.SSS0.Px1.p1.1.m1.1.1.1.1.1" xref="S2.SS3.SSS0.Px1.p1.1.m1.1.1.1.1.1.cmml"><mtext id="S2.SS3.SSS0.Px1.p1.1.m1.1.1.1.1.1.2" xref="S2.SS3.SSS0.Px1.p1.1.m1.1.1.1.1.1.2a.cmml">predicted</mtext><mo id="S2.SS3.SSS0.Px1.p1.1.m1.1.1.1.1.1.1" xref="S2.SS3.SSS0.Px1.p1.1.m1.1.1.1.1.1.1.cmml">−</mo><mtext id="S2.SS3.SSS0.Px1.p1.1.m1.1.1.1.1.1.3" xref="S2.SS3.SSS0.Px1.p1.1.m1.1.1.1.1.1.3a.cmml">actual</mtext></mrow><mo id="S2.SS3.SSS0.Px1.p1.1.m1.1.1.1.1.3" stretchy="false" xref="S2.SS3.SSS0.Px1.p1.1.m1.1.1.1.2.1.cmml">|</mo></mrow><mtext id="S2.SS3.SSS0.Px1.p1.1.m1.1.1.3" xref="S2.SS3.SSS0.Px1.p1.1.m1.1.1.3a.cmml">actual</mtext></mfrac><mo id="S2.SS3.SSS0.Px1.p1.1.m1.1.2.1" lspace="0.222em" rspace="0.222em" xref="S2.SS3.SSS0.Px1.p1.1.m1.1.2.1.cmml">×</mo><mrow id="S2.SS3.SSS0.Px1.p1.1.m1.1.2.2" xref="S2.SS3.SSS0.Px1.p1.1.m1.1.2.2.cmml"><mn id="S2.SS3.SSS0.Px1.p1.1.m1.1.2.2.2" xref="S2.SS3.SSS0.Px1.p1.1.m1.1.2.2.2.cmml">100</mn><mo id="S2.SS3.SSS0.Px1.p1.1.m1.1.2.2.1" xref="S2.SS3.SSS0.Px1.p1.1.m1.1.2.2.1.cmml">%</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.SS3.SSS0.Px1.p1.1.m1.1b"><apply id="S2.SS3.SSS0.Px1.p1.1.m1.1.2.cmml" xref="S2.SS3.SSS0.Px1.p1.1.m1.1.2"><times id="S2.SS3.SSS0.Px1.p1.1.m1.1.2.1.cmml" xref="S2.SS3.SSS0.Px1.p1.1.m1.1.2.1"></times><apply id="S2.SS3.SSS0.Px1.p1.1.m1.1.1.cmml" xref="S2.SS3.SSS0.Px1.p1.1.m1.1.1"><divide id="S2.SS3.SSS0.Px1.p1.1.m1.1.1.2.cmml" xref="S2.SS3.SSS0.Px1.p1.1.m1.1.1"></divide><apply id="S2.SS3.SSS0.Px1.p1.1.m1.1.1.1.2.cmml" xref="S2.SS3.SSS0.Px1.p1.1.m1.1.1.1.1"><abs id="S2.SS3.SSS0.Px1.p1.1.m1.1.1.1.2.1.cmml" xref="S2.SS3.SSS0.Px1.p1.1.m1.1.1.1.1.2"></abs><apply id="S2.SS3.SSS0.Px1.p1.1.m1.1.1.1.1.1.cmml" xref="S2.SS3.SSS0.Px1.p1.1.m1.1.1.1.1.1"><minus id="S2.SS3.SSS0.Px1.p1.1.m1.1.1.1.1.1.1.cmml" xref="S2.SS3.SSS0.Px1.p1.1.m1.1.1.1.1.1.1"></minus><ci id="S2.SS3.SSS0.Px1.p1.1.m1.1.1.1.1.1.2a.cmml" xref="S2.SS3.SSS0.Px1.p1.1.m1.1.1.1.1.1.2"><mtext id="S2.SS3.SSS0.Px1.p1.1.m1.1.1.1.1.1.2.cmml" mathsize="70%" xref="S2.SS3.SSS0.Px1.p1.1.m1.1.1.1.1.1.2">predicted</mtext></ci><ci id="S2.SS3.SSS0.Px1.p1.1.m1.1.1.1.1.1.3a.cmml" xref="S2.SS3.SSS0.Px1.p1.1.m1.1.1.1.1.1.3"><mtext id="S2.SS3.SSS0.Px1.p1.1.m1.1.1.1.1.1.3.cmml" mathsize="70%" xref="S2.SS3.SSS0.Px1.p1.1.m1.1.1.1.1.1.3">actual</mtext></ci></apply></apply><ci id="S2.SS3.SSS0.Px1.p1.1.m1.1.1.3a.cmml" xref="S2.SS3.SSS0.Px1.p1.1.m1.1.1.3"><mtext id="S2.SS3.SSS0.Px1.p1.1.m1.1.1.3.cmml" mathsize="70%" xref="S2.SS3.SSS0.Px1.p1.1.m1.1.1.3">actual</mtext></ci></apply><apply id="S2.SS3.SSS0.Px1.p1.1.m1.1.2.2.cmml" xref="S2.SS3.SSS0.Px1.p1.1.m1.1.2.2"><csymbol cd="latexml" id="S2.SS3.SSS0.Px1.p1.1.m1.1.2.2.1.cmml" xref="S2.SS3.SSS0.Px1.p1.1.m1.1.2.2.1">percent</csymbol><cn id="S2.SS3.SSS0.Px1.p1.1.m1.1.2.2.2.cmml" type="integer" xref="S2.SS3.SSS0.Px1.p1.1.m1.1.2.2.2">100</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS3.SSS0.Px1.p1.1.m1.1c">\frac{\lvert\text{predicted}-\text{actual}\rvert}{\text{actual}}\times 100\%</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.SSS0.Px1.p1.1.m1.1d">divide start_ARG | predicted - actual | end_ARG start_ARG actual end_ARG × 100 %</annotation></semantics></math>) or absolute error (<math alttext="\lvert\text{predicted}-\text{actual}\rvert\times 100\%" class="ltx_Math" display="inline" id="S2.SS3.SSS0.Px1.p1.2.m2.1"><semantics id="S2.SS3.SSS0.Px1.p1.2.m2.1a"><mrow id="S2.SS3.SSS0.Px1.p1.2.m2.1.1" xref="S2.SS3.SSS0.Px1.p1.2.m2.1.1.cmml"><mrow id="S2.SS3.SSS0.Px1.p1.2.m2.1.1.1.1" xref="S2.SS3.SSS0.Px1.p1.2.m2.1.1.1.2.cmml"><mo id="S2.SS3.SSS0.Px1.p1.2.m2.1.1.1.1.2" stretchy="false" xref="S2.SS3.SSS0.Px1.p1.2.m2.1.1.1.2.1.cmml">|</mo><mrow id="S2.SS3.SSS0.Px1.p1.2.m2.1.1.1.1.1" xref="S2.SS3.SSS0.Px1.p1.2.m2.1.1.1.1.1.cmml"><mtext id="S2.SS3.SSS0.Px1.p1.2.m2.1.1.1.1.1.2" xref="S2.SS3.SSS0.Px1.p1.2.m2.1.1.1.1.1.2a.cmml">predicted</mtext><mo id="S2.SS3.SSS0.Px1.p1.2.m2.1.1.1.1.1.1" xref="S2.SS3.SSS0.Px1.p1.2.m2.1.1.1.1.1.1.cmml">−</mo><mtext id="S2.SS3.SSS0.Px1.p1.2.m2.1.1.1.1.1.3" xref="S2.SS3.SSS0.Px1.p1.2.m2.1.1.1.1.1.3a.cmml">actual</mtext></mrow><mo id="S2.SS3.SSS0.Px1.p1.2.m2.1.1.1.1.3" rspace="0.055em" stretchy="false" xref="S2.SS3.SSS0.Px1.p1.2.m2.1.1.1.2.1.cmml">|</mo></mrow><mo id="S2.SS3.SSS0.Px1.p1.2.m2.1.1.2" rspace="0.222em" xref="S2.SS3.SSS0.Px1.p1.2.m2.1.1.2.cmml">×</mo><mrow id="S2.SS3.SSS0.Px1.p1.2.m2.1.1.3" xref="S2.SS3.SSS0.Px1.p1.2.m2.1.1.3.cmml"><mn id="S2.SS3.SSS0.Px1.p1.2.m2.1.1.3.2" xref="S2.SS3.SSS0.Px1.p1.2.m2.1.1.3.2.cmml">100</mn><mo id="S2.SS3.SSS0.Px1.p1.2.m2.1.1.3.1" xref="S2.SS3.SSS0.Px1.p1.2.m2.1.1.3.1.cmml">%</mo></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.SS3.SSS0.Px1.p1.2.m2.1b"><apply id="S2.SS3.SSS0.Px1.p1.2.m2.1.1.cmml" xref="S2.SS3.SSS0.Px1.p1.2.m2.1.1"><times id="S2.SS3.SSS0.Px1.p1.2.m2.1.1.2.cmml" xref="S2.SS3.SSS0.Px1.p1.2.m2.1.1.2"></times><apply id="S2.SS3.SSS0.Px1.p1.2.m2.1.1.1.2.cmml" xref="S2.SS3.SSS0.Px1.p1.2.m2.1.1.1.1"><abs id="S2.SS3.SSS0.Px1.p1.2.m2.1.1.1.2.1.cmml" xref="S2.SS3.SSS0.Px1.p1.2.m2.1.1.1.1.2"></abs><apply id="S2.SS3.SSS0.Px1.p1.2.m2.1.1.1.1.1.cmml" xref="S2.SS3.SSS0.Px1.p1.2.m2.1.1.1.1.1"><minus id="S2.SS3.SSS0.Px1.p1.2.m2.1.1.1.1.1.1.cmml" xref="S2.SS3.SSS0.Px1.p1.2.m2.1.1.1.1.1.1"></minus><ci id="S2.SS3.SSS0.Px1.p1.2.m2.1.1.1.1.1.2a.cmml" xref="S2.SS3.SSS0.Px1.p1.2.m2.1.1.1.1.1.2"><mtext id="S2.SS3.SSS0.Px1.p1.2.m2.1.1.1.1.1.2.cmml" xref="S2.SS3.SSS0.Px1.p1.2.m2.1.1.1.1.1.2">predicted</mtext></ci><ci id="S2.SS3.SSS0.Px1.p1.2.m2.1.1.1.1.1.3a.cmml" xref="S2.SS3.SSS0.Px1.p1.2.m2.1.1.1.1.1.3"><mtext id="S2.SS3.SSS0.Px1.p1.2.m2.1.1.1.1.1.3.cmml" xref="S2.SS3.SSS0.Px1.p1.2.m2.1.1.1.1.1.3">actual</mtext></ci></apply></apply><apply id="S2.SS3.SSS0.Px1.p1.2.m2.1.1.3.cmml" xref="S2.SS3.SSS0.Px1.p1.2.m2.1.1.3"><csymbol cd="latexml" id="S2.SS3.SSS0.Px1.p1.2.m2.1.1.3.1.cmml" xref="S2.SS3.SSS0.Px1.p1.2.m2.1.1.3.1">percent</csymbol><cn id="S2.SS3.SSS0.Px1.p1.2.m2.1.1.3.2.cmml" type="integer" xref="S2.SS3.SSS0.Px1.p1.2.m2.1.1.3.2">100</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS3.SSS0.Px1.p1.2.m2.1c">\lvert\text{predicted}-\text{actual}\rvert\times 100\%</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.SSS0.Px1.p1.2.m2.1d">| predicted - actual | × 100 %</annotation></semantics></math>). We call these absolute or relative “prediction error” to distinguish from the following metric.</p>
</div>
</section>
<section class="ltx_paragraph" id="S2.SS3.SSS0.Px2">
<h4 class="ltx_title ltx_title_paragraph">Decision Accuracy</h4>
<div class="ltx_para ltx_noindent" id="S2.SS3.SSS0.Px2.p1">
<p class="ltx_p" id="S2.SS3.SSS0.Px2.p1.9">Unlike previous work, we also measure the impact of predictions on <span class="ltx_text ltx_font_italic" id="S2.SS3.SSS0.Px2.p1.9.1">decisions</span> about which data recipe is better than another. The metric we use to capture this is decision accuracy, an accuracy over all pairs of data recipes <math alttext="A" class="ltx_Math" display="inline" id="S2.SS3.SSS0.Px2.p1.1.m1.1"><semantics id="S2.SS3.SSS0.Px2.p1.1.m1.1a"><mi id="S2.SS3.SSS0.Px2.p1.1.m1.1.1" xref="S2.SS3.SSS0.Px2.p1.1.m1.1.1.cmml">A</mi><annotation-xml encoding="MathML-Content" id="S2.SS3.SSS0.Px2.p1.1.m1.1b"><ci id="S2.SS3.SSS0.Px2.p1.1.m1.1.1.cmml" xref="S2.SS3.SSS0.Px2.p1.1.m1.1.1">𝐴</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS3.SSS0.Px2.p1.1.m1.1c">A</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.SSS0.Px2.p1.1.m1.1d">italic_A</annotation></semantics></math> and <math alttext="B" class="ltx_Math" display="inline" id="S2.SS3.SSS0.Px2.p1.2.m2.1"><semantics id="S2.SS3.SSS0.Px2.p1.2.m2.1a"><mi id="S2.SS3.SSS0.Px2.p1.2.m2.1.1" xref="S2.SS3.SSS0.Px2.p1.2.m2.1.1.cmml">B</mi><annotation-xml encoding="MathML-Content" id="S2.SS3.SSS0.Px2.p1.2.m2.1b"><ci id="S2.SS3.SSS0.Px2.p1.2.m2.1.1.cmml" xref="S2.SS3.SSS0.Px2.p1.2.m2.1.1">𝐵</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS3.SSS0.Px2.p1.2.m2.1c">B</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.SSS0.Px2.p1.2.m2.1d">italic_B</annotation></semantics></math> where either <math alttext="A" class="ltx_Math" display="inline" id="S2.SS3.SSS0.Px2.p1.3.m3.1"><semantics id="S2.SS3.SSS0.Px2.p1.3.m3.1a"><mi id="S2.SS3.SSS0.Px2.p1.3.m3.1.1" xref="S2.SS3.SSS0.Px2.p1.3.m3.1.1.cmml">A</mi><annotation-xml encoding="MathML-Content" id="S2.SS3.SSS0.Px2.p1.3.m3.1b"><ci id="S2.SS3.SSS0.Px2.p1.3.m3.1.1.cmml" xref="S2.SS3.SSS0.Px2.p1.3.m3.1.1">𝐴</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS3.SSS0.Px2.p1.3.m3.1c">A</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.SSS0.Px2.p1.3.m3.1d">italic_A</annotation></semantics></math> or <math alttext="B" class="ltx_Math" display="inline" id="S2.SS3.SSS0.Px2.p1.4.m4.1"><semantics id="S2.SS3.SSS0.Px2.p1.4.m4.1a"><mi id="S2.SS3.SSS0.Px2.p1.4.m4.1.1" xref="S2.SS3.SSS0.Px2.p1.4.m4.1.1.cmml">B</mi><annotation-xml encoding="MathML-Content" id="S2.SS3.SSS0.Px2.p1.4.m4.1b"><ci id="S2.SS3.SSS0.Px2.p1.4.m4.1.1.cmml" xref="S2.SS3.SSS0.Px2.p1.4.m4.1.1">𝐵</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS3.SSS0.Px2.p1.4.m4.1c">B</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.SSS0.Px2.p1.4.m4.1d">italic_B</annotation></semantics></math> is defined as the correct winner based on which achieves higher performance at the target scale. This is nearly equivalent to Kendall’s <math alttext="\tau" class="ltx_Math" display="inline" id="S2.SS3.SSS0.Px2.p1.5.m5.1"><semantics id="S2.SS3.SSS0.Px2.p1.5.m5.1a"><mi id="S2.SS3.SSS0.Px2.p1.5.m5.1.1" xref="S2.SS3.SSS0.Px2.p1.5.m5.1.1.cmml">τ</mi><annotation-xml encoding="MathML-Content" id="S2.SS3.SSS0.Px2.p1.5.m5.1b"><ci id="S2.SS3.SSS0.Px2.p1.5.m5.1.1.cmml" xref="S2.SS3.SSS0.Px2.p1.5.m5.1.1">𝜏</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS3.SSS0.Px2.p1.5.m5.1c">\tau</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.SSS0.Px2.p1.5.m5.1d">italic_τ</annotation></semantics></math>, but ranges from 0 to 1. We define the target-scale winner based on mean downstream performance over 3 random seeds.
Thus decision accuracy can be formalized as follows. Let <math alttext="\mathcal{P}" class="ltx_Math" display="inline" id="S2.SS3.SSS0.Px2.p1.6.m6.1"><semantics id="S2.SS3.SSS0.Px2.p1.6.m6.1a"><mi class="ltx_font_mathcaligraphic" id="S2.SS3.SSS0.Px2.p1.6.m6.1.1" xref="S2.SS3.SSS0.Px2.p1.6.m6.1.1.cmml">𝒫</mi><annotation-xml encoding="MathML-Content" id="S2.SS3.SSS0.Px2.p1.6.m6.1b"><ci id="S2.SS3.SSS0.Px2.p1.6.m6.1.1.cmml" xref="S2.SS3.SSS0.Px2.p1.6.m6.1.1">𝒫</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS3.SSS0.Px2.p1.6.m6.1c">\mathcal{P}</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.SSS0.Px2.p1.6.m6.1d">caligraphic_P</annotation></semantics></math> be the set of all data recipe pairs <math alttext="(A,B)" class="ltx_Math" display="inline" id="S2.SS3.SSS0.Px2.p1.7.m7.2"><semantics id="S2.SS3.SSS0.Px2.p1.7.m7.2a"><mrow id="S2.SS3.SSS0.Px2.p1.7.m7.2.3.2" xref="S2.SS3.SSS0.Px2.p1.7.m7.2.3.1.cmml"><mo id="S2.SS3.SSS0.Px2.p1.7.m7.2.3.2.1" stretchy="false" xref="S2.SS3.SSS0.Px2.p1.7.m7.2.3.1.cmml">(</mo><mi id="S2.SS3.SSS0.Px2.p1.7.m7.1.1" xref="S2.SS3.SSS0.Px2.p1.7.m7.1.1.cmml">A</mi><mo id="S2.SS3.SSS0.Px2.p1.7.m7.2.3.2.2" xref="S2.SS3.SSS0.Px2.p1.7.m7.2.3.1.cmml">,</mo><mi id="S2.SS3.SSS0.Px2.p1.7.m7.2.2" xref="S2.SS3.SSS0.Px2.p1.7.m7.2.2.cmml">B</mi><mo id="S2.SS3.SSS0.Px2.p1.7.m7.2.3.2.3" stretchy="false" xref="S2.SS3.SSS0.Px2.p1.7.m7.2.3.1.cmml">)</mo></mrow><annotation-xml encoding="MathML-Content" id="S2.SS3.SSS0.Px2.p1.7.m7.2b"><interval closure="open" id="S2.SS3.SSS0.Px2.p1.7.m7.2.3.1.cmml" xref="S2.SS3.SSS0.Px2.p1.7.m7.2.3.2"><ci id="S2.SS3.SSS0.Px2.p1.7.m7.1.1.cmml" xref="S2.SS3.SSS0.Px2.p1.7.m7.1.1">𝐴</ci><ci id="S2.SS3.SSS0.Px2.p1.7.m7.2.2.cmml" xref="S2.SS3.SSS0.Px2.p1.7.m7.2.2">𝐵</ci></interval></annotation-xml><annotation encoding="application/x-tex" id="S2.SS3.SSS0.Px2.p1.7.m7.2c">(A,B)</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.SSS0.Px2.p1.7.m7.2d">( italic_A , italic_B )</annotation></semantics></math> with observed mean performance <math alttext="y_{A},y_{B}" class="ltx_Math" display="inline" id="S2.SS3.SSS0.Px2.p1.8.m8.2"><semantics id="S2.SS3.SSS0.Px2.p1.8.m8.2a"><mrow id="S2.SS3.SSS0.Px2.p1.8.m8.2.2.2" xref="S2.SS3.SSS0.Px2.p1.8.m8.2.2.3.cmml"><msub id="S2.SS3.SSS0.Px2.p1.8.m8.1.1.1.1" xref="S2.SS3.SSS0.Px2.p1.8.m8.1.1.1.1.cmml"><mi id="S2.SS3.SSS0.Px2.p1.8.m8.1.1.1.1.2" xref="S2.SS3.SSS0.Px2.p1.8.m8.1.1.1.1.2.cmml">y</mi><mi id="S2.SS3.SSS0.Px2.p1.8.m8.1.1.1.1.3" xref="S2.SS3.SSS0.Px2.p1.8.m8.1.1.1.1.3.cmml">A</mi></msub><mo id="S2.SS3.SSS0.Px2.p1.8.m8.2.2.2.3" xref="S2.SS3.SSS0.Px2.p1.8.m8.2.2.3.cmml">,</mo><msub id="S2.SS3.SSS0.Px2.p1.8.m8.2.2.2.2" xref="S2.SS3.SSS0.Px2.p1.8.m8.2.2.2.2.cmml"><mi id="S2.SS3.SSS0.Px2.p1.8.m8.2.2.2.2.2" xref="S2.SS3.SSS0.Px2.p1.8.m8.2.2.2.2.2.cmml">y</mi><mi id="S2.SS3.SSS0.Px2.p1.8.m8.2.2.2.2.3" xref="S2.SS3.SSS0.Px2.p1.8.m8.2.2.2.2.3.cmml">B</mi></msub></mrow><annotation-xml encoding="MathML-Content" id="S2.SS3.SSS0.Px2.p1.8.m8.2b"><list id="S2.SS3.SSS0.Px2.p1.8.m8.2.2.3.cmml" xref="S2.SS3.SSS0.Px2.p1.8.m8.2.2.2"><apply id="S2.SS3.SSS0.Px2.p1.8.m8.1.1.1.1.cmml" xref="S2.SS3.SSS0.Px2.p1.8.m8.1.1.1.1"><csymbol cd="ambiguous" id="S2.SS3.SSS0.Px2.p1.8.m8.1.1.1.1.1.cmml" xref="S2.SS3.SSS0.Px2.p1.8.m8.1.1.1.1">subscript</csymbol><ci id="S2.SS3.SSS0.Px2.p1.8.m8.1.1.1.1.2.cmml" xref="S2.SS3.SSS0.Px2.p1.8.m8.1.1.1.1.2">𝑦</ci><ci id="S2.SS3.SSS0.Px2.p1.8.m8.1.1.1.1.3.cmml" xref="S2.SS3.SSS0.Px2.p1.8.m8.1.1.1.1.3">𝐴</ci></apply><apply id="S2.SS3.SSS0.Px2.p1.8.m8.2.2.2.2.cmml" xref="S2.SS3.SSS0.Px2.p1.8.m8.2.2.2.2"><csymbol cd="ambiguous" id="S2.SS3.SSS0.Px2.p1.8.m8.2.2.2.2.1.cmml" xref="S2.SS3.SSS0.Px2.p1.8.m8.2.2.2.2">subscript</csymbol><ci id="S2.SS3.SSS0.Px2.p1.8.m8.2.2.2.2.2.cmml" xref="S2.SS3.SSS0.Px2.p1.8.m8.2.2.2.2.2">𝑦</ci><ci id="S2.SS3.SSS0.Px2.p1.8.m8.2.2.2.2.3.cmml" xref="S2.SS3.SSS0.Px2.p1.8.m8.2.2.2.2.3">𝐵</ci></apply></list></annotation-xml><annotation encoding="application/x-tex" id="S2.SS3.SSS0.Px2.p1.8.m8.2c">y_{A},y_{B}</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.SSS0.Px2.p1.8.m8.2d">italic_y start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT</annotation></semantics></math> and predicted performance <math alttext="\hat{y}_{A},\hat{y}_{B}" class="ltx_Math" display="inline" id="S2.SS3.SSS0.Px2.p1.9.m9.2"><semantics id="S2.SS3.SSS0.Px2.p1.9.m9.2a"><mrow id="S2.SS3.SSS0.Px2.p1.9.m9.2.2.2" xref="S2.SS3.SSS0.Px2.p1.9.m9.2.2.3.cmml"><msub id="S2.SS3.SSS0.Px2.p1.9.m9.1.1.1.1" xref="S2.SS3.SSS0.Px2.p1.9.m9.1.1.1.1.cmml"><mover accent="true" id="S2.SS3.SSS0.Px2.p1.9.m9.1.1.1.1.2" xref="S2.SS3.SSS0.Px2.p1.9.m9.1.1.1.1.2.cmml"><mi id="S2.SS3.SSS0.Px2.p1.9.m9.1.1.1.1.2.2" xref="S2.SS3.SSS0.Px2.p1.9.m9.1.1.1.1.2.2.cmml">y</mi><mo id="S2.SS3.SSS0.Px2.p1.9.m9.1.1.1.1.2.1" xref="S2.SS3.SSS0.Px2.p1.9.m9.1.1.1.1.2.1.cmml">^</mo></mover><mi id="S2.SS3.SSS0.Px2.p1.9.m9.1.1.1.1.3" xref="S2.SS3.SSS0.Px2.p1.9.m9.1.1.1.1.3.cmml">A</mi></msub><mo id="S2.SS3.SSS0.Px2.p1.9.m9.2.2.2.3" xref="S2.SS3.SSS0.Px2.p1.9.m9.2.2.3.cmml">,</mo><msub id="S2.SS3.SSS0.Px2.p1.9.m9.2.2.2.2" xref="S2.SS3.SSS0.Px2.p1.9.m9.2.2.2.2.cmml"><mover accent="true" id="S2.SS3.SSS0.Px2.p1.9.m9.2.2.2.2.2" xref="S2.SS3.SSS0.Px2.p1.9.m9.2.2.2.2.2.cmml"><mi id="S2.SS3.SSS0.Px2.p1.9.m9.2.2.2.2.2.2" xref="S2.SS3.SSS0.Px2.p1.9.m9.2.2.2.2.2.2.cmml">y</mi><mo id="S2.SS3.SSS0.Px2.p1.9.m9.2.2.2.2.2.1" xref="S2.SS3.SSS0.Px2.p1.9.m9.2.2.2.2.2.1.cmml">^</mo></mover><mi id="S2.SS3.SSS0.Px2.p1.9.m9.2.2.2.2.3" xref="S2.SS3.SSS0.Px2.p1.9.m9.2.2.2.2.3.cmml">B</mi></msub></mrow><annotation-xml encoding="MathML-Content" id="S2.SS3.SSS0.Px2.p1.9.m9.2b"><list id="S2.SS3.SSS0.Px2.p1.9.m9.2.2.3.cmml" xref="S2.SS3.SSS0.Px2.p1.9.m9.2.2.2"><apply id="S2.SS3.SSS0.Px2.p1.9.m9.1.1.1.1.cmml" xref="S2.SS3.SSS0.Px2.p1.9.m9.1.1.1.1"><csymbol cd="ambiguous" id="S2.SS3.SSS0.Px2.p1.9.m9.1.1.1.1.1.cmml" xref="S2.SS3.SSS0.Px2.p1.9.m9.1.1.1.1">subscript</csymbol><apply id="S2.SS3.SSS0.Px2.p1.9.m9.1.1.1.1.2.cmml" xref="S2.SS3.SSS0.Px2.p1.9.m9.1.1.1.1.2"><ci id="S2.SS3.SSS0.Px2.p1.9.m9.1.1.1.1.2.1.cmml" xref="S2.SS3.SSS0.Px2.p1.9.m9.1.1.1.1.2.1">^</ci><ci id="S2.SS3.SSS0.Px2.p1.9.m9.1.1.1.1.2.2.cmml" xref="S2.SS3.SSS0.Px2.p1.9.m9.1.1.1.1.2.2">𝑦</ci></apply><ci id="S2.SS3.SSS0.Px2.p1.9.m9.1.1.1.1.3.cmml" xref="S2.SS3.SSS0.Px2.p1.9.m9.1.1.1.1.3">𝐴</ci></apply><apply id="S2.SS3.SSS0.Px2.p1.9.m9.2.2.2.2.cmml" xref="S2.SS3.SSS0.Px2.p1.9.m9.2.2.2.2"><csymbol cd="ambiguous" id="S2.SS3.SSS0.Px2.p1.9.m9.2.2.2.2.1.cmml" xref="S2.SS3.SSS0.Px2.p1.9.m9.2.2.2.2">subscript</csymbol><apply id="S2.SS3.SSS0.Px2.p1.9.m9.2.2.2.2.2.cmml" xref="S2.SS3.SSS0.Px2.p1.9.m9.2.2.2.2.2"><ci id="S2.SS3.SSS0.Px2.p1.9.m9.2.2.2.2.2.1.cmml" xref="S2.SS3.SSS0.Px2.p1.9.m9.2.2.2.2.2.1">^</ci><ci id="S2.SS3.SSS0.Px2.p1.9.m9.2.2.2.2.2.2.cmml" xref="S2.SS3.SSS0.Px2.p1.9.m9.2.2.2.2.2.2">𝑦</ci></apply><ci id="S2.SS3.SSS0.Px2.p1.9.m9.2.2.2.2.3.cmml" xref="S2.SS3.SSS0.Px2.p1.9.m9.2.2.2.2.3">𝐵</ci></apply></list></annotation-xml><annotation encoding="application/x-tex" id="S2.SS3.SSS0.Px2.p1.9.m9.2c">\hat{y}_{A},\hat{y}_{B}</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.SSS0.Px2.p1.9.m9.2d">over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT</annotation></semantics></math>, respectively, then decision accuracy is:</p>
<table class="ltx_equation ltx_eqn_table" id="S2.E3">
<tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline">
<td class="ltx_eqn_cell ltx_eqn_center_padleft"></td>
<td class="ltx_eqn_cell ltx_align_center"><math alttext="\textstyle\frac{1}{\lvert\mathcal{P}\rvert}\sum_{(A,B)\in\mathcal{P}}\mathbb{I%
}\big{(}\text{sign}(\hat{y}_{A}-\hat{y}_{B})=\text{sign}(y_{A}-y_{B})\big{)}" class="ltx_Math" display="block" id="S2.E3.m1.4"><semantics id="S2.E3.m1.4a"><mrow id="S2.E3.m1.4.4" xref="S2.E3.m1.4.4.cmml"><mstyle displaystyle="false" id="S2.E3.m1.1.1" xref="S2.E3.m1.1.1.cmml"><mfrac id="S2.E3.m1.1.1a" xref="S2.E3.m1.1.1.cmml"><mn id="S2.E3.m1.1.1.3" xref="S2.E3.m1.1.1.3.cmml">1</mn><mrow id="S2.E3.m1.1.1.1.3" xref="S2.E3.m1.1.1.1.2.cmml"><mo id="S2.E3.m1.1.1.1.3.1" stretchy="false" xref="S2.E3.m1.1.1.1.2.1.cmml">|</mo><mi class="ltx_font_mathcaligraphic" id="S2.E3.m1.1.1.1.1" xref="S2.E3.m1.1.1.1.1.cmml">𝒫</mi><mo id="S2.E3.m1.1.1.1.3.2" stretchy="false" xref="S2.E3.m1.1.1.1.2.1.cmml">|</mo></mrow></mfrac></mstyle><mo id="S2.E3.m1.4.4.2" xref="S2.E3.m1.4.4.2.cmml">⁢</mo><mrow id="S2.E3.m1.4.4.1" xref="S2.E3.m1.4.4.1.cmml"><mstyle displaystyle="false" id="S2.E3.m1.4.4.1.2" xref="S2.E3.m1.4.4.1.2.cmml"><msub id="S2.E3.m1.4.4.1.2a" xref="S2.E3.m1.4.4.1.2.cmml"><mo id="S2.E3.m1.4.4.1.2.2" xref="S2.E3.m1.4.4.1.2.2.cmml">∑</mo><mrow id="S2.E3.m1.3.3.2" xref="S2.E3.m1.3.3.2.cmml"><mrow id="S2.E3.m1.3.3.2.4.2" xref="S2.E3.m1.3.3.2.4.1.cmml"><mo id="S2.E3.m1.3.3.2.4.2.1" stretchy="false" xref="S2.E3.m1.3.3.2.4.1.cmml">(</mo><mi id="S2.E3.m1.2.2.1.1" xref="S2.E3.m1.2.2.1.1.cmml">A</mi><mo id="S2.E3.m1.3.3.2.4.2.2" xref="S2.E3.m1.3.3.2.4.1.cmml">,</mo><mi id="S2.E3.m1.3.3.2.2" xref="S2.E3.m1.3.3.2.2.cmml">B</mi><mo id="S2.E3.m1.3.3.2.4.2.3" stretchy="false" xref="S2.E3.m1.3.3.2.4.1.cmml">)</mo></mrow><mo id="S2.E3.m1.3.3.2.3" xref="S2.E3.m1.3.3.2.3.cmml">∈</mo><mi class="ltx_font_mathcaligraphic" id="S2.E3.m1.3.3.2.5" xref="S2.E3.m1.3.3.2.5.cmml">𝒫</mi></mrow></msub></mstyle><mrow id="S2.E3.m1.4.4.1.1" xref="S2.E3.m1.4.4.1.1.cmml"><mi id="S2.E3.m1.4.4.1.1.3" xref="S2.E3.m1.4.4.1.1.3.cmml">𝕀</mi><mo id="S2.E3.m1.4.4.1.1.2" xref="S2.E3.m1.4.4.1.1.2.cmml">⁢</mo><mrow id="S2.E3.m1.4.4.1.1.1.1" xref="S2.E3.m1.4.4.1.1.1.1.1.cmml"><mo id="S2.E3.m1.4.4.1.1.1.1.2" maxsize="120%" minsize="120%" xref="S2.E3.m1.4.4.1.1.1.1.1.cmml">(</mo><mrow id="S2.E3.m1.4.4.1.1.1.1.1" xref="S2.E3.m1.4.4.1.1.1.1.1.cmml"><mrow id="S2.E3.m1.4.4.1.1.1.1.1.1" xref="S2.E3.m1.4.4.1.1.1.1.1.1.cmml"><mtext id="S2.E3.m1.4.4.1.1.1.1.1.1.3" xref="S2.E3.m1.4.4.1.1.1.1.1.1.3a.cmml">sign</mtext><mo id="S2.E3.m1.4.4.1.1.1.1.1.1.2" xref="S2.E3.m1.4.4.1.1.1.1.1.1.2.cmml">⁢</mo><mrow id="S2.E3.m1.4.4.1.1.1.1.1.1.1.1" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.cmml"><mo id="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.2" stretchy="false" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.cmml">(</mo><mrow id="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.cmml"><msub id="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.2" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.2.cmml"><mover accent="true" id="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.2.2" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.2.2.cmml"><mi id="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.2.2.2" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.2.2.2.cmml">y</mi><mo id="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.2.2.1" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.2.2.1.cmml">^</mo></mover><mi id="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.2.3" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.2.3.cmml">A</mi></msub><mo id="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.1" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.1.cmml">−</mo><msub id="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.3" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.3.cmml"><mover accent="true" id="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.3.2" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.3.2.cmml"><mi id="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.3.2.2" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.3.2.2.cmml">y</mi><mo id="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.3.2.1" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.3.2.1.cmml">^</mo></mover><mi id="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.3.3" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.3.3.cmml">B</mi></msub></mrow><mo id="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.3" stretchy="false" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.cmml">)</mo></mrow></mrow><mo id="S2.E3.m1.4.4.1.1.1.1.1.3" xref="S2.E3.m1.4.4.1.1.1.1.1.3.cmml">=</mo><mrow id="S2.E3.m1.4.4.1.1.1.1.1.2" xref="S2.E3.m1.4.4.1.1.1.1.1.2.cmml"><mtext id="S2.E3.m1.4.4.1.1.1.1.1.2.3" xref="S2.E3.m1.4.4.1.1.1.1.1.2.3a.cmml">sign</mtext><mo id="S2.E3.m1.4.4.1.1.1.1.1.2.2" xref="S2.E3.m1.4.4.1.1.1.1.1.2.2.cmml">⁢</mo><mrow id="S2.E3.m1.4.4.1.1.1.1.1.2.1.1" xref="S2.E3.m1.4.4.1.1.1.1.1.2.1.1.1.cmml"><mo id="S2.E3.m1.4.4.1.1.1.1.1.2.1.1.2" stretchy="false" xref="S2.E3.m1.4.4.1.1.1.1.1.2.1.1.1.cmml">(</mo><mrow id="S2.E3.m1.4.4.1.1.1.1.1.2.1.1.1" xref="S2.E3.m1.4.4.1.1.1.1.1.2.1.1.1.cmml"><msub id="S2.E3.m1.4.4.1.1.1.1.1.2.1.1.1.2" xref="S2.E3.m1.4.4.1.1.1.1.1.2.1.1.1.2.cmml"><mi id="S2.E3.m1.4.4.1.1.1.1.1.2.1.1.1.2.2" xref="S2.E3.m1.4.4.1.1.1.1.1.2.1.1.1.2.2.cmml">y</mi><mi id="S2.E3.m1.4.4.1.1.1.1.1.2.1.1.1.2.3" xref="S2.E3.m1.4.4.1.1.1.1.1.2.1.1.1.2.3.cmml">A</mi></msub><mo id="S2.E3.m1.4.4.1.1.1.1.1.2.1.1.1.1" xref="S2.E3.m1.4.4.1.1.1.1.1.2.1.1.1.1.cmml">−</mo><msub id="S2.E3.m1.4.4.1.1.1.1.1.2.1.1.1.3" xref="S2.E3.m1.4.4.1.1.1.1.1.2.1.1.1.3.cmml"><mi id="S2.E3.m1.4.4.1.1.1.1.1.2.1.1.1.3.2" xref="S2.E3.m1.4.4.1.1.1.1.1.2.1.1.1.3.2.cmml">y</mi><mi id="S2.E3.m1.4.4.1.1.1.1.1.2.1.1.1.3.3" xref="S2.E3.m1.4.4.1.1.1.1.1.2.1.1.1.3.3.cmml">B</mi></msub></mrow><mo id="S2.E3.m1.4.4.1.1.1.1.1.2.1.1.3" stretchy="false" xref="S2.E3.m1.4.4.1.1.1.1.1.2.1.1.1.cmml">)</mo></mrow></mrow></mrow><mo id="S2.E3.m1.4.4.1.1.1.1.3" maxsize="120%" minsize="120%" xref="S2.E3.m1.4.4.1.1.1.1.1.cmml">)</mo></mrow></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.E3.m1.4b"><apply id="S2.E3.m1.4.4.cmml" xref="S2.E3.m1.4.4"><times id="S2.E3.m1.4.4.2.cmml" xref="S2.E3.m1.4.4.2"></times><apply id="S2.E3.m1.1.1.cmml" xref="S2.E3.m1.1.1"><divide id="S2.E3.m1.1.1.2.cmml" xref="S2.E3.m1.1.1"></divide><cn id="S2.E3.m1.1.1.3.cmml" type="integer" xref="S2.E3.m1.1.1.3">1</cn><apply id="S2.E3.m1.1.1.1.2.cmml" xref="S2.E3.m1.1.1.1.3"><abs id="S2.E3.m1.1.1.1.2.1.cmml" xref="S2.E3.m1.1.1.1.3.1"></abs><ci id="S2.E3.m1.1.1.1.1.cmml" xref="S2.E3.m1.1.1.1.1">𝒫</ci></apply></apply><apply id="S2.E3.m1.4.4.1.cmml" xref="S2.E3.m1.4.4.1"><apply id="S2.E3.m1.4.4.1.2.cmml" xref="S2.E3.m1.4.4.1.2"><csymbol cd="ambiguous" id="S2.E3.m1.4.4.1.2.1.cmml" xref="S2.E3.m1.4.4.1.2">subscript</csymbol><sum id="S2.E3.m1.4.4.1.2.2.cmml" xref="S2.E3.m1.4.4.1.2.2"></sum><apply id="S2.E3.m1.3.3.2.cmml" xref="S2.E3.m1.3.3.2"><in id="S2.E3.m1.3.3.2.3.cmml" xref="S2.E3.m1.3.3.2.3"></in><interval closure="open" id="S2.E3.m1.3.3.2.4.1.cmml" xref="S2.E3.m1.3.3.2.4.2"><ci id="S2.E3.m1.2.2.1.1.cmml" xref="S2.E3.m1.2.2.1.1">𝐴</ci><ci id="S2.E3.m1.3.3.2.2.cmml" xref="S2.E3.m1.3.3.2.2">𝐵</ci></interval><ci id="S2.E3.m1.3.3.2.5.cmml" xref="S2.E3.m1.3.3.2.5">𝒫</ci></apply></apply><apply id="S2.E3.m1.4.4.1.1.cmml" xref="S2.E3.m1.4.4.1.1"><times id="S2.E3.m1.4.4.1.1.2.cmml" xref="S2.E3.m1.4.4.1.1.2"></times><ci id="S2.E3.m1.4.4.1.1.3.cmml" xref="S2.E3.m1.4.4.1.1.3">𝕀</ci><apply id="S2.E3.m1.4.4.1.1.1.1.1.cmml" xref="S2.E3.m1.4.4.1.1.1.1"><eq id="S2.E3.m1.4.4.1.1.1.1.1.3.cmml" xref="S2.E3.m1.4.4.1.1.1.1.1.3"></eq><apply id="S2.E3.m1.4.4.1.1.1.1.1.1.cmml" xref="S2.E3.m1.4.4.1.1.1.1.1.1"><times id="S2.E3.m1.4.4.1.1.1.1.1.1.2.cmml" xref="S2.E3.m1.4.4.1.1.1.1.1.1.2"></times><ci id="S2.E3.m1.4.4.1.1.1.1.1.1.3a.cmml" xref="S2.E3.m1.4.4.1.1.1.1.1.1.3"><mtext id="S2.E3.m1.4.4.1.1.1.1.1.1.3.cmml" xref="S2.E3.m1.4.4.1.1.1.1.1.1.3">sign</mtext></ci><apply id="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.cmml" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1.1"><minus id="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.1.cmml" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.1"></minus><apply id="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.2.cmml" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.2"><csymbol cd="ambiguous" id="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.2.1.cmml" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.2">subscript</csymbol><apply id="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.2.2.cmml" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.2.2"><ci id="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.2.2.1.cmml" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.2.2.1">^</ci><ci id="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.2.2.2.cmml" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.2.2.2">𝑦</ci></apply><ci id="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.2.3.cmml" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.2.3">𝐴</ci></apply><apply id="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.3.cmml" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.3.1.cmml" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.3">subscript</csymbol><apply id="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.3.2.cmml" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.3.2"><ci id="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.3.2.1.cmml" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.3.2.1">^</ci><ci id="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.3.2.2.cmml" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.3.2.2">𝑦</ci></apply><ci id="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.3.3.cmml" xref="S2.E3.m1.4.4.1.1.1.1.1.1.1.1.1.3.3">𝐵</ci></apply></apply></apply><apply id="S2.E3.m1.4.4.1.1.1.1.1.2.cmml" xref="S2.E3.m1.4.4.1.1.1.1.1.2"><times id="S2.E3.m1.4.4.1.1.1.1.1.2.2.cmml" xref="S2.E3.m1.4.4.1.1.1.1.1.2.2"></times><ci id="S2.E3.m1.4.4.1.1.1.1.1.2.3a.cmml" xref="S2.E3.m1.4.4.1.1.1.1.1.2.3"><mtext id="S2.E3.m1.4.4.1.1.1.1.1.2.3.cmml" xref="S2.E3.m1.4.4.1.1.1.1.1.2.3">sign</mtext></ci><apply id="S2.E3.m1.4.4.1.1.1.1.1.2.1.1.1.cmml" xref="S2.E3.m1.4.4.1.1.1.1.1.2.1.1"><minus id="S2.E3.m1.4.4.1.1.1.1.1.2.1.1.1.1.cmml" xref="S2.E3.m1.4.4.1.1.1.1.1.2.1.1.1.1"></minus><apply id="S2.E3.m1.4.4.1.1.1.1.1.2.1.1.1.2.cmml" xref="S2.E3.m1.4.4.1.1.1.1.1.2.1.1.1.2"><csymbol cd="ambiguous" id="S2.E3.m1.4.4.1.1.1.1.1.2.1.1.1.2.1.cmml" xref="S2.E3.m1.4.4.1.1.1.1.1.2.1.1.1.2">subscript</csymbol><ci id="S2.E3.m1.4.4.1.1.1.1.1.2.1.1.1.2.2.cmml" xref="S2.E3.m1.4.4.1.1.1.1.1.2.1.1.1.2.2">𝑦</ci><ci id="S2.E3.m1.4.4.1.1.1.1.1.2.1.1.1.2.3.cmml" xref="S2.E3.m1.4.4.1.1.1.1.1.2.1.1.1.2.3">𝐴</ci></apply><apply id="S2.E3.m1.4.4.1.1.1.1.1.2.1.1.1.3.cmml" xref="S2.E3.m1.4.4.1.1.1.1.1.2.1.1.1.3"><csymbol cd="ambiguous" id="S2.E3.m1.4.4.1.1.1.1.1.2.1.1.1.3.1.cmml" xref="S2.E3.m1.4.4.1.1.1.1.1.2.1.1.1.3">subscript</csymbol><ci id="S2.E3.m1.4.4.1.1.1.1.1.2.1.1.1.3.2.cmml" xref="S2.E3.m1.4.4.1.1.1.1.1.2.1.1.1.3.2">𝑦</ci><ci id="S2.E3.m1.4.4.1.1.1.1.1.2.1.1.1.3.3.cmml" xref="S2.E3.m1.4.4.1.1.1.1.1.2.1.1.1.3.3">𝐵</ci></apply></apply></apply></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.E3.m1.4c">\textstyle\frac{1}{\lvert\mathcal{P}\rvert}\sum_{(A,B)\in\mathcal{P}}\mathbb{I%
}\big{(}\text{sign}(\hat{y}_{A}-\hat{y}_{B})=\text{sign}(y_{A}-y_{B})\big{)}</annotation><annotation encoding="application/x-llamapun" id="S2.E3.m1.4d">divide start_ARG 1 end_ARG start_ARG | caligraphic_P | end_ARG ∑ start_POSTSUBSCRIPT ( italic_A , italic_B ) ∈ caligraphic_P end_POSTSUBSCRIPT blackboard_I ( sign ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT - over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) = sign ( italic_y start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) )</annotation></semantics></math></td>
<td class="ltx_eqn_cell ltx_eqn_center_padright"></td>
<td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(3)</span></td>
</tr></tbody>
</table>
</div>
</section>
<section class="ltx_paragraph" id="S2.SS3.SSS0.Px3">
<h4 class="ltx_title ltx_title_paragraph">Percent of Target Compute Budget (<math alttext="\%C" class="ltx_math_unparsed" display="inline" id="S2.SS3.SSS0.Px3.1.m1.1"><semantics id="S2.SS3.SSS0.Px3.1.m1.1b"><mrow id="S2.SS3.SSS0.Px3.1.m1.1c"><mo id="S2.SS3.SSS0.Px3.1.m1.1.1">%</mo><mi id="S2.SS3.SSS0.Px3.1.m1.1.2">C</mi></mrow><annotation encoding="application/x-tex" id="S2.SS3.SSS0.Px3.1.m1.1d">\%C</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.SSS0.Px3.1.m1.1e">% italic_C</annotation></semantics></math>)</h4>
<div class="ltx_para ltx_noindent" id="S2.SS3.SSS0.Px3.p1">
<p class="ltx_p" id="S2.SS3.SSS0.Px3.p1.4">We measure compute in terms of theoretical FLOPs following the simplifying assumption made in most scaling literature that the costs associated with training a model are captured well enough by <math alttext="\text{FLOPs}=6ND" class="ltx_Math" display="inline" id="S2.SS3.SSS0.Px3.p1.1.m1.1"><semantics id="S2.SS3.SSS0.Px3.p1.1.m1.1a"><mrow id="S2.SS3.SSS0.Px3.p1.1.m1.1.1" xref="S2.SS3.SSS0.Px3.p1.1.m1.1.1.cmml"><mtext id="S2.SS3.SSS0.Px3.p1.1.m1.1.1.2" xref="S2.SS3.SSS0.Px3.p1.1.m1.1.1.2a.cmml">FLOPs</mtext><mo id="S2.SS3.SSS0.Px3.p1.1.m1.1.1.1" xref="S2.SS3.SSS0.Px3.p1.1.m1.1.1.1.cmml">=</mo><mrow id="S2.SS3.SSS0.Px3.p1.1.m1.1.1.3" xref="S2.SS3.SSS0.Px3.p1.1.m1.1.1.3.cmml"><mn id="S2.SS3.SSS0.Px3.p1.1.m1.1.1.3.2" xref="S2.SS3.SSS0.Px3.p1.1.m1.1.1.3.2.cmml">6</mn><mo id="S2.SS3.SSS0.Px3.p1.1.m1.1.1.3.1" xref="S2.SS3.SSS0.Px3.p1.1.m1.1.1.3.1.cmml">⁢</mo><mi id="S2.SS3.SSS0.Px3.p1.1.m1.1.1.3.3" xref="S2.SS3.SSS0.Px3.p1.1.m1.1.1.3.3.cmml">N</mi><mo id="S2.SS3.SSS0.Px3.p1.1.m1.1.1.3.1a" xref="S2.SS3.SSS0.Px3.p1.1.m1.1.1.3.1.cmml">⁢</mo><mi id="S2.SS3.SSS0.Px3.p1.1.m1.1.1.3.4" xref="S2.SS3.SSS0.Px3.p1.1.m1.1.1.3.4.cmml">D</mi></mrow></mrow><annotation-xml encoding="MathML-Content" id="S2.SS3.SSS0.Px3.p1.1.m1.1b"><apply id="S2.SS3.SSS0.Px3.p1.1.m1.1.1.cmml" xref="S2.SS3.SSS0.Px3.p1.1.m1.1.1"><eq id="S2.SS3.SSS0.Px3.p1.1.m1.1.1.1.cmml" xref="S2.SS3.SSS0.Px3.p1.1.m1.1.1.1"></eq><ci id="S2.SS3.SSS0.Px3.p1.1.m1.1.1.2a.cmml" xref="S2.SS3.SSS0.Px3.p1.1.m1.1.1.2"><mtext id="S2.SS3.SSS0.Px3.p1.1.m1.1.1.2.cmml" xref="S2.SS3.SSS0.Px3.p1.1.m1.1.1.2">FLOPs</mtext></ci><apply id="S2.SS3.SSS0.Px3.p1.1.m1.1.1.3.cmml" xref="S2.SS3.SSS0.Px3.p1.1.m1.1.1.3"><times id="S2.SS3.SSS0.Px3.p1.1.m1.1.1.3.1.cmml" xref="S2.SS3.SSS0.Px3.p1.1.m1.1.1.3.1"></times><cn id="S2.SS3.SSS0.Px3.p1.1.m1.1.1.3.2.cmml" type="integer" xref="S2.SS3.SSS0.Px3.p1.1.m1.1.1.3.2">6</cn><ci id="S2.SS3.SSS0.Px3.p1.1.m1.1.1.3.3.cmml" xref="S2.SS3.SSS0.Px3.p1.1.m1.1.1.3.3">𝑁</ci><ci id="S2.SS3.SSS0.Px3.p1.1.m1.1.1.3.4.cmml" xref="S2.SS3.SSS0.Px3.p1.1.m1.1.1.3.4">𝐷</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S2.SS3.SSS0.Px3.p1.1.m1.1c">\text{FLOPs}=6ND</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.SSS0.Px3.p1.1.m1.1d">FLOPs = 6 italic_N italic_D</annotation></semantics></math>, based solely on the number of parameters (<math alttext="N" class="ltx_Math" display="inline" id="S2.SS3.SSS0.Px3.p1.2.m2.1"><semantics id="S2.SS3.SSS0.Px3.p1.2.m2.1a"><mi id="S2.SS3.SSS0.Px3.p1.2.m2.1.1" xref="S2.SS3.SSS0.Px3.p1.2.m2.1.1.cmml">N</mi><annotation-xml encoding="MathML-Content" id="S2.SS3.SSS0.Px3.p1.2.m2.1b"><ci id="S2.SS3.SSS0.Px3.p1.2.m2.1.1.cmml" xref="S2.SS3.SSS0.Px3.p1.2.m2.1.1">𝑁</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS3.SSS0.Px3.p1.2.m2.1c">N</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.SSS0.Px3.p1.2.m2.1d">italic_N</annotation></semantics></math>) and tokens trained (<math alttext="D" class="ltx_Math" display="inline" id="S2.SS3.SSS0.Px3.p1.3.m3.1"><semantics id="S2.SS3.SSS0.Px3.p1.3.m3.1a"><mi id="S2.SS3.SSS0.Px3.p1.3.m3.1.1" xref="S2.SS3.SSS0.Px3.p1.3.m3.1.1.cmml">D</mi><annotation-xml encoding="MathML-Content" id="S2.SS3.SSS0.Px3.p1.3.m3.1b"><ci id="S2.SS3.SSS0.Px3.p1.3.m3.1.1.cmml" xref="S2.SS3.SSS0.Px3.p1.3.m3.1.1">𝐷</ci></annotation-xml><annotation encoding="application/x-tex" id="S2.SS3.SSS0.Px3.p1.3.m3.1c">D</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.SSS0.Px3.p1.3.m3.1d">italic_D</annotation></semantics></math>) <cite class="ltx_cite ltx_citemacro_citep">(Kaplan et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib21" title="">2020</a>)</cite>. We consider the efficiency of a prediction based on the ratio of the experimental budget and the target budget in FLOPs, <math alttext="\text{$\%C${}}=\frac{c}{C}\times 100\%" class="ltx_math_unparsed" display="inline" id="S2.SS3.SSS0.Px3.p1.4.m4.1"><semantics id="S2.SS3.SSS0.Px3.p1.4.m4.1a"><mrow id="S2.SS3.SSS0.Px3.p1.4.m4.1.2"><mrow id="S2.SS3.SSS0.Px3.p1.4.m4.1.1.1"><mo id="S2.SS3.SSS0.Px3.p1.4.m4.1.1.1.2">%</mo><mi id="S2.SS3.SSS0.Px3.p1.4.m4.1.1.1.3">C</mi></mrow><mo id="S2.SS3.SSS0.Px3.p1.4.m4.1.2.1">=</mo><mrow id="S2.SS3.SSS0.Px3.p1.4.m4.1.2.2"><mfrac id="S2.SS3.SSS0.Px3.p1.4.m4.1.2.2.2"><mi id="S2.SS3.SSS0.Px3.p1.4.m4.1.2.2.2.2">c</mi><mi id="S2.SS3.SSS0.Px3.p1.4.m4.1.2.2.2.3">C</mi></mfrac><mo id="S2.SS3.SSS0.Px3.p1.4.m4.1.2.2.1" lspace="0.222em" rspace="0.222em">×</mo><mrow id="S2.SS3.SSS0.Px3.p1.4.m4.1.2.2.3"><mn id="S2.SS3.SSS0.Px3.p1.4.m4.1.2.2.3.2">100</mn><mo id="S2.SS3.SSS0.Px3.p1.4.m4.1.2.2.3.1">%</mo></mrow></mrow></mrow><annotation encoding="application/x-tex" id="S2.SS3.SSS0.Px3.p1.4.m4.1b">\text{$\%C${}}=\frac{c}{C}\times 100\%</annotation><annotation encoding="application/x-llamapun" id="S2.SS3.SSS0.Px3.p1.4.m4.1c">% italic_C = divide start_ARG italic_c end_ARG start_ARG italic_C end_ARG × 100 %</annotation></semantics></math>.</p>
</div>
</section>
</section>
<section class="ltx_subsection" id="S2.SS4">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">2.4 </span>Performance Evaluation with OLMES</h3>
<div class="ltx_para ltx_noindent" id="S2.SS4.p1">
<p class="ltx_p" id="S2.SS4.p1.1">We use the OLMES suite of 10 multiple choice question answering benchmarks <cite class="ltx_cite ltx_citemacro_citep">(Gu et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib17" title="">2024</a>)</cite>: MMLU <cite class="ltx_cite ltx_citemacro_citep">(Hendrycks et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib18" title="">2021</a>)</cite>, HellaSwag <cite class="ltx_cite ltx_citemacro_citep">(Zellers et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib41" title="">2019</a>)</cite>, ARC Challenge <cite class="ltx_cite ltx_citemacro_citep">(Clark et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib10" title="">2018</a>)</cite>, ARC Easy <cite class="ltx_cite ltx_citemacro_citep">(Clark et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib10" title="">2018</a>)</cite>, PIQA <cite class="ltx_cite ltx_citemacro_citep">(Bisk et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib5" title="">2020</a>)</cite>, CommonsenseQA <cite class="ltx_cite ltx_citemacro_citep">(Talmor et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib39" title="">2019</a>)</cite>,SocialIQA <cite class="ltx_cite ltx_citemacro_citep">(Sap et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib35" title="">2019</a>)</cite>, OpenBookQA <cite class="ltx_cite ltx_citemacro_citep">(Mihaylov et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib26" title="">2018</a>)</cite>, BoolQ <cite class="ltx_cite ltx_citemacro_citep">(Clark et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib9" title="">2019</a>)</cite>, and WinoGrande <cite class="ltx_cite ltx_citemacro_citep">(Sakaguchi et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib34" title="">2020</a>)</cite>. These tasks are well suited for the model scales we examine with all but BoolQ receiving non-trivial performance. Unless otherwise noted, we consider the macro average of these ten tasks. The underlying metric for each task is accuracy, for which OLMES specifies a different length normalization scheme per task. Our target “gold” rankings which we aim to predict are always based on the “cloze” formulation (CF) accuracy with curated normalization per task, which we refer to as <span class="ltx_text ltx_font_smallcaps" id="S2.SS4.p1.1.1">Accuracy</span>. We diverge from OLMES only in that we make use of all available items in the specified split of each benchmark rather than subsampling them, to reduce variance over the task distribution.</p>
</div>
<div class="ltx_para ltx_noindent" id="S2.SS4.p2">
<p class="ltx_p" id="S2.SS4.p2.1">Note that while we focus just on OLMES multiple choice evaluations in this work, our method of validating decisions made through predictions can be applied to other benchmarks. We chose these tasks based on their appropriateness to our range of model scales, and one would have to select different tasks when targeting a larger scale. Moreover, <span class="ltx_text ltx_font_smallcaps" id="S2.SS4.p2.1.1">DataDecide</span> could be used to identify new evaluations that are sensitive within our range of scales.</p>
</div>
</section>
<section class="ltx_subsection" id="S2.SS5">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">2.5 </span>Proxy Metrics for Performance Evaluation</h3>
<div class="ltx_para ltx_noindent" id="S2.SS5.p1">
<p class="ltx_p" id="S2.SS5.p1.1">Previous work has noted how discrete metrics such as accuracy can cause jumps in performance across scale that otherwise see more predictable improvements with scale for continuous metrics <cite class="ltx_cite ltx_citemacro_citep">(Schaeffer et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib36" title="">2023</a>)</cite>.
We experiment with using continuous metrics at small scale as proxies of the accuracies selected by OLMES for each task (<span class="ltx_text ltx_font_smallcaps" id="S2.SS5.p1.1.1">Accuracy</span>) at the target scale to improve decision accuracy. We use the following metrics: <span class="ltx_text ltx_font_smallcaps" id="S2.SS5.p1.1.2">Correct Prob</span> is the average probabilities of the correct continuations. <span class="ltx_text ltx_font_smallcaps" id="S2.SS5.p1.1.3">Margin</span> is the average difference between the probability of the correct continuation and the most likely incorrect continuation. <span class="ltx_text ltx_font_smallcaps" id="S2.SS5.p1.1.4">Norm Correct Prob</span> is the average probability of the correct continuation conditioned on the response being in the set of correct or incorrect continuations. <span class="ltx_text ltx_font_smallcaps" id="S2.SS5.p1.1.5">Total Prob</span> is the average of the sum of probabilities of all correct and incorrect continuations. <span class="ltx_text ltx_font_smallcaps" id="S2.SS5.p1.1.6">Accuracy</span> is the fraction of instances where the correct continuation has the highest probability. Each of these can be computed with likelihoods normalized by number of tokens or characters; unless otherwise specified we use character length normalization. Appendix Table <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#A2.T3" title="Table 3 ‣ Appendix B Proxy Metric Definitions ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_tag">3</span></a> shows formal definitions.</p>
</div>
</section>
</section>
<section class="ltx_section" id="S3">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">3 </span>Results</h2>
<figure class="ltx_figure" id="S3.F2"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="342" id="S3.F2.g1" src="x3.png" width="830"/>
<figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 2: </span>Accuracy in pairwise decisions on best data when evaluating on the 10 OLMES tasks with <span class="ltx_text ltx_font_smallcaps" id="S3.F2.2.1">Accuracy</span> (shown aggregated in Figure <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#S0.F1" title="Figure 1 ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_tag">1</span></a>). Specific tasks have very distinct ranges of sensitivity, with some like ARC Easy being predictable at small scales and others like HellaSwag requiring substantially more compute to predict. </figcaption>
</figure>
<section class="ltx_subsection" id="S3.SS1">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">3.1 </span>What is the best way to spend compute for data decisions?</h3>
<div class="ltx_para ltx_noindent" id="S3.SS1.p1">
<svg class="ltx_picture" height="99.28" id="S3.SS1.p1.pic1" overflow="visible" version="1.1" width="600"><g fill="#000000" stroke="#000000" stroke-width="0.4pt" transform="translate(0,99.28) matrix(1 0 0 -1 0 0)"><g fill="#FFFF80" fill-opacity="1.0"><path d="M 0 0 L 0 99.28 L 600 99.28 L 600 0 Z" style="stroke:none"></path></g><g fill="#FFFFE6" fill-opacity="1.0"><path d="M 0 0 L 0 99.28 L 600 99.28 L 600 0 Z" style="stroke:none"></path></g><g fill-opacity="1.0" transform="matrix(1.0 0.0 0.0 1.0 19.68 11.81)"><foreignobject color="#000000" height="75.66" overflow="visible" transform="matrix(1 0 0 -1 0 16.6)" width="560.63">
<span class="ltx_inline-block ltx_minipage ltx_align_bottom" id="S3.SS1.p1.pic1.1.1.1.1.1" style="width:405.2pt;">
<span class="ltx_p" id="S3.SS1.p1.pic1.1.1.1.1.1.1"><span class="ltx_text ltx_font_italic" id="S3.SS1.p1.pic1.1.1.1.1.1.1.1">More compute makes better decisions. Decisions from intermediate checkpoints are as good as compute equivalent final checkpoints. The amount of compute needed to make good predictions varies between tasks. ARC and MMLU are predictable with much less compute than HellaSwag. The rest of OLMES tasks give markedly less reliable predictions across the scales we examine.</span></span>
</span></foreignobject></g></g></svg>
</div>
<div class="ltx_para ltx_noindent" id="S3.SS1.p2">
<p class="ltx_p" id="S3.SS1.p2.1">First looking at the aggregation of all 10 OLMES tasks (Figure <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#S0.F1" title="Figure 1 ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_tag">1</span></a> right), we see that there is a positive and roughly log-linear relationship between experimental compute and decision accuracy. Specifically, this figure illustrates the relationship between the compute used for predicting best data recipes and the decision accuracy those predictions achieve against targets ranked by OLMES performance at the 1B scale. Each point represents the average decision accuracy over three runs with different random seeds, with shading indicating standard deviation. Points with the same color show all intermediate checkpoints from a given parameter size. The color shows each model size for predicting using ranking single scale experiments. The stars show predictions from extrapolating scaling laws using our default 3-parameter approach, the details of which are discussed further in §<a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#S3.SS2" title="3.2 How does extrapolating scaling laws compare to ranking single scale experiments? ‣ 3 Results ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_tag">3.2</span></a>.</p>
</div>
<div class="ltx_para ltx_noindent" id="S3.SS1.p3">
<p class="ltx_p" id="S3.SS1.p3.1">The ease of prediction is greatly influenced by which evaluation benchmark we use. In Figure <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#S3.F2" title="Figure 2 ‣ 3 Results ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_tag">2</span></a>, we show the relationship of compute and decision accuracy for each of the tasks in OLMES individually. The predictive sensitivity of tasks at a given compute varies significantly, with ARC Easy being consistently predictable with 5 orders of magnitude less compute and BoolQ only reaching beyond trivial decision accuracy for intermediate checkpoints of the target runs. HellaSwag, SocialIQA, WinoGrande show distinct periods of insensitivity followed by roughly log-linear increase after hitting some compute threshold.</p>
</div>
<figure class="ltx_figure" id="S3.F3"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="340" id="S3.F3.g1" src="x4.png" width="622"/>
<figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 3: </span>Decision accuracy over 8 baseline scaling law variants. At best, these approaches reach only the same compute to decision accuracy frontier as ranking single scale experiments. <span class="ltx_text ltx_font_smallcaps" id="S3.F3.2.1">DataDecide</span> can be used to iterate on future scaling law prediction methods.</figcaption>
</figure>
</section>
<section class="ltx_subsection" id="S3.SS2">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">3.2 </span>How does extrapolating scaling laws compare to ranking single scale experiments?</h3>
<div class="ltx_para ltx_noindent" id="S3.SS2.p1">
<svg class="ltx_picture" height="52.53" id="S3.SS2.p1.pic1" overflow="visible" version="1.1" width="600"><g fill="#000000" stroke="#000000" stroke-width="0.4pt" transform="translate(0,52.53) matrix(1 0 0 -1 0 0)"><g fill="#FFFF80" fill-opacity="1.0"><path d="M 0 0 L 0 52.53 L 600 52.53 L 600 0 Z" style="stroke:none"></path></g><g fill="#FFFFE6" fill-opacity="1.0"><path d="M 0 0 L 0 52.53 L 600 52.53 L 600 0 Z" style="stroke:none"></path></g><g fill-opacity="1.0" transform="matrix(1.0 0.0 0.0 1.0 19.68 11.81)"><foreignobject color="#000000" height="28.9" overflow="visible" transform="matrix(1 0 0 -1 0 16.6)" width="560.63">
<span class="ltx_inline-block ltx_minipage ltx_align_bottom" id="S3.SS2.p1.pic1.1.1.1.1.1" style="width:405.2pt;">
<span class="ltx_p" id="S3.SS2.p1.pic1.1.1.1.1.1.1"><span class="ltx_text ltx_font_italic" id="S3.SS2.p1.pic1.1.1.1.1.1.1.1">A selection of 8 baseline scaling law methods are no more efficient than ranking single scale experiments. Future scaling law methods can be assessed on <span class="ltx_text ltx_font_smallcaps" id="S3.SS2.p1.pic1.1.1.1.1.1.1.1.1">DataDecide</span>.</span></span>
</span></foreignobject></g></g></svg>
</div>
<div class="ltx_para ltx_noindent" id="S3.SS2.p2">
<p class="ltx_p" id="S3.SS2.p2.3">Figure <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#S3.F3" title="Figure 3 ‣ 3.1 What is the best way to spend compute for data decisions? ‣ 3 Results ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_tag">3</span></a> contrasts different approaches to fitting scaling laws over multiple scales of small experiments. Each of the 8 approaches is shown in a different color. Multi-scale predictions have a compute budget equal to the training cost of the model sizes used to make the prediction. We try the following combinations of models sizes: We use <math alttext="\left\{\{s_{1},\dots,s_{k}\}\mid 3\leq k\leq 14{}\right\}" class="ltx_Math" display="inline" id="S3.SS2.p2.1.m1.3"><semantics id="S3.SS2.p2.1.m1.3a"><mrow id="S3.SS2.p2.1.m1.3.3.2" xref="S3.SS2.p2.1.m1.3.3.3.cmml"><mo id="S3.SS2.p2.1.m1.3.3.2.3" xref="S3.SS2.p2.1.m1.3.3.3.1.cmml">{</mo><mrow id="S3.SS2.p2.1.m1.2.2.1.1.2" xref="S3.SS2.p2.1.m1.2.2.1.1.3.cmml"><mo id="S3.SS2.p2.1.m1.2.2.1.1.2.3" stretchy="false" xref="S3.SS2.p2.1.m1.2.2.1.1.3.cmml">{</mo><msub id="S3.SS2.p2.1.m1.2.2.1.1.1.1" xref="S3.SS2.p2.1.m1.2.2.1.1.1.1.cmml"><mi id="S3.SS2.p2.1.m1.2.2.1.1.1.1.2" xref="S3.SS2.p2.1.m1.2.2.1.1.1.1.2.cmml">s</mi><mn id="S3.SS2.p2.1.m1.2.2.1.1.1.1.3" xref="S3.SS2.p2.1.m1.2.2.1.1.1.1.3.cmml">1</mn></msub><mo id="S3.SS2.p2.1.m1.2.2.1.1.2.4" xref="S3.SS2.p2.1.m1.2.2.1.1.3.cmml">,</mo><mi id="S3.SS2.p2.1.m1.1.1" mathvariant="normal" xref="S3.SS2.p2.1.m1.1.1.cmml">…</mi><mo id="S3.SS2.p2.1.m1.2.2.1.1.2.5" xref="S3.SS2.p2.1.m1.2.2.1.1.3.cmml">,</mo><msub id="S3.SS2.p2.1.m1.2.2.1.1.2.2" xref="S3.SS2.p2.1.m1.2.2.1.1.2.2.cmml"><mi id="S3.SS2.p2.1.m1.2.2.1.1.2.2.2" xref="S3.SS2.p2.1.m1.2.2.1.1.2.2.2.cmml">s</mi><mi id="S3.SS2.p2.1.m1.2.2.1.1.2.2.3" xref="S3.SS2.p2.1.m1.2.2.1.1.2.2.3.cmml">k</mi></msub><mo id="S3.SS2.p2.1.m1.2.2.1.1.2.6" stretchy="false" xref="S3.SS2.p2.1.m1.2.2.1.1.3.cmml">}</mo></mrow><mo fence="true" id="S3.SS2.p2.1.m1.3.3.2.4" lspace="0em" rspace="0em" xref="S3.SS2.p2.1.m1.3.3.3.1.cmml">∣</mo><mrow id="S3.SS2.p2.1.m1.3.3.2.2" xref="S3.SS2.p2.1.m1.3.3.2.2.cmml"><mn id="S3.SS2.p2.1.m1.3.3.2.2.2" xref="S3.SS2.p2.1.m1.3.3.2.2.2.cmml">3</mn><mo id="S3.SS2.p2.1.m1.3.3.2.2.3" xref="S3.SS2.p2.1.m1.3.3.2.2.3.cmml">≤</mo><mi id="S3.SS2.p2.1.m1.3.3.2.2.4" xref="S3.SS2.p2.1.m1.3.3.2.2.4.cmml">k</mi><mo id="S3.SS2.p2.1.m1.3.3.2.2.5" xref="S3.SS2.p2.1.m1.3.3.2.2.5.cmml">≤</mo><mn id="S3.SS2.p2.1.m1.3.3.2.2.6" xref="S3.SS2.p2.1.m1.3.3.2.2.6.cmml">14</mn></mrow><mo id="S3.SS2.p2.1.m1.3.3.2.5" xref="S3.SS2.p2.1.m1.3.3.3.1.cmml">}</mo></mrow><annotation-xml encoding="MathML-Content" id="S3.SS2.p2.1.m1.3b"><apply id="S3.SS2.p2.1.m1.3.3.3.cmml" xref="S3.SS2.p2.1.m1.3.3.2"><csymbol cd="latexml" id="S3.SS2.p2.1.m1.3.3.3.1.cmml" xref="S3.SS2.p2.1.m1.3.3.2.3">conditional-set</csymbol><set id="S3.SS2.p2.1.m1.2.2.1.1.3.cmml" xref="S3.SS2.p2.1.m1.2.2.1.1.2"><apply id="S3.SS2.p2.1.m1.2.2.1.1.1.1.cmml" xref="S3.SS2.p2.1.m1.2.2.1.1.1.1"><csymbol cd="ambiguous" id="S3.SS2.p2.1.m1.2.2.1.1.1.1.1.cmml" xref="S3.SS2.p2.1.m1.2.2.1.1.1.1">subscript</csymbol><ci id="S3.SS2.p2.1.m1.2.2.1.1.1.1.2.cmml" xref="S3.SS2.p2.1.m1.2.2.1.1.1.1.2">𝑠</ci><cn id="S3.SS2.p2.1.m1.2.2.1.1.1.1.3.cmml" type="integer" xref="S3.SS2.p2.1.m1.2.2.1.1.1.1.3">1</cn></apply><ci id="S3.SS2.p2.1.m1.1.1.cmml" xref="S3.SS2.p2.1.m1.1.1">…</ci><apply id="S3.SS2.p2.1.m1.2.2.1.1.2.2.cmml" xref="S3.SS2.p2.1.m1.2.2.1.1.2.2"><csymbol cd="ambiguous" id="S3.SS2.p2.1.m1.2.2.1.1.2.2.1.cmml" xref="S3.SS2.p2.1.m1.2.2.1.1.2.2">subscript</csymbol><ci id="S3.SS2.p2.1.m1.2.2.1.1.2.2.2.cmml" xref="S3.SS2.p2.1.m1.2.2.1.1.2.2.2">𝑠</ci><ci id="S3.SS2.p2.1.m1.2.2.1.1.2.2.3.cmml" xref="S3.SS2.p2.1.m1.2.2.1.1.2.2.3">𝑘</ci></apply></set><apply id="S3.SS2.p2.1.m1.3.3.2.2.cmml" xref="S3.SS2.p2.1.m1.3.3.2.2"><and id="S3.SS2.p2.1.m1.3.3.2.2a.cmml" xref="S3.SS2.p2.1.m1.3.3.2.2"></and><apply id="S3.SS2.p2.1.m1.3.3.2.2b.cmml" xref="S3.SS2.p2.1.m1.3.3.2.2"><leq id="S3.SS2.p2.1.m1.3.3.2.2.3.cmml" xref="S3.SS2.p2.1.m1.3.3.2.2.3"></leq><cn id="S3.SS2.p2.1.m1.3.3.2.2.2.cmml" type="integer" xref="S3.SS2.p2.1.m1.3.3.2.2.2">3</cn><ci id="S3.SS2.p2.1.m1.3.3.2.2.4.cmml" xref="S3.SS2.p2.1.m1.3.3.2.2.4">𝑘</ci></apply><apply id="S3.SS2.p2.1.m1.3.3.2.2c.cmml" xref="S3.SS2.p2.1.m1.3.3.2.2"><leq id="S3.SS2.p2.1.m1.3.3.2.2.5.cmml" xref="S3.SS2.p2.1.m1.3.3.2.2.5"></leq><share href="https://arxiv.org/html/2504.11393v1#S3.SS2.p2.1.m1.3.3.2.2.4.cmml" id="S3.SS2.p2.1.m1.3.3.2.2d.cmml" xref="S3.SS2.p2.1.m1.3.3.2.2"></share><cn id="S3.SS2.p2.1.m1.3.3.2.2.6.cmml" type="integer" xref="S3.SS2.p2.1.m1.3.3.2.2.6">14</cn></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p2.1.m1.3c">\left\{\{s_{1},\dots,s_{k}\}\mid 3\leq k\leq 14{}\right\}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p2.1.m1.3d">{ { italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } ∣ 3 ≤ italic_k ≤ 14 }</annotation></semantics></math>, where <math alttext="\mathbf{s}" class="ltx_Math" display="inline" id="S3.SS2.p2.2.m2.1"><semantics id="S3.SS2.p2.2.m2.1a"><mi id="S3.SS2.p2.2.m2.1.1" xref="S3.SS2.p2.2.m2.1.1.cmml">𝐬</mi><annotation-xml encoding="MathML-Content" id="S3.SS2.p2.2.m2.1b"><ci id="S3.SS2.p2.2.m2.1.1.cmml" xref="S3.SS2.p2.2.m2.1.1">𝐬</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p2.2.m2.1c">\mathbf{s}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p2.2.m2.1d">bold_s</annotation></semantics></math> is the ordered set of sizes, to explore the improvements of progressively adding larger model sizes beyond the minimum 3 required for fitting. We also use <math alttext="\left\{\{s_{k},\dots,s_{14}\}\mid 2\leq k\leq 11\right\}" class="ltx_Math" display="inline" id="S3.SS2.p2.3.m3.3"><semantics id="S3.SS2.p2.3.m3.3a"><mrow id="S3.SS2.p2.3.m3.3.3.2" xref="S3.SS2.p2.3.m3.3.3.3.cmml"><mo id="S3.SS2.p2.3.m3.3.3.2.3" xref="S3.SS2.p2.3.m3.3.3.3.1.cmml">{</mo><mrow id="S3.SS2.p2.3.m3.2.2.1.1.2" xref="S3.SS2.p2.3.m3.2.2.1.1.3.cmml"><mo id="S3.SS2.p2.3.m3.2.2.1.1.2.3" stretchy="false" xref="S3.SS2.p2.3.m3.2.2.1.1.3.cmml">{</mo><msub id="S3.SS2.p2.3.m3.2.2.1.1.1.1" xref="S3.SS2.p2.3.m3.2.2.1.1.1.1.cmml"><mi id="S3.SS2.p2.3.m3.2.2.1.1.1.1.2" xref="S3.SS2.p2.3.m3.2.2.1.1.1.1.2.cmml">s</mi><mi id="S3.SS2.p2.3.m3.2.2.1.1.1.1.3" xref="S3.SS2.p2.3.m3.2.2.1.1.1.1.3.cmml">k</mi></msub><mo id="S3.SS2.p2.3.m3.2.2.1.1.2.4" xref="S3.SS2.p2.3.m3.2.2.1.1.3.cmml">,</mo><mi id="S3.SS2.p2.3.m3.1.1" mathvariant="normal" xref="S3.SS2.p2.3.m3.1.1.cmml">…</mi><mo id="S3.SS2.p2.3.m3.2.2.1.1.2.5" xref="S3.SS2.p2.3.m3.2.2.1.1.3.cmml">,</mo><msub id="S3.SS2.p2.3.m3.2.2.1.1.2.2" xref="S3.SS2.p2.3.m3.2.2.1.1.2.2.cmml"><mi id="S3.SS2.p2.3.m3.2.2.1.1.2.2.2" xref="S3.SS2.p2.3.m3.2.2.1.1.2.2.2.cmml">s</mi><mn id="S3.SS2.p2.3.m3.2.2.1.1.2.2.3" xref="S3.SS2.p2.3.m3.2.2.1.1.2.2.3.cmml">14</mn></msub><mo id="S3.SS2.p2.3.m3.2.2.1.1.2.6" stretchy="false" xref="S3.SS2.p2.3.m3.2.2.1.1.3.cmml">}</mo></mrow><mo fence="true" id="S3.SS2.p2.3.m3.3.3.2.4" lspace="0em" rspace="0em" xref="S3.SS2.p2.3.m3.3.3.3.1.cmml">∣</mo><mrow id="S3.SS2.p2.3.m3.3.3.2.2" xref="S3.SS2.p2.3.m3.3.3.2.2.cmml"><mn id="S3.SS2.p2.3.m3.3.3.2.2.2" xref="S3.SS2.p2.3.m3.3.3.2.2.2.cmml">2</mn><mo id="S3.SS2.p2.3.m3.3.3.2.2.3" xref="S3.SS2.p2.3.m3.3.3.2.2.3.cmml">≤</mo><mi id="S3.SS2.p2.3.m3.3.3.2.2.4" xref="S3.SS2.p2.3.m3.3.3.2.2.4.cmml">k</mi><mo id="S3.SS2.p2.3.m3.3.3.2.2.5" xref="S3.SS2.p2.3.m3.3.3.2.2.5.cmml">≤</mo><mn id="S3.SS2.p2.3.m3.3.3.2.2.6" xref="S3.SS2.p2.3.m3.3.3.2.2.6.cmml">11</mn></mrow><mo id="S3.SS2.p2.3.m3.3.3.2.5" xref="S3.SS2.p2.3.m3.3.3.3.1.cmml">}</mo></mrow><annotation-xml encoding="MathML-Content" id="S3.SS2.p2.3.m3.3b"><apply id="S3.SS2.p2.3.m3.3.3.3.cmml" xref="S3.SS2.p2.3.m3.3.3.2"><csymbol cd="latexml" id="S3.SS2.p2.3.m3.3.3.3.1.cmml" xref="S3.SS2.p2.3.m3.3.3.2.3">conditional-set</csymbol><set id="S3.SS2.p2.3.m3.2.2.1.1.3.cmml" xref="S3.SS2.p2.3.m3.2.2.1.1.2"><apply id="S3.SS2.p2.3.m3.2.2.1.1.1.1.cmml" xref="S3.SS2.p2.3.m3.2.2.1.1.1.1"><csymbol cd="ambiguous" id="S3.SS2.p2.3.m3.2.2.1.1.1.1.1.cmml" xref="S3.SS2.p2.3.m3.2.2.1.1.1.1">subscript</csymbol><ci id="S3.SS2.p2.3.m3.2.2.1.1.1.1.2.cmml" xref="S3.SS2.p2.3.m3.2.2.1.1.1.1.2">𝑠</ci><ci id="S3.SS2.p2.3.m3.2.2.1.1.1.1.3.cmml" xref="S3.SS2.p2.3.m3.2.2.1.1.1.1.3">𝑘</ci></apply><ci id="S3.SS2.p2.3.m3.1.1.cmml" xref="S3.SS2.p2.3.m3.1.1">…</ci><apply id="S3.SS2.p2.3.m3.2.2.1.1.2.2.cmml" xref="S3.SS2.p2.3.m3.2.2.1.1.2.2"><csymbol cd="ambiguous" id="S3.SS2.p2.3.m3.2.2.1.1.2.2.1.cmml" xref="S3.SS2.p2.3.m3.2.2.1.1.2.2">subscript</csymbol><ci id="S3.SS2.p2.3.m3.2.2.1.1.2.2.2.cmml" xref="S3.SS2.p2.3.m3.2.2.1.1.2.2.2">𝑠</ci><cn id="S3.SS2.p2.3.m3.2.2.1.1.2.2.3.cmml" type="integer" xref="S3.SS2.p2.3.m3.2.2.1.1.2.2.3">14</cn></apply></set><apply id="S3.SS2.p2.3.m3.3.3.2.2.cmml" xref="S3.SS2.p2.3.m3.3.3.2.2"><and id="S3.SS2.p2.3.m3.3.3.2.2a.cmml" xref="S3.SS2.p2.3.m3.3.3.2.2"></and><apply id="S3.SS2.p2.3.m3.3.3.2.2b.cmml" xref="S3.SS2.p2.3.m3.3.3.2.2"><leq id="S3.SS2.p2.3.m3.3.3.2.2.3.cmml" xref="S3.SS2.p2.3.m3.3.3.2.2.3"></leq><cn id="S3.SS2.p2.3.m3.3.3.2.2.2.cmml" type="integer" xref="S3.SS2.p2.3.m3.3.3.2.2.2">2</cn><ci id="S3.SS2.p2.3.m3.3.3.2.2.4.cmml" xref="S3.SS2.p2.3.m3.3.3.2.2.4">𝑘</ci></apply><apply id="S3.SS2.p2.3.m3.3.3.2.2c.cmml" xref="S3.SS2.p2.3.m3.3.3.2.2"><leq id="S3.SS2.p2.3.m3.3.3.2.2.5.cmml" xref="S3.SS2.p2.3.m3.3.3.2.2.5"></leq><share href="https://arxiv.org/html/2504.11393v1#S3.SS2.p2.3.m3.3.3.2.2.4.cmml" id="S3.SS2.p2.3.m3.3.3.2.2d.cmml" xref="S3.SS2.p2.3.m3.3.3.2.2"></share><cn id="S3.SS2.p2.3.m3.3.3.2.2.6.cmml" type="integer" xref="S3.SS2.p2.3.m3.3.3.2.2.6">11</cn></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.p2.3.m3.3c">\left\{\{s_{k},\dots,s_{14}\}\mid 2\leq k\leq 11\right\}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.p2.3.m3.3d">{ { italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT 14 end_POSTSUBSCRIPT } ∣ 2 ≤ italic_k ≤ 11 }</annotation></semantics></math> to try removing potentially noisy information from small models. Unlike single scale results, we make only one prediction attempt with the default fully trained random seed, as final checkpoints are required for fitting the first step of these scaling law variants but are not available for all seeds.</p>
</div>
<div class="ltx_para ltx_noindent" id="S3.SS2.p3">
<p class="ltx_p" id="S3.SS2.p3.1">Our scaling law approaches vary in the number of parameters fit, using hard coded points to define the minimum and maximum performance, using only the second half of intermediate checkpoints for fitting the second step, or fitting a function directly from compute to accuracy in a single step. Each of the scaling law variants are defined formally in Appendix <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#A3" title="Appendix C Scaling Law Variants ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_tag">C</span></a>. The 2 and 3 parameter variants all achieve among the top decision accuracy.</p>
</div>
<div class="ltx_para ltx_noindent" id="S3.SS2.p4">
<p class="ltx_p" id="S3.SS2.p4.1">A priori we know that ranking single scale experiments cannot correctly predict when the scaling trend of one data recipe overtakes another at scales between our small experiments and target scale. Such crossovers bound the decision accuracy of this constant approximation of performance. Nevertheless ranking single scale experiments sets a high baseline decision accuracy, implying relatively little crossover occurs. It is difficult to distinguish evaluation variance from true crossovers, but the scaling trends we empirically observe cross over frequently. Improved future scaling laws may be able to advance the Pareto frontier on <span class="ltx_text ltx_font_smallcaps" id="S3.SS2.p4.1.1">DataDecide</span> as they are not bound by crossovers.</p>
</div>
<figure class="ltx_figure" id="S3.F4"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="375" id="S3.F4.g1" src="x5.png" width="830"/>
<figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 4: </span>
Per-task decision accuracy using character normalized proxy metrics for <span class="ltx_text ltx_font_smallcaps" id="S3.F4.7.1" style="color:#0000FF;">Accuracy</span> targets. 5 tasks benefit at smaller scales from using raw likelihood of answers (<span class="ltx_text ltx_font_smallcaps" id="S3.F4.8.2" style="color:#FF8000;">Correct Prob</span> and <span class="ltx_text ltx_font_smallcaps" id="S3.F4.9.3" style="color:#FF0000;">Total Prob</span>), as opposed to discrete <span class="ltx_text ltx_font_smallcaps" id="S3.F4.10.4">Accuracy</span> or continuous metrics that penalize probability on incorrect answers (<span class="ltx_text ltx_font_smallcaps" id="S3.F4.11.5" style="color:#BF0040;">Norm Correct Prob</span>, <span class="ltx_text ltx_font_smallcaps" id="S3.F4.12.6" style="color:#00FF00;">Margin</span>).
</figcaption>
</figure>
</section>
<section class="ltx_subsection" id="S3.SS3">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">3.3 </span>What proxy metrics give better signal for predictions at small scale?</h3>
<div class="ltx_para ltx_noindent" id="S3.SS3.p1">
<svg class="ltx_picture" height="69.13" id="S3.SS3.p1.pic1" overflow="visible" version="1.1" width="600"><g fill="#000000" stroke="#000000" stroke-width="0.4pt" transform="translate(0,69.13) matrix(1 0 0 -1 0 0)"><g fill="#FFFF80" fill-opacity="1.0"><path d="M 0 0 L 0 69.13 L 600 69.13 L 600 0 Z" style="stroke:none"></path></g><g fill="#FFFFE6" fill-opacity="1.0"><path d="M 0 0 L 0 69.13 L 600 69.13 L 600 0 Z" style="stroke:none"></path></g><g fill-opacity="1.0" transform="matrix(1.0 0.0 0.0 1.0 19.68 11.81)"><foreignobject color="#000000" height="45.51" overflow="visible" transform="matrix(1 0 0 -1 0 16.6)" width="560.63">
<span class="ltx_inline-block ltx_minipage ltx_align_bottom" id="S3.SS3.p1.pic1.1.1.1.1.1" style="width:405.2pt;">
<span class="ltx_p" id="S3.SS3.p1.pic1.1.1.1.1.1.1"><span class="ltx_text ltx_font_italic" id="S3.SS3.p1.pic1.1.1.1.1.1.1.1">At small scales, continuous metrics using the character normalized likelihood of correct or all answer options serve as better or equivalent predictors of decisions than using the same <span class="ltx_text ltx_font_smallcaps" id="S3.SS3.p1.pic1.1.1.1.1.1.1.1.1">Accuracy</span> as used at the target scale.</span></span>
</span></foreignobject></g></g></svg>
</div>
<div class="ltx_para ltx_noindent" id="S3.SS3.p2">
<p class="ltx_p" id="S3.SS3.p2.1">Figure <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#S3.F4" title="Figure 4 ‣ 3.2 How does extrapolating scaling laws compare to ranking single scale experiments? ‣ 3 Results ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_tag">4</span></a> shows the decision accuracy over different proxy metrics. Here we chose a single length normalization, <span class="ltx_text ltx_font_smallcaps" id="S3.SS3.p2.1.1">*_per_char</span>. Metrics follow similar trends regardless of length normalization and this one is empirically optimal for most of the tasks that we observe.</p>
</div>
<div class="ltx_para ltx_noindent" id="S3.SS3.p3">
<p class="ltx_p" id="S3.SS3.p3.1">Using <span class="ltx_text ltx_font_smallcaps" id="S3.SS3.p3.1.1">Correct Prob</span> or <span class="ltx_text ltx_font_smallcaps" id="S3.SS3.p3.1.2">Total Prob</span> leads to decision accuracy at least as good as any other metric for most small scales. These continuous metrics are simple likelihoods over answer strings. In particular, <span class="ltx_text ltx_font_smallcaps" id="S3.SS3.p3.1.3">Total Prob</span> may be interpretable as signal of a model having exposure to the domain of a given task in the form of higher likelihoods on incorrect but presumably relevant additional answers.</p>
</div>
<div class="ltx_para ltx_noindent" id="S3.SS3.p4">
<p class="ltx_p" id="S3.SS3.p4.1">We notice two very distinct types of trends over the different tasks. Either the different proxy metrics are nearly indistinguishable and increase in decision accuracy with compute or <span class="ltx_text ltx_font_smallcaps" id="S3.SS3.p4.1.1">Correct Prob</span> and <span class="ltx_text ltx_font_smallcaps" id="S3.SS3.p4.1.2">Total Prob</span> are flat with respect to scale and the other metrics only rise up to that level of decision accuracy towards the full target compute budget. In the last order of magnitude below the target compute <span class="ltx_text ltx_font_smallcaps" id="S3.SS3.p4.1.3">Accuracy</span> and the other metrics tend to overtake <span class="ltx_text ltx_font_smallcaps" id="S3.SS3.p4.1.4">Correct Prob</span> and <span class="ltx_text ltx_font_smallcaps" id="S3.SS3.p4.1.5">Total Prob</span>, while these two metrics sometimes even decrease in decision accuracy. Notably these other metrics that trend with <span class="ltx_text ltx_font_smallcaps" id="S3.SS3.p4.1.6">Accuracy</span> include continuous metrics that penalize probability assigned to incorrect answers, <span class="ltx_text ltx_font_smallcaps" id="S3.SS3.p4.1.7">Norm Correct Prob</span> and <span class="ltx_text ltx_font_smallcaps" id="S3.SS3.p4.1.8">Margin</span>.</p>
</div>
<figure class="ltx_figure" id="S3.F5"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="430" id="S3.F5.g1" src="x6.png" width="830"/>
<figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 5: </span>Why do some tasks or metrics get better or worse decision accuracy? At 150M with <span class="ltx_text ltx_font_smallcaps" id="S3.F5.2.1">Correct Prob</span> tasks like HellaSwag succeed with low run-to-run variance and tasks like SocialIQA widely spread the performance assigned to different pretraining data.</figcaption>
</figure>
</section>
<section class="ltx_subsection" id="S3.SS4">
<h3 class="ltx_title ltx_title_subsection">
<span class="ltx_tag ltx_tag_subsection">3.4 </span>How can we make evaluation benchmarks more predictable?</h3>
<div class="ltx_para ltx_noindent" id="S3.SS4.p1">
<svg class="ltx_picture" height="85.73" id="S3.SS4.p1.pic1" overflow="visible" version="1.1" width="600"><g fill="#000000" stroke="#000000" stroke-width="0.4pt" transform="translate(0,85.73) matrix(1 0 0 -1 0 0)"><g fill="#FFFF80" fill-opacity="1.0"><path d="M 0 0 L 0 85.73 L 600 85.73 L 600 0 Z" style="stroke:none"></path></g><g fill="#FFFFE6" fill-opacity="1.0"><path d="M 0 0 L 0 85.73 L 600 85.73 L 600 0 Z" style="stroke:none"></path></g><g fill-opacity="1.0" transform="matrix(1.0 0.0 0.0 1.0 19.68 11.81)"><foreignobject color="#000000" height="62.11" overflow="visible" transform="matrix(1 0 0 -1 0 16.6)" width="560.63">
<span class="ltx_inline-block ltx_minipage ltx_align_bottom" id="S3.SS4.p1.pic1.1.1.1.1.1" style="width:405.2pt;">
<span class="ltx_p" id="S3.SS4.p1.pic1.1.1.1.1.1.1"><span class="ltx_text ltx_font_italic" id="S3.SS4.p1.pic1.1.1.1.1.1.1.1">The decision accuracy on a task is driven in part by low run-to-run variance and a wide spread of performance values for different data recipes. Using <span class="ltx_text ltx_font_smallcaps" id="S3.SS4.p1.pic1.1.1.1.1.1.1.1.1">Correct Prob</span> sees wider spreads or reduced noise for many tasks. Using this metric enables predicting rankings for code tasks that are too hard for accuracy metrics at small scales.</span></span>
</span></foreignobject></g></g></svg>
</div>
<div class="ltx_para ltx_noindent" id="S3.SS4.p2">
<p class="ltx_p" id="S3.SS4.p2.1">What underlies differences in decision accuracy when benchmarks and metrics change? The evaluation must separate pairs of data recipes by an amount greater than combined noise from run-to-run variance of each of the pair’s runs. In Figure <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#S3.F5" title="Figure 5 ‣ 3.3 What proxy metrics give better signal for predictions at small scale? ‣ 3 Results ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_tag">5</span></a>, we plot tasks with a given metric using fully trained 150M models over these two characteristics: 1) noise—the standard deviation over 3 random seed runs averaged over all recipes, and 2) spread—the standard deviation among the mean performance of the different data recipes. Each point also shows the decision accuracy. We see that some highly predictable tasks (e.g., MMLU) are characterized by having low run-to-run noise, while others (e.g., ARC Easy) widely spread the different data recipes. We also see that improvements from using <span class="ltx_text ltx_font_smallcaps" id="S3.SS4.p2.1.1">Correct Prob</span> often align with improvements in one of these two characteristics.</p>
</div>
<figure class="ltx_figure" id="S3.F6"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="256" id="S3.F6.g1" src="x7.png" width="830"/>
<figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 6: </span>Code tasks such as humaneval and MBPP go from trivial decision accuracy to largely predictable when using using continuous <span class="ltx_text ltx_font_smallcaps" id="S3.F6.3.1">Correct Prob</span> instead of discrete <span class="ltx_text ltx_font_smallcaps" id="S3.F6.4.2">Accuracy</span>. Meanwhile common math tasks remain near trivial decision accuracy regardless of metric.</figcaption>
</figure>
<div class="ltx_para ltx_noindent" id="S3.SS4.p3">
<p class="ltx_p" id="S3.SS4.p3.1">As a practical application of these insights, we demonstrate that a change of proxy metric makes predictable two code tasks <cite class="ltx_cite ltx_citemacro_citep">(Austin et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib1" title="">2021</a>; Chen et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib7" title="">2021</a>)</cite> that are otherwise too challenging for our small models. Figure <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#S3.F6" title="Figure 6 ‣ 3.4 How can we make evaluation benchmarks more predictable? ‣ 3 Results ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_tag">6</span></a> shows how decision accuracy goes from trivial to  80% when using <span class="ltx_text ltx_font_smallcaps" id="S3.SS4.p3.1.1">Correct Prob</span>. The switch of metric allows small models to get above the noise floor for these tasks, while still predicting large-scale accuracy metrics. Notably, two math benchmarks <cite class="ltx_cite ltx_citemacro_citep">(Lewkowycz et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib23" title="">2022</a>; Cobbe et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib11" title="">2021</a>)</cite> do not see such a benefit. They do however give decision accuracy above 80% if we switch the <span class="ltx_text ltx_font_italic" id="S3.SS4.p3.1.2">target metric</span> to <span class="ltx_text ltx_font_smallcaps" id="S3.SS4.p3.1.3">Correct Prob</span>, raising a question for future work to explore whether changing the target metric can be justified.</p>
</div>
</section>
</section>
<section class="ltx_section" id="S4">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">4 </span>Related Work</h2>
<section class="ltx_paragraph" id="S4.SS0.SSS0.Px1">
<h4 class="ltx_title ltx_title_paragraph">Prediction</h4>
<div class="ltx_para ltx_noindent" id="S4.SS0.SSS0.Px1.p1">
<p class="ltx_p" id="S4.SS0.SSS0.Px1.p1.1">Much work studies scaling behavior in language models. Initially this focused on predicting LM loss from scale as determined by parameter count and tokens trained <cite class="ltx_cite ltx_citemacro_citep">(Kaplan et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib21" title="">2020</a>; Hoffmann et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib19" title="">2022</a>)</cite>. Special consideration is also given to the case of data constrained scaling <cite class="ltx_cite ltx_citemacro_citep">(Muennighoff et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib27" title="">2023</a>; Goyal et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib15" title="">2024</a>)</cite>.
Unlike predicting loss, predicting downstream performance from scale is generally harder <cite class="ltx_cite ltx_citemacro_citep">(Schaeffer et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib37" title="">2024</a>)</cite>. However, recent work has demonstrated it can be done based on a two step prediction that chains together predictions from scale to loss and loss to downstream performance
<cite class="ltx_cite ltx_citemacro_citep">(Gadre et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib14" title="">2024</a>; Bhagia et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib3" title="">2024</a>; Dubey et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib13" title="">2024</a>)</cite>, sometimes using training loss <cite class="ltx_cite ltx_citemacro_citep">(Du et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib12" title="">2024</a>)</cite> or transferring losses from different data recipes <cite class="ltx_cite ltx_citemacro_citep">(Brandfonbrener et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib6" title="">2024</a>; Ruan et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib33" title="">2024</a>)</cite>. The one line of work targeting pretraining data considers the special case of deciding mixing proportions of several data sources optimized through scaling laws <cite class="ltx_cite ltx_citemacro_citep">(Kang et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib20" title="">2024</a>; Ye et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib40" title="">2024</a>)</cite>. Most relevant to our work, <cite class="ltx_cite ltx_citemacro_citet">Choshen et al. (<a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib8" title="">2024</a>)</cite> consider practical methods for better scaling prediction error such as how much compute to use or whether to include intermediate checkpoints. Orthogonally to these findings, we propose a way to assess the accuracy of decisions made with such predictions.</p>
</div>
</section>
<section class="ltx_paragraph" id="S4.SS0.SSS0.Px2">
<h4 class="ltx_title ltx_title_paragraph">Suites over Data Differences</h4>
<div class="ltx_para ltx_noindent" id="S4.SS0.SSS0.Px2.p1">
<p class="ltx_p" id="S4.SS0.SSS0.Px2.p1.4"><span class="ltx_text ltx_font_smallcaps" id="S4.SS0.SSS0.Px2.p1.4.1">DataDecide</span> follows in the footsteps of the Pythia Suite <cite class="ltx_cite ltx_citemacro_citep">(Biderman et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib4" title="">2023</a>)</cite> which was the first to offer a controlled comparison of 2 data recipes, using compute scales up to <math alttext="2\times 10^{22}" class="ltx_Math" display="inline" id="S4.SS0.SSS0.Px2.p1.1.m1.1"><semantics id="S4.SS0.SSS0.Px2.p1.1.m1.1a"><mrow id="S4.SS0.SSS0.Px2.p1.1.m1.1.1" xref="S4.SS0.SSS0.Px2.p1.1.m1.1.1.cmml"><mn id="S4.SS0.SSS0.Px2.p1.1.m1.1.1.2" xref="S4.SS0.SSS0.Px2.p1.1.m1.1.1.2.cmml">2</mn><mo id="S4.SS0.SSS0.Px2.p1.1.m1.1.1.1" lspace="0.222em" rspace="0.222em" xref="S4.SS0.SSS0.Px2.p1.1.m1.1.1.1.cmml">×</mo><msup id="S4.SS0.SSS0.Px2.p1.1.m1.1.1.3" xref="S4.SS0.SSS0.Px2.p1.1.m1.1.1.3.cmml"><mn id="S4.SS0.SSS0.Px2.p1.1.m1.1.1.3.2" xref="S4.SS0.SSS0.Px2.p1.1.m1.1.1.3.2.cmml">10</mn><mn id="S4.SS0.SSS0.Px2.p1.1.m1.1.1.3.3" xref="S4.SS0.SSS0.Px2.p1.1.m1.1.1.3.3.cmml">22</mn></msup></mrow><annotation-xml encoding="MathML-Content" id="S4.SS0.SSS0.Px2.p1.1.m1.1b"><apply id="S4.SS0.SSS0.Px2.p1.1.m1.1.1.cmml" xref="S4.SS0.SSS0.Px2.p1.1.m1.1.1"><times id="S4.SS0.SSS0.Px2.p1.1.m1.1.1.1.cmml" xref="S4.SS0.SSS0.Px2.p1.1.m1.1.1.1"></times><cn id="S4.SS0.SSS0.Px2.p1.1.m1.1.1.2.cmml" type="integer" xref="S4.SS0.SSS0.Px2.p1.1.m1.1.1.2">2</cn><apply id="S4.SS0.SSS0.Px2.p1.1.m1.1.1.3.cmml" xref="S4.SS0.SSS0.Px2.p1.1.m1.1.1.3"><csymbol cd="ambiguous" id="S4.SS0.SSS0.Px2.p1.1.m1.1.1.3.1.cmml" xref="S4.SS0.SSS0.Px2.p1.1.m1.1.1.3">superscript</csymbol><cn id="S4.SS0.SSS0.Px2.p1.1.m1.1.1.3.2.cmml" type="integer" xref="S4.SS0.SSS0.Px2.p1.1.m1.1.1.3.2">10</cn><cn id="S4.SS0.SSS0.Px2.p1.1.m1.1.1.3.3.cmml" type="integer" xref="S4.SS0.SSS0.Px2.p1.1.m1.1.1.3.3">22</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS0.SSS0.Px2.p1.1.m1.1c">2\times 10^{22}</annotation><annotation encoding="application/x-llamapun" id="S4.SS0.SSS0.Px2.p1.1.m1.1d">2 × 10 start_POSTSUPERSCRIPT 22 end_POSTSUPERSCRIPT</annotation></semantics></math> FLOPs.
Subsequent suites have offered 6 data recipes at <math alttext="9\times 10^{20}" class="ltx_Math" display="inline" id="S4.SS0.SSS0.Px2.p1.2.m2.1"><semantics id="S4.SS0.SSS0.Px2.p1.2.m2.1a"><mrow id="S4.SS0.SSS0.Px2.p1.2.m2.1.1" xref="S4.SS0.SSS0.Px2.p1.2.m2.1.1.cmml"><mn id="S4.SS0.SSS0.Px2.p1.2.m2.1.1.2" xref="S4.SS0.SSS0.Px2.p1.2.m2.1.1.2.cmml">9</mn><mo id="S4.SS0.SSS0.Px2.p1.2.m2.1.1.1" lspace="0.222em" rspace="0.222em" xref="S4.SS0.SSS0.Px2.p1.2.m2.1.1.1.cmml">×</mo><msup id="S4.SS0.SSS0.Px2.p1.2.m2.1.1.3" xref="S4.SS0.SSS0.Px2.p1.2.m2.1.1.3.cmml"><mn id="S4.SS0.SSS0.Px2.p1.2.m2.1.1.3.2" xref="S4.SS0.SSS0.Px2.p1.2.m2.1.1.3.2.cmml">10</mn><mn id="S4.SS0.SSS0.Px2.p1.2.m2.1.1.3.3" xref="S4.SS0.SSS0.Px2.p1.2.m2.1.1.3.3.cmml">20</mn></msup></mrow><annotation-xml encoding="MathML-Content" id="S4.SS0.SSS0.Px2.p1.2.m2.1b"><apply id="S4.SS0.SSS0.Px2.p1.2.m2.1.1.cmml" xref="S4.SS0.SSS0.Px2.p1.2.m2.1.1"><times id="S4.SS0.SSS0.Px2.p1.2.m2.1.1.1.cmml" xref="S4.SS0.SSS0.Px2.p1.2.m2.1.1.1"></times><cn id="S4.SS0.SSS0.Px2.p1.2.m2.1.1.2.cmml" type="integer" xref="S4.SS0.SSS0.Px2.p1.2.m2.1.1.2">9</cn><apply id="S4.SS0.SSS0.Px2.p1.2.m2.1.1.3.cmml" xref="S4.SS0.SSS0.Px2.p1.2.m2.1.1.3"><csymbol cd="ambiguous" id="S4.SS0.SSS0.Px2.p1.2.m2.1.1.3.1.cmml" xref="S4.SS0.SSS0.Px2.p1.2.m2.1.1.3">superscript</csymbol><cn id="S4.SS0.SSS0.Px2.p1.2.m2.1.1.3.2.cmml" type="integer" xref="S4.SS0.SSS0.Px2.p1.2.m2.1.1.3.2">10</cn><cn id="S4.SS0.SSS0.Px2.p1.2.m2.1.1.3.3.cmml" type="integer" xref="S4.SS0.SSS0.Px2.p1.2.m2.1.1.3.3">20</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS0.SSS0.Px2.p1.2.m2.1c">9\times 10^{20}</annotation><annotation encoding="application/x-llamapun" id="S4.SS0.SSS0.Px2.p1.2.m2.1d">9 × 10 start_POSTSUPERSCRIPT 20 end_POSTSUPERSCRIPT</annotation></semantics></math> scale <cite class="ltx_cite ltx_citemacro_citep">(Magnusson et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib25" title="">2024</a>)</cite> and 6 data recipes over a range of scales up to <math alttext="10^{21}" class="ltx_Math" display="inline" id="S4.SS0.SSS0.Px2.p1.3.m3.1"><semantics id="S4.SS0.SSS0.Px2.p1.3.m3.1a"><msup id="S4.SS0.SSS0.Px2.p1.3.m3.1.1" xref="S4.SS0.SSS0.Px2.p1.3.m3.1.1.cmml"><mn id="S4.SS0.SSS0.Px2.p1.3.m3.1.1.2" xref="S4.SS0.SSS0.Px2.p1.3.m3.1.1.2.cmml">10</mn><mn id="S4.SS0.SSS0.Px2.p1.3.m3.1.1.3" xref="S4.SS0.SSS0.Px2.p1.3.m3.1.1.3.cmml">21</mn></msup><annotation-xml encoding="MathML-Content" id="S4.SS0.SSS0.Px2.p1.3.m3.1b"><apply id="S4.SS0.SSS0.Px2.p1.3.m3.1.1.cmml" xref="S4.SS0.SSS0.Px2.p1.3.m3.1.1"><csymbol cd="ambiguous" id="S4.SS0.SSS0.Px2.p1.3.m3.1.1.1.cmml" xref="S4.SS0.SSS0.Px2.p1.3.m3.1.1">superscript</csymbol><cn id="S4.SS0.SSS0.Px2.p1.3.m3.1.1.2.cmml" type="integer" xref="S4.SS0.SSS0.Px2.p1.3.m3.1.1.2">10</cn><cn id="S4.SS0.SSS0.Px2.p1.3.m3.1.1.3.cmml" type="integer" xref="S4.SS0.SSS0.Px2.p1.3.m3.1.1.3">21</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS0.SSS0.Px2.p1.3.m3.1c">10^{21}</annotation><annotation encoding="application/x-llamapun" id="S4.SS0.SSS0.Px2.p1.3.m3.1d">10 start_POSTSUPERSCRIPT 21 end_POSTSUPERSCRIPT</annotation></semantics></math> <cite class="ltx_cite ltx_citemacro_citep">(Brandfonbrener et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib6" title="">2024</a>)</cite>. Our <span class="ltx_text ltx_font_smallcaps" id="S4.SS0.SSS0.Px2.p1.4.2">DataDecide</span> offers a range of 14 scales up to <math alttext="7\times 10^{20}" class="ltx_Math" display="inline" id="S4.SS0.SSS0.Px2.p1.4.m4.1"><semantics id="S4.SS0.SSS0.Px2.p1.4.m4.1a"><mrow id="S4.SS0.SSS0.Px2.p1.4.m4.1.1" xref="S4.SS0.SSS0.Px2.p1.4.m4.1.1.cmml"><mn id="S4.SS0.SSS0.Px2.p1.4.m4.1.1.2" xref="S4.SS0.SSS0.Px2.p1.4.m4.1.1.2.cmml">7</mn><mo id="S4.SS0.SSS0.Px2.p1.4.m4.1.1.1" lspace="0.222em" rspace="0.222em" xref="S4.SS0.SSS0.Px2.p1.4.m4.1.1.1.cmml">×</mo><msup id="S4.SS0.SSS0.Px2.p1.4.m4.1.1.3" xref="S4.SS0.SSS0.Px2.p1.4.m4.1.1.3.cmml"><mn id="S4.SS0.SSS0.Px2.p1.4.m4.1.1.3.2" xref="S4.SS0.SSS0.Px2.p1.4.m4.1.1.3.2.cmml">10</mn><mn id="S4.SS0.SSS0.Px2.p1.4.m4.1.1.3.3" xref="S4.SS0.SSS0.Px2.p1.4.m4.1.1.3.3.cmml">20</mn></msup></mrow><annotation-xml encoding="MathML-Content" id="S4.SS0.SSS0.Px2.p1.4.m4.1b"><apply id="S4.SS0.SSS0.Px2.p1.4.m4.1.1.cmml" xref="S4.SS0.SSS0.Px2.p1.4.m4.1.1"><times id="S4.SS0.SSS0.Px2.p1.4.m4.1.1.1.cmml" xref="S4.SS0.SSS0.Px2.p1.4.m4.1.1.1"></times><cn id="S4.SS0.SSS0.Px2.p1.4.m4.1.1.2.cmml" type="integer" xref="S4.SS0.SSS0.Px2.p1.4.m4.1.1.2">7</cn><apply id="S4.SS0.SSS0.Px2.p1.4.m4.1.1.3.cmml" xref="S4.SS0.SSS0.Px2.p1.4.m4.1.1.3"><csymbol cd="ambiguous" id="S4.SS0.SSS0.Px2.p1.4.m4.1.1.3.1.cmml" xref="S4.SS0.SSS0.Px2.p1.4.m4.1.1.3">superscript</csymbol><cn id="S4.SS0.SSS0.Px2.p1.4.m4.1.1.3.2.cmml" type="integer" xref="S4.SS0.SSS0.Px2.p1.4.m4.1.1.3.2">10</cn><cn id="S4.SS0.SSS0.Px2.p1.4.m4.1.1.3.3.cmml" type="integer" xref="S4.SS0.SSS0.Px2.p1.4.m4.1.1.3.3">20</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS0.SSS0.Px2.p1.4.m4.1c">7\times 10^{20}</annotation><annotation encoding="application/x-llamapun" id="S4.SS0.SSS0.Px2.p1.4.m4.1d">7 × 10 start_POSTSUPERSCRIPT 20 end_POSTSUPERSCRIPT</annotation></semantics></math> FLOPs, while including an order of magnitude more fine-grained data differences.
Meanwhile, DCLM also makes extensive use of ranking single scale experiments to drive improvement in data recipes <cite class="ltx_cite ltx_citemacro_citep">(Li et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib24" title="">2024</a>)</cite>. They release their best data and a model trained on it, but do not release models from their decision making experiments and do not search over multiple recipes at their largest scale. Where their goal is creating a proposed best recipe, our <span class="ltx_text ltx_font_smallcaps" id="S4.SS0.SSS0.Px2.p1.4.3">DataDecide</span> enables the assessment of whether a method for decision making really does find the best among proposed recipes.</p>
</div>
</section>
</section>
<section class="ltx_section" id="S5">
<h2 class="ltx_title ltx_title_section">
<span class="ltx_tag ltx_tag_section">5 </span>Limitations</h2>
<div class="ltx_para ltx_noindent" id="S5.p1">
<p class="ltx_p" id="S5.p1.1">The scope of our work is limited to just one ratio of tokens to parameters, 100 or 5<math alttext="\times" class="ltx_Math" display="inline" id="S5.p1.1.m1.1"><semantics id="S5.p1.1.m1.1a"><mo id="S5.p1.1.m1.1.1" xref="S5.p1.1.m1.1.1.cmml">×</mo><annotation-xml encoding="MathML-Content" id="S5.p1.1.m1.1b"><times id="S5.p1.1.m1.1.1.cmml" xref="S5.p1.1.m1.1.1"></times></annotation-xml><annotation encoding="application/x-tex" id="S5.p1.1.m1.1c">\times</annotation><annotation encoding="application/x-llamapun" id="S5.p1.1.m1.1d">×</annotation></semantics></math> “Chinchilla” optimal ratio <cite class="ltx_cite ltx_citemacro_citep">(Hoffmann et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib19" title="">2022</a>)</cite>. We believe this captures the typical case, as most models now favor overtraining for inference savings.
Due to compute limitations and the need for a standardized set of model configurations over a long period of time in which compute became available for pretraining, we opt for 14 specific configurations from 4M–1B parameter scale. While observations across more configurations would always be better, this must be traded off with exploring the other dimensions of data recipes and random seed reruns. Likewise, while our 25 data recipes is an order of magnitude more than previous suites, there is always the possibility that findings across these will not be representative of future data recipes.
In our evaluations we focus on multiple choice tasks with a “cloze” formulation as we find these to be a good fit for our range of scales. Using <span class="ltx_text ltx_font_smallcaps" id="S5.p1.1.1">DataDecide</span>, new evaluations can be assessed easily by others without any additional pretraining.</p>
</div>
</section>
<section class="ltx_section" id="Sx1">
<h2 class="ltx_title ltx_title_section">Acknowledgments</h2>
<div class="ltx_para ltx_noindent" id="Sx1.p1">
<p class="ltx_p" id="Sx1.p1.1">We would like to thank Dave Wadden, Kyle Lo, Valentin Hofmann, and Hannaneh Hajishirzi for fruitful conversations. This material is based upon work supported by the U.S. National Science Foundation under Grant No. 2313998. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the U.S. National Science Foundation. IM is supported by the NSF CSGrad4US Fellowship. PWK is supported by the Singapore National Research Foundation and the National AI Group in the Singapore Ministry of Digital Development and Information under the AI Visiting Professorship Programme (award number AIVP-2024-001) and by the AI2050 program at Schmidt Sciences.</p>
</div>
</section>
<section class="ltx_section" id="Sx2">
<h2 class="ltx_title ltx_title_section">Ethics Statement</h2>
<div class="ltx_para ltx_noindent" id="Sx2.p1">
<p class="ltx_p" id="Sx2.p1.1">Training large language models is computationally expensive, especially when investigating thoroughly over dimensions of pretraining data composition, model scale, random initialization, and data order. The pretraining experiments in our <span class="ltx_text ltx_font_smallcaps" id="Sx2.p1.1.1">DataDecide</span> required approximately 820K H100 GPU hours. We share the benefit of this cost through releasing all of our models, data, and evaluations so that others will not have to repeat this expenditure. Moreover, our findings can guide efficient and cost-effective model development through the application of decision making with small-scale experiments. While <span class="ltx_text ltx_font_smallcaps" id="Sx2.p1.1.2">DataDecide</span> does not present direct ethical concerns beyond opportunity cost, we acknowledge that decisions about pretraining data heavily impact downstream model behavior. We encourage future research to explore potential biases in data selection methods and their implications for models deployed in the real world.</p>
</div>
</section>
<section class="ltx_bibliography" id="bib">
<h2 class="ltx_title ltx_title_bibliography">References</h2>
<ul class="ltx_biblist">
<li class="ltx_bibitem" id="bib.bib1">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Austin et al. (2021)</span>
<span class="ltx_bibblock">
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al.

</span>
<span class="ltx_bibblock">Program synthesis with large language models.

</span>
<span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib1.1.1">arXiv preprint arXiv:2108.07732</em>, 2021.

</span>
</li>
<li class="ltx_bibitem" id="bib.bib2">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Ben Allal et al. (2024)</span>
<span class="ltx_bibblock">
Loubna Ben Allal, Anton Lozhkov, Guilherme Penedo, Thomas Wolf, and Leandro von Werra.

</span>
<span class="ltx_bibblock">Smollm-corpus, 2024.

</span>
<span class="ltx_bibblock">URL <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus" title="">https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus</a>.

</span>
</li>
<li class="ltx_bibitem" id="bib.bib3">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Bhagia et al. (2024)</span>
<span class="ltx_bibblock">
Akshita Bhagia, Jiacheng Liu, Alexander Wettig, David Heineman, Oyvind Tafjord, Ananya Harsh Jha, Luca Soldaini, Noah A. Smith, Dirk Groeneveld, Pang Wei Koh, Jesse Dodge, and Hannaneh Hajishirzi.

</span>
<span class="ltx_bibblock">Establishing task scaling laws via compute-efficient model ladders, 2024.

</span>
<span class="ltx_bibblock">URL <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://arxiv.org/abs/2412.04403" title="">https://arxiv.org/abs/2412.04403</a>.

</span>
</li>
<li class="ltx_bibitem" id="bib.bib4">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Biderman et al. (2023)</span>
<span class="ltx_bibblock">
Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal.

</span>
<span class="ltx_bibblock">Pythia: A suite for analyzing large language models across training and scaling, 2023.

</span>
<span class="ltx_bibblock">URL <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://arxiv.org/abs/2304.01373" title="">https://arxiv.org/abs/2304.01373</a>.

</span>
</li>
<li class="ltx_bibitem" id="bib.bib5">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Bisk et al. (2020)</span>
<span class="ltx_bibblock">
Yonatan Bisk, Rowan Zellers, Ronan Le bras, Jianfeng Gao, and Yejin Choi.

</span>
<span class="ltx_bibblock">PIQA: Reasoning about physical commonsense in natural language.

</span>
<span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib5.1.1">Proceedings of the AAAI Conference on Artificial Intelligence</em>, 34(05):7432–7439, Apr. 2020.

</span>
<span class="ltx_bibblock">doi: <span class="ltx_ref ltx_nolink ltx_Url ltx_ref_self">10.1609/aaai.v34i05.6239</span>.

</span>
<span class="ltx_bibblock">URL <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://ojs.aaai.org/index.php/AAAI/article/view/6239" title="">https://ojs.aaai.org/index.php/AAAI/article/view/6239</a>.

</span>
</li>
<li class="ltx_bibitem" id="bib.bib6">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Brandfonbrener et al. (2024)</span>
<span class="ltx_bibblock">
David Brandfonbrener, Nikhil Anand, Nikhil Vyas, Eran Malach, and Sham Kakade.

</span>
<span class="ltx_bibblock">Loss-to-loss prediction: Scaling laws for all datasets, 2024.

</span>
<span class="ltx_bibblock">URL <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://arxiv.org/abs/2411.12925" title="">https://arxiv.org/abs/2411.12925</a>.

</span>
</li>
<li class="ltx_bibitem" id="bib.bib7">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Chen et al. (2021)</span>
<span class="ltx_bibblock">
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba.

</span>
<span class="ltx_bibblock">Evaluating large language models trained on code, 2021.

</span>
<span class="ltx_bibblock">URL <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://arxiv.org/abs/2107.03374" title="">https://arxiv.org/abs/2107.03374</a>.

</span>
</li>
<li class="ltx_bibitem" id="bib.bib8">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Choshen et al. (2024)</span>
<span class="ltx_bibblock">
Leshem Choshen, Yang Zhang, and Jacob Andreas.

</span>
<span class="ltx_bibblock">A hitchhiker’s guide to scaling law estimation, 2024.

</span>
<span class="ltx_bibblock">URL <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://arxiv.org/abs/2410.11840" title="">https://arxiv.org/abs/2410.11840</a>.

</span>
</li>
<li class="ltx_bibitem" id="bib.bib9">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Clark et al. (2019)</span>
<span class="ltx_bibblock">
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova.

</span>
<span class="ltx_bibblock">BoolQ: Exploring the surprising difficulty of natural yes/no questions.

</span>
<span class="ltx_bibblock">pp.  2924–2936, Minneapolis, Minnesota, June 2019.

</span>
<span class="ltx_bibblock">doi: <span class="ltx_ref ltx_nolink ltx_Url ltx_ref_self">10.18653/v1/N19-1300</span>.

</span>
<span class="ltx_bibblock">URL <a class="ltx_ref ltx_url ltx_font_typewriter" href="N19-1300" title="">N19-1300</a>.

</span>
</li>
<li class="ltx_bibitem" id="bib.bib10">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Clark et al. (2018)</span>
<span class="ltx_bibblock">
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord.

</span>
<span class="ltx_bibblock">Think you have solved question answering? try arc, the ai2 reasoning challenge.

</span>
<span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib10.1.1">ArXiv</em>, 2018.

</span>
<span class="ltx_bibblock">URL <a class="ltx_ref ltx_url ltx_font_typewriter" href="http://arxiv.org/abs/1803.05457" title="">http://arxiv.org/abs/1803.05457</a>.

</span>
</li>
<li class="ltx_bibitem" id="bib.bib11">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Cobbe et al. (2021)</span>
<span class="ltx_bibblock">
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman.

</span>
<span class="ltx_bibblock">Training verifiers to solve math word problems.

</span>
<span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib11.1.1">arXiv preprint arXiv:2110.14168</em>, 2021.

</span>
</li>
<li class="ltx_bibitem" id="bib.bib12">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Du et al. (2024)</span>
<span class="ltx_bibblock">
Zhengxiao Du, Aohan Zeng, Yuxiao Dong, and Jie Tang.

</span>
<span class="ltx_bibblock">Understanding emergent abilities of language models from the loss perspective.

</span>
<span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib12.1.1">The Thirty-eighth Annual Conference on Neural Information Processing Systems</em>, 2024.

</span>
<span class="ltx_bibblock">URL <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://openreview.net/forum?id=35DAviqMFo" title="">https://openreview.net/forum?id=35DAviqMFo</a>.

</span>
</li>
<li class="ltx_bibitem" id="bib.bib13">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Dubey et al. (2024)</span>
<span class="ltx_bibblock">
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony S. Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozière, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu, Corinne Wong, Cristian Cantón Ferrer, Cyrus Nikolaidis, Damien Allonsius, Daniel Song, Danielle Pintz, Danny Livshits, David Esiobu, Dhruv Choudhary, Dhruv Mahajan, Diego Garcia-Olano, Diego Perino, Dieuwke Hupkes, Egor Lakomkin, Ehab A. AlBadawy, Elina Lobanova, Emily Dinan, Eric Michael Smith, Filip Radenovic, Frank Zhang, Gabriele Synnaeve, Gabrielle Lee, Georgia Lewis Anderson, Graeme Nail, Grégoire Mialon, Guanglong Pang, Guillem Cucurell, Hailey Nguyen, Hannah Korevaar, Hu Xu,
Hugo Touvron, Iliyan Zarov, Imanol Arrieta Ibarra, Isabel M. Kloumann, Ishan Misra, Ivan Evtimov, Jade Copet, Jaewon Lee, Jan Laurens Geffert, Jana Vranes, Jason Park, Jay Mahadeokar, Jeet Shah, Jelmer van der Linde, Jennifer Billock, Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi, Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu, Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe, Ju-Qing Jia, Kalyan Vasuden Alwala, K. Upasani, Kate Plawiak, Keqian Li, Ken-591 neth Heafield, Kevin Stone, Khalid El-Arini, Krithika Iyer, Kshitiz Malik, Kuenley Chiu, Kunal Bhalla, Lauren Rantala-Yeary, Laurens van der Maaten, Lawrence Chen, Liang Tan, Liz Jenkins, Louis Martin, Lovish Madaan, Lubo Malo, Lukas Blecher, Lukas Landzaat, Luke de Oliveira, Madeline Muzzi, Mahesh Babu Pasupuleti, Mannat Singh, Manohar Paluri, Marcin Kardas, Mathew Oldham, Mathieu Rita, Maya Pavlova, Melissa Hall Melanie Kambadur, Mike Lewis, Min Si, Mitesh Kumar Singh, Mona Hassan, Naman Goyal, Narjes Torabi, Nikolay
Bashlykov, Nikolay Bogoychev, Niladri S. Chatterji, Olivier Duchenne, Onur cCelebi, Patrick Alrassy, Pengchuan Zhang, Pengwei Li, Petar Vasić, Peter Weng, Prajjwal Bhargava, Pratik Dubal, Praveen Krishnan, Punit Singh Koura, Puxin Xu, Qing He, Qingxiao Dong, Ragavan Srinivasan, Raj Ganapathy, Ramon Calderer, Ricardo Silveira Cabral, Robert Stojnic, Roberta Raileanu, Rohit Girdhar, Rohit Patel, Romain Sauvestre, Ronnie Polidoro, Roshan Sumbaly, Ross Taylor, Ruan Silva, Rui Hou, Rui Wang, Saghar Hosseini, Sahana Chennabasappa, Sanjay Singh, Sean Bell, Seohyun Sonia Kim, Sergey Edunov, Shaoliang Nie, Sharan Narang, Sharath Chandra Raparthy, Sheng Shen, Shengye Wan, Shruti Bhosale, Shun Zhang, Simon Vandenhende, Soumya Batra, Spencer Whitman, Sten Sootla, Stephane Collot, Suchin Gururangan, Sydney Borodinsky, Tamar Herman, Tara Fowler, Tarek Sheasha, Thomas Georgiou, Thomas Scialom, Tobias Speckbacher, Todor Mihaylov, Tong Xiao, Ujjwal Karn, Vedanuj Goswami, Vibhor Gupta, Vignesh Ramanathan, Viktor Kerkez,
Vincent Gonguet, Virginie Do, Vish Vogeti, Vladan Petrovic, Weiwei Chu, Wenhan Xiong, Wenyin Fu, Whit ney Meers, Xavier Martinet, Xiaodong Wang, Xiaoqing Ellen Tan, Xinfeng Xie, Xuchao Jia, Xuewei Wang, Yaelle Goldschlag, Yashesh Gaur, Yasmine Babaei, Yiqian Wen, Yiwen Song, Yuchen Zhang, Yue Li, Yuning Mao, Zacharie Delpierre Coudert, Zhengxu Yan, Zhengxing Chen, Zoe Papakipos, Aaditya K. Singh, Aaron Grattafiori, Abha Jain, Adam Kelsey, Adam Shajnfeld, Adi Gangidi, Adolfo Victoria, Ahuva Goldstand, Ajay Menon, Ajay Sharma, Alex Boesenberg, Alex Vaughan, Alexei Baevski, Allie Feinstein, Amanda Kallet, Amit Sangani, Anam Yunus, Andrei Lupu, Andres Alvarado, Andrew Caples, Andrew Gu, Andrew Ho, Andrew Poulton, Andrew Ryan, Ankit Ramchandani, Annie Franco, Aparajita Saraf, Arkabandhu Chowdhury, Ashley Gabriel, Ashwin Bharambe, Assaf Eisenman, Azadeh Yazdan, Beau James, Ben Maurer, Ben Leonhardi, Po-Yao (Bernie) Huang, Beth Loyd, Beto De Paola, Bhargavi Paranjape, Bing Liu, Bo Wu, Boyu Ni, Braden Hancock, Bram
Wasti, Brandon Spence, Brani Stojkovic, Brian Gamido, Britt Montalvo, Carl Parker, Carly Burton, Catalina Mejia, Changhan Wang, Changkyu Kim, Chao Zhou, Chester Hu, Ching-Hsiang Chu, Chris Cai, Chris Tindal, Christoph Feichtenhofer, Damon Civin, Dana Beaty, Daniel Kreymer, Shang-Wen Li, Danny Wyatt, David Adkins, David Xu, Davide Testuggine, Delia David, Devi Parikh, Diana Liskovich, Didem Foss, Dingkang Wang, Duc Le, Dustin Holland, Edward Dowling, Eissa Jamil, Elaine Montgomery, Eleonora Presani, Emily Hahn, Emily Wood, Erik Brinkman, Esteban Arcaute, Evan Dunbar, Evan Smothers, Fei Sun, Felix Kreuk, Feng Tian, Firat Ozgenel, Francesco Caggioni, Francisco Guzm’an, Frank J. Kanayet, Frank Seide, Gabriela Medina Florez, Gabriella Schwarz, Gada Badeer, Georgia Swee, Gil Halpern, Govind Thattai, Grant Herman, Grigory G. Sizov, Guangyi Zhang, Guna Lakshminarayanan, Hamid Shojanazeri, Han Zou, Hannah Wang, Han Zha, Haroun Habeeb, Harrison Rudolph, Helen Suk, Henry Aspegren, Hunter Goldman, Igor Molybog, Igor
Tufanov, Irina-Elena Veliche, Itai Gat, Jake Weissman, James Geboski, James Kohli, Japhet Asher, Jean-Baptiste Gaya, Jeff Marcus, Jeff Tang, Jennifer Chan, Jenny Zhen, Jeremy Reizenstein, Jeremy Teboul, Jessica Zhong, Jian Jin, Jingyi Yang, Joe Cummings, Jon Carvill, Jon Shepard, Jonathan McPhie, Jonathan Torres, Josh Ginsburg, Junjie Wang, Kaixing(Kai) Wu, U KamHou, Karan Saxena, Karthik Prasad, Kartikay Khandelwal, Katayoun Zand, Kathy Matosich, Kaushik Veeraraghavan, Kelly Michelena, Keqian Li, Kun Huang, Kunal Chawla, Kushal Lakhotia, Kyle Huang, Lailin Chen, Lakshya Garg, A Lavender, Leandro Silva, Lee Bell, Lei Zhang, Liangpeng Guo, Licheng Yu, Liron Moshkovich, Luca Wehrstedt, Madian Khabsa, Manav Avalani, Manish Bhatt, Maria Tsimpoukelli, Martynas Mankus, Matan Hasson, Matthew Lennie, Matthias Reso, Maxim Groshev, Maxim Naumov, Maya Lathi, Meghan Keneally, Michael L. Seltzer, Michal Valko, Michelle Restrepo, Mihir Patel, Mik Vyatskov, Mikayel Samvelyan, Mike Clark, Mike Macey, Mike Wang,
Miquel Jubert Hermoso, Mo Metanat, Mohammad Rastegari, Munish Bansal, Nandhini Santhanam, Natascha Parks, Natasha White, Navyata Bawa, Nayan Singhal, Nick Egebo, Nicolas Usunier, Nikolay Pavlovich Laptev, Ning Dong, Ning Zhang, Norman Cheng, Oleg Chernoguz, Olivia Hart, Omkar Salpekar, Ozlem Kalinli, Parkin Kent, Parth Parekh, Paul Saab, Pavan Balaji, Pedro Rittner, Philip Bontrager, Pierre Roux, Piotr Dollár, Polina Zvyagina, Prashant Ratanchandani, Pritish Yuvraj, Qian Liang, Rachad Alao, Rachel Rodriguez, Rafi Ayub, Raghotham Murthy, Raghu Nayani, Rahul Mitra, Raymond Li, Rebekkah Hogan, Robin Battey, Rocky Wang, Rohan Maheswari, Russ Howes, Ruty Rinott, Sai Jayesh Bondu, Samyak Datta, Sara Chugh, Sara Hunt, Sargun Dhillon, Sasha Sidorov, Satadru Pan, Saurabh Verma, Seiji Yamamoto, Sharadh Ramaswamy, Shaun Lindsay, Sheng Feng, Shenghao Lin, Shengxin Cindy Zha, Shiva Shankar, Shuqiang Zhang, Sinong Wang, Sneha Agarwal, Soji Sajuyigbe, Soumith Chintala, Stephanie Max, Stephen Chen, Steve Kehoe, Steve
Satterfield, Sudarshan Govindaprasad, Sumit Gupta, Sung-Bae Cho, Sunny Virk, Suraj Subramanian, Sy Choudhury, Sydney Goldman, Tal Remez, Tamar Glaser, Tamara Best, Thilo Kohler, Thomas Robinson, Tianhe Li, Tianjun Zhang, Tim Matthews, Timothy Chou, Tzook Shaked, Varun Vontimitta, Victoria Ajayi, Victoria Montanez, Vijai Mohan, Vinay Satish Kumar, Vishal Mangla, Vlad Ionescu, Vlad Andrei Poenaru, Vlad T. Mihailescu, Vladimir Ivanov, Wei Li, Wenchen Wang, Wenwen Jiang, Wes Bouaziz, Will Constable, Xia Tang, Xiaofang Wang, Xiaojian Wu, Xiaolan Wang, Xide Xia, Xilun Wu, Xinbo Gao, Yanjun Chen, Ye Hu, Ye Jia, Ye Qi, Yenda Li, Yilin Zhang, Ying Zhang, Yossi Adi, Youngjin Nam, Yu Wang, Yuchen Hao, Yundi Qian, Yuzi He, Zach Rait, Zachary DeVito, Zef Rosnbrick, Zhaoduo Wen, Zhenyu Yang, and Zhiwei Zhao.

</span>
<span class="ltx_bibblock">The llama 3 herd of models.

</span>
<span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib13.1.1">ArXiv</em>, abs/2407.21783, 2024.

</span>
<span class="ltx_bibblock">URL <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://api.semanticscholar.org/CorpusID:271571434" title="">https://api.semanticscholar.org/CorpusID:271571434</a>.

</span>
</li>
<li class="ltx_bibitem" id="bib.bib14">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Gadre et al. (2024)</span>
<span class="ltx_bibblock">
Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar, Suchin Gururangan, Mitchell Wortsman, Rulin Shao, Jean Mercat, Alex Fang, Jeffrey Li, Sedrick Keh, Rui Xin, Marianna Nezhurina, Igor Vasiljevic, Jenia Jitsev, Luca Soldaini, Alexandros G. Dimakis, Gabriel Ilharco, Pang Wei Koh, Shuran Song, Thomas Kollar, Yair Carmon, Achal Dave, Reinhard Heckel, Niklas Muennighoff, and Ludwig Schmidt.

</span>
<span class="ltx_bibblock">Language models scale reliably with over-training and on downstream tasks, 2024.

</span>
<span class="ltx_bibblock">URL <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://arxiv.org/abs/2403.08540" title="">https://arxiv.org/abs/2403.08540</a>.

</span>
</li>
<li class="ltx_bibitem" id="bib.bib15">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Goyal et al. (2024)</span>
<span class="ltx_bibblock">
Sachin Goyal, Pratyush Maini, Zachary C. Lipton, Aditi Raghunathan, and J. Zico Kolter.

</span>
<span class="ltx_bibblock">Scaling laws for data filtering - data curation cannot be compute agnostic.

</span>
<span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib15.1.1">CoRR</em>, abs/2404.07177, 2024.

</span>
<span class="ltx_bibblock">doi: <span class="ltx_ref ltx_nolink ltx_Url ltx_ref_self">10.48550/ARXIV.2404.07177</span>.

</span>
<span class="ltx_bibblock">URL <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://doi.org/10.48550/arXiv.2404.07177" title="">https://doi.org/10.48550/arXiv.2404.07177</a>.

</span>
</li>
<li class="ltx_bibitem" id="bib.bib16">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Groeneveld et al. (2024)</span>
<span class="ltx_bibblock">
Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Raghavi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Valentina Pyatkin, Abhilasha Ravichander, Dustin Schwenk, Saurabh Shah, Will Smith, Emma Strubell, Nishant Subramani, Mitchell Wortsman, Pradeep Dasigi, Nathan Lambert, Kyle Richardson, Luke Zettlemoyer, Jesse Dodge, Kyle Lo, Luca Soldaini, Noah A. Smith, and Hannaneh Hajishirzi.

</span>
<span class="ltx_bibblock">Olmo: Accelerating the science of language models, 2024.

</span>
<span class="ltx_bibblock">URL <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://arxiv.org/abs/2402.00838" title="">https://arxiv.org/abs/2402.00838</a>.

</span>
</li>
<li class="ltx_bibitem" id="bib.bib17">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Gu et al. (2024)</span>
<span class="ltx_bibblock">
Yuling Gu, Oyvind Tafjord, Bailey Kuehl, Dany Haddad, Jesse Dodge, and Hannaneh Hajishirzi.

</span>
<span class="ltx_bibblock">Olmes: A standard for language model evaluations, 2024.

</span>
<span class="ltx_bibblock">URL <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://arxiv.org/abs/2406.08446" title="">https://arxiv.org/abs/2406.08446</a>.

</span>
</li>
<li class="ltx_bibitem" id="bib.bib18">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Hendrycks et al. (2021)</span>
<span class="ltx_bibblock">
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt.

</span>
<span class="ltx_bibblock">Measuring massive multitask language understanding.

</span>
<span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib18.1.1">Proceedings of the International Conference on Learning Representations (ICLR)</em>, 2021.

</span>
</li>
<li class="ltx_bibitem" id="bib.bib19">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Hoffmann et al. (2022)</span>
<span class="ltx_bibblock">
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre.

</span>
<span class="ltx_bibblock">Training compute-optimal large language models, 2022.

</span>
<span class="ltx_bibblock">URL <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://arxiv.org/abs/2203.15556" title="">https://arxiv.org/abs/2203.15556</a>.

</span>
</li>
<li class="ltx_bibitem" id="bib.bib20">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Kang et al. (2024)</span>
<span class="ltx_bibblock">
Feiyang Kang, Yifan Sun, Bingbing Wen, Si Chen, Dawn Song, Rafid Mahmood, and Ruoxi Jia.

</span>
<span class="ltx_bibblock">Autoscale: Automatic prediction of compute-optimal data composition for training llms.

</span>
<span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib20.1.1">ArXiv</em>, abs/2407.20177, 2024.

</span>
<span class="ltx_bibblock">URL <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://api.semanticscholar.org/CorpusID:271533897" title="">https://api.semanticscholar.org/CorpusID:271533897</a>.

</span>
</li>
<li class="ltx_bibitem" id="bib.bib21">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Kaplan et al. (2020)</span>
<span class="ltx_bibblock">
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei.

</span>
<span class="ltx_bibblock">Scaling laws for neural language models, 2020.

</span>
<span class="ltx_bibblock">URL <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://arxiv.org/abs/2001.08361" title="">https://arxiv.org/abs/2001.08361</a>.

</span>
</li>
<li class="ltx_bibitem" id="bib.bib22">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Lambert et al. (2024)</span>
<span class="ltx_bibblock">
Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V. Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh Hajishirzi.

</span>
<span class="ltx_bibblock">Tülu 3: Pushing frontiers in open language model post-training.

</span>
<span class="ltx_bibblock">2024.

</span>
</li>
<li class="ltx_bibitem" id="bib.bib23">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Lewkowycz et al. (2022)</span>
<span class="ltx_bibblock">
Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra.

</span>
<span class="ltx_bibblock">Solving quantitative reasoning problems with language models, 2022.

</span>
<span class="ltx_bibblock">URL <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://arxiv.org/abs/2206.14858" title="">https://arxiv.org/abs/2206.14858</a>.

</span>
</li>
<li class="ltx_bibitem" id="bib.bib24">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Li et al. (2024)</span>
<span class="ltx_bibblock">
Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, Reinhard Heckel, Jean Mercat, Mayee Chen, Suchin Gururangan, Mitchell Wortsman, Alon Albalak, Yonatan Bitton, Marianna Nezhurina, Amro Abbas, Cheng-Yu Hsieh, Dhruba Ghosh, Josh Gardner, Maciej Kilian, Hanlin Zhang, Rulin Shao, Sarah Pratt, Sunny Sanyal, Gabriel Ilharco, Giannis Daras, Kalyani Marathe, Aaron Gokaslan, Jieyu Zhang, Khyathi Chandu, Thao Nguyen, Igor Vasiljevic, Sham Kakade, Shuran Song, Sujay Sanghavi, Fartash Faghri, Sewoong Oh, Luke Zettlemoyer, Kyle Lo, Alaaeldin El-Nouby, Hadi Pouransari, Alexander Toshev, Stephanie Wang, Dirk Groeneveld, Luca Soldaini, Pang Wei Koh, Jenia Jitsev, Thomas Kollar, Alexandros G. Dimakis, Yair Carmon, Achal Dave, Ludwig Schmidt, and Vaishaal Shankar.

</span>
<span class="ltx_bibblock">Datacomp-lm: In search of the next generation of training sets for language models, 2024.

</span>
<span class="ltx_bibblock">URL <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://arxiv.org/abs/2406.11794" title="">https://arxiv.org/abs/2406.11794</a>.

</span>
</li>
<li class="ltx_bibitem" id="bib.bib25">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Magnusson et al. (2024)</span>
<span class="ltx_bibblock">
Ian Magnusson, Akshita Bhagia, Valentin Hofmann, Luca Soldaini, Ananya Harsh Jha, Oyvind Tafjord, Dustin Schwenk, Evan Pete Walsh, Yanai Elazar, Kyle Lo, Dirk Groeneveld, Iz Beltagy, Hannaneh Hajishirzi, Noah A. Smith, Kyle Richardson, and Jesse Dodge.

</span>
<span class="ltx_bibblock">Paloma: A benchmark for evaluating language model fit, 2024.

</span>
<span class="ltx_bibblock">URL <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://arxiv.org/abs/2312.10523" title="">https://arxiv.org/abs/2312.10523</a>.

</span>
</li>
<li class="ltx_bibitem" id="bib.bib26">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Mihaylov et al. (2018)</span>
<span class="ltx_bibblock">
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal.

</span>
<span class="ltx_bibblock">Can a suit of armor conduct electricity? a new dataset for open book question answering.

</span>
<span class="ltx_bibblock">pp.  2381–2391, Brussels, Belgium, October-November 2018.

</span>
<span class="ltx_bibblock">doi: <span class="ltx_ref ltx_nolink ltx_Url ltx_ref_self">10.18653/v1/D18-1260</span>.

</span>
<span class="ltx_bibblock">URL <a class="ltx_ref ltx_url ltx_font_typewriter" href="D18-1260" title="">D18-1260</a>.

</span>
</li>
<li class="ltx_bibitem" id="bib.bib27">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Muennighoff et al. (2023)</span>
<span class="ltx_bibblock">
Niklas Muennighoff, Alexander Rush, Boaz Barak, Teven Le Scao, Nouamane Tazi, Aleksandra Piktus, Sampo Pyysalo, Thomas Wolf, and Colin A Raffel.

</span>
<span class="ltx_bibblock">Scaling data-constrained language models.

</span>
<span class="ltx_bibblock">In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), <em class="ltx_emph ltx_font_italic" id="bib.bib27.1.1">Advances in Neural Information Processing Systems</em>, volume 36, pp.  50358–50376. Curran Associates, Inc., 2023.

</span>
<span class="ltx_bibblock">URL <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://proceedings.neurips.cc/paper_files/paper/2023/file/9d89448b63ce1e2e8dc7af72c984c196-Paper-Conference.pdf" title="">https://proceedings.neurips.cc/paper_files/paper/2023/file/9d89448b63ce1e2e8dc7af72c984c196-Paper-Conference.pdf</a>.

</span>
</li>
<li class="ltx_bibitem" id="bib.bib28">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">OLMo et al. (2025)</span>
<span class="ltx_bibblock">
Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Michal Guerquin, Hamish Ivison, Pang Wei Koh, Jiacheng Liu, Saumya Malik, William Merrill, Lester James V. Miranda, Jacob Morrison, Tyler Murray, Crystal Nam, Valentina Pyatkin, Aman Rangapur, Michael Schmitz, Sam Skjonsberg, David Wadden, Christopher Wilhelm, Michael Wilson, Luke Zettlemoyer, Ali Farhadi, Noah A. Smith, and Hannaneh Hajishirzi.

</span>
<span class="ltx_bibblock">2 olmo 2 furious, 2025.

</span>
<span class="ltx_bibblock">URL <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://arxiv.org/abs/2501.00656" title="">https://arxiv.org/abs/2501.00656</a>.

</span>
</li>
<li class="ltx_bibitem" id="bib.bib29">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Penedo et al. (2023)</span>
<span class="ltx_bibblock">
Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra-Aimée Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay.

</span>
<span class="ltx_bibblock">The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only.

</span>
<span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib29.1.1">ArXiv</em>, abs/2306.01116, 2023.

</span>
<span class="ltx_bibblock">URL <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://api.semanticscholar.org/CorpusID:259063761" title="">https://api.semanticscholar.org/CorpusID:259063761</a>.

</span>
</li>
<li class="ltx_bibitem" id="bib.bib30">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Penedo et al. (2024)</span>
<span class="ltx_bibblock">
Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf.

</span>
<span class="ltx_bibblock">The fineweb datasets: Decanting the web for the finest text data at scale, 2024.

</span>
<span class="ltx_bibblock">URL <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://arxiv.org/abs/2406.17557" title="">https://arxiv.org/abs/2406.17557</a>.

</span>
</li>
<li class="ltx_bibitem" id="bib.bib31">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Porian et al. (2024)</span>
<span class="ltx_bibblock">
Tomer Porian, Mitchell Wortsman, Jenia Jitsev, Ludwig Schmidt, and Yair Carmon.

</span>
<span class="ltx_bibblock">Resolving discrepancies in compute-optimal scaling of language models.

</span>
<span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib31.1.1">ArXiv</em>, abs/2406.19146, 2024.

</span>
<span class="ltx_bibblock">URL <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://api.semanticscholar.org/CorpusID:270764838" title="">https://api.semanticscholar.org/CorpusID:270764838</a>.

</span>
</li>
<li class="ltx_bibitem" id="bib.bib32">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Raffel et al. (2019)</span>
<span class="ltx_bibblock">
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu.

</span>
<span class="ltx_bibblock">Exploring the limits of transfer learning with a unified text-to-text transformer.

</span>
<span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib32.1.1">arXiv e-prints</em>, 2019.

</span>
</li>
<li class="ltx_bibitem" id="bib.bib33">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Ruan et al. (2024)</span>
<span class="ltx_bibblock">
Yangjun Ruan, Chris J. Maddison, and Tatsunori Hashimoto.

</span>
<span class="ltx_bibblock">Observational scaling laws and the predictability of langauge model performance.

</span>
<span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib33.1.1">The Thirty-eighth Annual Conference on Neural Information Processing Systems</em>, 2024.

</span>
<span class="ltx_bibblock">URL <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://openreview.net/forum?id=On5WIN7xyD" title="">https://openreview.net/forum?id=On5WIN7xyD</a>.

</span>
</li>
<li class="ltx_bibitem" id="bib.bib34">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Sakaguchi et al. (2020)</span>
<span class="ltx_bibblock">
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi.

</span>
<span class="ltx_bibblock">WinoGrande: An adversarial winograd schema challenge at scale.

</span>
<span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib34.1.1">Proceedings of the AAAI Conference on Artificial Intelligence</em>, 34(05):8732–8740, Apr. 2020.

</span>
<span class="ltx_bibblock">doi: <span class="ltx_ref ltx_nolink ltx_Url ltx_ref_self">10.1609/aaai.v34i05.6399</span>.

</span>
<span class="ltx_bibblock">URL <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://ojs.aaai.org/index.php/AAAI/article/view/6399" title="">https://ojs.aaai.org/index.php/AAAI/article/view/6399</a>.

</span>
</li>
<li class="ltx_bibitem" id="bib.bib35">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Sap et al. (2019)</span>
<span class="ltx_bibblock">
Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi.

</span>
<span class="ltx_bibblock">Social IQa: Commonsense reasoning about social interactions.

</span>
<span class="ltx_bibblock">pp.  4463–4473, Hong Kong, China, November 2019.

</span>
<span class="ltx_bibblock">doi: <span class="ltx_ref ltx_nolink ltx_Url ltx_ref_self">10.18653/v1/D19-1454</span>.

</span>
<span class="ltx_bibblock">URL <a class="ltx_ref ltx_url ltx_font_typewriter" href="D19-1454" title="">D19-1454</a>.

</span>
</li>
<li class="ltx_bibitem" id="bib.bib36">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Schaeffer et al. (2023)</span>
<span class="ltx_bibblock">
Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo.

</span>
<span class="ltx_bibblock">Are emergent abilities of large language models a mirage?, 2023.

</span>
<span class="ltx_bibblock">URL <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://arxiv.org/abs/2304.15004" title="">https://arxiv.org/abs/2304.15004</a>.

</span>
</li>
<li class="ltx_bibitem" id="bib.bib37">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Schaeffer et al. (2024)</span>
<span class="ltx_bibblock">
Rylan Schaeffer, Hailey Schoelkopf, Brando Miranda, Gabriel Mukobi, Varun Madan, Adam Ibrahim, Herbie Bradley, Stella Biderman, and Sanmi Koyejo.

</span>
<span class="ltx_bibblock">Why has predicting downstream capabilities of frontier AI models with scale remained elusive?

</span>
<span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib37.1.1">Trustworthy Multi-modal Foundation Models and AI Agents (TiFA)</em>, 2024.

</span>
<span class="ltx_bibblock">URL <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://openreview.net/forum?id=AbHHrj9afB" title="">https://openreview.net/forum?id=AbHHrj9afB</a>.

</span>
</li>
<li class="ltx_bibitem" id="bib.bib38">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Soldaini et al. (2024)</span>
<span class="ltx_bibblock">
Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Harsh Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Abhilasha Ravichander, Kyle Richardson, Zejiang Shen, Emma Strubell, Nishant Subramani, Oyvind Tafjord, Pete Walsh, Luke Zettlemoyer, Noah A. Smith, Hannaneh Hajishirzi, Iz Beltagy, Dirk Groeneveld, Jesse Dodge, and Kyle Lo.

</span>
<span class="ltx_bibblock">Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research.

</span>
<span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib38.1.1">arXiv preprint</em>, 2024.

</span>
</li>
<li class="ltx_bibitem" id="bib.bib39">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Talmor et al. (2019)</span>
<span class="ltx_bibblock">
Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant.

</span>
<span class="ltx_bibblock">CommonsenseQA: A question answering challenge targeting commonsense knowledge.

</span>
<span class="ltx_bibblock">pp.  4149–4158, Minneapolis, Minnesota, June 2019.

</span>
<span class="ltx_bibblock">doi: <span class="ltx_ref ltx_nolink ltx_Url ltx_ref_self">10.18653/v1/N19-1421</span>.

</span>
<span class="ltx_bibblock">URL <a class="ltx_ref ltx_url ltx_font_typewriter" href="N19-1421" title="">N19-1421</a>.

</span>
</li>
<li class="ltx_bibitem" id="bib.bib40">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Ye et al. (2024)</span>
<span class="ltx_bibblock">
Jiasheng Ye, Peiju Liu, Tianxiang Sun, Yunhua Zhou, Jun Zhan, and Xipeng Qiu.

</span>
<span class="ltx_bibblock">Data mixing laws: Optimizing data mixtures by predicting language modeling performance.

</span>
<span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib40.1.1">ArXiv</em>, abs/2403.16952, 2024.

</span>
<span class="ltx_bibblock">URL <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://api.semanticscholar.org/CorpusID:268681464" title="">https://api.semanticscholar.org/CorpusID:268681464</a>.

</span>
</li>
<li class="ltx_bibitem" id="bib.bib41">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Zellers et al. (2019)</span>
<span class="ltx_bibblock">
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi.

</span>
<span class="ltx_bibblock">HellaSwag: Can a machine really finish your sentence?

</span>
<span class="ltx_bibblock">pp.  4791–4800, Florence, Italy, July 2019.

</span>
<span class="ltx_bibblock">doi: <span class="ltx_ref ltx_nolink ltx_Url ltx_ref_self">10.18653/v1/P19-1472</span>.

</span>
<span class="ltx_bibblock">URL <a class="ltx_ref ltx_url ltx_font_typewriter" href="P19-1472" title="">P19-1472</a>.

</span>
</li>
<li class="ltx_bibitem" id="bib.bib42">
<span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Zhou et al. (2024)</span>
<span class="ltx_bibblock">
Fan Zhou, Zengzhi Wang, Qian Liu, Junlong Li, and Pengfei Liu.

</span>
<span class="ltx_bibblock">Programming every example: Lifting pre-training data quality like experts at scale.

</span>
<span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib42.1.1">arXiv preprint arXiv:2409.17115</em>, 2024.

</span>
</li>
</ul>
</section>
<section class="ltx_appendix" id="A1">
<h2 class="ltx_title ltx_title_appendix">
<span class="ltx_tag ltx_tag_appendix">Appendix A </span>Hyperparameters</h2>
<figure class="ltx_table" id="A1.T2">
<table class="ltx_tabular ltx_centering ltx_align_middle" id="A1.T2.3">
<tr class="ltx_tr" id="A1.T2.3.1">
<td class="ltx_td ltx_align_justify ltx_align_top ltx_border_tt" id="A1.T2.3.1.1">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.1.1.1">
<span class="ltx_p" id="A1.T2.3.1.1.1.1" style="width:28.5pt;">Model name</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top ltx_border_tt" id="A1.T2.3.1.2">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.1.2.1">
<span class="ltx_p" id="A1.T2.3.1.2.1.1" style="width:28.5pt;">Batch size</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top ltx_border_tt" id="A1.T2.3.1.3">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.1.3.1">
<span class="ltx_p" id="A1.T2.3.1.3.1.1" style="width:28.5pt;">Hidden dim.</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top ltx_border_tt" id="A1.T2.3.1.4">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.1.4.1">
<span class="ltx_p" id="A1.T2.3.1.4.1.1" style="width:35.6pt;">LR</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top ltx_border_tt" id="A1.T2.3.1.5">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.1.5.1">
<span class="ltx_p" id="A1.T2.3.1.5.1.1" style="width:28.5pt;">Model size</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top ltx_border_tt" id="A1.T2.3.1.6">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.1.6.1">
<span class="ltx_p" id="A1.T2.3.1.6.1.1" style="width:28.5pt;">Heads</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top ltx_border_tt" id="A1.T2.3.1.7">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.1.7.1">
<span class="ltx_p" id="A1.T2.3.1.7.1.1" style="width:28.5pt;">Layers</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top ltx_border_tt" id="A1.T2.3.1.8">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.1.8.1">
<span class="ltx_p" id="A1.T2.3.1.8.1.1" style="width:28.5pt;">Training steps</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top ltx_border_tt" id="A1.T2.3.1.9">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.1.9.1">
<span class="ltx_p" id="A1.T2.3.1.9.1.1" style="width:28.5pt;">Tokens trained</span>
</span>
</td>
</tr>
<tr class="ltx_tr" id="A1.T2.3.2">
<td class="ltx_td ltx_align_justify ltx_align_top ltx_border_t" id="A1.T2.3.2.1">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.2.1.1">
<span class="ltx_p" id="A1.T2.3.2.1.1.1" style="width:28.5pt;">4M</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top ltx_border_t" id="A1.T2.3.2.2">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.2.2.1">
<span class="ltx_p" id="A1.T2.3.2.2.1.1" style="width:28.5pt;">32</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top ltx_border_t" id="A1.T2.3.2.3">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.2.3.1">
<span class="ltx_p" id="A1.T2.3.2.3.1.1" style="width:28.5pt;">64</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top ltx_border_t" id="A1.T2.3.2.4">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.2.4.1">
<span class="ltx_p" id="A1.T2.3.2.4.1.1" style="width:35.6pt;">1.4e-02</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top ltx_border_t" id="A1.T2.3.2.5">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.2.5.1">
<span class="ltx_p" id="A1.T2.3.2.5.1.1" style="width:28.5pt;">3.7M</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top ltx_border_t" id="A1.T2.3.2.6">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.2.6.1">
<span class="ltx_p" id="A1.T2.3.2.6.1.1" style="width:28.5pt;">8</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top ltx_border_t" id="A1.T2.3.2.7">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.2.7.1">
<span class="ltx_p" id="A1.T2.3.2.7.1.1" style="width:28.5pt;">8</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top ltx_border_t" id="A1.T2.3.2.8">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.2.8.1">
<span class="ltx_p" id="A1.T2.3.2.8.1.1" style="width:28.5pt;">5,725</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top ltx_border_t" id="A1.T2.3.2.9">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.2.9.1">
<span class="ltx_p" id="A1.T2.3.2.9.1.1" style="width:28.5pt;">0.4B</span>
</span>
</td>
</tr>
<tr class="ltx_tr" id="A1.T2.3.3">
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.3.1">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.3.1.1">
<span class="ltx_p" id="A1.T2.3.3.1.1.1" style="width:28.5pt;">6M</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.3.2">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.3.2.1">
<span class="ltx_p" id="A1.T2.3.3.2.1.1" style="width:28.5pt;">32</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.3.3">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.3.3.1">
<span class="ltx_p" id="A1.T2.3.3.3.1.1" style="width:28.5pt;">96</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.3.4">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.3.4.1">
<span class="ltx_p" id="A1.T2.3.3.4.1.1" style="width:35.6pt;">1.2e-02</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.3.5">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.3.5.1">
<span class="ltx_p" id="A1.T2.3.3.5.1.1" style="width:28.5pt;">6.0M</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.3.6">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.3.6.1">
<span class="ltx_p" id="A1.T2.3.3.6.1.1" style="width:28.5pt;">8</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.3.7">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.3.7.1">
<span class="ltx_p" id="A1.T2.3.3.7.1.1" style="width:28.5pt;">8</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.3.8">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.3.8.1">
<span class="ltx_p" id="A1.T2.3.3.8.1.1" style="width:28.5pt;">9,182</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.3.9">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.3.9.1">
<span class="ltx_p" id="A1.T2.3.3.9.1.1" style="width:28.5pt;">0.6B</span>
</span>
</td>
</tr>
<tr class="ltx_tr" id="A1.T2.3.4">
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.4.1">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.4.1.1">
<span class="ltx_p" id="A1.T2.3.4.1.1.1" style="width:28.5pt;">8M</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.4.2">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.4.2.1">
<span class="ltx_p" id="A1.T2.3.4.2.1.1" style="width:28.5pt;">32</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.4.3">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.4.3.1">
<span class="ltx_p" id="A1.T2.3.4.3.1.1" style="width:28.5pt;">128</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.4.4">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.4.4.1">
<span class="ltx_p" id="A1.T2.3.4.4.1.1" style="width:35.6pt;">1.1e-02</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.4.5">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.4.5.1">
<span class="ltx_p" id="A1.T2.3.4.5.1.1" style="width:28.5pt;">8.5M</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.4.6">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.4.6.1">
<span class="ltx_p" id="A1.T2.3.4.6.1.1" style="width:28.5pt;">8</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.4.7">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.4.7.1">
<span class="ltx_p" id="A1.T2.3.4.7.1.1" style="width:28.5pt;">8</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.4.8">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.4.8.1">
<span class="ltx_p" id="A1.T2.3.4.8.1.1" style="width:28.5pt;">13,039</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.4.9">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.4.9.1">
<span class="ltx_p" id="A1.T2.3.4.9.1.1" style="width:28.5pt;">0.9B</span>
</span>
</td>
</tr>
<tr class="ltx_tr" id="A1.T2.3.5">
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.5.1">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.5.1.1">
<span class="ltx_p" id="A1.T2.3.5.1.1.1" style="width:28.5pt;">10M</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.5.2">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.5.2.1">
<span class="ltx_p" id="A1.T2.3.5.2.1.1" style="width:28.5pt;">32</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.5.3">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.5.3.1">
<span class="ltx_p" id="A1.T2.3.5.3.1.1" style="width:28.5pt;">144</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.5.4">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.5.4.1">
<span class="ltx_p" id="A1.T2.3.5.4.1.1" style="width:35.6pt;">1.0e-02</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.5.5">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.5.5.1">
<span class="ltx_p" id="A1.T2.3.5.5.1.1" style="width:28.5pt;">9.9M</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.5.6">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.5.6.1">
<span class="ltx_p" id="A1.T2.3.5.6.1.1" style="width:28.5pt;">8</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.5.7">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.5.7.1">
<span class="ltx_p" id="A1.T2.3.5.7.1.1" style="width:28.5pt;">8</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.5.8">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.5.8.1">
<span class="ltx_p" id="A1.T2.3.5.8.1.1" style="width:28.5pt;">15,117</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.5.9">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.5.9.1">
<span class="ltx_p" id="A1.T2.3.5.9.1.1" style="width:28.5pt;">1.0B</span>
</span>
</td>
</tr>
<tr class="ltx_tr" id="A1.T2.3.6">
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.6.1">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.6.1.1">
<span class="ltx_p" id="A1.T2.3.6.1.1.1" style="width:28.5pt;">14M</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.6.2">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.6.2.1">
<span class="ltx_p" id="A1.T2.3.6.2.1.1" style="width:28.5pt;">32</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.6.3">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.6.3.1">
<span class="ltx_p" id="A1.T2.3.6.3.1.1" style="width:28.5pt;">192</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.6.4">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.6.4.1">
<span class="ltx_p" id="A1.T2.3.6.4.1.1" style="width:35.6pt;">9.2e-03</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.6.5">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.6.5.1">
<span class="ltx_p" id="A1.T2.3.6.5.1.1" style="width:28.5pt;">14.4M</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.6.6">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.6.6.1">
<span class="ltx_p" id="A1.T2.3.6.6.1.1" style="width:28.5pt;">8</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.6.7">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.6.7.1">
<span class="ltx_p" id="A1.T2.3.6.7.1.1" style="width:28.5pt;">8</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.6.8">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.6.8.1">
<span class="ltx_p" id="A1.T2.3.6.8.1.1" style="width:28.5pt;">21,953</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.6.9">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.6.9.1">
<span class="ltx_p" id="A1.T2.3.6.9.1.1" style="width:28.5pt;">1.4B</span>
</span>
</td>
</tr>
<tr class="ltx_tr" id="A1.T2.3.7">
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.7.1">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.7.1.1">
<span class="ltx_p" id="A1.T2.3.7.1.1.1" style="width:28.5pt;">16M</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.7.2">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.7.2.1">
<span class="ltx_p" id="A1.T2.3.7.2.1.1" style="width:28.5pt;">32</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.7.3">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.7.3.1">
<span class="ltx_p" id="A1.T2.3.7.3.1.1" style="width:28.5pt;">208</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.7.4">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.7.4.1">
<span class="ltx_p" id="A1.T2.3.7.4.1.1" style="width:35.6pt;">8.9e-03</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.7.5">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.7.5.1">
<span class="ltx_p" id="A1.T2.3.7.5.1.1" style="width:28.5pt;">16.0M</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.7.6">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.7.6.1">
<span class="ltx_p" id="A1.T2.3.7.6.1.1" style="width:28.5pt;">8</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.7.7">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.7.7.1">
<span class="ltx_p" id="A1.T2.3.7.7.1.1" style="width:28.5pt;">8</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.7.8">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.7.8.1">
<span class="ltx_p" id="A1.T2.3.7.8.1.1" style="width:28.5pt;">24,432</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.7.9">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.7.9.1">
<span class="ltx_p" id="A1.T2.3.7.9.1.1" style="width:28.5pt;">1.6B</span>
</span>
</td>
</tr>
<tr class="ltx_tr" id="A1.T2.3.8">
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.8.1">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.8.1.1">
<span class="ltx_p" id="A1.T2.3.8.1.1.1" style="width:28.5pt;">20M</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.8.2">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.8.2.1">
<span class="ltx_p" id="A1.T2.3.8.2.1.1" style="width:28.5pt;">64</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.8.3">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.8.3.1">
<span class="ltx_p" id="A1.T2.3.8.3.1.1" style="width:28.5pt;">192</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.8.4">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.8.4.1">
<span class="ltx_p" id="A1.T2.3.8.4.1.1" style="width:35.6pt;">8.4e-03</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.8.5">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.8.5.1">
<span class="ltx_p" id="A1.T2.3.8.5.1.1" style="width:28.5pt;">19.1M</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.8.6">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.8.6.1">
<span class="ltx_p" id="A1.T2.3.8.6.1.1" style="width:28.5pt;">8</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.8.7">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.8.7.1">
<span class="ltx_p" id="A1.T2.3.8.7.1.1" style="width:28.5pt;">16</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.8.8">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.8.8.1">
<span class="ltx_p" id="A1.T2.3.8.8.1.1" style="width:28.5pt;">14,584</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.8.9">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.8.9.1">
<span class="ltx_p" id="A1.T2.3.8.9.1.1" style="width:28.5pt;">1.9B</span>
</span>
</td>
</tr>
<tr class="ltx_tr" id="A1.T2.3.9">
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.9.1">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.9.1.1">
<span class="ltx_p" id="A1.T2.3.9.1.1.1" style="width:28.5pt;">60M</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.9.2">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.9.2.1">
<span class="ltx_p" id="A1.T2.3.9.2.1.1" style="width:28.5pt;">96</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.9.3">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.9.3.1">
<span class="ltx_p" id="A1.T2.3.9.3.1.1" style="width:28.5pt;">384</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.9.4">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.9.4.1">
<span class="ltx_p" id="A1.T2.3.9.4.1.1" style="width:35.6pt;">5.8e-03</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.9.5">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.9.5.1">
<span class="ltx_p" id="A1.T2.3.9.5.1.1" style="width:28.5pt;">57.1M</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.9.6">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.9.6.1">
<span class="ltx_p" id="A1.T2.3.9.6.1.1" style="width:28.5pt;">12</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.9.7">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.9.7.1">
<span class="ltx_p" id="A1.T2.3.9.7.1.1" style="width:28.5pt;">16</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.9.8">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.9.8.1">
<span class="ltx_p" id="A1.T2.3.9.8.1.1" style="width:28.5pt;">29,042</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.9.9">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.9.9.1">
<span class="ltx_p" id="A1.T2.3.9.9.1.1" style="width:28.5pt;">5.7B</span>
</span>
</td>
</tr>
<tr class="ltx_tr" id="A1.T2.3.10">
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.10.1">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.10.1.1">
<span class="ltx_p" id="A1.T2.3.10.1.1.1" style="width:28.5pt;">90M</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.10.2">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.10.2.1">
<span class="ltx_p" id="A1.T2.3.10.2.1.1" style="width:28.5pt;">160</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.10.3">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.10.3.1">
<span class="ltx_p" id="A1.T2.3.10.3.1.1" style="width:28.5pt;">528</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.10.4">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.10.4.1">
<span class="ltx_p" id="A1.T2.3.10.4.1.1" style="width:35.6pt;">4.9e-03</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.10.5">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.10.5.1">
<span class="ltx_p" id="A1.T2.3.10.5.1.1" style="width:28.5pt;">97.9M</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.10.6">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.10.6.1">
<span class="ltx_p" id="A1.T2.3.10.6.1.1" style="width:28.5pt;">12</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.10.7">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.10.7.1">
<span class="ltx_p" id="A1.T2.3.10.7.1.1" style="width:28.5pt;">16</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.10.8">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.10.8.1">
<span class="ltx_p" id="A1.T2.3.10.8.1.1" style="width:28.5pt;">29,901</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.10.9">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.10.9.1">
<span class="ltx_p" id="A1.T2.3.10.9.1.1" style="width:28.5pt;">9.8B</span>
</span>
</td>
</tr>
<tr class="ltx_tr" id="A1.T2.3.11">
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.11.1">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.11.1.1">
<span class="ltx_p" id="A1.T2.3.11.1.1.1" style="width:28.5pt;">150M</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.11.2">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.11.2.1">
<span class="ltx_p" id="A1.T2.3.11.2.1.1" style="width:28.5pt;">192</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.11.3">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.11.3.1">
<span class="ltx_p" id="A1.T2.3.11.3.1.1" style="width:28.5pt;">768</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.11.4">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.11.4.1">
<span class="ltx_p" id="A1.T2.3.11.4.1.1" style="width:35.6pt;">4.2e-03</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.11.5">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.11.5.1">
<span class="ltx_p" id="A1.T2.3.11.5.1.1" style="width:28.5pt;">151.9M</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.11.6">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.11.6.1">
<span class="ltx_p" id="A1.T2.3.11.6.1.1" style="width:28.5pt;">12</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.11.7">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.11.7.1">
<span class="ltx_p" id="A1.T2.3.11.7.1.1" style="width:28.5pt;">12</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.11.8">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.11.8.1">
<span class="ltx_p" id="A1.T2.3.11.8.1.1" style="width:28.5pt;">38,157</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.11.9">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.11.9.1">
<span class="ltx_p" id="A1.T2.3.11.9.1.1" style="width:28.5pt;">15.0B</span>
</span>
</td>
</tr>
<tr class="ltx_tr" id="A1.T2.3.12">
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.12.1">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.12.1.1">
<span class="ltx_p" id="A1.T2.3.12.1.1.1" style="width:28.5pt;">300M</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.12.2">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.12.2.1">
<span class="ltx_p" id="A1.T2.3.12.2.1.1" style="width:28.5pt;">320</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.12.3">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.12.3.1">
<span class="ltx_p" id="A1.T2.3.12.3.1.1" style="width:28.5pt;">1,024</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.12.4">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.12.4.1">
<span class="ltx_p" id="A1.T2.3.12.4.1.1" style="width:35.6pt;">3.3e-03</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.12.5">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.12.5.1">
<span class="ltx_p" id="A1.T2.3.12.5.1.1" style="width:28.5pt;">320.0M</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.12.6">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.12.6.1">
<span class="ltx_p" id="A1.T2.3.12.6.1.1" style="width:28.5pt;">16</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.12.7">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.12.7.1">
<span class="ltx_p" id="A1.T2.3.12.7.1.1" style="width:28.5pt;">16</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.12.8">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.12.8.1">
<span class="ltx_p" id="A1.T2.3.12.8.1.1" style="width:28.5pt;">45,787</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.12.9">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.12.9.1">
<span class="ltx_p" id="A1.T2.3.12.9.1.1" style="width:28.5pt;">30.0B</span>
</span>
</td>
</tr>
<tr class="ltx_tr" id="A1.T2.3.13">
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.13.1">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.13.1.1">
<span class="ltx_p" id="A1.T2.3.13.1.1.1" style="width:28.5pt;">530M</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.13.2">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.13.2.1">
<span class="ltx_p" id="A1.T2.3.13.2.1.1" style="width:28.5pt;">448</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.13.3">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.13.3.1">
<span class="ltx_p" id="A1.T2.3.13.3.1.1" style="width:28.5pt;">1,344</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.13.4">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.13.4.1">
<span class="ltx_p" id="A1.T2.3.13.4.1.1" style="width:35.6pt;">2.8e-03</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.13.5">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.13.5.1">
<span class="ltx_p" id="A1.T2.3.13.5.1.1" style="width:28.5pt;">530.1M</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.13.6">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.13.6.1">
<span class="ltx_p" id="A1.T2.3.13.6.1.1" style="width:28.5pt;">16</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.13.7">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.13.7.1">
<span class="ltx_p" id="A1.T2.3.13.7.1.1" style="width:28.5pt;">16</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.13.8">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.13.8.1">
<span class="ltx_p" id="A1.T2.3.13.8.1.1" style="width:28.5pt;">57,786</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.13.9">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.13.9.1">
<span class="ltx_p" id="A1.T2.3.13.9.1.1" style="width:28.5pt;">53.0B</span>
</span>
</td>
</tr>
<tr class="ltx_tr" id="A1.T2.3.14">
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.14.1">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.14.1.1">
<span class="ltx_p" id="A1.T2.3.14.1.1.1" style="width:28.5pt;">750M</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.14.2">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.14.2.1">
<span class="ltx_p" id="A1.T2.3.14.2.1.1" style="width:28.5pt;">576</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.14.3">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.14.3.1">
<span class="ltx_p" id="A1.T2.3.14.3.1.1" style="width:28.5pt;">1,536</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.14.4">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.14.4.1">
<span class="ltx_p" id="A1.T2.3.14.4.1.1" style="width:35.6pt;">2.5e-03</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.14.5">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.14.5.1">
<span class="ltx_p" id="A1.T2.3.14.5.1.1" style="width:28.5pt;">681.3M</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.14.6">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.14.6.1">
<span class="ltx_p" id="A1.T2.3.14.6.1.1" style="width:28.5pt;">16</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.14.7">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.14.7.1">
<span class="ltx_p" id="A1.T2.3.14.7.1.1" style="width:28.5pt;">16</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.14.8">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.14.8.1">
<span class="ltx_p" id="A1.T2.3.14.8.1.1" style="width:28.5pt;">63,589</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top" id="A1.T2.3.14.9">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.14.9.1">
<span class="ltx_p" id="A1.T2.3.14.9.1.1" style="width:28.5pt;">75.0B</span>
</span>
</td>
</tr>
<tr class="ltx_tr" id="A1.T2.3.15">
<td class="ltx_td ltx_align_justify ltx_align_top ltx_border_bb" id="A1.T2.3.15.1">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.15.1.1">
<span class="ltx_p" id="A1.T2.3.15.1.1.1" style="width:28.5pt;">1B</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top ltx_border_bb" id="A1.T2.3.15.2">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.15.2.1">
<span class="ltx_p" id="A1.T2.3.15.2.1.1" style="width:28.5pt;">704</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top ltx_border_bb" id="A1.T2.3.15.3">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.15.3.1">
<span class="ltx_p" id="A1.T2.3.15.3.1.1" style="width:28.5pt;">2,048</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top ltx_border_bb" id="A1.T2.3.15.4">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.15.4.1">
<span class="ltx_p" id="A1.T2.3.15.4.1.1" style="width:35.6pt;">2.1e-03</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top ltx_border_bb" id="A1.T2.3.15.5">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.15.5.1">
<span class="ltx_p" id="A1.T2.3.15.5.1.1" style="width:28.5pt;">1176.8M</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top ltx_border_bb" id="A1.T2.3.15.6">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.15.6.1">
<span class="ltx_p" id="A1.T2.3.15.6.1.1" style="width:28.5pt;">16</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top ltx_border_bb" id="A1.T2.3.15.7">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.15.7.1">
<span class="ltx_p" id="A1.T2.3.15.7.1.1" style="width:28.5pt;">16</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top ltx_border_bb" id="A1.T2.3.15.8">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.15.8.1">
<span class="ltx_p" id="A1.T2.3.15.8.1.1" style="width:28.5pt;">69,369</span>
</span>
</td>
<td class="ltx_td ltx_align_justify ltx_align_top ltx_border_bb" id="A1.T2.3.15.9">
<span class="ltx_inline-block ltx_align_top" id="A1.T2.3.15.9.1">
<span class="ltx_p" id="A1.T2.3.15.9.1.1" style="width:28.5pt;">100.0B</span>
</span>
</td>
</tr>
</table>
<figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_table">Table 2: </span><span class="ltx_text ltx_font_smallcaps" id="A1.T2.6.1">DataDecide</span> uses OLMo’s <span class="ltx_text ltx_font_italic" id="A1.T2.7.2">model ladder</span> <cite class="ltx_cite ltx_citemacro_citep">(Groeneveld et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib16" title="">2024</a>; OLMo et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib28" title="">2025</a>; Bhagia et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib3" title="">2024</a>)</cite> to programmatically create configurations for 14 model sizes with hyperparameters determined by heuristics in <cite class="ltx_cite ltx_citemacro_citet">Porian et al. (<a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib31" title="">2024</a>)</cite>. All models have sequence length of 2024 and MLP ratio of 8. Each configuration is pretrained over 25 data recipes (Table <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#S2.T1" title="Table 1 ‣ 2 Methods ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_tag">1</span></a>).
Each recipe and configuration is also trained for 3 random seeds where model sizes <math alttext="&lt;1" class="ltx_Math" display="inline" id="A1.T2.2.m1.1"><semantics id="A1.T2.2.m1.1b"><mrow id="A1.T2.2.m1.1.1" xref="A1.T2.2.m1.1.1.cmml"><mi id="A1.T2.2.m1.1.1.2" xref="A1.T2.2.m1.1.1.2.cmml"></mi><mo id="A1.T2.2.m1.1.1.1" xref="A1.T2.2.m1.1.1.1.cmml">&lt;</mo><mn id="A1.T2.2.m1.1.1.3" xref="A1.T2.2.m1.1.1.3.cmml">1</mn></mrow><annotation-xml encoding="MathML-Content" id="A1.T2.2.m1.1c"><apply id="A1.T2.2.m1.1.1.cmml" xref="A1.T2.2.m1.1.1"><lt id="A1.T2.2.m1.1.1.1.cmml" xref="A1.T2.2.m1.1.1.1"></lt><csymbol cd="latexml" id="A1.T2.2.m1.1.1.2.cmml" xref="A1.T2.2.m1.1.1.2">absent</csymbol><cn id="A1.T2.2.m1.1.1.3.cmml" type="integer" xref="A1.T2.2.m1.1.1.3">1</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="A1.T2.2.m1.1d">&lt;1</annotation><annotation encoding="application/x-llamapun" id="A1.T2.2.m1.1e">&lt; 1</annotation></semantics></math>B are stopped early at 25% of the compute used to train the 1B model for all but the default seed. Model size is number of non-embedding parameters. Batch size is the number of sequences per batch.
</figcaption>
</figure>
<div class="ltx_para ltx_noindent" id="A1.p1">
<p class="ltx_p" id="A1.p1.1">Table <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#A1.T2" title="Table 2 ‣ Appendix A Hyperparameters ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_tag">2</span></a> provides OLMo model ladder configurations for all models in <span class="ltx_text ltx_font_smallcaps" id="A1.p1.1.1">DataDecide</span>.</p>
</div>
</section>
<section class="ltx_appendix" id="A2">
<h2 class="ltx_title ltx_title_appendix">
<span class="ltx_tag ltx_tag_appendix">Appendix B </span>Proxy Metric Definitions</h2>
<figure class="ltx_table" id="A2.T3">
<table class="ltx_tabular ltx_centering ltx_align_middle" id="A2.T3.7">
<tr class="ltx_tr" id="A2.T3.7.8">
<td class="ltx_td ltx_align_left ltx_border_tt" id="A2.T3.7.8.1"><span class="ltx_text ltx_font_bold" id="A2.T3.7.8.1.1">Metric Name</span></td>
<td class="ltx_td ltx_align_left ltx_border_tt" id="A2.T3.7.8.2"><span class="ltx_text ltx_font_bold" id="A2.T3.7.8.2.1">Equation</span></td>
</tr>
<tr class="ltx_tr" id="A2.T3.1.1">
<td class="ltx_td ltx_align_left ltx_border_t" id="A2.T3.1.1.2"><span class="ltx_text ltx_font_smallcaps" id="A2.T3.1.1.2.1">Correct Prob</span></td>
<td class="ltx_td ltx_align_left ltx_border_t" id="A2.T3.1.1.1"><math alttext="\frac{1}{N}\sum_{i=1}^{N}P(c^{(i)}_{\text{correct}}\mid\text{context}_{i})" class="ltx_Math" display="inline" id="A2.T3.1.1.1.m1.2"><semantics id="A2.T3.1.1.1.m1.2a"><mrow id="A2.T3.1.1.1.m1.2.2" xref="A2.T3.1.1.1.m1.2.2.cmml"><mfrac id="A2.T3.1.1.1.m1.2.2.3" xref="A2.T3.1.1.1.m1.2.2.3.cmml"><mn id="A2.T3.1.1.1.m1.2.2.3.2" xref="A2.T3.1.1.1.m1.2.2.3.2.cmml">1</mn><mi id="A2.T3.1.1.1.m1.2.2.3.3" xref="A2.T3.1.1.1.m1.2.2.3.3.cmml">N</mi></mfrac><mo id="A2.T3.1.1.1.m1.2.2.2" xref="A2.T3.1.1.1.m1.2.2.2.cmml">⁢</mo><mrow id="A2.T3.1.1.1.m1.2.2.1" xref="A2.T3.1.1.1.m1.2.2.1.cmml"><msubsup id="A2.T3.1.1.1.m1.2.2.1.2" xref="A2.T3.1.1.1.m1.2.2.1.2.cmml"><mo id="A2.T3.1.1.1.m1.2.2.1.2.2.2" xref="A2.T3.1.1.1.m1.2.2.1.2.2.2.cmml">∑</mo><mrow id="A2.T3.1.1.1.m1.2.2.1.2.2.3" xref="A2.T3.1.1.1.m1.2.2.1.2.2.3.cmml"><mi id="A2.T3.1.1.1.m1.2.2.1.2.2.3.2" xref="A2.T3.1.1.1.m1.2.2.1.2.2.3.2.cmml">i</mi><mo id="A2.T3.1.1.1.m1.2.2.1.2.2.3.1" xref="A2.T3.1.1.1.m1.2.2.1.2.2.3.1.cmml">=</mo><mn id="A2.T3.1.1.1.m1.2.2.1.2.2.3.3" xref="A2.T3.1.1.1.m1.2.2.1.2.2.3.3.cmml">1</mn></mrow><mi id="A2.T3.1.1.1.m1.2.2.1.2.3" xref="A2.T3.1.1.1.m1.2.2.1.2.3.cmml">N</mi></msubsup><mrow id="A2.T3.1.1.1.m1.2.2.1.1" xref="A2.T3.1.1.1.m1.2.2.1.1.cmml"><mi id="A2.T3.1.1.1.m1.2.2.1.1.3" xref="A2.T3.1.1.1.m1.2.2.1.1.3.cmml">P</mi><mo id="A2.T3.1.1.1.m1.2.2.1.1.2" xref="A2.T3.1.1.1.m1.2.2.1.1.2.cmml">⁢</mo><mrow id="A2.T3.1.1.1.m1.2.2.1.1.1.1" xref="A2.T3.1.1.1.m1.2.2.1.1.1.1.1.cmml"><mo id="A2.T3.1.1.1.m1.2.2.1.1.1.1.2" stretchy="false" xref="A2.T3.1.1.1.m1.2.2.1.1.1.1.1.cmml">(</mo><mrow id="A2.T3.1.1.1.m1.2.2.1.1.1.1.1" xref="A2.T3.1.1.1.m1.2.2.1.1.1.1.1.cmml"><msubsup id="A2.T3.1.1.1.m1.2.2.1.1.1.1.1.2" xref="A2.T3.1.1.1.m1.2.2.1.1.1.1.1.2.cmml"><mi id="A2.T3.1.1.1.m1.2.2.1.1.1.1.1.2.2.2" xref="A2.T3.1.1.1.m1.2.2.1.1.1.1.1.2.2.2.cmml">c</mi><mtext id="A2.T3.1.1.1.m1.2.2.1.1.1.1.1.2.3" xref="A2.T3.1.1.1.m1.2.2.1.1.1.1.1.2.3a.cmml">correct</mtext><mrow id="A2.T3.1.1.1.m1.1.1.1.3" xref="A2.T3.1.1.1.m1.2.2.1.1.1.1.1.2.cmml"><mo id="A2.T3.1.1.1.m1.1.1.1.3.1" stretchy="false" xref="A2.T3.1.1.1.m1.2.2.1.1.1.1.1.2.cmml">(</mo><mi id="A2.T3.1.1.1.m1.1.1.1.1" xref="A2.T3.1.1.1.m1.1.1.1.1.cmml">i</mi><mo id="A2.T3.1.1.1.m1.1.1.1.3.2" stretchy="false" xref="A2.T3.1.1.1.m1.2.2.1.1.1.1.1.2.cmml">)</mo></mrow></msubsup><mo id="A2.T3.1.1.1.m1.2.2.1.1.1.1.1.1" xref="A2.T3.1.1.1.m1.2.2.1.1.1.1.1.1.cmml">∣</mo><msub id="A2.T3.1.1.1.m1.2.2.1.1.1.1.1.3" xref="A2.T3.1.1.1.m1.2.2.1.1.1.1.1.3.cmml"><mtext id="A2.T3.1.1.1.m1.2.2.1.1.1.1.1.3.2" xref="A2.T3.1.1.1.m1.2.2.1.1.1.1.1.3.2a.cmml">context</mtext><mi id="A2.T3.1.1.1.m1.2.2.1.1.1.1.1.3.3" xref="A2.T3.1.1.1.m1.2.2.1.1.1.1.1.3.3.cmml">i</mi></msub></mrow><mo id="A2.T3.1.1.1.m1.2.2.1.1.1.1.3" stretchy="false" xref="A2.T3.1.1.1.m1.2.2.1.1.1.1.1.cmml">)</mo></mrow></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="A2.T3.1.1.1.m1.2b"><apply id="A2.T3.1.1.1.m1.2.2.cmml" xref="A2.T3.1.1.1.m1.2.2"><times id="A2.T3.1.1.1.m1.2.2.2.cmml" xref="A2.T3.1.1.1.m1.2.2.2"></times><apply id="A2.T3.1.1.1.m1.2.2.3.cmml" xref="A2.T3.1.1.1.m1.2.2.3"><divide id="A2.T3.1.1.1.m1.2.2.3.1.cmml" xref="A2.T3.1.1.1.m1.2.2.3"></divide><cn id="A2.T3.1.1.1.m1.2.2.3.2.cmml" type="integer" xref="A2.T3.1.1.1.m1.2.2.3.2">1</cn><ci id="A2.T3.1.1.1.m1.2.2.3.3.cmml" xref="A2.T3.1.1.1.m1.2.2.3.3">𝑁</ci></apply><apply id="A2.T3.1.1.1.m1.2.2.1.cmml" xref="A2.T3.1.1.1.m1.2.2.1"><apply id="A2.T3.1.1.1.m1.2.2.1.2.cmml" xref="A2.T3.1.1.1.m1.2.2.1.2"><csymbol cd="ambiguous" id="A2.T3.1.1.1.m1.2.2.1.2.1.cmml" xref="A2.T3.1.1.1.m1.2.2.1.2">superscript</csymbol><apply id="A2.T3.1.1.1.m1.2.2.1.2.2.cmml" xref="A2.T3.1.1.1.m1.2.2.1.2"><csymbol cd="ambiguous" id="A2.T3.1.1.1.m1.2.2.1.2.2.1.cmml" xref="A2.T3.1.1.1.m1.2.2.1.2">subscript</csymbol><sum id="A2.T3.1.1.1.m1.2.2.1.2.2.2.cmml" xref="A2.T3.1.1.1.m1.2.2.1.2.2.2"></sum><apply id="A2.T3.1.1.1.m1.2.2.1.2.2.3.cmml" xref="A2.T3.1.1.1.m1.2.2.1.2.2.3"><eq id="A2.T3.1.1.1.m1.2.2.1.2.2.3.1.cmml" xref="A2.T3.1.1.1.m1.2.2.1.2.2.3.1"></eq><ci id="A2.T3.1.1.1.m1.2.2.1.2.2.3.2.cmml" xref="A2.T3.1.1.1.m1.2.2.1.2.2.3.2">𝑖</ci><cn id="A2.T3.1.1.1.m1.2.2.1.2.2.3.3.cmml" type="integer" xref="A2.T3.1.1.1.m1.2.2.1.2.2.3.3">1</cn></apply></apply><ci id="A2.T3.1.1.1.m1.2.2.1.2.3.cmml" xref="A2.T3.1.1.1.m1.2.2.1.2.3">𝑁</ci></apply><apply id="A2.T3.1.1.1.m1.2.2.1.1.cmml" xref="A2.T3.1.1.1.m1.2.2.1.1"><times id="A2.T3.1.1.1.m1.2.2.1.1.2.cmml" xref="A2.T3.1.1.1.m1.2.2.1.1.2"></times><ci id="A2.T3.1.1.1.m1.2.2.1.1.3.cmml" xref="A2.T3.1.1.1.m1.2.2.1.1.3">𝑃</ci><apply id="A2.T3.1.1.1.m1.2.2.1.1.1.1.1.cmml" xref="A2.T3.1.1.1.m1.2.2.1.1.1.1"><csymbol cd="latexml" id="A2.T3.1.1.1.m1.2.2.1.1.1.1.1.1.cmml" xref="A2.T3.1.1.1.m1.2.2.1.1.1.1.1.1">conditional</csymbol><apply id="A2.T3.1.1.1.m1.2.2.1.1.1.1.1.2.cmml" xref="A2.T3.1.1.1.m1.2.2.1.1.1.1.1.2"><csymbol cd="ambiguous" id="A2.T3.1.1.1.m1.2.2.1.1.1.1.1.2.1.cmml" xref="A2.T3.1.1.1.m1.2.2.1.1.1.1.1.2">subscript</csymbol><apply id="A2.T3.1.1.1.m1.2.2.1.1.1.1.1.2.2.cmml" xref="A2.T3.1.1.1.m1.2.2.1.1.1.1.1.2"><csymbol cd="ambiguous" id="A2.T3.1.1.1.m1.2.2.1.1.1.1.1.2.2.1.cmml" xref="A2.T3.1.1.1.m1.2.2.1.1.1.1.1.2">superscript</csymbol><ci id="A2.T3.1.1.1.m1.2.2.1.1.1.1.1.2.2.2.cmml" xref="A2.T3.1.1.1.m1.2.2.1.1.1.1.1.2.2.2">𝑐</ci><ci id="A2.T3.1.1.1.m1.1.1.1.1.cmml" xref="A2.T3.1.1.1.m1.1.1.1.1">𝑖</ci></apply><ci id="A2.T3.1.1.1.m1.2.2.1.1.1.1.1.2.3a.cmml" xref="A2.T3.1.1.1.m1.2.2.1.1.1.1.1.2.3"><mtext id="A2.T3.1.1.1.m1.2.2.1.1.1.1.1.2.3.cmml" mathsize="70%" xref="A2.T3.1.1.1.m1.2.2.1.1.1.1.1.2.3">correct</mtext></ci></apply><apply id="A2.T3.1.1.1.m1.2.2.1.1.1.1.1.3.cmml" xref="A2.T3.1.1.1.m1.2.2.1.1.1.1.1.3"><csymbol cd="ambiguous" id="A2.T3.1.1.1.m1.2.2.1.1.1.1.1.3.1.cmml" xref="A2.T3.1.1.1.m1.2.2.1.1.1.1.1.3">subscript</csymbol><ci id="A2.T3.1.1.1.m1.2.2.1.1.1.1.1.3.2a.cmml" xref="A2.T3.1.1.1.m1.2.2.1.1.1.1.1.3.2"><mtext id="A2.T3.1.1.1.m1.2.2.1.1.1.1.1.3.2.cmml" xref="A2.T3.1.1.1.m1.2.2.1.1.1.1.1.3.2">context</mtext></ci><ci id="A2.T3.1.1.1.m1.2.2.1.1.1.1.1.3.3.cmml" xref="A2.T3.1.1.1.m1.2.2.1.1.1.1.1.3.3">𝑖</ci></apply></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="A2.T3.1.1.1.m1.2c">\frac{1}{N}\sum_{i=1}^{N}P(c^{(i)}_{\text{correct}}\mid\text{context}_{i})</annotation><annotation encoding="application/x-llamapun" id="A2.T3.1.1.1.m1.2d">divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_P ( italic_c start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT correct end_POSTSUBSCRIPT ∣ context start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )</annotation></semantics></math></td>
</tr>
<tr class="ltx_tr" id="A2.T3.2.2">
<td class="ltx_td ltx_align_left ltx_border_t" id="A2.T3.2.2.2"><span class="ltx_text ltx_font_smallcaps" id="A2.T3.2.2.2.1">Margin</span></td>
<td class="ltx_td ltx_align_left ltx_border_t" id="A2.T3.2.2.1"><math alttext="\frac{1}{N}\sum_{i=1}^{N}\big{(}P(c_{\text{correct}}^{(i)}\mid\text{context}_{%
i})-\max_{c^{\prime}\neq c_{\text{correct}}^{(i)}\in C^{(i)}}P(c^{\prime}\mid%
\text{context}_{i})\big{)}" class="ltx_Math" display="inline" id="A2.T3.2.2.1.m1.4"><semantics id="A2.T3.2.2.1.m1.4a"><mrow id="A2.T3.2.2.1.m1.4.4" xref="A2.T3.2.2.1.m1.4.4.cmml"><mfrac id="A2.T3.2.2.1.m1.4.4.3" xref="A2.T3.2.2.1.m1.4.4.3.cmml"><mn id="A2.T3.2.2.1.m1.4.4.3.2" xref="A2.T3.2.2.1.m1.4.4.3.2.cmml">1</mn><mi id="A2.T3.2.2.1.m1.4.4.3.3" xref="A2.T3.2.2.1.m1.4.4.3.3.cmml">N</mi></mfrac><mo id="A2.T3.2.2.1.m1.4.4.2" xref="A2.T3.2.2.1.m1.4.4.2.cmml">⁢</mo><mrow id="A2.T3.2.2.1.m1.4.4.1" xref="A2.T3.2.2.1.m1.4.4.1.cmml"><msubsup id="A2.T3.2.2.1.m1.4.4.1.2" xref="A2.T3.2.2.1.m1.4.4.1.2.cmml"><mo id="A2.T3.2.2.1.m1.4.4.1.2.2.2" rspace="0em" xref="A2.T3.2.2.1.m1.4.4.1.2.2.2.cmml">∑</mo><mrow id="A2.T3.2.2.1.m1.4.4.1.2.2.3" xref="A2.T3.2.2.1.m1.4.4.1.2.2.3.cmml"><mi id="A2.T3.2.2.1.m1.4.4.1.2.2.3.2" xref="A2.T3.2.2.1.m1.4.4.1.2.2.3.2.cmml">i</mi><mo id="A2.T3.2.2.1.m1.4.4.1.2.2.3.1" xref="A2.T3.2.2.1.m1.4.4.1.2.2.3.1.cmml">=</mo><mn id="A2.T3.2.2.1.m1.4.4.1.2.2.3.3" xref="A2.T3.2.2.1.m1.4.4.1.2.2.3.3.cmml">1</mn></mrow><mi id="A2.T3.2.2.1.m1.4.4.1.2.3" xref="A2.T3.2.2.1.m1.4.4.1.2.3.cmml">N</mi></msubsup><mrow id="A2.T3.2.2.1.m1.4.4.1.1.1" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.cmml"><mo id="A2.T3.2.2.1.m1.4.4.1.1.1.2" maxsize="120%" minsize="120%" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.cmml">(</mo><mrow id="A2.T3.2.2.1.m1.4.4.1.1.1.1" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.cmml"><mrow id="A2.T3.2.2.1.m1.4.4.1.1.1.1.1" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.cmml"><mi id="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.3" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.3.cmml">P</mi><mo id="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.2" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.2.cmml">⁢</mo><mrow id="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.1.cmml"><mo id="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.2" stretchy="false" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.1.cmml">(</mo><mrow id="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.1" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.1.cmml"><msubsup id="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.1.2" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.1.2.cmml"><mi id="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.1.2.2.2" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.1.2.2.2.cmml">c</mi><mtext id="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.1.2.2.3" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.1.2.2.3a.cmml">correct</mtext><mrow id="A2.T3.2.2.1.m1.1.1.1.3" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.1.2.cmml"><mo id="A2.T3.2.2.1.m1.1.1.1.3.1" stretchy="false" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.1.2.cmml">(</mo><mi id="A2.T3.2.2.1.m1.1.1.1.1" xref="A2.T3.2.2.1.m1.1.1.1.1.cmml">i</mi><mo id="A2.T3.2.2.1.m1.1.1.1.3.2" stretchy="false" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.1.2.cmml">)</mo></mrow></msubsup><mo id="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.1.1" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.1.1.cmml">∣</mo><msub id="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.1.3" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.1.3.cmml"><mtext id="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.1.3.2" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.1.3.2a.cmml">context</mtext><mi id="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.1.3.3" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.1.3.3.cmml">i</mi></msub></mrow><mo id="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.3" stretchy="false" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.1.cmml">)</mo></mrow></mrow><mo id="A2.T3.2.2.1.m1.4.4.1.1.1.1.3" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.3.cmml">−</mo><mrow id="A2.T3.2.2.1.m1.4.4.1.1.1.1.2" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.cmml"><mrow id="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.3" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.3.cmml"><msub id="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.3.1" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.3.1.cmml"><mi id="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.3.1.2" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.3.1.2.cmml">max</mi><mrow id="A2.T3.2.2.1.m1.3.3.2" xref="A2.T3.2.2.1.m1.3.3.2.cmml"><msup id="A2.T3.2.2.1.m1.3.3.2.4" xref="A2.T3.2.2.1.m1.3.3.2.4.cmml"><mi id="A2.T3.2.2.1.m1.3.3.2.4.2" xref="A2.T3.2.2.1.m1.3.3.2.4.2.cmml">c</mi><mo id="A2.T3.2.2.1.m1.3.3.2.4.3" xref="A2.T3.2.2.1.m1.3.3.2.4.3.cmml">′</mo></msup><mo id="A2.T3.2.2.1.m1.3.3.2.5" xref="A2.T3.2.2.1.m1.3.3.2.5.cmml">≠</mo><msubsup id="A2.T3.2.2.1.m1.3.3.2.6" xref="A2.T3.2.2.1.m1.3.3.2.6.cmml"><mi id="A2.T3.2.2.1.m1.3.3.2.6.2.2" xref="A2.T3.2.2.1.m1.3.3.2.6.2.2.cmml">c</mi><mtext id="A2.T3.2.2.1.m1.3.3.2.6.2.3" xref="A2.T3.2.2.1.m1.3.3.2.6.2.3a.cmml">correct</mtext><mrow id="A2.T3.2.2.1.m1.2.2.1.1.1.3" xref="A2.T3.2.2.1.m1.3.3.2.6.cmml"><mo id="A2.T3.2.2.1.m1.2.2.1.1.1.3.1" stretchy="false" xref="A2.T3.2.2.1.m1.3.3.2.6.cmml">(</mo><mi id="A2.T3.2.2.1.m1.2.2.1.1.1.1" xref="A2.T3.2.2.1.m1.2.2.1.1.1.1.cmml">i</mi><mo id="A2.T3.2.2.1.m1.2.2.1.1.1.3.2" stretchy="false" xref="A2.T3.2.2.1.m1.3.3.2.6.cmml">)</mo></mrow></msubsup><mo id="A2.T3.2.2.1.m1.3.3.2.7" xref="A2.T3.2.2.1.m1.3.3.2.7.cmml">∈</mo><msup id="A2.T3.2.2.1.m1.3.3.2.8" xref="A2.T3.2.2.1.m1.3.3.2.8.cmml"><mi id="A2.T3.2.2.1.m1.3.3.2.8.2" xref="A2.T3.2.2.1.m1.3.3.2.8.2.cmml">C</mi><mrow id="A2.T3.2.2.1.m1.3.3.2.2.1.3" xref="A2.T3.2.2.1.m1.3.3.2.8.cmml"><mo id="A2.T3.2.2.1.m1.3.3.2.2.1.3.1" stretchy="false" xref="A2.T3.2.2.1.m1.3.3.2.8.cmml">(</mo><mi id="A2.T3.2.2.1.m1.3.3.2.2.1.1" xref="A2.T3.2.2.1.m1.3.3.2.2.1.1.cmml">i</mi><mo id="A2.T3.2.2.1.m1.3.3.2.2.1.3.2" stretchy="false" xref="A2.T3.2.2.1.m1.3.3.2.8.cmml">)</mo></mrow></msup></mrow></msub><mo id="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.3a" lspace="0.167em" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.3.cmml">⁡</mo><mi id="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.3.2" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.3.2.cmml">P</mi></mrow><mo id="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.2" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.2.cmml">⁢</mo><mrow id="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.1.1" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.1.1.1.cmml"><mo id="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.1.1.2" stretchy="false" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.1.1.1.cmml">(</mo><mrow id="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.1.1.1" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.1.1.1.cmml"><msup id="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.1.1.1.2" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.1.1.1.2.cmml"><mi id="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.1.1.1.2.2" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.1.1.1.2.2.cmml">c</mi><mo id="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.1.1.1.2.3" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.1.1.1.2.3.cmml">′</mo></msup><mo id="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.1.1.1.1" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.1.1.1.1.cmml">∣</mo><msub id="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.1.1.1.3" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.1.1.1.3.cmml"><mtext id="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.1.1.1.3.2" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.1.1.1.3.2a.cmml">context</mtext><mi id="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.1.1.1.3.3" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.1.1.1.3.3.cmml">i</mi></msub></mrow><mo id="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.1.1.3" stretchy="false" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.1.1.1.cmml">)</mo></mrow></mrow></mrow><mo id="A2.T3.2.2.1.m1.4.4.1.1.1.3" maxsize="120%" minsize="120%" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.cmml">)</mo></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="A2.T3.2.2.1.m1.4b"><apply id="A2.T3.2.2.1.m1.4.4.cmml" xref="A2.T3.2.2.1.m1.4.4"><times id="A2.T3.2.2.1.m1.4.4.2.cmml" xref="A2.T3.2.2.1.m1.4.4.2"></times><apply id="A2.T3.2.2.1.m1.4.4.3.cmml" xref="A2.T3.2.2.1.m1.4.4.3"><divide id="A2.T3.2.2.1.m1.4.4.3.1.cmml" xref="A2.T3.2.2.1.m1.4.4.3"></divide><cn id="A2.T3.2.2.1.m1.4.4.3.2.cmml" type="integer" xref="A2.T3.2.2.1.m1.4.4.3.2">1</cn><ci id="A2.T3.2.2.1.m1.4.4.3.3.cmml" xref="A2.T3.2.2.1.m1.4.4.3.3">𝑁</ci></apply><apply id="A2.T3.2.2.1.m1.4.4.1.cmml" xref="A2.T3.2.2.1.m1.4.4.1"><apply id="A2.T3.2.2.1.m1.4.4.1.2.cmml" xref="A2.T3.2.2.1.m1.4.4.1.2"><csymbol cd="ambiguous" id="A2.T3.2.2.1.m1.4.4.1.2.1.cmml" xref="A2.T3.2.2.1.m1.4.4.1.2">superscript</csymbol><apply id="A2.T3.2.2.1.m1.4.4.1.2.2.cmml" xref="A2.T3.2.2.1.m1.4.4.1.2"><csymbol cd="ambiguous" id="A2.T3.2.2.1.m1.4.4.1.2.2.1.cmml" xref="A2.T3.2.2.1.m1.4.4.1.2">subscript</csymbol><sum id="A2.T3.2.2.1.m1.4.4.1.2.2.2.cmml" xref="A2.T3.2.2.1.m1.4.4.1.2.2.2"></sum><apply id="A2.T3.2.2.1.m1.4.4.1.2.2.3.cmml" xref="A2.T3.2.2.1.m1.4.4.1.2.2.3"><eq id="A2.T3.2.2.1.m1.4.4.1.2.2.3.1.cmml" xref="A2.T3.2.2.1.m1.4.4.1.2.2.3.1"></eq><ci id="A2.T3.2.2.1.m1.4.4.1.2.2.3.2.cmml" xref="A2.T3.2.2.1.m1.4.4.1.2.2.3.2">𝑖</ci><cn id="A2.T3.2.2.1.m1.4.4.1.2.2.3.3.cmml" type="integer" xref="A2.T3.2.2.1.m1.4.4.1.2.2.3.3">1</cn></apply></apply><ci id="A2.T3.2.2.1.m1.4.4.1.2.3.cmml" xref="A2.T3.2.2.1.m1.4.4.1.2.3">𝑁</ci></apply><apply id="A2.T3.2.2.1.m1.4.4.1.1.1.1.cmml" xref="A2.T3.2.2.1.m1.4.4.1.1.1"><minus id="A2.T3.2.2.1.m1.4.4.1.1.1.1.3.cmml" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.3"></minus><apply id="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.cmml" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.1"><times id="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.2.cmml" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.2"></times><ci id="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.3.cmml" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.3">𝑃</ci><apply id="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.1.cmml" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1"><csymbol cd="latexml" id="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.1.1.cmml" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.1.1">conditional</csymbol><apply id="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.1.2.cmml" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.1.2"><csymbol cd="ambiguous" id="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.1.2.1.cmml" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.1.2">superscript</csymbol><apply id="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.1.2.2.cmml" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.1.2"><csymbol cd="ambiguous" id="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.1.2.2.1.cmml" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.1.2">subscript</csymbol><ci id="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.1.2.2.2.cmml" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.1.2.2.2">𝑐</ci><ci id="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.1.2.2.3a.cmml" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.1.2.2.3"><mtext id="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.1.2.2.3.cmml" mathsize="70%" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.1.2.2.3">correct</mtext></ci></apply><ci id="A2.T3.2.2.1.m1.1.1.1.1.cmml" xref="A2.T3.2.2.1.m1.1.1.1.1">𝑖</ci></apply><apply id="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.1.3.cmml" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.1.3.1.cmml" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.1.3">subscript</csymbol><ci id="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.1.3.2a.cmml" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.1.3.2"><mtext id="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.1.3.2.cmml" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.1.3.2">context</mtext></ci><ci id="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.1.3.3.cmml" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.1.1.1.1.3.3">𝑖</ci></apply></apply></apply><apply id="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.cmml" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.2"><times id="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.2.cmml" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.2"></times><apply id="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.3.cmml" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.3"><apply id="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.3.1.cmml" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.3.1"><csymbol cd="ambiguous" id="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.3.1.1.cmml" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.3.1">subscript</csymbol><max id="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.3.1.2.cmml" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.3.1.2"></max><apply id="A2.T3.2.2.1.m1.3.3.2.cmml" xref="A2.T3.2.2.1.m1.3.3.2"><and id="A2.T3.2.2.1.m1.3.3.2a.cmml" xref="A2.T3.2.2.1.m1.3.3.2"></and><apply id="A2.T3.2.2.1.m1.3.3.2b.cmml" xref="A2.T3.2.2.1.m1.3.3.2"><neq id="A2.T3.2.2.1.m1.3.3.2.5.cmml" xref="A2.T3.2.2.1.m1.3.3.2.5"></neq><apply id="A2.T3.2.2.1.m1.3.3.2.4.cmml" xref="A2.T3.2.2.1.m1.3.3.2.4"><csymbol cd="ambiguous" id="A2.T3.2.2.1.m1.3.3.2.4.1.cmml" xref="A2.T3.2.2.1.m1.3.3.2.4">superscript</csymbol><ci id="A2.T3.2.2.1.m1.3.3.2.4.2.cmml" xref="A2.T3.2.2.1.m1.3.3.2.4.2">𝑐</ci><ci id="A2.T3.2.2.1.m1.3.3.2.4.3.cmml" xref="A2.T3.2.2.1.m1.3.3.2.4.3">′</ci></apply><apply id="A2.T3.2.2.1.m1.3.3.2.6.cmml" xref="A2.T3.2.2.1.m1.3.3.2.6"><csymbol cd="ambiguous" id="A2.T3.2.2.1.m1.3.3.2.6.1.cmml" xref="A2.T3.2.2.1.m1.3.3.2.6">superscript</csymbol><apply id="A2.T3.2.2.1.m1.3.3.2.6.2.cmml" xref="A2.T3.2.2.1.m1.3.3.2.6"><csymbol cd="ambiguous" id="A2.T3.2.2.1.m1.3.3.2.6.2.1.cmml" xref="A2.T3.2.2.1.m1.3.3.2.6">subscript</csymbol><ci id="A2.T3.2.2.1.m1.3.3.2.6.2.2.cmml" xref="A2.T3.2.2.1.m1.3.3.2.6.2.2">𝑐</ci><ci id="A2.T3.2.2.1.m1.3.3.2.6.2.3a.cmml" xref="A2.T3.2.2.1.m1.3.3.2.6.2.3"><mtext id="A2.T3.2.2.1.m1.3.3.2.6.2.3.cmml" mathsize="50%" xref="A2.T3.2.2.1.m1.3.3.2.6.2.3">correct</mtext></ci></apply><ci id="A2.T3.2.2.1.m1.2.2.1.1.1.1.cmml" xref="A2.T3.2.2.1.m1.2.2.1.1.1.1">𝑖</ci></apply></apply><apply id="A2.T3.2.2.1.m1.3.3.2c.cmml" xref="A2.T3.2.2.1.m1.3.3.2"><in id="A2.T3.2.2.1.m1.3.3.2.7.cmml" xref="A2.T3.2.2.1.m1.3.3.2.7"></in><share href="https://arxiv.org/html/2504.11393v1#A2.T3.2.2.1.m1.3.3.2.6.cmml" id="A2.T3.2.2.1.m1.3.3.2d.cmml" xref="A2.T3.2.2.1.m1.3.3.2"></share><apply id="A2.T3.2.2.1.m1.3.3.2.8.cmml" xref="A2.T3.2.2.1.m1.3.3.2.8"><csymbol cd="ambiguous" id="A2.T3.2.2.1.m1.3.3.2.8.1.cmml" xref="A2.T3.2.2.1.m1.3.3.2.8">superscript</csymbol><ci id="A2.T3.2.2.1.m1.3.3.2.8.2.cmml" xref="A2.T3.2.2.1.m1.3.3.2.8.2">𝐶</ci><ci id="A2.T3.2.2.1.m1.3.3.2.2.1.1.cmml" xref="A2.T3.2.2.1.m1.3.3.2.2.1.1">𝑖</ci></apply></apply></apply></apply><ci id="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.3.2.cmml" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.3.2">𝑃</ci></apply><apply id="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.1.1.1.cmml" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.1.1"><csymbol cd="latexml" id="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.1.1.1.1.cmml" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.1.1.1.1">conditional</csymbol><apply id="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.1.1.1.2.cmml" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.1.1.1.2"><csymbol cd="ambiguous" id="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.1.1.1.2.1.cmml" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.1.1.1.2">superscript</csymbol><ci id="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.1.1.1.2.2.cmml" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.1.1.1.2.2">𝑐</ci><ci id="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.1.1.1.2.3.cmml" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.1.1.1.2.3">′</ci></apply><apply id="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.1.1.1.3.cmml" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.1.1.1.3"><csymbol cd="ambiguous" id="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.1.1.1.3.1.cmml" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.1.1.1.3">subscript</csymbol><ci id="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.1.1.1.3.2a.cmml" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.1.1.1.3.2"><mtext id="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.1.1.1.3.2.cmml" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.1.1.1.3.2">context</mtext></ci><ci id="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.1.1.1.3.3.cmml" xref="A2.T3.2.2.1.m1.4.4.1.1.1.1.2.1.1.1.3.3">𝑖</ci></apply></apply></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="A2.T3.2.2.1.m1.4c">\frac{1}{N}\sum_{i=1}^{N}\big{(}P(c_{\text{correct}}^{(i)}\mid\text{context}_{%
i})-\max_{c^{\prime}\neq c_{\text{correct}}^{(i)}\in C^{(i)}}P(c^{\prime}\mid%
\text{context}_{i})\big{)}</annotation><annotation encoding="application/x-llamapun" id="A2.T3.2.2.1.m1.4d">divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_P ( italic_c start_POSTSUBSCRIPT correct end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∣ context start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - roman_max start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_c start_POSTSUBSCRIPT correct end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∈ italic_C start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_P ( italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ context start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) )</annotation></semantics></math></td>
</tr>
<tr class="ltx_tr" id="A2.T3.3.3">
<td class="ltx_td ltx_align_left ltx_border_t" id="A2.T3.3.3.2"><span class="ltx_text ltx_font_smallcaps" id="A2.T3.3.3.2.1">Norm Correct Prob</span></td>
<td class="ltx_td ltx_align_left ltx_border_t" id="A2.T3.3.3.1"><math alttext="\frac{1}{N}\sum_{i=1}^{N}\frac{P(c^{(i)}_{\text{correct}}\mid\text{context}_{i%
})}{\sum_{c\in C^{(i)}}P(c\mid\text{context}_{i})}" class="ltx_Math" display="inline" id="A2.T3.3.3.1.m1.4"><semantics id="A2.T3.3.3.1.m1.4a"><mrow id="A2.T3.3.3.1.m1.4.5" xref="A2.T3.3.3.1.m1.4.5.cmml"><mfrac id="A2.T3.3.3.1.m1.4.5.2" xref="A2.T3.3.3.1.m1.4.5.2.cmml"><mn id="A2.T3.3.3.1.m1.4.5.2.2" xref="A2.T3.3.3.1.m1.4.5.2.2.cmml">1</mn><mi id="A2.T3.3.3.1.m1.4.5.2.3" xref="A2.T3.3.3.1.m1.4.5.2.3.cmml">N</mi></mfrac><mo id="A2.T3.3.3.1.m1.4.5.1" xref="A2.T3.3.3.1.m1.4.5.1.cmml">⁢</mo><mrow id="A2.T3.3.3.1.m1.4.5.3" xref="A2.T3.3.3.1.m1.4.5.3.cmml"><msubsup id="A2.T3.3.3.1.m1.4.5.3.1" xref="A2.T3.3.3.1.m1.4.5.3.1.cmml"><mo id="A2.T3.3.3.1.m1.4.5.3.1.2.2" xref="A2.T3.3.3.1.m1.4.5.3.1.2.2.cmml">∑</mo><mrow id="A2.T3.3.3.1.m1.4.5.3.1.2.3" xref="A2.T3.3.3.1.m1.4.5.3.1.2.3.cmml"><mi id="A2.T3.3.3.1.m1.4.5.3.1.2.3.2" xref="A2.T3.3.3.1.m1.4.5.3.1.2.3.2.cmml">i</mi><mo id="A2.T3.3.3.1.m1.4.5.3.1.2.3.1" xref="A2.T3.3.3.1.m1.4.5.3.1.2.3.1.cmml">=</mo><mn id="A2.T3.3.3.1.m1.4.5.3.1.2.3.3" xref="A2.T3.3.3.1.m1.4.5.3.1.2.3.3.cmml">1</mn></mrow><mi id="A2.T3.3.3.1.m1.4.5.3.1.3" xref="A2.T3.3.3.1.m1.4.5.3.1.3.cmml">N</mi></msubsup><mfrac id="A2.T3.3.3.1.m1.4.4" xref="A2.T3.3.3.1.m1.4.4.cmml"><mrow id="A2.T3.3.3.1.m1.2.2.2" xref="A2.T3.3.3.1.m1.2.2.2.cmml"><mi id="A2.T3.3.3.1.m1.2.2.2.4" xref="A2.T3.3.3.1.m1.2.2.2.4.cmml">P</mi><mo id="A2.T3.3.3.1.m1.2.2.2.3" xref="A2.T3.3.3.1.m1.2.2.2.3.cmml">⁢</mo><mrow id="A2.T3.3.3.1.m1.2.2.2.2.1" xref="A2.T3.3.3.1.m1.2.2.2.2.1.1.cmml"><mo id="A2.T3.3.3.1.m1.2.2.2.2.1.2" stretchy="false" xref="A2.T3.3.3.1.m1.2.2.2.2.1.1.cmml">(</mo><mrow id="A2.T3.3.3.1.m1.2.2.2.2.1.1" xref="A2.T3.3.3.1.m1.2.2.2.2.1.1.cmml"><msubsup id="A2.T3.3.3.1.m1.2.2.2.2.1.1.2" xref="A2.T3.3.3.1.m1.2.2.2.2.1.1.2.cmml"><mi id="A2.T3.3.3.1.m1.2.2.2.2.1.1.2.2.2" xref="A2.T3.3.3.1.m1.2.2.2.2.1.1.2.2.2.cmml">c</mi><mtext id="A2.T3.3.3.1.m1.2.2.2.2.1.1.2.3" xref="A2.T3.3.3.1.m1.2.2.2.2.1.1.2.3a.cmml">correct</mtext><mrow id="A2.T3.3.3.1.m1.1.1.1.1.1.3" xref="A2.T3.3.3.1.m1.2.2.2.2.1.1.2.cmml"><mo id="A2.T3.3.3.1.m1.1.1.1.1.1.3.1" stretchy="false" xref="A2.T3.3.3.1.m1.2.2.2.2.1.1.2.cmml">(</mo><mi id="A2.T3.3.3.1.m1.1.1.1.1.1.1" xref="A2.T3.3.3.1.m1.1.1.1.1.1.1.cmml">i</mi><mo id="A2.T3.3.3.1.m1.1.1.1.1.1.3.2" stretchy="false" xref="A2.T3.3.3.1.m1.2.2.2.2.1.1.2.cmml">)</mo></mrow></msubsup><mo id="A2.T3.3.3.1.m1.2.2.2.2.1.1.1" xref="A2.T3.3.3.1.m1.2.2.2.2.1.1.1.cmml">∣</mo><msub id="A2.T3.3.3.1.m1.2.2.2.2.1.1.3" xref="A2.T3.3.3.1.m1.2.2.2.2.1.1.3.cmml"><mtext id="A2.T3.3.3.1.m1.2.2.2.2.1.1.3.2" xref="A2.T3.3.3.1.m1.2.2.2.2.1.1.3.2a.cmml">context</mtext><mi id="A2.T3.3.3.1.m1.2.2.2.2.1.1.3.3" xref="A2.T3.3.3.1.m1.2.2.2.2.1.1.3.3.cmml">i</mi></msub></mrow><mo id="A2.T3.3.3.1.m1.2.2.2.2.1.3" stretchy="false" xref="A2.T3.3.3.1.m1.2.2.2.2.1.1.cmml">)</mo></mrow></mrow><mrow id="A2.T3.3.3.1.m1.4.4.4" xref="A2.T3.3.3.1.m1.4.4.4.cmml"><mstyle displaystyle="false" id="A2.T3.3.3.1.m1.4.4.4.3" xref="A2.T3.3.3.1.m1.4.4.4.3.cmml"><msub id="A2.T3.3.3.1.m1.4.4.4.3a" xref="A2.T3.3.3.1.m1.4.4.4.3.cmml"><mo id="A2.T3.3.3.1.m1.4.4.4.3.2" xref="A2.T3.3.3.1.m1.4.4.4.3.2.cmml">∑</mo><mrow id="A2.T3.3.3.1.m1.3.3.3.1.1" xref="A2.T3.3.3.1.m1.3.3.3.1.1.cmml"><mi id="A2.T3.3.3.1.m1.3.3.3.1.1.3" xref="A2.T3.3.3.1.m1.3.3.3.1.1.3.cmml">c</mi><mo id="A2.T3.3.3.1.m1.3.3.3.1.1.2" xref="A2.T3.3.3.1.m1.3.3.3.1.1.2.cmml">∈</mo><msup id="A2.T3.3.3.1.m1.3.3.3.1.1.4" xref="A2.T3.3.3.1.m1.3.3.3.1.1.4.cmml"><mi id="A2.T3.3.3.1.m1.3.3.3.1.1.4.2" xref="A2.T3.3.3.1.m1.3.3.3.1.1.4.2.cmml">C</mi><mrow id="A2.T3.3.3.1.m1.3.3.3.1.1.1.1.3" xref="A2.T3.3.3.1.m1.3.3.3.1.1.4.cmml"><mo id="A2.T3.3.3.1.m1.3.3.3.1.1.1.1.3.1" stretchy="false" xref="A2.T3.3.3.1.m1.3.3.3.1.1.4.cmml">(</mo><mi id="A2.T3.3.3.1.m1.3.3.3.1.1.1.1.1" xref="A2.T3.3.3.1.m1.3.3.3.1.1.1.1.1.cmml">i</mi><mo id="A2.T3.3.3.1.m1.3.3.3.1.1.1.1.3.2" stretchy="false" xref="A2.T3.3.3.1.m1.3.3.3.1.1.4.cmml">)</mo></mrow></msup></mrow></msub></mstyle><mrow id="A2.T3.3.3.1.m1.4.4.4.2" xref="A2.T3.3.3.1.m1.4.4.4.2.cmml"><mi id="A2.T3.3.3.1.m1.4.4.4.2.3" xref="A2.T3.3.3.1.m1.4.4.4.2.3.cmml">P</mi><mo id="A2.T3.3.3.1.m1.4.4.4.2.2" xref="A2.T3.3.3.1.m1.4.4.4.2.2.cmml">⁢</mo><mrow id="A2.T3.3.3.1.m1.4.4.4.2.1.1" xref="A2.T3.3.3.1.m1.4.4.4.2.1.1.1.cmml"><mo id="A2.T3.3.3.1.m1.4.4.4.2.1.1.2" stretchy="false" xref="A2.T3.3.3.1.m1.4.4.4.2.1.1.1.cmml">(</mo><mrow id="A2.T3.3.3.1.m1.4.4.4.2.1.1.1" xref="A2.T3.3.3.1.m1.4.4.4.2.1.1.1.cmml"><mi id="A2.T3.3.3.1.m1.4.4.4.2.1.1.1.2" xref="A2.T3.3.3.1.m1.4.4.4.2.1.1.1.2.cmml">c</mi><mo id="A2.T3.3.3.1.m1.4.4.4.2.1.1.1.1" xref="A2.T3.3.3.1.m1.4.4.4.2.1.1.1.1.cmml">∣</mo><msub id="A2.T3.3.3.1.m1.4.4.4.2.1.1.1.3" xref="A2.T3.3.3.1.m1.4.4.4.2.1.1.1.3.cmml"><mtext id="A2.T3.3.3.1.m1.4.4.4.2.1.1.1.3.2" xref="A2.T3.3.3.1.m1.4.4.4.2.1.1.1.3.2a.cmml">context</mtext><mi id="A2.T3.3.3.1.m1.4.4.4.2.1.1.1.3.3" xref="A2.T3.3.3.1.m1.4.4.4.2.1.1.1.3.3.cmml">i</mi></msub></mrow><mo id="A2.T3.3.3.1.m1.4.4.4.2.1.1.3" stretchy="false" xref="A2.T3.3.3.1.m1.4.4.4.2.1.1.1.cmml">)</mo></mrow></mrow></mrow></mfrac></mrow></mrow><annotation-xml encoding="MathML-Content" id="A2.T3.3.3.1.m1.4b"><apply id="A2.T3.3.3.1.m1.4.5.cmml" xref="A2.T3.3.3.1.m1.4.5"><times id="A2.T3.3.3.1.m1.4.5.1.cmml" xref="A2.T3.3.3.1.m1.4.5.1"></times><apply id="A2.T3.3.3.1.m1.4.5.2.cmml" xref="A2.T3.3.3.1.m1.4.5.2"><divide id="A2.T3.3.3.1.m1.4.5.2.1.cmml" xref="A2.T3.3.3.1.m1.4.5.2"></divide><cn id="A2.T3.3.3.1.m1.4.5.2.2.cmml" type="integer" xref="A2.T3.3.3.1.m1.4.5.2.2">1</cn><ci id="A2.T3.3.3.1.m1.4.5.2.3.cmml" xref="A2.T3.3.3.1.m1.4.5.2.3">𝑁</ci></apply><apply id="A2.T3.3.3.1.m1.4.5.3.cmml" xref="A2.T3.3.3.1.m1.4.5.3"><apply id="A2.T3.3.3.1.m1.4.5.3.1.cmml" xref="A2.T3.3.3.1.m1.4.5.3.1"><csymbol cd="ambiguous" id="A2.T3.3.3.1.m1.4.5.3.1.1.cmml" xref="A2.T3.3.3.1.m1.4.5.3.1">superscript</csymbol><apply id="A2.T3.3.3.1.m1.4.5.3.1.2.cmml" xref="A2.T3.3.3.1.m1.4.5.3.1"><csymbol cd="ambiguous" id="A2.T3.3.3.1.m1.4.5.3.1.2.1.cmml" xref="A2.T3.3.3.1.m1.4.5.3.1">subscript</csymbol><sum id="A2.T3.3.3.1.m1.4.5.3.1.2.2.cmml" xref="A2.T3.3.3.1.m1.4.5.3.1.2.2"></sum><apply id="A2.T3.3.3.1.m1.4.5.3.1.2.3.cmml" xref="A2.T3.3.3.1.m1.4.5.3.1.2.3"><eq id="A2.T3.3.3.1.m1.4.5.3.1.2.3.1.cmml" xref="A2.T3.3.3.1.m1.4.5.3.1.2.3.1"></eq><ci id="A2.T3.3.3.1.m1.4.5.3.1.2.3.2.cmml" xref="A2.T3.3.3.1.m1.4.5.3.1.2.3.2">𝑖</ci><cn id="A2.T3.3.3.1.m1.4.5.3.1.2.3.3.cmml" type="integer" xref="A2.T3.3.3.1.m1.4.5.3.1.2.3.3">1</cn></apply></apply><ci id="A2.T3.3.3.1.m1.4.5.3.1.3.cmml" xref="A2.T3.3.3.1.m1.4.5.3.1.3">𝑁</ci></apply><apply id="A2.T3.3.3.1.m1.4.4.cmml" xref="A2.T3.3.3.1.m1.4.4"><divide id="A2.T3.3.3.1.m1.4.4.5.cmml" xref="A2.T3.3.3.1.m1.4.4"></divide><apply id="A2.T3.3.3.1.m1.2.2.2.cmml" xref="A2.T3.3.3.1.m1.2.2.2"><times id="A2.T3.3.3.1.m1.2.2.2.3.cmml" xref="A2.T3.3.3.1.m1.2.2.2.3"></times><ci id="A2.T3.3.3.1.m1.2.2.2.4.cmml" xref="A2.T3.3.3.1.m1.2.2.2.4">𝑃</ci><apply id="A2.T3.3.3.1.m1.2.2.2.2.1.1.cmml" xref="A2.T3.3.3.1.m1.2.2.2.2.1"><csymbol cd="latexml" id="A2.T3.3.3.1.m1.2.2.2.2.1.1.1.cmml" xref="A2.T3.3.3.1.m1.2.2.2.2.1.1.1">conditional</csymbol><apply id="A2.T3.3.3.1.m1.2.2.2.2.1.1.2.cmml" xref="A2.T3.3.3.1.m1.2.2.2.2.1.1.2"><csymbol cd="ambiguous" id="A2.T3.3.3.1.m1.2.2.2.2.1.1.2.1.cmml" xref="A2.T3.3.3.1.m1.2.2.2.2.1.1.2">subscript</csymbol><apply id="A2.T3.3.3.1.m1.2.2.2.2.1.1.2.2.cmml" xref="A2.T3.3.3.1.m1.2.2.2.2.1.1.2"><csymbol cd="ambiguous" id="A2.T3.3.3.1.m1.2.2.2.2.1.1.2.2.1.cmml" xref="A2.T3.3.3.1.m1.2.2.2.2.1.1.2">superscript</csymbol><ci id="A2.T3.3.3.1.m1.2.2.2.2.1.1.2.2.2.cmml" xref="A2.T3.3.3.1.m1.2.2.2.2.1.1.2.2.2">𝑐</ci><ci id="A2.T3.3.3.1.m1.1.1.1.1.1.1.cmml" xref="A2.T3.3.3.1.m1.1.1.1.1.1.1">𝑖</ci></apply><ci id="A2.T3.3.3.1.m1.2.2.2.2.1.1.2.3a.cmml" xref="A2.T3.3.3.1.m1.2.2.2.2.1.1.2.3"><mtext id="A2.T3.3.3.1.m1.2.2.2.2.1.1.2.3.cmml" mathsize="50%" xref="A2.T3.3.3.1.m1.2.2.2.2.1.1.2.3">correct</mtext></ci></apply><apply id="A2.T3.3.3.1.m1.2.2.2.2.1.1.3.cmml" xref="A2.T3.3.3.1.m1.2.2.2.2.1.1.3"><csymbol cd="ambiguous" id="A2.T3.3.3.1.m1.2.2.2.2.1.1.3.1.cmml" xref="A2.T3.3.3.1.m1.2.2.2.2.1.1.3">subscript</csymbol><ci id="A2.T3.3.3.1.m1.2.2.2.2.1.1.3.2a.cmml" xref="A2.T3.3.3.1.m1.2.2.2.2.1.1.3.2"><mtext id="A2.T3.3.3.1.m1.2.2.2.2.1.1.3.2.cmml" mathsize="70%" xref="A2.T3.3.3.1.m1.2.2.2.2.1.1.3.2">context</mtext></ci><ci id="A2.T3.3.3.1.m1.2.2.2.2.1.1.3.3.cmml" xref="A2.T3.3.3.1.m1.2.2.2.2.1.1.3.3">𝑖</ci></apply></apply></apply><apply id="A2.T3.3.3.1.m1.4.4.4.cmml" xref="A2.T3.3.3.1.m1.4.4.4"><apply id="A2.T3.3.3.1.m1.4.4.4.3.cmml" xref="A2.T3.3.3.1.m1.4.4.4.3"><csymbol cd="ambiguous" id="A2.T3.3.3.1.m1.4.4.4.3.1.cmml" xref="A2.T3.3.3.1.m1.4.4.4.3">subscript</csymbol><sum id="A2.T3.3.3.1.m1.4.4.4.3.2.cmml" xref="A2.T3.3.3.1.m1.4.4.4.3.2"></sum><apply id="A2.T3.3.3.1.m1.3.3.3.1.1.cmml" xref="A2.T3.3.3.1.m1.3.3.3.1.1"><in id="A2.T3.3.3.1.m1.3.3.3.1.1.2.cmml" xref="A2.T3.3.3.1.m1.3.3.3.1.1.2"></in><ci id="A2.T3.3.3.1.m1.3.3.3.1.1.3.cmml" xref="A2.T3.3.3.1.m1.3.3.3.1.1.3">𝑐</ci><apply id="A2.T3.3.3.1.m1.3.3.3.1.1.4.cmml" xref="A2.T3.3.3.1.m1.3.3.3.1.1.4"><csymbol cd="ambiguous" id="A2.T3.3.3.1.m1.3.3.3.1.1.4.1.cmml" xref="A2.T3.3.3.1.m1.3.3.3.1.1.4">superscript</csymbol><ci id="A2.T3.3.3.1.m1.3.3.3.1.1.4.2.cmml" xref="A2.T3.3.3.1.m1.3.3.3.1.1.4.2">𝐶</ci><ci id="A2.T3.3.3.1.m1.3.3.3.1.1.1.1.1.cmml" xref="A2.T3.3.3.1.m1.3.3.3.1.1.1.1.1">𝑖</ci></apply></apply></apply><apply id="A2.T3.3.3.1.m1.4.4.4.2.cmml" xref="A2.T3.3.3.1.m1.4.4.4.2"><times id="A2.T3.3.3.1.m1.4.4.4.2.2.cmml" xref="A2.T3.3.3.1.m1.4.4.4.2.2"></times><ci id="A2.T3.3.3.1.m1.4.4.4.2.3.cmml" xref="A2.T3.3.3.1.m1.4.4.4.2.3">𝑃</ci><apply id="A2.T3.3.3.1.m1.4.4.4.2.1.1.1.cmml" xref="A2.T3.3.3.1.m1.4.4.4.2.1.1"><csymbol cd="latexml" id="A2.T3.3.3.1.m1.4.4.4.2.1.1.1.1.cmml" xref="A2.T3.3.3.1.m1.4.4.4.2.1.1.1.1">conditional</csymbol><ci id="A2.T3.3.3.1.m1.4.4.4.2.1.1.1.2.cmml" xref="A2.T3.3.3.1.m1.4.4.4.2.1.1.1.2">𝑐</ci><apply id="A2.T3.3.3.1.m1.4.4.4.2.1.1.1.3.cmml" xref="A2.T3.3.3.1.m1.4.4.4.2.1.1.1.3"><csymbol cd="ambiguous" id="A2.T3.3.3.1.m1.4.4.4.2.1.1.1.3.1.cmml" xref="A2.T3.3.3.1.m1.4.4.4.2.1.1.1.3">subscript</csymbol><ci id="A2.T3.3.3.1.m1.4.4.4.2.1.1.1.3.2a.cmml" xref="A2.T3.3.3.1.m1.4.4.4.2.1.1.1.3.2"><mtext id="A2.T3.3.3.1.m1.4.4.4.2.1.1.1.3.2.cmml" mathsize="70%" xref="A2.T3.3.3.1.m1.4.4.4.2.1.1.1.3.2">context</mtext></ci><ci id="A2.T3.3.3.1.m1.4.4.4.2.1.1.1.3.3.cmml" xref="A2.T3.3.3.1.m1.4.4.4.2.1.1.1.3.3">𝑖</ci></apply></apply></apply></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="A2.T3.3.3.1.m1.4c">\frac{1}{N}\sum_{i=1}^{N}\frac{P(c^{(i)}_{\text{correct}}\mid\text{context}_{i%
})}{\sum_{c\in C^{(i)}}P(c\mid\text{context}_{i})}</annotation><annotation encoding="application/x-llamapun" id="A2.T3.3.3.1.m1.4d">divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG italic_P ( italic_c start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT correct end_POSTSUBSCRIPT ∣ context start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_c ∈ italic_C start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_P ( italic_c ∣ context start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG</annotation></semantics></math></td>
</tr>
<tr class="ltx_tr" id="A2.T3.4.4">
<td class="ltx_td ltx_align_left ltx_border_t" id="A2.T3.4.4.2"><span class="ltx_text ltx_font_smallcaps" id="A2.T3.4.4.2.1">Total Prob</span></td>
<td class="ltx_td ltx_align_left ltx_border_t" id="A2.T3.4.4.1"><math alttext="\frac{1}{N}\sum_{i=1}^{N}\sum_{c\in C^{(i)}}P(c\mid\text{context}_{i})" class="ltx_Math" display="inline" id="A2.T3.4.4.1.m1.2"><semantics id="A2.T3.4.4.1.m1.2a"><mrow id="A2.T3.4.4.1.m1.2.2" xref="A2.T3.4.4.1.m1.2.2.cmml"><mfrac id="A2.T3.4.4.1.m1.2.2.3" xref="A2.T3.4.4.1.m1.2.2.3.cmml"><mn id="A2.T3.4.4.1.m1.2.2.3.2" xref="A2.T3.4.4.1.m1.2.2.3.2.cmml">1</mn><mi id="A2.T3.4.4.1.m1.2.2.3.3" xref="A2.T3.4.4.1.m1.2.2.3.3.cmml">N</mi></mfrac><mo id="A2.T3.4.4.1.m1.2.2.2" xref="A2.T3.4.4.1.m1.2.2.2.cmml">⁢</mo><mrow id="A2.T3.4.4.1.m1.2.2.1" xref="A2.T3.4.4.1.m1.2.2.1.cmml"><msubsup id="A2.T3.4.4.1.m1.2.2.1.2" xref="A2.T3.4.4.1.m1.2.2.1.2.cmml"><mo id="A2.T3.4.4.1.m1.2.2.1.2.2.2" rspace="0em" xref="A2.T3.4.4.1.m1.2.2.1.2.2.2.cmml">∑</mo><mrow id="A2.T3.4.4.1.m1.2.2.1.2.2.3" xref="A2.T3.4.4.1.m1.2.2.1.2.2.3.cmml"><mi id="A2.T3.4.4.1.m1.2.2.1.2.2.3.2" xref="A2.T3.4.4.1.m1.2.2.1.2.2.3.2.cmml">i</mi><mo id="A2.T3.4.4.1.m1.2.2.1.2.2.3.1" xref="A2.T3.4.4.1.m1.2.2.1.2.2.3.1.cmml">=</mo><mn id="A2.T3.4.4.1.m1.2.2.1.2.2.3.3" xref="A2.T3.4.4.1.m1.2.2.1.2.2.3.3.cmml">1</mn></mrow><mi id="A2.T3.4.4.1.m1.2.2.1.2.3" xref="A2.T3.4.4.1.m1.2.2.1.2.3.cmml">N</mi></msubsup><mrow id="A2.T3.4.4.1.m1.2.2.1.1" xref="A2.T3.4.4.1.m1.2.2.1.1.cmml"><msub id="A2.T3.4.4.1.m1.2.2.1.1.2" xref="A2.T3.4.4.1.m1.2.2.1.1.2.cmml"><mo id="A2.T3.4.4.1.m1.2.2.1.1.2.2" xref="A2.T3.4.4.1.m1.2.2.1.1.2.2.cmml">∑</mo><mrow id="A2.T3.4.4.1.m1.1.1.1" xref="A2.T3.4.4.1.m1.1.1.1.cmml"><mi id="A2.T3.4.4.1.m1.1.1.1.3" xref="A2.T3.4.4.1.m1.1.1.1.3.cmml">c</mi><mo id="A2.T3.4.4.1.m1.1.1.1.2" xref="A2.T3.4.4.1.m1.1.1.1.2.cmml">∈</mo><msup id="A2.T3.4.4.1.m1.1.1.1.4" xref="A2.T3.4.4.1.m1.1.1.1.4.cmml"><mi id="A2.T3.4.4.1.m1.1.1.1.4.2" xref="A2.T3.4.4.1.m1.1.1.1.4.2.cmml">C</mi><mrow id="A2.T3.4.4.1.m1.1.1.1.1.1.3" xref="A2.T3.4.4.1.m1.1.1.1.4.cmml"><mo id="A2.T3.4.4.1.m1.1.1.1.1.1.3.1" stretchy="false" xref="A2.T3.4.4.1.m1.1.1.1.4.cmml">(</mo><mi id="A2.T3.4.4.1.m1.1.1.1.1.1.1" xref="A2.T3.4.4.1.m1.1.1.1.1.1.1.cmml">i</mi><mo id="A2.T3.4.4.1.m1.1.1.1.1.1.3.2" stretchy="false" xref="A2.T3.4.4.1.m1.1.1.1.4.cmml">)</mo></mrow></msup></mrow></msub><mrow id="A2.T3.4.4.1.m1.2.2.1.1.1" xref="A2.T3.4.4.1.m1.2.2.1.1.1.cmml"><mi id="A2.T3.4.4.1.m1.2.2.1.1.1.3" xref="A2.T3.4.4.1.m1.2.2.1.1.1.3.cmml">P</mi><mo id="A2.T3.4.4.1.m1.2.2.1.1.1.2" xref="A2.T3.4.4.1.m1.2.2.1.1.1.2.cmml">⁢</mo><mrow id="A2.T3.4.4.1.m1.2.2.1.1.1.1.1" xref="A2.T3.4.4.1.m1.2.2.1.1.1.1.1.1.cmml"><mo id="A2.T3.4.4.1.m1.2.2.1.1.1.1.1.2" stretchy="false" xref="A2.T3.4.4.1.m1.2.2.1.1.1.1.1.1.cmml">(</mo><mrow id="A2.T3.4.4.1.m1.2.2.1.1.1.1.1.1" xref="A2.T3.4.4.1.m1.2.2.1.1.1.1.1.1.cmml"><mi id="A2.T3.4.4.1.m1.2.2.1.1.1.1.1.1.2" xref="A2.T3.4.4.1.m1.2.2.1.1.1.1.1.1.2.cmml">c</mi><mo id="A2.T3.4.4.1.m1.2.2.1.1.1.1.1.1.1" xref="A2.T3.4.4.1.m1.2.2.1.1.1.1.1.1.1.cmml">∣</mo><msub id="A2.T3.4.4.1.m1.2.2.1.1.1.1.1.1.3" xref="A2.T3.4.4.1.m1.2.2.1.1.1.1.1.1.3.cmml"><mtext id="A2.T3.4.4.1.m1.2.2.1.1.1.1.1.1.3.2" xref="A2.T3.4.4.1.m1.2.2.1.1.1.1.1.1.3.2a.cmml">context</mtext><mi id="A2.T3.4.4.1.m1.2.2.1.1.1.1.1.1.3.3" xref="A2.T3.4.4.1.m1.2.2.1.1.1.1.1.1.3.3.cmml">i</mi></msub></mrow><mo id="A2.T3.4.4.1.m1.2.2.1.1.1.1.1.3" stretchy="false" xref="A2.T3.4.4.1.m1.2.2.1.1.1.1.1.1.cmml">)</mo></mrow></mrow></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="A2.T3.4.4.1.m1.2b"><apply id="A2.T3.4.4.1.m1.2.2.cmml" xref="A2.T3.4.4.1.m1.2.2"><times id="A2.T3.4.4.1.m1.2.2.2.cmml" xref="A2.T3.4.4.1.m1.2.2.2"></times><apply id="A2.T3.4.4.1.m1.2.2.3.cmml" xref="A2.T3.4.4.1.m1.2.2.3"><divide id="A2.T3.4.4.1.m1.2.2.3.1.cmml" xref="A2.T3.4.4.1.m1.2.2.3"></divide><cn id="A2.T3.4.4.1.m1.2.2.3.2.cmml" type="integer" xref="A2.T3.4.4.1.m1.2.2.3.2">1</cn><ci id="A2.T3.4.4.1.m1.2.2.3.3.cmml" xref="A2.T3.4.4.1.m1.2.2.3.3">𝑁</ci></apply><apply id="A2.T3.4.4.1.m1.2.2.1.cmml" xref="A2.T3.4.4.1.m1.2.2.1"><apply id="A2.T3.4.4.1.m1.2.2.1.2.cmml" xref="A2.T3.4.4.1.m1.2.2.1.2"><csymbol cd="ambiguous" id="A2.T3.4.4.1.m1.2.2.1.2.1.cmml" xref="A2.T3.4.4.1.m1.2.2.1.2">superscript</csymbol><apply id="A2.T3.4.4.1.m1.2.2.1.2.2.cmml" xref="A2.T3.4.4.1.m1.2.2.1.2"><csymbol cd="ambiguous" id="A2.T3.4.4.1.m1.2.2.1.2.2.1.cmml" xref="A2.T3.4.4.1.m1.2.2.1.2">subscript</csymbol><sum id="A2.T3.4.4.1.m1.2.2.1.2.2.2.cmml" xref="A2.T3.4.4.1.m1.2.2.1.2.2.2"></sum><apply id="A2.T3.4.4.1.m1.2.2.1.2.2.3.cmml" xref="A2.T3.4.4.1.m1.2.2.1.2.2.3"><eq id="A2.T3.4.4.1.m1.2.2.1.2.2.3.1.cmml" xref="A2.T3.4.4.1.m1.2.2.1.2.2.3.1"></eq><ci id="A2.T3.4.4.1.m1.2.2.1.2.2.3.2.cmml" xref="A2.T3.4.4.1.m1.2.2.1.2.2.3.2">𝑖</ci><cn id="A2.T3.4.4.1.m1.2.2.1.2.2.3.3.cmml" type="integer" xref="A2.T3.4.4.1.m1.2.2.1.2.2.3.3">1</cn></apply></apply><ci id="A2.T3.4.4.1.m1.2.2.1.2.3.cmml" xref="A2.T3.4.4.1.m1.2.2.1.2.3">𝑁</ci></apply><apply id="A2.T3.4.4.1.m1.2.2.1.1.cmml" xref="A2.T3.4.4.1.m1.2.2.1.1"><apply id="A2.T3.4.4.1.m1.2.2.1.1.2.cmml" xref="A2.T3.4.4.1.m1.2.2.1.1.2"><csymbol cd="ambiguous" id="A2.T3.4.4.1.m1.2.2.1.1.2.1.cmml" xref="A2.T3.4.4.1.m1.2.2.1.1.2">subscript</csymbol><sum id="A2.T3.4.4.1.m1.2.2.1.1.2.2.cmml" xref="A2.T3.4.4.1.m1.2.2.1.1.2.2"></sum><apply id="A2.T3.4.4.1.m1.1.1.1.cmml" xref="A2.T3.4.4.1.m1.1.1.1"><in id="A2.T3.4.4.1.m1.1.1.1.2.cmml" xref="A2.T3.4.4.1.m1.1.1.1.2"></in><ci id="A2.T3.4.4.1.m1.1.1.1.3.cmml" xref="A2.T3.4.4.1.m1.1.1.1.3">𝑐</ci><apply id="A2.T3.4.4.1.m1.1.1.1.4.cmml" xref="A2.T3.4.4.1.m1.1.1.1.4"><csymbol cd="ambiguous" id="A2.T3.4.4.1.m1.1.1.1.4.1.cmml" xref="A2.T3.4.4.1.m1.1.1.1.4">superscript</csymbol><ci id="A2.T3.4.4.1.m1.1.1.1.4.2.cmml" xref="A2.T3.4.4.1.m1.1.1.1.4.2">𝐶</ci><ci id="A2.T3.4.4.1.m1.1.1.1.1.1.1.cmml" xref="A2.T3.4.4.1.m1.1.1.1.1.1.1">𝑖</ci></apply></apply></apply><apply id="A2.T3.4.4.1.m1.2.2.1.1.1.cmml" xref="A2.T3.4.4.1.m1.2.2.1.1.1"><times id="A2.T3.4.4.1.m1.2.2.1.1.1.2.cmml" xref="A2.T3.4.4.1.m1.2.2.1.1.1.2"></times><ci id="A2.T3.4.4.1.m1.2.2.1.1.1.3.cmml" xref="A2.T3.4.4.1.m1.2.2.1.1.1.3">𝑃</ci><apply id="A2.T3.4.4.1.m1.2.2.1.1.1.1.1.1.cmml" xref="A2.T3.4.4.1.m1.2.2.1.1.1.1.1"><csymbol cd="latexml" id="A2.T3.4.4.1.m1.2.2.1.1.1.1.1.1.1.cmml" xref="A2.T3.4.4.1.m1.2.2.1.1.1.1.1.1.1">conditional</csymbol><ci id="A2.T3.4.4.1.m1.2.2.1.1.1.1.1.1.2.cmml" xref="A2.T3.4.4.1.m1.2.2.1.1.1.1.1.1.2">𝑐</ci><apply id="A2.T3.4.4.1.m1.2.2.1.1.1.1.1.1.3.cmml" xref="A2.T3.4.4.1.m1.2.2.1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="A2.T3.4.4.1.m1.2.2.1.1.1.1.1.1.3.1.cmml" xref="A2.T3.4.4.1.m1.2.2.1.1.1.1.1.1.3">subscript</csymbol><ci id="A2.T3.4.4.1.m1.2.2.1.1.1.1.1.1.3.2a.cmml" xref="A2.T3.4.4.1.m1.2.2.1.1.1.1.1.1.3.2"><mtext id="A2.T3.4.4.1.m1.2.2.1.1.1.1.1.1.3.2.cmml" xref="A2.T3.4.4.1.m1.2.2.1.1.1.1.1.1.3.2">context</mtext></ci><ci id="A2.T3.4.4.1.m1.2.2.1.1.1.1.1.1.3.3.cmml" xref="A2.T3.4.4.1.m1.2.2.1.1.1.1.1.1.3.3">𝑖</ci></apply></apply></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="A2.T3.4.4.1.m1.2c">\frac{1}{N}\sum_{i=1}^{N}\sum_{c\in C^{(i)}}P(c\mid\text{context}_{i})</annotation><annotation encoding="application/x-llamapun" id="A2.T3.4.4.1.m1.2d">divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_c ∈ italic_C start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_P ( italic_c ∣ context start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )</annotation></semantics></math></td>
</tr>
<tr class="ltx_tr" id="A2.T3.5.5">
<td class="ltx_td ltx_align_left ltx_border_t" id="A2.T3.5.5.2"><span class="ltx_text ltx_font_smallcaps" id="A2.T3.5.5.2.1">Accuracy</span></td>
<td class="ltx_td ltx_align_left ltx_border_t" id="A2.T3.5.5.1"><math alttext="\frac{1}{N}\sum_{i=1}^{N}\mathbb{I}\big{(}\arg\max_{c\in C^{(i)}}P(c\mid\text{%
context}_{i})=c_{\text{correct}}^{(i)}\big{)}" class="ltx_Math" display="inline" id="A2.T3.5.5.1.m1.3"><semantics id="A2.T3.5.5.1.m1.3a"><mrow id="A2.T3.5.5.1.m1.3.3" xref="A2.T3.5.5.1.m1.3.3.cmml"><mfrac id="A2.T3.5.5.1.m1.3.3.3" xref="A2.T3.5.5.1.m1.3.3.3.cmml"><mn id="A2.T3.5.5.1.m1.3.3.3.2" xref="A2.T3.5.5.1.m1.3.3.3.2.cmml">1</mn><mi id="A2.T3.5.5.1.m1.3.3.3.3" xref="A2.T3.5.5.1.m1.3.3.3.3.cmml">N</mi></mfrac><mo id="A2.T3.5.5.1.m1.3.3.2" xref="A2.T3.5.5.1.m1.3.3.2.cmml">⁢</mo><mrow id="A2.T3.5.5.1.m1.3.3.1" xref="A2.T3.5.5.1.m1.3.3.1.cmml"><msubsup id="A2.T3.5.5.1.m1.3.3.1.2" xref="A2.T3.5.5.1.m1.3.3.1.2.cmml"><mo id="A2.T3.5.5.1.m1.3.3.1.2.2.2" xref="A2.T3.5.5.1.m1.3.3.1.2.2.2.cmml">∑</mo><mrow id="A2.T3.5.5.1.m1.3.3.1.2.2.3" xref="A2.T3.5.5.1.m1.3.3.1.2.2.3.cmml"><mi id="A2.T3.5.5.1.m1.3.3.1.2.2.3.2" xref="A2.T3.5.5.1.m1.3.3.1.2.2.3.2.cmml">i</mi><mo id="A2.T3.5.5.1.m1.3.3.1.2.2.3.1" xref="A2.T3.5.5.1.m1.3.3.1.2.2.3.1.cmml">=</mo><mn id="A2.T3.5.5.1.m1.3.3.1.2.2.3.3" xref="A2.T3.5.5.1.m1.3.3.1.2.2.3.3.cmml">1</mn></mrow><mi id="A2.T3.5.5.1.m1.3.3.1.2.3" xref="A2.T3.5.5.1.m1.3.3.1.2.3.cmml">N</mi></msubsup><mrow id="A2.T3.5.5.1.m1.3.3.1.1" xref="A2.T3.5.5.1.m1.3.3.1.1.cmml"><mi id="A2.T3.5.5.1.m1.3.3.1.1.3" xref="A2.T3.5.5.1.m1.3.3.1.1.3.cmml">𝕀</mi><mo id="A2.T3.5.5.1.m1.3.3.1.1.2" xref="A2.T3.5.5.1.m1.3.3.1.1.2.cmml">⁢</mo><mrow id="A2.T3.5.5.1.m1.3.3.1.1.1.1" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.cmml"><mo id="A2.T3.5.5.1.m1.3.3.1.1.1.1.2" maxsize="120%" minsize="120%" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.cmml">(</mo><mrow id="A2.T3.5.5.1.m1.3.3.1.1.1.1.1" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.cmml"><mrow id="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.cmml"><mrow id="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.3" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.3.cmml"><mi id="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.3.1" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.3.1.cmml">arg</mi><mo id="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.3a" lspace="0.167em" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.3.cmml">⁡</mo><mrow id="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.3.2" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.3.2.cmml"><msub id="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.3.2.1" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.3.2.1.cmml"><mi id="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.3.2.1.2" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.3.2.1.2.cmml">max</mi><mrow id="A2.T3.5.5.1.m1.1.1.1" xref="A2.T3.5.5.1.m1.1.1.1.cmml"><mi id="A2.T3.5.5.1.m1.1.1.1.3" xref="A2.T3.5.5.1.m1.1.1.1.3.cmml">c</mi><mo id="A2.T3.5.5.1.m1.1.1.1.2" xref="A2.T3.5.5.1.m1.1.1.1.2.cmml">∈</mo><msup id="A2.T3.5.5.1.m1.1.1.1.4" xref="A2.T3.5.5.1.m1.1.1.1.4.cmml"><mi id="A2.T3.5.5.1.m1.1.1.1.4.2" xref="A2.T3.5.5.1.m1.1.1.1.4.2.cmml">C</mi><mrow id="A2.T3.5.5.1.m1.1.1.1.1.1.3" xref="A2.T3.5.5.1.m1.1.1.1.4.cmml"><mo id="A2.T3.5.5.1.m1.1.1.1.1.1.3.1" stretchy="false" xref="A2.T3.5.5.1.m1.1.1.1.4.cmml">(</mo><mi id="A2.T3.5.5.1.m1.1.1.1.1.1.1" xref="A2.T3.5.5.1.m1.1.1.1.1.1.1.cmml">i</mi><mo id="A2.T3.5.5.1.m1.1.1.1.1.1.3.2" stretchy="false" xref="A2.T3.5.5.1.m1.1.1.1.4.cmml">)</mo></mrow></msup></mrow></msub><mo id="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.3.2a" lspace="0.167em" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.3.2.cmml">⁡</mo><mi id="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.3.2.2" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.3.2.2.cmml">P</mi></mrow></mrow><mo id="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.2" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.2.cmml">⁢</mo><mrow id="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.1.1" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.1.1.1.cmml"><mo id="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.1.1.2" stretchy="false" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.1.1.1.cmml">(</mo><mrow id="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.1.1.1" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.1.1.1.cmml"><mi id="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.1.1.1.2" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.1.1.1.2.cmml">c</mi><mo id="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.1.1.1.1" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.1.1.1.1.cmml">∣</mo><msub id="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.1.1.1.3" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.1.1.1.3.cmml"><mtext id="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.1.1.1.3.2" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.1.1.1.3.2a.cmml">context</mtext><mi id="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.1.1.1.3.3" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.1.1.1.3.3.cmml">i</mi></msub></mrow><mo id="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.1.1.3" stretchy="false" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.1.1.1.cmml">)</mo></mrow></mrow><mo id="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.2" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.2.cmml">=</mo><msubsup id="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.3" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.3.cmml"><mi id="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.3.2.2" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.3.2.2.cmml">c</mi><mtext id="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.3.2.3" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.3.2.3a.cmml">correct</mtext><mrow id="A2.T3.5.5.1.m1.2.2.1.3" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.3.cmml"><mo id="A2.T3.5.5.1.m1.2.2.1.3.1" stretchy="false" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.3.cmml">(</mo><mi id="A2.T3.5.5.1.m1.2.2.1.1" xref="A2.T3.5.5.1.m1.2.2.1.1.cmml">i</mi><mo id="A2.T3.5.5.1.m1.2.2.1.3.2" stretchy="false" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.3.cmml">)</mo></mrow></msubsup></mrow><mo id="A2.T3.5.5.1.m1.3.3.1.1.1.1.3" maxsize="120%" minsize="120%" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.cmml">)</mo></mrow></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="A2.T3.5.5.1.m1.3b"><apply id="A2.T3.5.5.1.m1.3.3.cmml" xref="A2.T3.5.5.1.m1.3.3"><times id="A2.T3.5.5.1.m1.3.3.2.cmml" xref="A2.T3.5.5.1.m1.3.3.2"></times><apply id="A2.T3.5.5.1.m1.3.3.3.cmml" xref="A2.T3.5.5.1.m1.3.3.3"><divide id="A2.T3.5.5.1.m1.3.3.3.1.cmml" xref="A2.T3.5.5.1.m1.3.3.3"></divide><cn id="A2.T3.5.5.1.m1.3.3.3.2.cmml" type="integer" xref="A2.T3.5.5.1.m1.3.3.3.2">1</cn><ci id="A2.T3.5.5.1.m1.3.3.3.3.cmml" xref="A2.T3.5.5.1.m1.3.3.3.3">𝑁</ci></apply><apply id="A2.T3.5.5.1.m1.3.3.1.cmml" xref="A2.T3.5.5.1.m1.3.3.1"><apply id="A2.T3.5.5.1.m1.3.3.1.2.cmml" xref="A2.T3.5.5.1.m1.3.3.1.2"><csymbol cd="ambiguous" id="A2.T3.5.5.1.m1.3.3.1.2.1.cmml" xref="A2.T3.5.5.1.m1.3.3.1.2">superscript</csymbol><apply id="A2.T3.5.5.1.m1.3.3.1.2.2.cmml" xref="A2.T3.5.5.1.m1.3.3.1.2"><csymbol cd="ambiguous" id="A2.T3.5.5.1.m1.3.3.1.2.2.1.cmml" xref="A2.T3.5.5.1.m1.3.3.1.2">subscript</csymbol><sum id="A2.T3.5.5.1.m1.3.3.1.2.2.2.cmml" xref="A2.T3.5.5.1.m1.3.3.1.2.2.2"></sum><apply id="A2.T3.5.5.1.m1.3.3.1.2.2.3.cmml" xref="A2.T3.5.5.1.m1.3.3.1.2.2.3"><eq id="A2.T3.5.5.1.m1.3.3.1.2.2.3.1.cmml" xref="A2.T3.5.5.1.m1.3.3.1.2.2.3.1"></eq><ci id="A2.T3.5.5.1.m1.3.3.1.2.2.3.2.cmml" xref="A2.T3.5.5.1.m1.3.3.1.2.2.3.2">𝑖</ci><cn id="A2.T3.5.5.1.m1.3.3.1.2.2.3.3.cmml" type="integer" xref="A2.T3.5.5.1.m1.3.3.1.2.2.3.3">1</cn></apply></apply><ci id="A2.T3.5.5.1.m1.3.3.1.2.3.cmml" xref="A2.T3.5.5.1.m1.3.3.1.2.3">𝑁</ci></apply><apply id="A2.T3.5.5.1.m1.3.3.1.1.cmml" xref="A2.T3.5.5.1.m1.3.3.1.1"><times id="A2.T3.5.5.1.m1.3.3.1.1.2.cmml" xref="A2.T3.5.5.1.m1.3.3.1.1.2"></times><ci id="A2.T3.5.5.1.m1.3.3.1.1.3.cmml" xref="A2.T3.5.5.1.m1.3.3.1.1.3">𝕀</ci><apply id="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.cmml" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1"><eq id="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.2.cmml" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.2"></eq><apply id="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.cmml" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1"><times id="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.2.cmml" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.2"></times><apply id="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.3.cmml" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.3"><arg id="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.3.1.cmml" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.3.1"></arg><apply id="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.3.2.cmml" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.3.2"><apply id="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.3.2.1.cmml" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.3.2.1"><csymbol cd="ambiguous" id="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.3.2.1.1.cmml" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.3.2.1">subscript</csymbol><max id="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.3.2.1.2.cmml" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.3.2.1.2"></max><apply id="A2.T3.5.5.1.m1.1.1.1.cmml" xref="A2.T3.5.5.1.m1.1.1.1"><in id="A2.T3.5.5.1.m1.1.1.1.2.cmml" xref="A2.T3.5.5.1.m1.1.1.1.2"></in><ci id="A2.T3.5.5.1.m1.1.1.1.3.cmml" xref="A2.T3.5.5.1.m1.1.1.1.3">𝑐</ci><apply id="A2.T3.5.5.1.m1.1.1.1.4.cmml" xref="A2.T3.5.5.1.m1.1.1.1.4"><csymbol cd="ambiguous" id="A2.T3.5.5.1.m1.1.1.1.4.1.cmml" xref="A2.T3.5.5.1.m1.1.1.1.4">superscript</csymbol><ci id="A2.T3.5.5.1.m1.1.1.1.4.2.cmml" xref="A2.T3.5.5.1.m1.1.1.1.4.2">𝐶</ci><ci id="A2.T3.5.5.1.m1.1.1.1.1.1.1.cmml" xref="A2.T3.5.5.1.m1.1.1.1.1.1.1">𝑖</ci></apply></apply></apply><ci id="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.3.2.2.cmml" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.3.2.2">𝑃</ci></apply></apply><apply id="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.1.1.1.cmml" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.1.1"><csymbol cd="latexml" id="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.1.1.1.1.cmml" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.1.1.1.1">conditional</csymbol><ci id="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.1.1.1.2.cmml" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.1.1.1.2">𝑐</ci><apply id="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.1.1.1.3.cmml" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.1.1.1.3.1.cmml" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.1.1.1.3">subscript</csymbol><ci id="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.1.1.1.3.2a.cmml" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.1.1.1.3.2"><mtext id="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.1.1.1.3.2.cmml" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.1.1.1.3.2">context</mtext></ci><ci id="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.1.1.1.3.3.cmml" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.1.1.1.1.3.3">𝑖</ci></apply></apply></apply><apply id="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.3.cmml" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.3"><csymbol cd="ambiguous" id="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.3.1.cmml" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.3">superscript</csymbol><apply id="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.3.2.cmml" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.3"><csymbol cd="ambiguous" id="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.3.2.1.cmml" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.3">subscript</csymbol><ci id="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.3.2.2.cmml" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.3.2.2">𝑐</ci><ci id="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.3.2.3a.cmml" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.3.2.3"><mtext id="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.3.2.3.cmml" mathsize="70%" xref="A2.T3.5.5.1.m1.3.3.1.1.1.1.1.3.2.3">correct</mtext></ci></apply><ci id="A2.T3.5.5.1.m1.2.2.1.1.cmml" xref="A2.T3.5.5.1.m1.2.2.1.1">𝑖</ci></apply></apply></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="A2.T3.5.5.1.m1.3c">\frac{1}{N}\sum_{i=1}^{N}\mathbb{I}\big{(}\arg\max_{c\in C^{(i)}}P(c\mid\text{%
context}_{i})=c_{\text{correct}}^{(i)}\big{)}</annotation><annotation encoding="application/x-llamapun" id="A2.T3.5.5.1.m1.3d">divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_I ( roman_arg roman_max start_POSTSUBSCRIPT italic_c ∈ italic_C start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_P ( italic_c ∣ context start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_c start_POSTSUBSCRIPT correct end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT )</annotation></semantics></math></td>
</tr>
<tr class="ltx_tr" id="A2.T3.6.6">
<td class="ltx_td ltx_align_left ltx_border_t" id="A2.T3.6.6.2"><span class="ltx_text ltx_font_sansserif" id="A2.T3.6.6.2.1">*_per_token</span></td>
<td class="ltx_td ltx_align_left ltx_border_t" id="A2.T3.6.6.1"><math alttext="\nicefrac{{P(c\mid\text{context})}}{{\text{tokens}(c)}}" class="ltx_Math" display="inline" id="A2.T3.6.6.1.m1.2"><semantics id="A2.T3.6.6.1.m1.2a"><mrow id="A2.T3.6.6.1.m1.2.2" xref="A2.T3.6.6.1.m1.2.2.cmml"><mpadded id="A2.T3.6.6.1.m1.1.1.1" voffset="0.3em" xref="A2.T3.6.6.1.m1.1.1.1.cmml"><mi id="A2.T3.6.6.1.m1.1.1.1.3" mathsize="70%" xref="A2.T3.6.6.1.m1.1.1.1.3.cmml">P</mi><mo id="A2.T3.6.6.1.m1.1.1.1.2" xref="A2.T3.6.6.1.m1.1.1.1.2.cmml">⁢</mo><mrow id="A2.T3.6.6.1.m1.1.1.1.1.1" xref="A2.T3.6.6.1.m1.1.1.1.1.1.1.cmml"><mo id="A2.T3.6.6.1.m1.1.1.1.1.1.2" maxsize="70%" minsize="70%" xref="A2.T3.6.6.1.m1.1.1.1.1.1.1.cmml">(</mo><mrow id="A2.T3.6.6.1.m1.1.1.1.1.1.1" xref="A2.T3.6.6.1.m1.1.1.1.1.1.1.cmml"><mi id="A2.T3.6.6.1.m1.1.1.1.1.1.1.2" mathsize="70%" xref="A2.T3.6.6.1.m1.1.1.1.1.1.1.2.cmml">c</mi><mo id="A2.T3.6.6.1.m1.1.1.1.1.1.1.1" mathsize="70%" xref="A2.T3.6.6.1.m1.1.1.1.1.1.1.1.cmml">∣</mo><mtext id="A2.T3.6.6.1.m1.1.1.1.1.1.1.3" mathsize="70%" xref="A2.T3.6.6.1.m1.1.1.1.1.1.1.3a.cmml">context</mtext></mrow><mo id="A2.T3.6.6.1.m1.1.1.1.1.1.3" maxsize="70%" minsize="70%" xref="A2.T3.6.6.1.m1.1.1.1.1.1.1.cmml">)</mo></mrow></mpadded><mpadded id="A2.T3.6.6.1.m1.2.2.3" lspace="-0.1em" width="-0.15em" xref="A2.T3.6.6.1.m1.2.2.3.cmml"><mo id="A2.T3.6.6.1.m1.2.2.3a" stretchy="true" symmetric="true" xref="A2.T3.6.6.1.m1.2.2.3.cmml">/</mo></mpadded><mrow id="A2.T3.6.6.1.m1.2.2.2" xref="A2.T3.6.6.1.m1.2.2.2.cmml"><mtext id="A2.T3.6.6.1.m1.2.2.2.3" mathsize="70%" xref="A2.T3.6.6.1.m1.2.2.2.3a.cmml">tokens</mtext><mo id="A2.T3.6.6.1.m1.2.2.2.2" xref="A2.T3.6.6.1.m1.2.2.2.2.cmml">⁢</mo><mrow id="A2.T3.6.6.1.m1.2.2.2.4.2" xref="A2.T3.6.6.1.m1.2.2.2.cmml"><mo id="A2.T3.6.6.1.m1.2.2.2.4.2.1" maxsize="70%" minsize="70%" xref="A2.T3.6.6.1.m1.2.2.2.cmml">(</mo><mi id="A2.T3.6.6.1.m1.2.2.2.1" mathsize="70%" xref="A2.T3.6.6.1.m1.2.2.2.1.cmml">c</mi><mo id="A2.T3.6.6.1.m1.2.2.2.4.2.2" maxsize="70%" minsize="70%" xref="A2.T3.6.6.1.m1.2.2.2.cmml">)</mo></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="A2.T3.6.6.1.m1.2b"><apply id="A2.T3.6.6.1.m1.2.2.cmml" xref="A2.T3.6.6.1.m1.2.2"><divide id="A2.T3.6.6.1.m1.2.2.3.cmml" xref="A2.T3.6.6.1.m1.2.2.3"></divide><apply id="A2.T3.6.6.1.m1.1.1.1.cmml" xref="A2.T3.6.6.1.m1.1.1.1"><times id="A2.T3.6.6.1.m1.1.1.1.2.cmml" xref="A2.T3.6.6.1.m1.1.1.1.2"></times><ci id="A2.T3.6.6.1.m1.1.1.1.3.cmml" xref="A2.T3.6.6.1.m1.1.1.1.3">𝑃</ci><apply id="A2.T3.6.6.1.m1.1.1.1.1.1.1.cmml" xref="A2.T3.6.6.1.m1.1.1.1.1.1"><csymbol cd="latexml" id="A2.T3.6.6.1.m1.1.1.1.1.1.1.1.cmml" xref="A2.T3.6.6.1.m1.1.1.1.1.1.1.1">conditional</csymbol><ci id="A2.T3.6.6.1.m1.1.1.1.1.1.1.2.cmml" xref="A2.T3.6.6.1.m1.1.1.1.1.1.1.2">𝑐</ci><ci id="A2.T3.6.6.1.m1.1.1.1.1.1.1.3a.cmml" xref="A2.T3.6.6.1.m1.1.1.1.1.1.1.3"><mtext id="A2.T3.6.6.1.m1.1.1.1.1.1.1.3.cmml" mathsize="70%" xref="A2.T3.6.6.1.m1.1.1.1.1.1.1.3">context</mtext></ci></apply></apply><apply id="A2.T3.6.6.1.m1.2.2.2.cmml" xref="A2.T3.6.6.1.m1.2.2.2"><times id="A2.T3.6.6.1.m1.2.2.2.2.cmml" xref="A2.T3.6.6.1.m1.2.2.2.2"></times><ci id="A2.T3.6.6.1.m1.2.2.2.3a.cmml" xref="A2.T3.6.6.1.m1.2.2.2.3"><mtext id="A2.T3.6.6.1.m1.2.2.2.3.cmml" mathsize="70%" xref="A2.T3.6.6.1.m1.2.2.2.3">tokens</mtext></ci><ci id="A2.T3.6.6.1.m1.2.2.2.1.cmml" xref="A2.T3.6.6.1.m1.2.2.2.1">𝑐</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="A2.T3.6.6.1.m1.2c">\nicefrac{{P(c\mid\text{context})}}{{\text{tokens}(c)}}</annotation><annotation encoding="application/x-llamapun" id="A2.T3.6.6.1.m1.2d">/ start_ARG italic_P ( italic_c ∣ context ) end_ARG start_ARG tokens ( italic_c ) end_ARG</annotation></semantics></math></td>
</tr>
<tr class="ltx_tr" id="A2.T3.7.7">
<td class="ltx_td ltx_align_left ltx_border_t" id="A2.T3.7.7.2"><span class="ltx_text ltx_font_sansserif" id="A2.T3.7.7.2.1">*_per_char</span></td>
<td class="ltx_td ltx_align_left ltx_border_t" id="A2.T3.7.7.1"><math alttext="\nicefrac{{P(c\mid\text{context})}}{{\text{chars}(c)}}" class="ltx_Math" display="inline" id="A2.T3.7.7.1.m1.2"><semantics id="A2.T3.7.7.1.m1.2a"><mrow id="A2.T3.7.7.1.m1.2.2" xref="A2.T3.7.7.1.m1.2.2.cmml"><mpadded id="A2.T3.7.7.1.m1.1.1.1" voffset="0.3em" xref="A2.T3.7.7.1.m1.1.1.1.cmml"><mi id="A2.T3.7.7.1.m1.1.1.1.3" mathsize="70%" xref="A2.T3.7.7.1.m1.1.1.1.3.cmml">P</mi><mo id="A2.T3.7.7.1.m1.1.1.1.2" xref="A2.T3.7.7.1.m1.1.1.1.2.cmml">⁢</mo><mrow id="A2.T3.7.7.1.m1.1.1.1.1.1" xref="A2.T3.7.7.1.m1.1.1.1.1.1.1.cmml"><mo id="A2.T3.7.7.1.m1.1.1.1.1.1.2" maxsize="70%" minsize="70%" xref="A2.T3.7.7.1.m1.1.1.1.1.1.1.cmml">(</mo><mrow id="A2.T3.7.7.1.m1.1.1.1.1.1.1" xref="A2.T3.7.7.1.m1.1.1.1.1.1.1.cmml"><mi id="A2.T3.7.7.1.m1.1.1.1.1.1.1.2" mathsize="70%" xref="A2.T3.7.7.1.m1.1.1.1.1.1.1.2.cmml">c</mi><mo id="A2.T3.7.7.1.m1.1.1.1.1.1.1.1" mathsize="70%" xref="A2.T3.7.7.1.m1.1.1.1.1.1.1.1.cmml">∣</mo><mtext id="A2.T3.7.7.1.m1.1.1.1.1.1.1.3" mathsize="70%" xref="A2.T3.7.7.1.m1.1.1.1.1.1.1.3a.cmml">context</mtext></mrow><mo id="A2.T3.7.7.1.m1.1.1.1.1.1.3" maxsize="70%" minsize="70%" xref="A2.T3.7.7.1.m1.1.1.1.1.1.1.cmml">)</mo></mrow></mpadded><mpadded id="A2.T3.7.7.1.m1.2.2.3" lspace="-0.1em" width="-0.15em" xref="A2.T3.7.7.1.m1.2.2.3.cmml"><mo id="A2.T3.7.7.1.m1.2.2.3a" stretchy="true" symmetric="true" xref="A2.T3.7.7.1.m1.2.2.3.cmml">/</mo></mpadded><mrow id="A2.T3.7.7.1.m1.2.2.2" xref="A2.T3.7.7.1.m1.2.2.2.cmml"><mtext id="A2.T3.7.7.1.m1.2.2.2.3" mathsize="70%" xref="A2.T3.7.7.1.m1.2.2.2.3a.cmml">chars</mtext><mo id="A2.T3.7.7.1.m1.2.2.2.2" xref="A2.T3.7.7.1.m1.2.2.2.2.cmml">⁢</mo><mrow id="A2.T3.7.7.1.m1.2.2.2.4.2" xref="A2.T3.7.7.1.m1.2.2.2.cmml"><mo id="A2.T3.7.7.1.m1.2.2.2.4.2.1" maxsize="70%" minsize="70%" xref="A2.T3.7.7.1.m1.2.2.2.cmml">(</mo><mi id="A2.T3.7.7.1.m1.2.2.2.1" mathsize="70%" xref="A2.T3.7.7.1.m1.2.2.2.1.cmml">c</mi><mo id="A2.T3.7.7.1.m1.2.2.2.4.2.2" maxsize="70%" minsize="70%" xref="A2.T3.7.7.1.m1.2.2.2.cmml">)</mo></mrow></mrow></mrow><annotation-xml encoding="MathML-Content" id="A2.T3.7.7.1.m1.2b"><apply id="A2.T3.7.7.1.m1.2.2.cmml" xref="A2.T3.7.7.1.m1.2.2"><divide id="A2.T3.7.7.1.m1.2.2.3.cmml" xref="A2.T3.7.7.1.m1.2.2.3"></divide><apply id="A2.T3.7.7.1.m1.1.1.1.cmml" xref="A2.T3.7.7.1.m1.1.1.1"><times id="A2.T3.7.7.1.m1.1.1.1.2.cmml" xref="A2.T3.7.7.1.m1.1.1.1.2"></times><ci id="A2.T3.7.7.1.m1.1.1.1.3.cmml" xref="A2.T3.7.7.1.m1.1.1.1.3">𝑃</ci><apply id="A2.T3.7.7.1.m1.1.1.1.1.1.1.cmml" xref="A2.T3.7.7.1.m1.1.1.1.1.1"><csymbol cd="latexml" id="A2.T3.7.7.1.m1.1.1.1.1.1.1.1.cmml" xref="A2.T3.7.7.1.m1.1.1.1.1.1.1.1">conditional</csymbol><ci id="A2.T3.7.7.1.m1.1.1.1.1.1.1.2.cmml" xref="A2.T3.7.7.1.m1.1.1.1.1.1.1.2">𝑐</ci><ci id="A2.T3.7.7.1.m1.1.1.1.1.1.1.3a.cmml" xref="A2.T3.7.7.1.m1.1.1.1.1.1.1.3"><mtext id="A2.T3.7.7.1.m1.1.1.1.1.1.1.3.cmml" mathsize="70%" xref="A2.T3.7.7.1.m1.1.1.1.1.1.1.3">context</mtext></ci></apply></apply><apply id="A2.T3.7.7.1.m1.2.2.2.cmml" xref="A2.T3.7.7.1.m1.2.2.2"><times id="A2.T3.7.7.1.m1.2.2.2.2.cmml" xref="A2.T3.7.7.1.m1.2.2.2.2"></times><ci id="A2.T3.7.7.1.m1.2.2.2.3a.cmml" xref="A2.T3.7.7.1.m1.2.2.2.3"><mtext id="A2.T3.7.7.1.m1.2.2.2.3.cmml" mathsize="70%" xref="A2.T3.7.7.1.m1.2.2.2.3">chars</mtext></ci><ci id="A2.T3.7.7.1.m1.2.2.2.1.cmml" xref="A2.T3.7.7.1.m1.2.2.2.1">𝑐</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="A2.T3.7.7.1.m1.2c">\nicefrac{{P(c\mid\text{context})}}{{\text{chars}(c)}}</annotation><annotation encoding="application/x-llamapun" id="A2.T3.7.7.1.m1.2d">/ start_ARG italic_P ( italic_c ∣ context ) end_ARG start_ARG chars ( italic_c ) end_ARG</annotation></semantics></math></td>
</tr>
</table>
<figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_table">Table 3: </span>Proxy metrics used as alternative inputs to our prediction methods, <math alttext="C^{(i)}" class="ltx_Math" display="inline" id="A2.T3.11.m1.1"><semantics id="A2.T3.11.m1.1b"><msup id="A2.T3.11.m1.1.2" xref="A2.T3.11.m1.1.2.cmml"><mi id="A2.T3.11.m1.1.2.2" xref="A2.T3.11.m1.1.2.2.cmml">C</mi><mrow id="A2.T3.11.m1.1.1.1.3" xref="A2.T3.11.m1.1.2.cmml"><mo id="A2.T3.11.m1.1.1.1.3.1" stretchy="false" xref="A2.T3.11.m1.1.2.cmml">(</mo><mi id="A2.T3.11.m1.1.1.1.1" xref="A2.T3.11.m1.1.1.1.1.cmml">i</mi><mo id="A2.T3.11.m1.1.1.1.3.2" stretchy="false" xref="A2.T3.11.m1.1.2.cmml">)</mo></mrow></msup><annotation-xml encoding="MathML-Content" id="A2.T3.11.m1.1c"><apply id="A2.T3.11.m1.1.2.cmml" xref="A2.T3.11.m1.1.2"><csymbol cd="ambiguous" id="A2.T3.11.m1.1.2.1.cmml" xref="A2.T3.11.m1.1.2">superscript</csymbol><ci id="A2.T3.11.m1.1.2.2.cmml" xref="A2.T3.11.m1.1.2.2">𝐶</ci><ci id="A2.T3.11.m1.1.1.1.1.cmml" xref="A2.T3.11.m1.1.1.1.1">𝑖</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="A2.T3.11.m1.1d">C^{(i)}</annotation><annotation encoding="application/x-llamapun" id="A2.T3.11.m1.1e">italic_C start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT</annotation></semantics></math> is the set of possible continuations for item <math alttext="i" class="ltx_Math" display="inline" id="A2.T3.12.m2.1"><semantics id="A2.T3.12.m2.1b"><mi id="A2.T3.12.m2.1.1" xref="A2.T3.12.m2.1.1.cmml">i</mi><annotation-xml encoding="MathML-Content" id="A2.T3.12.m2.1c"><ci id="A2.T3.12.m2.1.1.cmml" xref="A2.T3.12.m2.1.1">𝑖</ci></annotation-xml><annotation encoding="application/x-tex" id="A2.T3.12.m2.1d">i</annotation><annotation encoding="application/x-llamapun" id="A2.T3.12.m2.1e">italic_i</annotation></semantics></math> and <math alttext="N" class="ltx_Math" display="inline" id="A2.T3.13.m3.1"><semantics id="A2.T3.13.m3.1b"><mi id="A2.T3.13.m3.1.1" xref="A2.T3.13.m3.1.1.cmml">N</mi><annotation-xml encoding="MathML-Content" id="A2.T3.13.m3.1c"><ci id="A2.T3.13.m3.1.1.cmml" xref="A2.T3.13.m3.1.1">𝑁</ci></annotation-xml><annotation encoding="application/x-tex" id="A2.T3.13.m3.1d">N</annotation><annotation encoding="application/x-llamapun" id="A2.T3.13.m3.1e">italic_N</annotation></semantics></math> is the number of items in a benchmark. Each each of the first 5 metrics have <span class="ltx_text" id="A2.T3.16.1">*_per_token</span> and <span class="ltx_text" id="A2.T3.17.2">*_per_char</span> variants in which likelihoods are normalized as defined in the bottom two rows.
</figcaption>
</figure>
<div class="ltx_para ltx_noindent" id="A2.p1">
<p class="ltx_p" id="A2.p1.1">Table <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#A2.T3" title="Table 3 ‣ Appendix B Proxy Metric Definitions ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_tag">3</span></a> provides formal definitions for our proxy metrics (§<a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#S2.SS5" title="2.5 Proxy Metrics for Performance Evaluation ‣ 2 Methods ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_tag">2.5</span></a>).</p>
</div>
<figure class="ltx_table" id="A2.T4">
<table class="ltx_tabular ltx_centering ltx_align_middle" id="A2.T4.2">
<tr class="ltx_tr" id="A2.T4.2.3">
<td class="ltx_td ltx_border_tt" id="A2.T4.2.3.1"></td>
<td class="ltx_td ltx_align_right ltx_border_tt" id="A2.T4.2.3.2">Relative Error</td>
<td class="ltx_td ltx_align_right ltx_border_tt" id="A2.T4.2.3.3">Absolute Error</td>
</tr>
<tr class="ltx_tr" id="A2.T4.2.4">
<td class="ltx_td ltx_align_left" id="A2.T4.2.4.1">Scaling Law Variant</td>
<td class="ltx_td" id="A2.T4.2.4.2"></td>
<td class="ltx_td" id="A2.T4.2.4.3"></td>
</tr>
<tr class="ltx_tr" id="A2.T4.1.1">
<td class="ltx_td ltx_align_left ltx_border_t" id="A2.T4.1.1.1">3-parameter with helpers and <math alttext="&gt;" class="ltx_Math" display="inline" id="A2.T4.1.1.1.m1.1"><semantics id="A2.T4.1.1.1.m1.1a"><mo id="A2.T4.1.1.1.m1.1.1" xref="A2.T4.1.1.1.m1.1.1.cmml">&gt;</mo><annotation-xml encoding="MathML-Content" id="A2.T4.1.1.1.m1.1b"><gt id="A2.T4.1.1.1.m1.1.1.cmml" xref="A2.T4.1.1.1.m1.1.1"></gt></annotation-xml><annotation encoding="application/x-tex" id="A2.T4.1.1.1.m1.1c">&gt;</annotation><annotation encoding="application/x-llamapun" id="A2.T4.1.1.1.m1.1d">&gt;</annotation></semantics></math>50% checkpoints</td>
<td class="ltx_td ltx_align_right ltx_border_t" id="A2.T4.1.1.2">5.6</td>
<td class="ltx_td ltx_align_right ltx_border_t" id="A2.T4.1.1.3">2.6</td>
</tr>
<tr class="ltx_tr" id="A2.T4.2.5">
<td class="ltx_td ltx_align_left" id="A2.T4.2.5.1">3-parameter with helper points</td>
<td class="ltx_td ltx_align_right" id="A2.T4.2.5.2">6.0</td>
<td class="ltx_td ltx_align_right" id="A2.T4.2.5.3">2.8</td>
</tr>
<tr class="ltx_tr" id="A2.T4.2.2">
<td class="ltx_td ltx_align_left" id="A2.T4.2.2.1">3-parameter step 2 fit with <math alttext="&gt;" class="ltx_Math" display="inline" id="A2.T4.2.2.1.m1.1"><semantics id="A2.T4.2.2.1.m1.1a"><mo id="A2.T4.2.2.1.m1.1.1" xref="A2.T4.2.2.1.m1.1.1.cmml">&gt;</mo><annotation-xml encoding="MathML-Content" id="A2.T4.2.2.1.m1.1b"><gt id="A2.T4.2.2.1.m1.1.1.cmml" xref="A2.T4.2.2.1.m1.1.1"></gt></annotation-xml><annotation encoding="application/x-tex" id="A2.T4.2.2.1.m1.1c">&gt;</annotation><annotation encoding="application/x-llamapun" id="A2.T4.2.2.1.m1.1d">&gt;</annotation></semantics></math>50% checkpoints</td>
<td class="ltx_td ltx_align_right" id="A2.T4.2.2.2">5.9</td>
<td class="ltx_td ltx_align_right" id="A2.T4.2.2.3">2.9</td>
</tr>
<tr class="ltx_tr" id="A2.T4.2.6">
<td class="ltx_td ltx_align_left" id="A2.T4.2.6.1">3-parameter</td>
<td class="ltx_td ltx_align_right" id="A2.T4.2.6.2">6.5</td>
<td class="ltx_td ltx_align_right" id="A2.T4.2.6.3">3.1</td>
</tr>
<tr class="ltx_tr" id="A2.T4.2.7">
<td class="ltx_td ltx_align_left" id="A2.T4.2.7.1">2-parameter</td>
<td class="ltx_td ltx_align_right" id="A2.T4.2.7.2">6.5</td>
<td class="ltx_td ltx_align_right" id="A2.T4.2.7.3">3.2</td>
</tr>
<tr class="ltx_tr" id="A2.T4.2.8">
<td class="ltx_td ltx_align_left" id="A2.T4.2.8.1">5-parameter, single step</td>
<td class="ltx_td ltx_align_right" id="A2.T4.2.8.2">42.8</td>
<td class="ltx_td ltx_align_right" id="A2.T4.2.8.3">17.4</td>
</tr>
<tr class="ltx_tr" id="A2.T4.2.9">
<td class="ltx_td ltx_align_left" id="A2.T4.2.9.1">3-parameter, single step</td>
<td class="ltx_td ltx_align_right" id="A2.T4.2.9.2">42.9</td>
<td class="ltx_td ltx_align_right" id="A2.T4.2.9.3">42.3</td>
</tr>
<tr class="ltx_tr" id="A2.T4.2.10">
<td class="ltx_td ltx_align_left ltx_border_bb" id="A2.T4.2.10.1">5-parameter</td>
<td class="ltx_td ltx_align_right ltx_border_bb" id="A2.T4.2.10.2">230.8</td>
<td class="ltx_td ltx_align_right ltx_border_bb" id="A2.T4.2.10.3">65.4</td>
</tr>
</table>
<figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_table">Table 4: </span>Average prediction error for 1B targets for the different scaling law setups across tasks and recipes on <span class="ltx_text ltx_font_smallcaps" id="A2.T4.4.1">Accuracy</span> fit to all models but 1B. We see that other than the single step and 5-parameter variants errors are comparable, and these variants also roughly follow the compute-decision frontier in Figure <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#S3.F3" title="Figure 3 ‣ 3.1 What is the best way to spend compute for data decisions? ‣ 3 Results ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_tag">3</span></a>. </figcaption>
</figure>
</section>
<section class="ltx_appendix" id="A3">
<h2 class="ltx_title ltx_title_appendix">
<span class="ltx_tag ltx_tag_appendix">Appendix C </span>Scaling Law Variants</h2>
<div class="ltx_para ltx_noindent" id="A3.p1">
<p class="ltx_p" id="A3.p1.5"><span class="ltx_text ltx_font_bold" id="A3.p1.5.1">Baseline 3-parameter fit.</span>
Our default setup (described in §<a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#S2.SS2" title="2.2 Prediction Methods ‣ 2 Methods ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_tag">2.2</span></a>) follows the two-step fit from <cite class="ltx_cite ltx_citemacro_citep">(Bhagia et al., <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib3" title="">2024</a>)</cite> and uses Equation <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#S2.E1" title="In Extrapolating Scaling Laws (Multi Scale) ‣ 2.2 Prediction Methods ‣ 2 Methods ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_tag">1</span></a> to map compute <math alttext="C" class="ltx_Math" display="inline" id="A3.p1.1.m1.1"><semantics id="A3.p1.1.m1.1a"><mi id="A3.p1.1.m1.1.1" xref="A3.p1.1.m1.1.1.cmml">C</mi><annotation-xml encoding="MathML-Content" id="A3.p1.1.m1.1b"><ci id="A3.p1.1.m1.1.1.cmml" xref="A3.p1.1.m1.1.1">𝐶</ci></annotation-xml><annotation encoding="application/x-tex" id="A3.p1.1.m1.1c">C</annotation><annotation encoding="application/x-llamapun" id="A3.p1.1.m1.1d">italic_C</annotation></semantics></math> to task loss <math alttext="L" class="ltx_Math" display="inline" id="A3.p1.2.m2.1"><semantics id="A3.p1.2.m2.1a"><mi id="A3.p1.2.m2.1.1" xref="A3.p1.2.m2.1.1.cmml">L</mi><annotation-xml encoding="MathML-Content" id="A3.p1.2.m2.1b"><ci id="A3.p1.2.m2.1.1.cmml" xref="A3.p1.2.m2.1.1">𝐿</ci></annotation-xml><annotation encoding="application/x-tex" id="A3.p1.2.m2.1c">L</annotation><annotation encoding="application/x-llamapun" id="A3.p1.2.m2.1d">italic_L</annotation></semantics></math>, and Equation <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#S2.E2" title="In Extrapolating Scaling Laws (Multi Scale) ‣ 2.2 Prediction Methods ‣ 2 Methods ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_tag">2</span></a> to map task loss to metric score. This variant fits three parameters (<math alttext="A" class="ltx_Math" display="inline" id="A3.p1.3.m3.1"><semantics id="A3.p1.3.m3.1a"><mi id="A3.p1.3.m3.1.1" xref="A3.p1.3.m3.1.1.cmml">A</mi><annotation-xml encoding="MathML-Content" id="A3.p1.3.m3.1b"><ci id="A3.p1.3.m3.1.1.cmml" xref="A3.p1.3.m3.1.1">𝐴</ci></annotation-xml><annotation encoding="application/x-tex" id="A3.p1.3.m3.1c">A</annotation><annotation encoding="application/x-llamapun" id="A3.p1.3.m3.1d">italic_A</annotation></semantics></math>, <math alttext="\alpha" class="ltx_Math" display="inline" id="A3.p1.4.m4.1"><semantics id="A3.p1.4.m4.1a"><mi id="A3.p1.4.m4.1.1" xref="A3.p1.4.m4.1.1.cmml">α</mi><annotation-xml encoding="MathML-Content" id="A3.p1.4.m4.1b"><ci id="A3.p1.4.m4.1.1.cmml" xref="A3.p1.4.m4.1.1">𝛼</ci></annotation-xml><annotation encoding="application/x-tex" id="A3.p1.4.m4.1c">\alpha</annotation><annotation encoding="application/x-llamapun" id="A3.p1.4.m4.1d">italic_α</annotation></semantics></math>, <math alttext="E" class="ltx_Math" display="inline" id="A3.p1.5.m5.1"><semantics id="A3.p1.5.m5.1a"><mi id="A3.p1.5.m5.1.1" xref="A3.p1.5.m5.1.1.cmml">E</mi><annotation-xml encoding="MathML-Content" id="A3.p1.5.m5.1b"><ci id="A3.p1.5.m5.1.1.cmml" xref="A3.p1.5.m5.1.1">𝐸</ci></annotation-xml><annotation encoding="application/x-tex" id="A3.p1.5.m5.1c">E</annotation><annotation encoding="application/x-llamapun" id="A3.p1.5.m5.1d">italic_E</annotation></semantics></math>) in the first step.</p>
</div>
<div class="ltx_para ltx_noindent" id="A3.p2">
<p class="ltx_p" id="A3.p2.1"><span class="ltx_text ltx_font_bold" id="A3.p2.1.1">2-parameter fit.</span>
This is a restricted version of the baseline where the irreducible loss term <math alttext="E" class="ltx_Math" display="inline" id="A3.p2.1.m1.1"><semantics id="A3.p2.1.m1.1a"><mi id="A3.p2.1.m1.1.1" xref="A3.p2.1.m1.1.1.cmml">E</mi><annotation-xml encoding="MathML-Content" id="A3.p2.1.m1.1b"><ci id="A3.p2.1.m1.1.1.cmml" xref="A3.p2.1.m1.1.1">𝐸</ci></annotation-xml><annotation encoding="application/x-tex" id="A3.p2.1.m1.1c">E</annotation><annotation encoding="application/x-llamapun" id="A3.p2.1.m1.1d">italic_E</annotation></semantics></math> is removed from Equation <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#S2.E1" title="In Extrapolating Scaling Laws (Multi Scale) ‣ 2.2 Prediction Methods ‣ 2 Methods ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_tag">1</span></a>, leaving only two parameters:</p>
<table class="ltx_equation ltx_eqn_table" id="A3.E4">
<tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline">
<td class="ltx_eqn_cell ltx_eqn_center_padleft"></td>
<td class="ltx_eqn_cell ltx_align_center"><math alttext="L(C)=\frac{A}{C^{\alpha}}" class="ltx_Math" display="block" id="A3.E4.m1.1"><semantics id="A3.E4.m1.1a"><mrow id="A3.E4.m1.1.2" xref="A3.E4.m1.1.2.cmml"><mrow id="A3.E4.m1.1.2.2" xref="A3.E4.m1.1.2.2.cmml"><mi id="A3.E4.m1.1.2.2.2" xref="A3.E4.m1.1.2.2.2.cmml">L</mi><mo id="A3.E4.m1.1.2.2.1" xref="A3.E4.m1.1.2.2.1.cmml">⁢</mo><mrow id="A3.E4.m1.1.2.2.3.2" xref="A3.E4.m1.1.2.2.cmml"><mo id="A3.E4.m1.1.2.2.3.2.1" stretchy="false" xref="A3.E4.m1.1.2.2.cmml">(</mo><mi id="A3.E4.m1.1.1" xref="A3.E4.m1.1.1.cmml">C</mi><mo id="A3.E4.m1.1.2.2.3.2.2" stretchy="false" xref="A3.E4.m1.1.2.2.cmml">)</mo></mrow></mrow><mo id="A3.E4.m1.1.2.1" xref="A3.E4.m1.1.2.1.cmml">=</mo><mfrac id="A3.E4.m1.1.2.3" xref="A3.E4.m1.1.2.3.cmml"><mi id="A3.E4.m1.1.2.3.2" xref="A3.E4.m1.1.2.3.2.cmml">A</mi><msup id="A3.E4.m1.1.2.3.3" xref="A3.E4.m1.1.2.3.3.cmml"><mi id="A3.E4.m1.1.2.3.3.2" xref="A3.E4.m1.1.2.3.3.2.cmml">C</mi><mi id="A3.E4.m1.1.2.3.3.3" xref="A3.E4.m1.1.2.3.3.3.cmml">α</mi></msup></mfrac></mrow><annotation-xml encoding="MathML-Content" id="A3.E4.m1.1b"><apply id="A3.E4.m1.1.2.cmml" xref="A3.E4.m1.1.2"><eq id="A3.E4.m1.1.2.1.cmml" xref="A3.E4.m1.1.2.1"></eq><apply id="A3.E4.m1.1.2.2.cmml" xref="A3.E4.m1.1.2.2"><times id="A3.E4.m1.1.2.2.1.cmml" xref="A3.E4.m1.1.2.2.1"></times><ci id="A3.E4.m1.1.2.2.2.cmml" xref="A3.E4.m1.1.2.2.2">𝐿</ci><ci id="A3.E4.m1.1.1.cmml" xref="A3.E4.m1.1.1">𝐶</ci></apply><apply id="A3.E4.m1.1.2.3.cmml" xref="A3.E4.m1.1.2.3"><divide id="A3.E4.m1.1.2.3.1.cmml" xref="A3.E4.m1.1.2.3"></divide><ci id="A3.E4.m1.1.2.3.2.cmml" xref="A3.E4.m1.1.2.3.2">𝐴</ci><apply id="A3.E4.m1.1.2.3.3.cmml" xref="A3.E4.m1.1.2.3.3"><csymbol cd="ambiguous" id="A3.E4.m1.1.2.3.3.1.cmml" xref="A3.E4.m1.1.2.3.3">superscript</csymbol><ci id="A3.E4.m1.1.2.3.3.2.cmml" xref="A3.E4.m1.1.2.3.3.2">𝐶</ci><ci id="A3.E4.m1.1.2.3.3.3.cmml" xref="A3.E4.m1.1.2.3.3.3">𝛼</ci></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="A3.E4.m1.1c">L(C)=\frac{A}{C^{\alpha}}</annotation><annotation encoding="application/x-llamapun" id="A3.E4.m1.1d">italic_L ( italic_C ) = divide start_ARG italic_A end_ARG start_ARG italic_C start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_ARG</annotation></semantics></math></td>
<td class="ltx_eqn_cell ltx_eqn_center_padright"></td>
<td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(4)</span></td>
</tr></tbody>
</table>
</div>
<div class="ltx_para ltx_noindent" id="A3.p3">
<p class="ltx_p" id="A3.p3.4"><span class="ltx_text ltx_font_bold" id="A3.p3.1.1">5-parameter <math alttext="(N,D)" class="ltx_Math" display="inline" id="A3.p3.1.1.m1.2"><semantics id="A3.p3.1.1.m1.2a"><mrow id="A3.p3.1.1.m1.2.3.2" xref="A3.p3.1.1.m1.2.3.1.cmml"><mo id="A3.p3.1.1.m1.2.3.2.1" stretchy="false" xref="A3.p3.1.1.m1.2.3.1.cmml">(</mo><mi id="A3.p3.1.1.m1.1.1" xref="A3.p3.1.1.m1.1.1.cmml">N</mi><mo id="A3.p3.1.1.m1.2.3.2.2" xref="A3.p3.1.1.m1.2.3.1.cmml">,</mo><mi id="A3.p3.1.1.m1.2.2" xref="A3.p3.1.1.m1.2.2.cmml">D</mi><mo id="A3.p3.1.1.m1.2.3.2.3" stretchy="false" xref="A3.p3.1.1.m1.2.3.1.cmml">)</mo></mrow><annotation-xml encoding="MathML-Content" id="A3.p3.1.1.m1.2b"><interval closure="open" id="A3.p3.1.1.m1.2.3.1.cmml" xref="A3.p3.1.1.m1.2.3.2"><ci id="A3.p3.1.1.m1.1.1.cmml" xref="A3.p3.1.1.m1.1.1">𝑁</ci><ci id="A3.p3.1.1.m1.2.2.cmml" xref="A3.p3.1.1.m1.2.2">𝐷</ci></interval></annotation-xml><annotation encoding="application/x-tex" id="A3.p3.1.1.m1.2c">(N,D)</annotation><annotation encoding="application/x-llamapun" id="A3.p3.1.1.m1.2d">( italic_N , italic_D )</annotation></semantics></math> fit.</span>
Instead of modeling loss as a function of compute <math alttext="C" class="ltx_Math" display="inline" id="A3.p3.2.m1.1"><semantics id="A3.p3.2.m1.1a"><mi id="A3.p3.2.m1.1.1" xref="A3.p3.2.m1.1.1.cmml">C</mi><annotation-xml encoding="MathML-Content" id="A3.p3.2.m1.1b"><ci id="A3.p3.2.m1.1.1.cmml" xref="A3.p3.2.m1.1.1">𝐶</ci></annotation-xml><annotation encoding="application/x-tex" id="A3.p3.2.m1.1c">C</annotation><annotation encoding="application/x-llamapun" id="A3.p3.2.m1.1d">italic_C</annotation></semantics></math>, this variant uses both number of tokens <math alttext="N" class="ltx_Math" display="inline" id="A3.p3.3.m2.1"><semantics id="A3.p3.3.m2.1a"><mi id="A3.p3.3.m2.1.1" xref="A3.p3.3.m2.1.1.cmml">N</mi><annotation-xml encoding="MathML-Content" id="A3.p3.3.m2.1b"><ci id="A3.p3.3.m2.1.1.cmml" xref="A3.p3.3.m2.1.1">𝑁</ci></annotation-xml><annotation encoding="application/x-tex" id="A3.p3.3.m2.1c">N</annotation><annotation encoding="application/x-llamapun" id="A3.p3.3.m2.1d">italic_N</annotation></semantics></math> and number of parameters <math alttext="D" class="ltx_Math" display="inline" id="A3.p3.4.m3.1"><semantics id="A3.p3.4.m3.1a"><mi id="A3.p3.4.m3.1.1" xref="A3.p3.4.m3.1.1.cmml">D</mi><annotation-xml encoding="MathML-Content" id="A3.p3.4.m3.1b"><ci id="A3.p3.4.m3.1.1.cmml" xref="A3.p3.4.m3.1.1">𝐷</ci></annotation-xml><annotation encoding="application/x-tex" id="A3.p3.4.m3.1c">D</annotation><annotation encoding="application/x-llamapun" id="A3.p3.4.m3.1d">italic_D</annotation></semantics></math> directly in the loss function:</p>
<table class="ltx_equation ltx_eqn_table" id="A3.E5">
<tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline">
<td class="ltx_eqn_cell ltx_eqn_center_padleft"></td>
<td class="ltx_eqn_cell ltx_align_center"><math alttext="L(N,D)=\frac{A}{N^{\alpha}}+\frac{B}{D^{\beta}}+E" class="ltx_Math" display="block" id="A3.E5.m1.2"><semantics id="A3.E5.m1.2a"><mrow id="A3.E5.m1.2.3" xref="A3.E5.m1.2.3.cmml"><mrow id="A3.E5.m1.2.3.2" xref="A3.E5.m1.2.3.2.cmml"><mi id="A3.E5.m1.2.3.2.2" xref="A3.E5.m1.2.3.2.2.cmml">L</mi><mo id="A3.E5.m1.2.3.2.1" xref="A3.E5.m1.2.3.2.1.cmml">⁢</mo><mrow id="A3.E5.m1.2.3.2.3.2" xref="A3.E5.m1.2.3.2.3.1.cmml"><mo id="A3.E5.m1.2.3.2.3.2.1" stretchy="false" xref="A3.E5.m1.2.3.2.3.1.cmml">(</mo><mi id="A3.E5.m1.1.1" xref="A3.E5.m1.1.1.cmml">N</mi><mo id="A3.E5.m1.2.3.2.3.2.2" xref="A3.E5.m1.2.3.2.3.1.cmml">,</mo><mi id="A3.E5.m1.2.2" xref="A3.E5.m1.2.2.cmml">D</mi><mo id="A3.E5.m1.2.3.2.3.2.3" stretchy="false" xref="A3.E5.m1.2.3.2.3.1.cmml">)</mo></mrow></mrow><mo id="A3.E5.m1.2.3.1" xref="A3.E5.m1.2.3.1.cmml">=</mo><mrow id="A3.E5.m1.2.3.3" xref="A3.E5.m1.2.3.3.cmml"><mfrac id="A3.E5.m1.2.3.3.2" xref="A3.E5.m1.2.3.3.2.cmml"><mi id="A3.E5.m1.2.3.3.2.2" xref="A3.E5.m1.2.3.3.2.2.cmml">A</mi><msup id="A3.E5.m1.2.3.3.2.3" xref="A3.E5.m1.2.3.3.2.3.cmml"><mi id="A3.E5.m1.2.3.3.2.3.2" xref="A3.E5.m1.2.3.3.2.3.2.cmml">N</mi><mi id="A3.E5.m1.2.3.3.2.3.3" xref="A3.E5.m1.2.3.3.2.3.3.cmml">α</mi></msup></mfrac><mo id="A3.E5.m1.2.3.3.1" xref="A3.E5.m1.2.3.3.1.cmml">+</mo><mfrac id="A3.E5.m1.2.3.3.3" xref="A3.E5.m1.2.3.3.3.cmml"><mi id="A3.E5.m1.2.3.3.3.2" xref="A3.E5.m1.2.3.3.3.2.cmml">B</mi><msup id="A3.E5.m1.2.3.3.3.3" xref="A3.E5.m1.2.3.3.3.3.cmml"><mi id="A3.E5.m1.2.3.3.3.3.2" xref="A3.E5.m1.2.3.3.3.3.2.cmml">D</mi><mi id="A3.E5.m1.2.3.3.3.3.3" xref="A3.E5.m1.2.3.3.3.3.3.cmml">β</mi></msup></mfrac><mo id="A3.E5.m1.2.3.3.1a" xref="A3.E5.m1.2.3.3.1.cmml">+</mo><mi id="A3.E5.m1.2.3.3.4" xref="A3.E5.m1.2.3.3.4.cmml">E</mi></mrow></mrow><annotation-xml encoding="MathML-Content" id="A3.E5.m1.2b"><apply id="A3.E5.m1.2.3.cmml" xref="A3.E5.m1.2.3"><eq id="A3.E5.m1.2.3.1.cmml" xref="A3.E5.m1.2.3.1"></eq><apply id="A3.E5.m1.2.3.2.cmml" xref="A3.E5.m1.2.3.2"><times id="A3.E5.m1.2.3.2.1.cmml" xref="A3.E5.m1.2.3.2.1"></times><ci id="A3.E5.m1.2.3.2.2.cmml" xref="A3.E5.m1.2.3.2.2">𝐿</ci><interval closure="open" id="A3.E5.m1.2.3.2.3.1.cmml" xref="A3.E5.m1.2.3.2.3.2"><ci id="A3.E5.m1.1.1.cmml" xref="A3.E5.m1.1.1">𝑁</ci><ci id="A3.E5.m1.2.2.cmml" xref="A3.E5.m1.2.2">𝐷</ci></interval></apply><apply id="A3.E5.m1.2.3.3.cmml" xref="A3.E5.m1.2.3.3"><plus id="A3.E5.m1.2.3.3.1.cmml" xref="A3.E5.m1.2.3.3.1"></plus><apply id="A3.E5.m1.2.3.3.2.cmml" xref="A3.E5.m1.2.3.3.2"><divide id="A3.E5.m1.2.3.3.2.1.cmml" xref="A3.E5.m1.2.3.3.2"></divide><ci id="A3.E5.m1.2.3.3.2.2.cmml" xref="A3.E5.m1.2.3.3.2.2">𝐴</ci><apply id="A3.E5.m1.2.3.3.2.3.cmml" xref="A3.E5.m1.2.3.3.2.3"><csymbol cd="ambiguous" id="A3.E5.m1.2.3.3.2.3.1.cmml" xref="A3.E5.m1.2.3.3.2.3">superscript</csymbol><ci id="A3.E5.m1.2.3.3.2.3.2.cmml" xref="A3.E5.m1.2.3.3.2.3.2">𝑁</ci><ci id="A3.E5.m1.2.3.3.2.3.3.cmml" xref="A3.E5.m1.2.3.3.2.3.3">𝛼</ci></apply></apply><apply id="A3.E5.m1.2.3.3.3.cmml" xref="A3.E5.m1.2.3.3.3"><divide id="A3.E5.m1.2.3.3.3.1.cmml" xref="A3.E5.m1.2.3.3.3"></divide><ci id="A3.E5.m1.2.3.3.3.2.cmml" xref="A3.E5.m1.2.3.3.3.2">𝐵</ci><apply id="A3.E5.m1.2.3.3.3.3.cmml" xref="A3.E5.m1.2.3.3.3.3"><csymbol cd="ambiguous" id="A3.E5.m1.2.3.3.3.3.1.cmml" xref="A3.E5.m1.2.3.3.3.3">superscript</csymbol><ci id="A3.E5.m1.2.3.3.3.3.2.cmml" xref="A3.E5.m1.2.3.3.3.3.2">𝐷</ci><ci id="A3.E5.m1.2.3.3.3.3.3.cmml" xref="A3.E5.m1.2.3.3.3.3.3">𝛽</ci></apply></apply><ci id="A3.E5.m1.2.3.3.4.cmml" xref="A3.E5.m1.2.3.3.4">𝐸</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="A3.E5.m1.2c">L(N,D)=\frac{A}{N^{\alpha}}+\frac{B}{D^{\beta}}+E</annotation><annotation encoding="application/x-llamapun" id="A3.E5.m1.2d">italic_L ( italic_N , italic_D ) = divide start_ARG italic_A end_ARG start_ARG italic_N start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_ARG + divide start_ARG italic_B end_ARG start_ARG italic_D start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT end_ARG + italic_E</annotation></semantics></math></td>
<td class="ltx_eqn_cell ltx_eqn_center_padright"></td>
<td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(5)</span></td>
</tr></tbody>
</table>
<p class="ltx_p" id="A3.p3.9">This introduces five parameters: <math alttext="A" class="ltx_Math" display="inline" id="A3.p3.5.m1.1"><semantics id="A3.p3.5.m1.1a"><mi id="A3.p3.5.m1.1.1" xref="A3.p3.5.m1.1.1.cmml">A</mi><annotation-xml encoding="MathML-Content" id="A3.p3.5.m1.1b"><ci id="A3.p3.5.m1.1.1.cmml" xref="A3.p3.5.m1.1.1">𝐴</ci></annotation-xml><annotation encoding="application/x-tex" id="A3.p3.5.m1.1c">A</annotation><annotation encoding="application/x-llamapun" id="A3.p3.5.m1.1d">italic_A</annotation></semantics></math>, <math alttext="\alpha" class="ltx_Math" display="inline" id="A3.p3.6.m2.1"><semantics id="A3.p3.6.m2.1a"><mi id="A3.p3.6.m2.1.1" xref="A3.p3.6.m2.1.1.cmml">α</mi><annotation-xml encoding="MathML-Content" id="A3.p3.6.m2.1b"><ci id="A3.p3.6.m2.1.1.cmml" xref="A3.p3.6.m2.1.1">𝛼</ci></annotation-xml><annotation encoding="application/x-tex" id="A3.p3.6.m2.1c">\alpha</annotation><annotation encoding="application/x-llamapun" id="A3.p3.6.m2.1d">italic_α</annotation></semantics></math>, <math alttext="B" class="ltx_Math" display="inline" id="A3.p3.7.m3.1"><semantics id="A3.p3.7.m3.1a"><mi id="A3.p3.7.m3.1.1" xref="A3.p3.7.m3.1.1.cmml">B</mi><annotation-xml encoding="MathML-Content" id="A3.p3.7.m3.1b"><ci id="A3.p3.7.m3.1.1.cmml" xref="A3.p3.7.m3.1.1">𝐵</ci></annotation-xml><annotation encoding="application/x-tex" id="A3.p3.7.m3.1c">B</annotation><annotation encoding="application/x-llamapun" id="A3.p3.7.m3.1d">italic_B</annotation></semantics></math>, <math alttext="\beta" class="ltx_Math" display="inline" id="A3.p3.8.m4.1"><semantics id="A3.p3.8.m4.1a"><mi id="A3.p3.8.m4.1.1" xref="A3.p3.8.m4.1.1.cmml">β</mi><annotation-xml encoding="MathML-Content" id="A3.p3.8.m4.1b"><ci id="A3.p3.8.m4.1.1.cmml" xref="A3.p3.8.m4.1.1">𝛽</ci></annotation-xml><annotation encoding="application/x-tex" id="A3.p3.8.m4.1c">\beta</annotation><annotation encoding="application/x-llamapun" id="A3.p3.8.m4.1d">italic_β</annotation></semantics></math>, and <math alttext="E" class="ltx_Math" display="inline" id="A3.p3.9.m5.1"><semantics id="A3.p3.9.m5.1a"><mi id="A3.p3.9.m5.1.1" xref="A3.p3.9.m5.1.1.cmml">E</mi><annotation-xml encoding="MathML-Content" id="A3.p3.9.m5.1b"><ci id="A3.p3.9.m5.1.1.cmml" xref="A3.p3.9.m5.1.1">𝐸</ci></annotation-xml><annotation encoding="application/x-tex" id="A3.p3.9.m5.1c">E</annotation><annotation encoding="application/x-llamapun" id="A3.p3.9.m5.1d">italic_E</annotation></semantics></math>.</p>
</div>
<div class="ltx_para ltx_noindent" id="A3.p4">
<p class="ltx_p" id="A3.p4.1"><span class="ltx_text ltx_font_bold" id="A3.p4.1.1">Single-step prediction.</span>
In this variant, the two-stage fitting procedure is replaced with a single step that directly maps compute <math alttext="C" class="ltx_Math" display="inline" id="A3.p4.1.m1.1"><semantics id="A3.p4.1.m1.1a"><mi id="A3.p4.1.m1.1.1" xref="A3.p4.1.m1.1.1.cmml">C</mi><annotation-xml encoding="MathML-Content" id="A3.p4.1.m1.1b"><ci id="A3.p4.1.m1.1.1.cmml" xref="A3.p4.1.m1.1.1">𝐶</ci></annotation-xml><annotation encoding="application/x-tex" id="A3.p4.1.m1.1c">C</annotation><annotation encoding="application/x-llamapun" id="A3.p4.1.m1.1d">italic_C</annotation></semantics></math> to accuracy:</p>
<table class="ltx_equation ltx_eqn_table" id="A3.E6">
<tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline">
<td class="ltx_eqn_cell ltx_eqn_center_padleft"></td>
<td class="ltx_eqn_cell ltx_align_center"><math alttext="Acc(C)=\frac{a}{1+\exp\left(-k\left(\frac{A}{C^{\alpha}}+E-L_{0}\right)\right)%
}+b" class="ltx_Math" display="block" id="A3.E6.m1.3"><semantics id="A3.E6.m1.3a"><mrow id="A3.E6.m1.3.4" xref="A3.E6.m1.3.4.cmml"><mrow id="A3.E6.m1.3.4.2" xref="A3.E6.m1.3.4.2.cmml"><mi id="A3.E6.m1.3.4.2.2" xref="A3.E6.m1.3.4.2.2.cmml">A</mi><mo id="A3.E6.m1.3.4.2.1" xref="A3.E6.m1.3.4.2.1.cmml">⁢</mo><mi id="A3.E6.m1.3.4.2.3" xref="A3.E6.m1.3.4.2.3.cmml">c</mi><mo id="A3.E6.m1.3.4.2.1a" xref="A3.E6.m1.3.4.2.1.cmml">⁢</mo><mi id="A3.E6.m1.3.4.2.4" xref="A3.E6.m1.3.4.2.4.cmml">c</mi><mo id="A3.E6.m1.3.4.2.1b" xref="A3.E6.m1.3.4.2.1.cmml">⁢</mo><mrow id="A3.E6.m1.3.4.2.5.2" xref="A3.E6.m1.3.4.2.cmml"><mo id="A3.E6.m1.3.4.2.5.2.1" stretchy="false" xref="A3.E6.m1.3.4.2.cmml">(</mo><mi id="A3.E6.m1.3.3" xref="A3.E6.m1.3.3.cmml">C</mi><mo id="A3.E6.m1.3.4.2.5.2.2" stretchy="false" xref="A3.E6.m1.3.4.2.cmml">)</mo></mrow></mrow><mo id="A3.E6.m1.3.4.1" xref="A3.E6.m1.3.4.1.cmml">=</mo><mrow id="A3.E6.m1.3.4.3" xref="A3.E6.m1.3.4.3.cmml"><mfrac id="A3.E6.m1.2.2" xref="A3.E6.m1.2.2.cmml"><mi id="A3.E6.m1.2.2.4" xref="A3.E6.m1.2.2.4.cmml">a</mi><mrow id="A3.E6.m1.2.2.2" xref="A3.E6.m1.2.2.2.cmml"><mn id="A3.E6.m1.2.2.2.4" xref="A3.E6.m1.2.2.2.4.cmml">1</mn><mo id="A3.E6.m1.2.2.2.3" xref="A3.E6.m1.2.2.2.3.cmml">+</mo><mrow id="A3.E6.m1.2.2.2.2.1" xref="A3.E6.m1.2.2.2.2.2.cmml"><mi id="A3.E6.m1.1.1.1.1" xref="A3.E6.m1.1.1.1.1.cmml">exp</mi><mo id="A3.E6.m1.2.2.2.2.1a" xref="A3.E6.m1.2.2.2.2.2.cmml">⁡</mo><mrow id="A3.E6.m1.2.2.2.2.1.1" xref="A3.E6.m1.2.2.2.2.2.cmml"><mo id="A3.E6.m1.2.2.2.2.1.1.2" xref="A3.E6.m1.2.2.2.2.2.cmml">(</mo><mrow id="A3.E6.m1.2.2.2.2.1.1.1" xref="A3.E6.m1.2.2.2.2.1.1.1.cmml"><mo id="A3.E6.m1.2.2.2.2.1.1.1a" xref="A3.E6.m1.2.2.2.2.1.1.1.cmml">−</mo><mrow id="A3.E6.m1.2.2.2.2.1.1.1.1" xref="A3.E6.m1.2.2.2.2.1.1.1.1.cmml"><mi id="A3.E6.m1.2.2.2.2.1.1.1.1.3" xref="A3.E6.m1.2.2.2.2.1.1.1.1.3.cmml">k</mi><mo id="A3.E6.m1.2.2.2.2.1.1.1.1.2" xref="A3.E6.m1.2.2.2.2.1.1.1.1.2.cmml">⁢</mo><mrow id="A3.E6.m1.2.2.2.2.1.1.1.1.1.1" xref="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.cmml"><mo id="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.2" xref="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.cmml">(</mo><mrow id="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1" xref="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.cmml"><mrow id="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.2" xref="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.2.cmml"><mfrac id="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.2.2" xref="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.2.2.cmml"><mi id="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.2.2.2" xref="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.2.2.2.cmml">A</mi><msup id="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.2.2.3" xref="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.2.2.3.cmml"><mi id="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.2.2.3.2" xref="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.2.2.3.2.cmml">C</mi><mi id="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.2.2.3.3" xref="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.2.2.3.3.cmml">α</mi></msup></mfrac><mo id="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.2.1" xref="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.2.1.cmml">+</mo><mi id="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.2.3" xref="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.2.3.cmml">E</mi></mrow><mo id="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.1" xref="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.1.cmml">−</mo><msub id="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.3" xref="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.3.cmml"><mi id="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.3.2" xref="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.3.2.cmml">L</mi><mn id="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.3.3" xref="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.3.3.cmml">0</mn></msub></mrow><mo id="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.3" xref="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.cmml">)</mo></mrow></mrow></mrow><mo id="A3.E6.m1.2.2.2.2.1.1.3" xref="A3.E6.m1.2.2.2.2.2.cmml">)</mo></mrow></mrow></mrow></mfrac><mo id="A3.E6.m1.3.4.3.1" xref="A3.E6.m1.3.4.3.1.cmml">+</mo><mi id="A3.E6.m1.3.4.3.2" xref="A3.E6.m1.3.4.3.2.cmml">b</mi></mrow></mrow><annotation-xml encoding="MathML-Content" id="A3.E6.m1.3b"><apply id="A3.E6.m1.3.4.cmml" xref="A3.E6.m1.3.4"><eq id="A3.E6.m1.3.4.1.cmml" xref="A3.E6.m1.3.4.1"></eq><apply id="A3.E6.m1.3.4.2.cmml" xref="A3.E6.m1.3.4.2"><times id="A3.E6.m1.3.4.2.1.cmml" xref="A3.E6.m1.3.4.2.1"></times><ci id="A3.E6.m1.3.4.2.2.cmml" xref="A3.E6.m1.3.4.2.2">𝐴</ci><ci id="A3.E6.m1.3.4.2.3.cmml" xref="A3.E6.m1.3.4.2.3">𝑐</ci><ci id="A3.E6.m1.3.4.2.4.cmml" xref="A3.E6.m1.3.4.2.4">𝑐</ci><ci id="A3.E6.m1.3.3.cmml" xref="A3.E6.m1.3.3">𝐶</ci></apply><apply id="A3.E6.m1.3.4.3.cmml" xref="A3.E6.m1.3.4.3"><plus id="A3.E6.m1.3.4.3.1.cmml" xref="A3.E6.m1.3.4.3.1"></plus><apply id="A3.E6.m1.2.2.cmml" xref="A3.E6.m1.2.2"><divide id="A3.E6.m1.2.2.3.cmml" xref="A3.E6.m1.2.2"></divide><ci id="A3.E6.m1.2.2.4.cmml" xref="A3.E6.m1.2.2.4">𝑎</ci><apply id="A3.E6.m1.2.2.2.cmml" xref="A3.E6.m1.2.2.2"><plus id="A3.E6.m1.2.2.2.3.cmml" xref="A3.E6.m1.2.2.2.3"></plus><cn id="A3.E6.m1.2.2.2.4.cmml" type="integer" xref="A3.E6.m1.2.2.2.4">1</cn><apply id="A3.E6.m1.2.2.2.2.2.cmml" xref="A3.E6.m1.2.2.2.2.1"><exp id="A3.E6.m1.1.1.1.1.cmml" xref="A3.E6.m1.1.1.1.1"></exp><apply id="A3.E6.m1.2.2.2.2.1.1.1.cmml" xref="A3.E6.m1.2.2.2.2.1.1.1"><minus id="A3.E6.m1.2.2.2.2.1.1.1.2.cmml" xref="A3.E6.m1.2.2.2.2.1.1.1"></minus><apply id="A3.E6.m1.2.2.2.2.1.1.1.1.cmml" xref="A3.E6.m1.2.2.2.2.1.1.1.1"><times id="A3.E6.m1.2.2.2.2.1.1.1.1.2.cmml" xref="A3.E6.m1.2.2.2.2.1.1.1.1.2"></times><ci id="A3.E6.m1.2.2.2.2.1.1.1.1.3.cmml" xref="A3.E6.m1.2.2.2.2.1.1.1.1.3">𝑘</ci><apply id="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.cmml" xref="A3.E6.m1.2.2.2.2.1.1.1.1.1.1"><minus id="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.1.cmml" xref="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.1"></minus><apply id="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.2.cmml" xref="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.2"><plus id="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.2.1.cmml" xref="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.2.1"></plus><apply id="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.2.2.cmml" xref="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.2.2"><divide id="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.2.2.1.cmml" xref="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.2.2"></divide><ci id="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.2.2.2.cmml" xref="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.2.2.2">𝐴</ci><apply id="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.2.2.3.cmml" xref="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.2.2.3"><csymbol cd="ambiguous" id="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.2.2.3.1.cmml" xref="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.2.2.3">superscript</csymbol><ci id="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.2.2.3.2.cmml" xref="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.2.2.3.2">𝐶</ci><ci id="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.2.2.3.3.cmml" xref="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.2.2.3.3">𝛼</ci></apply></apply><ci id="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.2.3.cmml" xref="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.2.3">𝐸</ci></apply><apply id="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.3.cmml" xref="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.3.1.cmml" xref="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.3">subscript</csymbol><ci id="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.3.2.cmml" xref="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.3.2">𝐿</ci><cn id="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.3.3.cmml" type="integer" xref="A3.E6.m1.2.2.2.2.1.1.1.1.1.1.1.3.3">0</cn></apply></apply></apply></apply></apply></apply></apply><ci id="A3.E6.m1.3.4.3.2.cmml" xref="A3.E6.m1.3.4.3.2">𝑏</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="A3.E6.m1.3c">Acc(C)=\frac{a}{1+\exp\left(-k\left(\frac{A}{C^{\alpha}}+E-L_{0}\right)\right)%
}+b</annotation><annotation encoding="application/x-llamapun" id="A3.E6.m1.3d">italic_A italic_c italic_c ( italic_C ) = divide start_ARG italic_a end_ARG start_ARG 1 + roman_exp ( - italic_k ( divide start_ARG italic_A end_ARG start_ARG italic_C start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_ARG + italic_E - italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) end_ARG + italic_b</annotation></semantics></math></td>
<td class="ltx_eqn_cell ltx_eqn_center_padright"></td>
<td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(6)</span></td>
</tr></tbody>
</table>
<p class="ltx_p" id="A3.p4.2">This combines the loss and accuracy mapping into one function.</p>
</div>
<div class="ltx_para ltx_noindent" id="A3.p5">
<p class="ltx_p" id="A3.p5.1"><span class="ltx_text ltx_font_bold" id="A3.p5.1.1">5-parameter, single step.</span>
We also test a single-step variant that directly maps from <math alttext="(N,D)" class="ltx_Math" display="inline" id="A3.p5.1.m1.2"><semantics id="A3.p5.1.m1.2a"><mrow id="A3.p5.1.m1.2.3.2" xref="A3.p5.1.m1.2.3.1.cmml"><mo id="A3.p5.1.m1.2.3.2.1" stretchy="false" xref="A3.p5.1.m1.2.3.1.cmml">(</mo><mi id="A3.p5.1.m1.1.1" xref="A3.p5.1.m1.1.1.cmml">N</mi><mo id="A3.p5.1.m1.2.3.2.2" xref="A3.p5.1.m1.2.3.1.cmml">,</mo><mi id="A3.p5.1.m1.2.2" xref="A3.p5.1.m1.2.2.cmml">D</mi><mo id="A3.p5.1.m1.2.3.2.3" stretchy="false" xref="A3.p5.1.m1.2.3.1.cmml">)</mo></mrow><annotation-xml encoding="MathML-Content" id="A3.p5.1.m1.2b"><interval closure="open" id="A3.p5.1.m1.2.3.1.cmml" xref="A3.p5.1.m1.2.3.2"><ci id="A3.p5.1.m1.1.1.cmml" xref="A3.p5.1.m1.1.1">𝑁</ci><ci id="A3.p5.1.m1.2.2.cmml" xref="A3.p5.1.m1.2.2">𝐷</ci></interval></annotation-xml><annotation encoding="application/x-tex" id="A3.p5.1.m1.2c">(N,D)</annotation><annotation encoding="application/x-llamapun" id="A3.p5.1.m1.2d">( italic_N , italic_D )</annotation></semantics></math> to accuracy using a logistic function over the predicted loss. This merges Equations <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#A3.E5" title="In Appendix C Scaling Law Variants ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_tag">5</span></a> and <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#S2.E2" title="In Extrapolating Scaling Laws (Multi Scale) ‣ 2.2 Prediction Methods ‣ 2 Methods ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_tag">2</span></a> into:</p>
</div>
<div class="ltx_para ltx_noindent" id="A3.p6">
<table class="ltx_equation ltx_eqn_table" id="A3.E7">
<tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline">
<td class="ltx_eqn_cell ltx_eqn_center_padleft"></td>
<td class="ltx_eqn_cell ltx_align_center"><math alttext="Acc(N,D)=\frac{a}{1+\exp\left(-\left(\frac{A}{N^{\alpha}}+\frac{B}{D^{\beta}}+%
E\right)\right)}+b" class="ltx_Math" display="block" id="A3.E7.m1.4"><semantics id="A3.E7.m1.4a"><mrow id="A3.E7.m1.4.5" xref="A3.E7.m1.4.5.cmml"><mrow id="A3.E7.m1.4.5.2" xref="A3.E7.m1.4.5.2.cmml"><mi id="A3.E7.m1.4.5.2.2" xref="A3.E7.m1.4.5.2.2.cmml">A</mi><mo id="A3.E7.m1.4.5.2.1" xref="A3.E7.m1.4.5.2.1.cmml">⁢</mo><mi id="A3.E7.m1.4.5.2.3" xref="A3.E7.m1.4.5.2.3.cmml">c</mi><mo id="A3.E7.m1.4.5.2.1a" xref="A3.E7.m1.4.5.2.1.cmml">⁢</mo><mi id="A3.E7.m1.4.5.2.4" xref="A3.E7.m1.4.5.2.4.cmml">c</mi><mo id="A3.E7.m1.4.5.2.1b" xref="A3.E7.m1.4.5.2.1.cmml">⁢</mo><mrow id="A3.E7.m1.4.5.2.5.2" xref="A3.E7.m1.4.5.2.5.1.cmml"><mo id="A3.E7.m1.4.5.2.5.2.1" stretchy="false" xref="A3.E7.m1.4.5.2.5.1.cmml">(</mo><mi id="A3.E7.m1.3.3" xref="A3.E7.m1.3.3.cmml">N</mi><mo id="A3.E7.m1.4.5.2.5.2.2" xref="A3.E7.m1.4.5.2.5.1.cmml">,</mo><mi id="A3.E7.m1.4.4" xref="A3.E7.m1.4.4.cmml">D</mi><mo id="A3.E7.m1.4.5.2.5.2.3" stretchy="false" xref="A3.E7.m1.4.5.2.5.1.cmml">)</mo></mrow></mrow><mo id="A3.E7.m1.4.5.1" xref="A3.E7.m1.4.5.1.cmml">=</mo><mrow id="A3.E7.m1.4.5.3" xref="A3.E7.m1.4.5.3.cmml"><mfrac id="A3.E7.m1.2.2" xref="A3.E7.m1.2.2.cmml"><mi id="A3.E7.m1.2.2.4" xref="A3.E7.m1.2.2.4.cmml">a</mi><mrow id="A3.E7.m1.2.2.2" xref="A3.E7.m1.2.2.2.cmml"><mn id="A3.E7.m1.2.2.2.4" xref="A3.E7.m1.2.2.2.4.cmml">1</mn><mo id="A3.E7.m1.2.2.2.3" xref="A3.E7.m1.2.2.2.3.cmml">+</mo><mrow id="A3.E7.m1.2.2.2.2.1" xref="A3.E7.m1.2.2.2.2.2.cmml"><mi id="A3.E7.m1.1.1.1.1" xref="A3.E7.m1.1.1.1.1.cmml">exp</mi><mo id="A3.E7.m1.2.2.2.2.1a" xref="A3.E7.m1.2.2.2.2.2.cmml">⁡</mo><mrow id="A3.E7.m1.2.2.2.2.1.1" xref="A3.E7.m1.2.2.2.2.2.cmml"><mo id="A3.E7.m1.2.2.2.2.1.1.2" xref="A3.E7.m1.2.2.2.2.2.cmml">(</mo><mrow id="A3.E7.m1.2.2.2.2.1.1.1" xref="A3.E7.m1.2.2.2.2.1.1.1.cmml"><mo id="A3.E7.m1.2.2.2.2.1.1.1a" xref="A3.E7.m1.2.2.2.2.1.1.1.cmml">−</mo><mrow id="A3.E7.m1.2.2.2.2.1.1.1.1.1" xref="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.cmml"><mo id="A3.E7.m1.2.2.2.2.1.1.1.1.1.2" xref="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.cmml">(</mo><mrow id="A3.E7.m1.2.2.2.2.1.1.1.1.1.1" xref="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.cmml"><mfrac id="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.2" xref="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.2.cmml"><mi id="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.2.2" xref="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.2.2.cmml">A</mi><msup id="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.2.3" xref="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.2.3.cmml"><mi id="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.2.3.2" xref="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.2.3.2.cmml">N</mi><mi id="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.2.3.3" xref="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.2.3.3.cmml">α</mi></msup></mfrac><mo id="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.1" xref="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.1.cmml">+</mo><mfrac id="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.3" xref="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.3.cmml"><mi id="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.3.2" xref="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.3.2.cmml">B</mi><msup id="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.3.3" xref="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.3.3.cmml"><mi id="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.3.3.2" xref="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.3.3.2.cmml">D</mi><mi id="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.3.3.3" xref="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.3.3.3.cmml">β</mi></msup></mfrac><mo id="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.1a" xref="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.1.cmml">+</mo><mi id="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.4" xref="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.4.cmml">E</mi></mrow><mo id="A3.E7.m1.2.2.2.2.1.1.1.1.1.3" xref="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.cmml">)</mo></mrow></mrow><mo id="A3.E7.m1.2.2.2.2.1.1.3" xref="A3.E7.m1.2.2.2.2.2.cmml">)</mo></mrow></mrow></mrow></mfrac><mo id="A3.E7.m1.4.5.3.1" xref="A3.E7.m1.4.5.3.1.cmml">+</mo><mi id="A3.E7.m1.4.5.3.2" xref="A3.E7.m1.4.5.3.2.cmml">b</mi></mrow></mrow><annotation-xml encoding="MathML-Content" id="A3.E7.m1.4b"><apply id="A3.E7.m1.4.5.cmml" xref="A3.E7.m1.4.5"><eq id="A3.E7.m1.4.5.1.cmml" xref="A3.E7.m1.4.5.1"></eq><apply id="A3.E7.m1.4.5.2.cmml" xref="A3.E7.m1.4.5.2"><times id="A3.E7.m1.4.5.2.1.cmml" xref="A3.E7.m1.4.5.2.1"></times><ci id="A3.E7.m1.4.5.2.2.cmml" xref="A3.E7.m1.4.5.2.2">𝐴</ci><ci id="A3.E7.m1.4.5.2.3.cmml" xref="A3.E7.m1.4.5.2.3">𝑐</ci><ci id="A3.E7.m1.4.5.2.4.cmml" xref="A3.E7.m1.4.5.2.4">𝑐</ci><interval closure="open" id="A3.E7.m1.4.5.2.5.1.cmml" xref="A3.E7.m1.4.5.2.5.2"><ci id="A3.E7.m1.3.3.cmml" xref="A3.E7.m1.3.3">𝑁</ci><ci id="A3.E7.m1.4.4.cmml" xref="A3.E7.m1.4.4">𝐷</ci></interval></apply><apply id="A3.E7.m1.4.5.3.cmml" xref="A3.E7.m1.4.5.3"><plus id="A3.E7.m1.4.5.3.1.cmml" xref="A3.E7.m1.4.5.3.1"></plus><apply id="A3.E7.m1.2.2.cmml" xref="A3.E7.m1.2.2"><divide id="A3.E7.m1.2.2.3.cmml" xref="A3.E7.m1.2.2"></divide><ci id="A3.E7.m1.2.2.4.cmml" xref="A3.E7.m1.2.2.4">𝑎</ci><apply id="A3.E7.m1.2.2.2.cmml" xref="A3.E7.m1.2.2.2"><plus id="A3.E7.m1.2.2.2.3.cmml" xref="A3.E7.m1.2.2.2.3"></plus><cn id="A3.E7.m1.2.2.2.4.cmml" type="integer" xref="A3.E7.m1.2.2.2.4">1</cn><apply id="A3.E7.m1.2.2.2.2.2.cmml" xref="A3.E7.m1.2.2.2.2.1"><exp id="A3.E7.m1.1.1.1.1.cmml" xref="A3.E7.m1.1.1.1.1"></exp><apply id="A3.E7.m1.2.2.2.2.1.1.1.cmml" xref="A3.E7.m1.2.2.2.2.1.1.1"><minus id="A3.E7.m1.2.2.2.2.1.1.1.2.cmml" xref="A3.E7.m1.2.2.2.2.1.1.1"></minus><apply id="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.cmml" xref="A3.E7.m1.2.2.2.2.1.1.1.1.1"><plus id="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.1.cmml" xref="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.1"></plus><apply id="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.2.cmml" xref="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.2"><divide id="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.2.1.cmml" xref="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.2"></divide><ci id="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.2.2.cmml" xref="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.2.2">𝐴</ci><apply id="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.2.3.cmml" xref="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.2.3"><csymbol cd="ambiguous" id="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.2.3.1.cmml" xref="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.2.3">superscript</csymbol><ci id="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.2.3.2.cmml" xref="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.2.3.2">𝑁</ci><ci id="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.2.3.3.cmml" xref="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.2.3.3">𝛼</ci></apply></apply><apply id="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.3.cmml" xref="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.3"><divide id="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.3.1.cmml" xref="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.3"></divide><ci id="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.3.2.cmml" xref="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.3.2">𝐵</ci><apply id="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.3.3.cmml" xref="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.3.3"><csymbol cd="ambiguous" id="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.3.3.1.cmml" xref="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.3.3">superscript</csymbol><ci id="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.3.3.2.cmml" xref="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.3.3.2">𝐷</ci><ci id="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.3.3.3.cmml" xref="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.3.3.3">𝛽</ci></apply></apply><ci id="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.4.cmml" xref="A3.E7.m1.2.2.2.2.1.1.1.1.1.1.4">𝐸</ci></apply></apply></apply></apply></apply><ci id="A3.E7.m1.4.5.3.2.cmml" xref="A3.E7.m1.4.5.3.2">𝑏</ci></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="A3.E7.m1.4c">Acc(N,D)=\frac{a}{1+\exp\left(-\left(\frac{A}{N^{\alpha}}+\frac{B}{D^{\beta}}+%
E\right)\right)}+b</annotation><annotation encoding="application/x-llamapun" id="A3.E7.m1.4d">italic_A italic_c italic_c ( italic_N , italic_D ) = divide start_ARG italic_a end_ARG start_ARG 1 + roman_exp ( - ( divide start_ARG italic_A end_ARG start_ARG italic_N start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_ARG + divide start_ARG italic_B end_ARG start_ARG italic_D start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT end_ARG + italic_E ) ) end_ARG + italic_b</annotation></semantics></math></td>
<td class="ltx_eqn_cell ltx_eqn_center_padright"></td>
<td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1"><span class="ltx_tag ltx_tag_equation ltx_align_right">(7)</span></td>
</tr></tbody>
</table>
</div>
<div class="ltx_para ltx_noindent" id="A3.p7">
<p class="ltx_p" id="A3.p7.7">This formulation retains the same five parameters from the two-step <math alttext="(N,D)" class="ltx_Math" display="inline" id="A3.p7.1.m1.2"><semantics id="A3.p7.1.m1.2a"><mrow id="A3.p7.1.m1.2.3.2" xref="A3.p7.1.m1.2.3.1.cmml"><mo id="A3.p7.1.m1.2.3.2.1" stretchy="false" xref="A3.p7.1.m1.2.3.1.cmml">(</mo><mi id="A3.p7.1.m1.1.1" xref="A3.p7.1.m1.1.1.cmml">N</mi><mo id="A3.p7.1.m1.2.3.2.2" xref="A3.p7.1.m1.2.3.1.cmml">,</mo><mi id="A3.p7.1.m1.2.2" xref="A3.p7.1.m1.2.2.cmml">D</mi><mo id="A3.p7.1.m1.2.3.2.3" stretchy="false" xref="A3.p7.1.m1.2.3.1.cmml">)</mo></mrow><annotation-xml encoding="MathML-Content" id="A3.p7.1.m1.2b"><interval closure="open" id="A3.p7.1.m1.2.3.1.cmml" xref="A3.p7.1.m1.2.3.2"><ci id="A3.p7.1.m1.1.1.cmml" xref="A3.p7.1.m1.1.1">𝑁</ci><ci id="A3.p7.1.m1.2.2.cmml" xref="A3.p7.1.m1.2.2">𝐷</ci></interval></annotation-xml><annotation encoding="application/x-tex" id="A3.p7.1.m1.2c">(N,D)</annotation><annotation encoding="application/x-llamapun" id="A3.p7.1.m1.2d">( italic_N , italic_D )</annotation></semantics></math> loss function. Following <cite class="ltx_cite ltx_citemacro_citet">Bhagia et al. (<a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib3" title="">2024</a>)</cite>, we merge the parameters <math alttext="k" class="ltx_Math" display="inline" id="A3.p7.2.m2.1"><semantics id="A3.p7.2.m2.1a"><mi id="A3.p7.2.m2.1.1" xref="A3.p7.2.m2.1.1.cmml">k</mi><annotation-xml encoding="MathML-Content" id="A3.p7.2.m2.1b"><ci id="A3.p7.2.m2.1.1.cmml" xref="A3.p7.2.m2.1.1">𝑘</ci></annotation-xml><annotation encoding="application/x-tex" id="A3.p7.2.m2.1c">k</annotation><annotation encoding="application/x-llamapun" id="A3.p7.2.m2.1d">italic_k</annotation></semantics></math> and <math alttext="L_{0}" class="ltx_Math" display="inline" id="A3.p7.3.m3.1"><semantics id="A3.p7.3.m3.1a"><msub id="A3.p7.3.m3.1.1" xref="A3.p7.3.m3.1.1.cmml"><mi id="A3.p7.3.m3.1.1.2" xref="A3.p7.3.m3.1.1.2.cmml">L</mi><mn id="A3.p7.3.m3.1.1.3" xref="A3.p7.3.m3.1.1.3.cmml">0</mn></msub><annotation-xml encoding="MathML-Content" id="A3.p7.3.m3.1b"><apply id="A3.p7.3.m3.1.1.cmml" xref="A3.p7.3.m3.1.1"><csymbol cd="ambiguous" id="A3.p7.3.m3.1.1.1.cmml" xref="A3.p7.3.m3.1.1">subscript</csymbol><ci id="A3.p7.3.m3.1.1.2.cmml" xref="A3.p7.3.m3.1.1.2">𝐿</ci><cn id="A3.p7.3.m3.1.1.3.cmml" type="integer" xref="A3.p7.3.m3.1.1.3">0</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="A3.p7.3.m3.1c">L_{0}</annotation><annotation encoding="application/x-llamapun" id="A3.p7.3.m3.1d">italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT</annotation></semantics></math> from the second-stage sigmoid into the loss-side parameters (<math alttext="A" class="ltx_Math" display="inline" id="A3.p7.4.m4.1"><semantics id="A3.p7.4.m4.1a"><mi id="A3.p7.4.m4.1.1" xref="A3.p7.4.m4.1.1.cmml">A</mi><annotation-xml encoding="MathML-Content" id="A3.p7.4.m4.1b"><ci id="A3.p7.4.m4.1.1.cmml" xref="A3.p7.4.m4.1.1">𝐴</ci></annotation-xml><annotation encoding="application/x-tex" id="A3.p7.4.m4.1c">A</annotation><annotation encoding="application/x-llamapun" id="A3.p7.4.m4.1d">italic_A</annotation></semantics></math>, <math alttext="B" class="ltx_Math" display="inline" id="A3.p7.5.m5.1"><semantics id="A3.p7.5.m5.1a"><mi id="A3.p7.5.m5.1.1" xref="A3.p7.5.m5.1.1.cmml">B</mi><annotation-xml encoding="MathML-Content" id="A3.p7.5.m5.1b"><ci id="A3.p7.5.m5.1.1.cmml" xref="A3.p7.5.m5.1.1">𝐵</ci></annotation-xml><annotation encoding="application/x-tex" id="A3.p7.5.m5.1c">B</annotation><annotation encoding="application/x-llamapun" id="A3.p7.5.m5.1d">italic_B</annotation></semantics></math>, <math alttext="E" class="ltx_Math" display="inline" id="A3.p7.6.m6.1"><semantics id="A3.p7.6.m6.1a"><mi id="A3.p7.6.m6.1.1" xref="A3.p7.6.m6.1.1.cmml">E</mi><annotation-xml encoding="MathML-Content" id="A3.p7.6.m6.1b"><ci id="A3.p7.6.m6.1.1.cmml" xref="A3.p7.6.m6.1.1">𝐸</ci></annotation-xml><annotation encoding="application/x-tex" id="A3.p7.6.m6.1c">E</annotation><annotation encoding="application/x-llamapun" id="A3.p7.6.m6.1d">italic_E</annotation></semantics></math>), yielding a simplified single-stage fit with 7 total free parameters: <math alttext="\{A,\alpha,B,\beta,E,a,b\}" class="ltx_Math" display="inline" id="A3.p7.7.m7.7"><semantics id="A3.p7.7.m7.7a"><mrow id="A3.p7.7.m7.7.8.2" xref="A3.p7.7.m7.7.8.1.cmml"><mo id="A3.p7.7.m7.7.8.2.1" stretchy="false" xref="A3.p7.7.m7.7.8.1.cmml">{</mo><mi id="A3.p7.7.m7.1.1" xref="A3.p7.7.m7.1.1.cmml">A</mi><mo id="A3.p7.7.m7.7.8.2.2" xref="A3.p7.7.m7.7.8.1.cmml">,</mo><mi id="A3.p7.7.m7.2.2" xref="A3.p7.7.m7.2.2.cmml">α</mi><mo id="A3.p7.7.m7.7.8.2.3" xref="A3.p7.7.m7.7.8.1.cmml">,</mo><mi id="A3.p7.7.m7.3.3" xref="A3.p7.7.m7.3.3.cmml">B</mi><mo id="A3.p7.7.m7.7.8.2.4" xref="A3.p7.7.m7.7.8.1.cmml">,</mo><mi id="A3.p7.7.m7.4.4" xref="A3.p7.7.m7.4.4.cmml">β</mi><mo id="A3.p7.7.m7.7.8.2.5" xref="A3.p7.7.m7.7.8.1.cmml">,</mo><mi id="A3.p7.7.m7.5.5" xref="A3.p7.7.m7.5.5.cmml">E</mi><mo id="A3.p7.7.m7.7.8.2.6" xref="A3.p7.7.m7.7.8.1.cmml">,</mo><mi id="A3.p7.7.m7.6.6" xref="A3.p7.7.m7.6.6.cmml">a</mi><mo id="A3.p7.7.m7.7.8.2.7" xref="A3.p7.7.m7.7.8.1.cmml">,</mo><mi id="A3.p7.7.m7.7.7" xref="A3.p7.7.m7.7.7.cmml">b</mi><mo id="A3.p7.7.m7.7.8.2.8" stretchy="false" xref="A3.p7.7.m7.7.8.1.cmml">}</mo></mrow><annotation-xml encoding="MathML-Content" id="A3.p7.7.m7.7b"><set id="A3.p7.7.m7.7.8.1.cmml" xref="A3.p7.7.m7.7.8.2"><ci id="A3.p7.7.m7.1.1.cmml" xref="A3.p7.7.m7.1.1">𝐴</ci><ci id="A3.p7.7.m7.2.2.cmml" xref="A3.p7.7.m7.2.2">𝛼</ci><ci id="A3.p7.7.m7.3.3.cmml" xref="A3.p7.7.m7.3.3">𝐵</ci><ci id="A3.p7.7.m7.4.4.cmml" xref="A3.p7.7.m7.4.4">𝛽</ci><ci id="A3.p7.7.m7.5.5.cmml" xref="A3.p7.7.m7.5.5">𝐸</ci><ci id="A3.p7.7.m7.6.6.cmml" xref="A3.p7.7.m7.6.6">𝑎</ci><ci id="A3.p7.7.m7.7.7.cmml" xref="A3.p7.7.m7.7.7">𝑏</ci></set></annotation-xml><annotation encoding="application/x-tex" id="A3.p7.7.m7.7c">\{A,\alpha,B,\beta,E,a,b\}</annotation><annotation encoding="application/x-llamapun" id="A3.p7.7.m7.7d">{ italic_A , italic_α , italic_B , italic_β , italic_E , italic_a , italic_b }</annotation></semantics></math>.</p>
</div>
<div class="ltx_para ltx_noindent" id="A3.p8">
<p class="ltx_p" id="A3.p8.1"><span class="ltx_text ltx_font_bold" id="A3.p8.1.1">Use of helper points.</span>
Following <cite class="ltx_cite ltx_citemacro_citet">Bhagia et al. (<a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#bib.bib3" title="">2024</a>)</cite>, we optionally include an extra point <math alttext="(L=0.0,Acc=1.0)" class="ltx_Math" display="inline" id="A3.p8.1.m1.1"><semantics id="A3.p8.1.m1.1a"><mrow id="A3.p8.1.m1.1.1.1"><mo id="A3.p8.1.m1.1.1.1.2" stretchy="false">(</mo><mrow id="A3.p8.1.m1.1.1.1.1.2" xref="A3.p8.1.m1.1.1.1.1.3.cmml"><mrow id="A3.p8.1.m1.1.1.1.1.1.1" xref="A3.p8.1.m1.1.1.1.1.1.1.cmml"><mi id="A3.p8.1.m1.1.1.1.1.1.1.2" xref="A3.p8.1.m1.1.1.1.1.1.1.2.cmml">L</mi><mo id="A3.p8.1.m1.1.1.1.1.1.1.1" xref="A3.p8.1.m1.1.1.1.1.1.1.1.cmml">=</mo><mn id="A3.p8.1.m1.1.1.1.1.1.1.3" xref="A3.p8.1.m1.1.1.1.1.1.1.3.cmml">0.0</mn></mrow><mo id="A3.p8.1.m1.1.1.1.1.2.3" xref="A3.p8.1.m1.1.1.1.1.3a.cmml">,</mo><mrow id="A3.p8.1.m1.1.1.1.1.2.2" xref="A3.p8.1.m1.1.1.1.1.2.2.cmml"><mrow id="A3.p8.1.m1.1.1.1.1.2.2.2" xref="A3.p8.1.m1.1.1.1.1.2.2.2.cmml"><mi id="A3.p8.1.m1.1.1.1.1.2.2.2.2" xref="A3.p8.1.m1.1.1.1.1.2.2.2.2.cmml">A</mi><mo id="A3.p8.1.m1.1.1.1.1.2.2.2.1" xref="A3.p8.1.m1.1.1.1.1.2.2.2.1.cmml">⁢</mo><mi id="A3.p8.1.m1.1.1.1.1.2.2.2.3" xref="A3.p8.1.m1.1.1.1.1.2.2.2.3.cmml">c</mi><mo id="A3.p8.1.m1.1.1.1.1.2.2.2.1a" xref="A3.p8.1.m1.1.1.1.1.2.2.2.1.cmml">⁢</mo><mi id="A3.p8.1.m1.1.1.1.1.2.2.2.4" xref="A3.p8.1.m1.1.1.1.1.2.2.2.4.cmml">c</mi></mrow><mo id="A3.p8.1.m1.1.1.1.1.2.2.1" xref="A3.p8.1.m1.1.1.1.1.2.2.1.cmml">=</mo><mn id="A3.p8.1.m1.1.1.1.1.2.2.3" xref="A3.p8.1.m1.1.1.1.1.2.2.3.cmml">1.0</mn></mrow></mrow><mo id="A3.p8.1.m1.1.1.1.3" stretchy="false">)</mo></mrow><annotation-xml encoding="MathML-Content" id="A3.p8.1.m1.1b"><apply id="A3.p8.1.m1.1.1.1.1.3.cmml" xref="A3.p8.1.m1.1.1.1.1.2"><csymbol cd="ambiguous" id="A3.p8.1.m1.1.1.1.1.3a.cmml" xref="A3.p8.1.m1.1.1.1.1.2.3">formulae-sequence</csymbol><apply id="A3.p8.1.m1.1.1.1.1.1.1.cmml" xref="A3.p8.1.m1.1.1.1.1.1.1"><eq id="A3.p8.1.m1.1.1.1.1.1.1.1.cmml" xref="A3.p8.1.m1.1.1.1.1.1.1.1"></eq><ci id="A3.p8.1.m1.1.1.1.1.1.1.2.cmml" xref="A3.p8.1.m1.1.1.1.1.1.1.2">𝐿</ci><cn id="A3.p8.1.m1.1.1.1.1.1.1.3.cmml" type="float" xref="A3.p8.1.m1.1.1.1.1.1.1.3">0.0</cn></apply><apply id="A3.p8.1.m1.1.1.1.1.2.2.cmml" xref="A3.p8.1.m1.1.1.1.1.2.2"><eq id="A3.p8.1.m1.1.1.1.1.2.2.1.cmml" xref="A3.p8.1.m1.1.1.1.1.2.2.1"></eq><apply id="A3.p8.1.m1.1.1.1.1.2.2.2.cmml" xref="A3.p8.1.m1.1.1.1.1.2.2.2"><times id="A3.p8.1.m1.1.1.1.1.2.2.2.1.cmml" xref="A3.p8.1.m1.1.1.1.1.2.2.2.1"></times><ci id="A3.p8.1.m1.1.1.1.1.2.2.2.2.cmml" xref="A3.p8.1.m1.1.1.1.1.2.2.2.2">𝐴</ci><ci id="A3.p8.1.m1.1.1.1.1.2.2.2.3.cmml" xref="A3.p8.1.m1.1.1.1.1.2.2.2.3">𝑐</ci><ci id="A3.p8.1.m1.1.1.1.1.2.2.2.4.cmml" xref="A3.p8.1.m1.1.1.1.1.2.2.2.4">𝑐</ci></apply><cn id="A3.p8.1.m1.1.1.1.1.2.2.3.cmml" type="float" xref="A3.p8.1.m1.1.1.1.1.2.2.3">1.0</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="A3.p8.1.m1.1c">(L=0.0,Acc=1.0)</annotation><annotation encoding="application/x-llamapun" id="A3.p8.1.m1.1d">( italic_L = 0.0 , italic_A italic_c italic_c = 1.0 )</annotation></semantics></math> in the second-stage fit. This “helper” point anchors the upper asymptote of the accuracy prediction.</p>
</div>
<div class="ltx_para ltx_noindent" id="A3.p9">
<p class="ltx_p" id="A3.p9.1"><span class="ltx_text ltx_font_bold" id="A3.p9.1.1">Filtering early checkpoints.</span>
We experiment with excluding the first 50% of intermediate checkpoints when fitting the second-stage sigmoid. This reduces noise from high-loss early training points and often improves the fit for extrapolation.</p>
</div>
<div class="ltx_para ltx_noindent" id="A3.p10">
<p class="ltx_p" id="A3.p10.1"><span class="ltx_text ltx_font_bold" id="A3.p10.1.1">Helpers and <math alttext="&gt;50" class="ltx_Math" display="inline" id="A3.p10.1.1.m1.1"><semantics id="A3.p10.1.1.m1.1a"><mrow id="A3.p10.1.1.m1.1.1" xref="A3.p10.1.1.m1.1.1.cmml"><mi id="A3.p10.1.1.m1.1.1.2" xref="A3.p10.1.1.m1.1.1.2.cmml"></mi><mo id="A3.p10.1.1.m1.1.1.1" xref="A3.p10.1.1.m1.1.1.1.cmml">&gt;</mo><mn id="A3.p10.1.1.m1.1.1.3" xref="A3.p10.1.1.m1.1.1.3.cmml">50</mn></mrow><annotation-xml encoding="MathML-Content" id="A3.p10.1.1.m1.1b"><apply id="A3.p10.1.1.m1.1.1.cmml" xref="A3.p10.1.1.m1.1.1"><gt id="A3.p10.1.1.m1.1.1.1.cmml" xref="A3.p10.1.1.m1.1.1.1"></gt><csymbol cd="latexml" id="A3.p10.1.1.m1.1.1.2.cmml" xref="A3.p10.1.1.m1.1.1.2">absent</csymbol><cn id="A3.p10.1.1.m1.1.1.3.cmml" type="integer" xref="A3.p10.1.1.m1.1.1.3">50</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="A3.p10.1.1.m1.1c">&gt;50</annotation><annotation encoding="application/x-llamapun" id="A3.p10.1.1.m1.1d">&gt; 50</annotation></semantics></math>% checkpoints.</span>
Lastly we experiment with combining the previous two techniques on the baseline 3-parameter fit.</p>
</div>
<div class="ltx_para ltx_noindent" id="A3.p11">
<p class="ltx_p" id="A3.p11.1"><span class="ltx_text ltx_font_bold" id="A3.p11.1.1">Prediction Error.</span> We report prediction errors in Table <a class="ltx_ref" href="https://arxiv.org/html/2504.11393v1#A2.T4" title="Table 4 ‣ Appendix B Proxy Metric Definitions ‣ How to Predict Best Pretraining Data with Small Experiments"><span class="ltx_text ltx_ref_tag">4</span></a> for each setup. As the best scaling laws variants are all roughly comparable to the simple 3-parameter set up, we use this one as our baseline.</p>
<div class="ltx_pagination ltx_role_newpage"></div>
</div>
</section>
</article>
</div>
<footer class="ltx_page_footer">
<div class="ltx_page_logo">Generated  on Tue Apr 15 16:57:41 2025 by <a class="ltx_LaTeXML_logo" href="http://dlmf.nist.gov/LaTeXML/"><span style="letter-spacing:-0.2em; margin-right:0.1em;">L<span class="ltx_font_smallcaps" style="position:relative; bottom:2.2pt;">a</span>T<span class="ltx_font_smallcaps" style="font-size:120%;position:relative; bottom:-0.2ex;">e</span></span><span style="font-size:90%; position:relative; bottom:-0.2ex;">XML</span><img alt="Mascot Sammy" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAsAAAAOCAYAAAD5YeaVAAAAAXNSR0IArs4c6QAAAAZiS0dEAP8A/wD/oL2nkwAAAAlwSFlzAAALEwAACxMBAJqcGAAAAAd0SU1FB9wKExQZLWTEaOUAAAAddEVYdENvbW1lbnQAQ3JlYXRlZCB3aXRoIFRoZSBHSU1Q72QlbgAAAdpJREFUKM9tkL+L2nAARz9fPZNCKFapUn8kyI0e4iRHSR1Kb8ng0lJw6FYHFwv2LwhOpcWxTjeUunYqOmqd6hEoRDhtDWdA8ApRYsSUCDHNt5ul13vz4w0vWCgUnnEc975arX6ORqN3VqtVZbfbTQC4uEHANM3jSqXymFI6yWazP2KxWAXAL9zCUa1Wy2tXVxheKA9YNoR8Pt+aTqe4FVVVvz05O6MBhqUIBGk8Hn8HAOVy+T+XLJfLS4ZhTiRJgqIoVBRFIoric47jPnmeB1mW/9rr9ZpSSn3Lsmir1fJZlqWlUonKsvwWwD8ymc/nXwVBeLjf7xEKhdBut9Hr9WgmkyGEkJwsy5eHG5vN5g0AKIoCAEgkEkin0wQAfN9/cXPdheu6P33fBwB4ngcAcByHJpPJl+fn54mD3Gg0NrquXxeLRQAAwzAYj8cwTZPwPH9/sVg8PXweDAauqqr2cDjEer1GJBLBZDJBs9mE4zjwfZ85lAGg2+06hmGgXq+j3+/DsixYlgVN03a9Xu8jgCNCyIegIAgx13Vfd7vdu+FweG8YRkjXdWy329+dTgeSJD3ieZ7RNO0VAXAPwDEAO5VKndi2fWrb9jWl9Esul6PZbDY9Go1OZ7PZ9z/lyuD3OozU2wAAAABJRU5ErkJggg=="/></a>
</div></footer>
</div>
</body>
</html>

GitHub Events

Total
  • Push event: 3
  • Pull request event: 1
Last Year
  • Push event: 3
  • Pull request event: 1

Dependencies

pyproject.toml pypi
  • accelerate >=0.20.0
  • datasets >=2.14.0
  • numpy >=1.24.0
  • python-dotenv >=1.0.0
  • pyyaml >=6.0
  • rich >=13.0.0
  • scikit-learn >=1.3.0
  • scipy >=1.10.0
  • tensorboard >=2.13.0
  • tokenizers >=0.13.0
  • torch >=2.0.0
  • tqdm >=4.65.0
  • transformers >=4.30.0
  • wandb >=0.15.0
requirements.txt pypi
  • accelerate >=0.20.0
  • datasets >=2.14.0
  • numpy >=1.24.0
  • pytest >=7.0.0
  • pytest-cov >=4.0.0
  • pyyaml >=6.0
  • rich >=13.0.0
  • scikit-learn >=1.3.0
  • scipy >=1.10.0
  • tensorboard >=2.13.0
  • tokenizers >=0.13.0
  • torch >=2.0.0
  • tqdm >=4.65.0
  • transformers >=4.30.0
  • wandb >=0.15.0
uv.lock pypi
  • 112 dependencies