Skip to content

Evaluation

Plato supports an optional [evaluation] section for structured server-side evaluation. This runs after the trainer's regular test metric (for example accuracy or perplexity) and records named benchmark metrics under the evaluation_ prefix in the runtime CSV.

Use this section when you want benchmark-style outputs such as IFEval, ARC, HellaSwag, PIQA, or Nanochat CORE instead of only a single scalar test metric.

When evaluation runs

Structured evaluation is triggered from the trainer's test flow, so it depends on server-side testing being enabled:

[server]
do_test = true

If [evaluation] is omitted, Plato only records the trainer's normal scalar metric.

Common options

type

The evaluator backend to run.

Built-in values include:

  • lighteval for Hugging Face's Lighteval benchmark runner.
  • nanochat_core for Nanochat's CORE benchmark. Requires trainer.type = "nanochat". This evaluator is not registered in the general evaluator registry; it is wired internally by the nanochat trainer. Using it with any other trainer type produces no evaluation output and no error.

fail_on_error

Whether evaluator failures should abort the run.

Default value: false

When false, Plato logs the evaluator exception and continues without structured evaluation metrics. Set this to true when the evaluation itself is a required part of the experiment.

Built-in evaluators

Evaluator Install path Primary output style Typical use
lighteval uv sync --extra llm_eval Named benchmark metrics such as ifeval_avg and arc_avg Server-side LLM evaluation
nanochat_core uv sync --extra nanochat core_metric Nanochat benchmark runs — requires trainer.type = "nanochat" and a trained Nanochat tokenizer under ~/.cache/nanochat/tokenizer/

Lighteval

Plato's Lighteval adapter wraps the lighteval package and normalizes its task outputs into CSV-friendly metrics.

Supported options

preset

Name of the built-in task preset.

Current built-in value:

  • smollm_round_fast

This preset runs:

  • ifeval
  • hellaswag
  • arc_easy
  • arc_challenge
  • piqa

primary_metric

The summary metric to treat as the evaluator's primary output.

For smollm_round_fast, the default is ifeval_avg.

backend

Lighteval execution backend.

Supported values in Plato's current integration include:

  • transformers
  • accelerate

transformers and accelerate currently resolve to the same safe server-side launcher path in Plato.

batch_size

Evaluation batch size passed to Lighteval.

Default value: 1

Plato intentionally defaults to 1 to avoid aggressive auto-probing on multi-GPU systems.

max_length

Optional maximum sequence length passed to the Lighteval transformers backend.

max_samples

Optional per-task sample cap.

Example: max_samples = 32 runs up to 32 examples for each configured task. Lighteval shuffles deterministically before truncating, so the subset is stable across runs.

Partial benchmark

When max_samples is set, benchmark numbers are partial and should not be compared directly with full-dataset leaderboard runs.

model_parallel

Whether Lighteval should shard the evaluated model across multiple GPUs.

Default value: false

dtype

Optional evaluation dtype override.

If omitted, Plato infers a sensible default from the trainer configuration:

  • trainer.bf16 = truebfloat16
  • trainer.fp16 = truefloat16

device

Device string for evaluation, such as cuda:0, cuda:1, or cpu.

If omitted, Plato uses Config.device().

show_progress

Whether to show the coarse-grained server-side Lighteval progress bar.

Default value: true

Reference example

The configuration configs/HuggingFace/fedavg_smol_smoltalk_smollm2_135m.toml uses Lighteval like this:

[server]
do_test = true

[evaluation]
type = "lighteval"
preset = "smollm_round_fast"
primary_metric = "ifeval_avg"
backend = "transformers"
batch_size = 1
model_parallel = false
device = "cuda:0"
show_progress = true
max_samples = 32

Metrics exported to the CSV

Lighteval summary metrics are written as:

  • evaluation_ifeval_avg
  • evaluation_hellaswag
  • evaluation_arc_easy
  • evaluation_arc_challenge
  • evaluation_arc_avg
  • evaluation_piqa

Plato also exports detailed Lighteval task metrics as additional CSV columns when they are present, for example:

  • evaluation_ifeval_prompt_level_strict_acc
  • evaluation_ifeval_inst_level_loose_acc
  • evaluation_arc_easy_acc
  • evaluation_arc_challenge_acc_stderr
  • evaluation_hellaswag_em
  • evaluation_piqa_em

These columns are added to the CSV automatically the first time they appear.

Nanochat CORE

Nanochat's CORE benchmark is also available through [evaluation].

Tokenizer required

nanochat_core does not just need the nanochat Python dependencies. It also requires a trained Nanochat tokenizer under ~/.cache/nanochat/tokenizer/ (notably tokenizer.pkl and token_bytes.pt).

Plato can download the CORE evaluation bundle automatically, but it does not create the tokenizer automatically.

See Nanochat in Plato for the full setup sequence, including tokenizer training.

Supported options

bundle_dir

Optional directory containing the downloaded CORE evaluation bundle.

If omitted, Plato resolves the Nanochat base directory automatically and downloads the bundle when needed.

max_per_task

Optional cap on the number of examples per CORE task.

Default value: -1, which means use all available examples.

Example

Requires the nanochat trainer

nanochat_core is only wired up when trainer.type = "nanochat". The nanochat trainer creates the evaluator internally rather than looking it up in the registry. Setting [evaluation] type = "nanochat_core" with any other trainer type silently produces no evaluation output.

[trainer]
type = "nanochat"

[evaluation]
type = "nanochat_core"
max_per_task = 16

This evaluator exports core_metric, which can be listed in [results].types.

Results logging

Structured evaluator metrics are written directly into the runtime CSV in result_path. The CSV is the authoritative log for evaluator outputs.

See Results for how evaluator columns are named and expanded at runtime.