Plato's evaluator subsystem adds structured benchmark outputs on top of the trainer's normal scalar test metric.
A trainer's TestingStrategy still returns a single value such as accuracy or perplexity. When [evaluation] is configured, Plato then runs an evaluator and stores richer benchmark results in the trainer context for server-side logging.
Runtime flow
The evaluation path is:
TestingStrategy.test_model(...) computes the trainer's scalar test metric.
nanochat_core is not registered in the evaluator registry. Plato's nanochat
trainer (plato/trainers/nanochat.py) creates a NanochatCoreEvaluator directly
and supplies it as an override when [evaluation] type = "nanochat_core" is set.
Using this evaluator type with any other trainer (e.g., HuggingFace, basic,
or composable) produces no evaluation output and no error — the runner silently
skips it.
nanochat_core tokenizer prerequisite
In addition to uv sync --extra nanochat, this evaluator requires a trained
Nanochat tokenizer under ~/.cache/nanochat/tokenizer/. Plato can auto-download
the CORE bundle, but it does not auto-create the tokenizer.
run_configured_evaluation(...) treats evaluator failures as non-fatal by default.
If evaluation.fail_on_error = false or omitted, Plato logs the exception and continues without structured evaluation metrics.
If evaluation.fail_on_error = true, the exception is raised and the run stops.
This is useful for keeping long training runs alive when the evaluator stack is optional.
Lighteval-specific behaviour
Plato's Lighteval adapter adds several integration details on top of upstream Lighteval:
task aliases are mapped to the concrete upstream ids:
arc_easy → arc:easy
arc_challenge → arc:challenge
piqa → custom piqa_hf
summary metrics are normalized into stable Plato names:
ifeval_avg
hellaswag
arc_easy
arc_challenge
arc_avg
piqa
detailed per-task metrics are also exported to the CSV, for example:
evaluation_ifeval_prompt_level_strict_acc
evaluation_arc_easy_acc
evaluation_arc_challenge_acc_stderr
safe runtime defaults are used for server-side evaluation:
batch_size = 1
model_parallel = false
device = Config.device()
dtype inferred from trainer.bf16 / trainer.fp16
when trainer.max_concurrency spawns a subprocess for testing, Plato persists evaluator state so the parent server process still logs the structured metrics.
The adapter also exposes the preset smollm_round_fast, used by the SmolLM2 server-side evaluation example.
Server logging contract
Evaluator metrics appear in the runtime CSV under the evaluation_ prefix.
Every key in EvaluationResult.metrics becomes evaluation_<metric>.
Lighteval additionally exports detailed task-level metrics from the evaluator metadata.
The CSV schema expands automatically when new evaluation columns appear.
See Evaluation for configuration details and Results for logging behaviour.
Extending Plato with new evaluators
A good custom evaluator should:
Accept a lightweight config object.
Use request.context.state for temporary coordination instead of mutating the trainer directly.
Return a normalized EvaluationResult with small summary metrics in metrics.
Store heavier nested details inside metadata if they are still useful for downstream inspection.
Choose a stable primary_metric so CSV dashboards and automated comparisons have a clear headline number.
For larger integrations, pair the evaluator with a dedicated documentation example under docs/docs/examples/ and a smoke test under tests/evaluators/.