Server-side Lighteval for SmolLM2

Plato ships a reference configuration for running federated SmolLM2 fine-tuning while evaluating the global model on the server with Lighteval.

Reference config:

configs/HuggingFace/fedavg_smol_smoltalk_smollm2_135m.toml

Install the evaluator stack

From the repository root:


uv sync --extra llm_eval

This installs the optional Lighteval runtime used by evaluation.type = "lighteval".

Run the reference configuration


uv run python plato.py --config configs/HuggingFace/fedavg_smol_smoltalk_smollm2_135m.toml

The run trains SmolLM2 on HuggingFaceTB/smol-smoltalk and evaluates the aggregated global model on the server after each round.

Key configuration snippet


[server]
do_test = true

[trainer]
type = "HuggingFace"
model_type = "huggingface"
model_name = "HuggingFaceTB/SmolLM2-135M"
tokenizer_name = "HuggingFaceTB/SmolLM2-135M-Instruct"
max_concurrency = 1
gradient_accumulation_steps = 4
gradient_checkpointing = true
target_perplexity = 50

[evaluation]
type = "lighteval"
preset = "smollm_round_fast"
primary_metric = "ifeval_avg"
backend = "transformers"
batch_size = 1
model_parallel = false
device = "cuda:0"
show_progress = true
max_samples = 32

What the preset runs

smollm_round_fast evaluates five tasks:

IFEval
HellaSwag
ARC Easy
ARC Challenge
PIQA

Plato normalizes their outputs into the following summary metrics:

ifeval_avg
hellaswag
arc_easy
arc_challenge
arc_avg
piqa

Sampling behaviour

max_samples = 32 is a per-task cap, not a global cap.

That means the run evaluates up to 32 examples from each task, using a deterministic shuffled subset inside Lighteval. This is useful for fast round-by-round feedback, but it produces partial benchmark numbers rather than full leaderboard-style scores.

Result logging

The runtime CSV is the canonical result log. Summary columns can be declared up front:


[results]
types = "round, elapsed_time, accuracy, train_loss, evaluation_ifeval_avg, evaluation_hellaswag, evaluation_arc_avg, evaluation_piqa"

Plato also appends detailed Lighteval metrics automatically when they appear, for example:

evaluation_ifeval_prompt_level_strict_acc
evaluation_ifeval_prompt_level_loose_acc
evaluation_ifeval_inst_level_loose_acc
evaluation_arc_easy_acc
evaluation_arc_challenge_acc_stderr
evaluation_hellaswag_em
evaluation_piqa_em

So you can start with the summary columns above and still get the detailed task metrics in the same CSV file.

Runtime notes

batch_size = 1 and model_parallel = false are conservative defaults intended for reliable server-side evaluation on multi-GPU hosts.
device = "cuda:0" keeps the evaluator on one explicit GPU instead of relying on broad multi-GPU auto-detection.
If you want evaluator failures to stop the run, add:


[evaluation]
fail_on_error = true

Otherwise Plato logs the evaluator exception and continues the training run without structured benchmark outputs for that round.