Server-side Lighteval for SmolLM2
Plato ships a reference configuration for running federated SmolLM2 fine-tuning while evaluating the global model on the server with Lighteval.
Reference config:
configs/HuggingFace/fedavg_smol_smoltalk_smollm2_135m.toml
Install the evaluator stack
From the repository root:
uv sync --extra llm_eval
This installs the optional Lighteval runtime used by evaluation.type = "lighteval".
Run the reference configuration
uv run python plato.py --config configs/HuggingFace/fedavg_smol_smoltalk_smollm2_135m.toml
The run trains SmolLM2 on HuggingFaceTB/smol-smoltalk and evaluates the aggregated global model on the server after each round.
Key configuration snippet
[server]
do_test = true
[trainer]
type = "HuggingFace"
model_type = "huggingface"
model_name = "HuggingFaceTB/SmolLM2-135M"
tokenizer_name = "HuggingFaceTB/SmolLM2-135M-Instruct"
max_concurrency = 1
gradient_accumulation_steps = 4
gradient_checkpointing = true
target_perplexity = 50
[evaluation]
type = "lighteval"
preset = "smollm_round_fast"
primary_metric = "ifeval_avg"
backend = "transformers"
batch_size = 1
model_parallel = false
device = "cuda:0"
show_progress = true
max_samples = 32
What the preset runs
smollm_round_fast evaluates five tasks:
- IFEval
- HellaSwag
- ARC Easy
- ARC Challenge
- PIQA
Plato normalizes their outputs into the following summary metrics:
ifeval_avghellaswagarc_easyarc_challengearc_avgpiqa
Sampling behaviour
max_samples = 32 is a per-task cap, not a global cap.
That means the run evaluates up to 32 examples from each task, using a deterministic shuffled subset inside Lighteval. This is useful for fast round-by-round feedback, but it produces partial benchmark numbers rather than full leaderboard-style scores.
Result logging
The runtime CSV is the canonical result log. Summary columns can be declared up front:
[results]
types = "round, elapsed_time, accuracy, train_loss, evaluation_ifeval_avg, evaluation_hellaswag, evaluation_arc_avg, evaluation_piqa"
Plato also appends detailed Lighteval metrics automatically when they appear, for example:
evaluation_ifeval_prompt_level_strict_accevaluation_ifeval_prompt_level_loose_accevaluation_ifeval_inst_level_loose_accevaluation_arc_easy_accevaluation_arc_challenge_acc_stderrevaluation_hellaswag_emevaluation_piqa_em
So you can start with the summary columns above and still get the detailed task metrics in the same CSV file.
Runtime notes
batch_size = 1andmodel_parallel = falseare conservative defaults intended for reliable server-side evaluation on multi-GPU hosts.device = "cuda:0"keeps the evaluator on one explicit GPU instead of relying on broad multi-GPU auto-detection.- If you want evaluator failures to stop the run, add:
[evaluation]
fail_on_error = true
Otherwise Plato logs the evaluator exception and continues the training run without structured benchmark outputs for that round.