FedAtt is a server aggregation algorithm where client updates were aggregated using a layer-wise attention-based mechanism that considered the similarity between the server and client models. The objective was to improve the accuracy or perplexity of the trained model with the same number of communication rounds.
The strategy’s attention scores (fedatt_server_strategy.py:63-78) line up with Eq. (2) of Ji et al. (2019) by softmaxing the layer-wise parameter distances; the update step (fedatt_server_strategy.py:80-101) implements Eq. (4) with the optional Gaussian randomization from Eq. (6), using epsilon and magnitude as the paper’s ϵ and β.
The author's own reference implementation, shaoxiongji/fed-att/src/agg/aggregate.py, follows the same pipeline (norm-based attention → softmax → weighted sum of server–client deltas → stepsize + noise), and every operation has a direct counterpart in the current Plato strategy implementation in fedatt_server_strategy.py.
FedAdp
FedAdp is another server aggregation algorithm, which exploited the implicit connection between data distribution on a client and the contribution from that client to the global model, measured at the server by inferring gradient information of participating clients.
The original paper did not come with source code. This example implementation follows Algorithm 1 in Wu & Wang (TCCN 2021) in the following ways: sample-weighted aggregation of local gradients (Alg. 1 line 9 / Eq. 7) happens in aggregate_deltas; the smoothed angle update (Eq. 8) is implemented via the running average in calc_contribution; node contribution uses the Gompertz mapping (Eq. 9) at fedadp_server_strategy.py:139-147; and final weights apply the Softmax-style normalization with dataset sizes (Eq. 10) in calc_adaptive_weighting. The guardrails for zero norms/weights simply prevent numerical issues and do not change the prescribed behavior.
FedNova
FedNova addresses the objective inconsistency problem in heterogeneous federated optimization, where clients may perform different numbers of local training epochs. The algorithm normalizes local updates based on the number of local steps taken, enabling effective aggregation across heterogeneous clients. On the client side, each client randomly selects the number of local epochs (between 2 and max_local_epochs) and trains accordingly. The number of epochs is included in the report sent to the server. The server then computes the effective number of steps (tau_eff) across all clients and normalizes each client's update by their individual number of local epochs, ensuring fair aggregation.
The strategy implements Eq. (12) from Wang et al. (NeurIPS 2020) in the following ways: average normalized updates with weight piτeff/τi, after first accumulating τeff=∑ipiτi. This matches the description in §5/fn.11-12 of the paper.
The official FedNova code — JYWa/FedNova, distoptim/FedNova.py — uses the same scaling factor tau_eff / local_normalizing_vec, confirming that this example implementation mirrors the reference source.
FedDF
FedDF (Federated Distillation and Fusion) replaces direct parameter averaging with server-side distillation on a proxy set. The server selects a deterministic unlabeled proxy subset, ships those proxy inputs alongside the current global weights, and each client returns teacher logits on that shared payload instead of ordinary weight deltas. The server then aggregates those logits and distills their ensemble into the next global model using temperature-scaled soft targets.
The module split follows the FedDF workflow directly: feddf.py stays as a thin launcher, feddf_server.py packages the current global weights together with the shared proxy inputs, feddf_client.py performs the standard local update and then emits logits on that server-supplied proxy payload, feddf_server_strategy.py resolves the deterministic proxy subset and routes the logits payload through direct weight aggregation, and feddf_algorithm.py encapsulates the weighted-logit ensemble plus the temperature-scaled KL distillation step.
The configuration surface above mirrors the paper’s core knobs. proxy_set_size, proxy_batch_size, and proxy_seed control the deterministic unlabeled proxy data used for ensemble distillation, temperature shapes the softened teacher distribution, and distillation_epochs, distillation_batch_size, and learning_rate control the server-side student optimization that replaces direct averaging.
MOON (client-training customization)
MOON is included in Plato as a client-training customization rather than as a server-aggregation
rule. The server still performs sample-weighted FedAvg; the distinguishing mechanism is the
contrastive local objective together with the historical-model buffer maintained by each client.
See 5. Algorithms with Customized Client Training Loops for the runnable example, key
configuration parameters, and the implementation alignment notes for MOON.
Attack-Adaptive Aggregation
Attack-Adaptive Aggregation is a robust server aggregation algorithm designed to defend against malicious clients in federated learning. The algorithm works by calculating the cosine similarity between the baseline weights and each client's weight deltas, then applying a temperature-scaled softmax to compute attention weights for each client. These attention weights are used to perform a weighted aggregation of client updates, giving more weight to clients whose updates are more aligned with the server model.
The Plato strategy keeps the original pipeline in the paper: it stacks floating-point client deltas, applies layer-wise PCA, seeds the estimator with the per-channel median, and feeds query/key pairs through a multi-pass attention block with temperature scaling and ϵ-thresholding before normalising weights, exactly mirroring Algorithm 1 and the PCA discussion in Sections 3.2–4.1 of Wan & Chen (2021). Refer to attack_adaptive_server_strategy.py:425-486 and the shared attention module in attack_adaptive_server_strategy.py:174-289.
The attention trainer, included in examples/server_aggregation/attack_adaptive/pretraining, reuses the Plato strategy, learning weights via an L1 loss on captured round projections, matching the paper’s goal of approximating the robust mean by data-driven weighting. Relevant code can be found at examples/server_aggregation/attack_adaptive/pretraining/train_attention.py:78-156.
This example, referred to as the Plato implementation, also matches closely with the author's reference implmentation.
PCA handling is more defensive in Plato: _pca supports both over- and under-determined SVD paths with fallback
(attack_adaptive_server_strategy.py:128-153), whereas the original convert_pca.getPCA_torch_over assumes the over-
determined case (Attack-Adaptive-Aggregation-in-Federated-Learning/utils/convert_pca.py:19-45).
Plato discovers trainable parameter names from the live trainer context and skips non-floating tensors at stack time
(attack_adaptive_server_strategy.py:425-436), while the reference deletes keys by mutating the stacked dict in place
(Attack-Adaptive-Aggregation-in-Federated-Learning/aaa/attention.py:104-110).
When no checkpoint exists, Plato caches and persists a random attention state so subsequent rounds stay deterministic
(attack_adaptive_server_strategy.py:439-466); the upstream code simply expects ./aaa/attention.pt to be present
(Attack-Adaptive-Aggregation-in-Federated-Learning/aaa/attention.py:114-126).
Plato normalises weights with a NaN guard that falls back to uniform weights (attack_adaptive_server_strategy.py:480-
484), while the original normalises blindly (Attack-Adaptive-Aggregation-in-Federated-Learning/aaa/attention.py:118-
123).
Dataset capture is built into the aggregation step: each round records the projection, attention output, FedAdp-
style reference weights, sample counts, and metadata (attack_adaptive_server_strategy.py:487-651). The upstream
workflow instead saves PCA tensors and a benign/malicious label vector when --save_model_weights is enabled (Attack-
Adaptive-Aggregation-in-Federated-Learning/server.py:237-279 and _main.py:64-85), leaving downstream scripts to
derive targets.
Plato’s trainer uses a concise CLI with standard argparse flags and optional validation split (examples/
server_aggregation/attack_adaptive/pretraining/train_attention.py:21-110); the reference script expects path lists
via @args files and logs to per-hyperparameter text files (Attack-Adaptive-Aggregation-in-Federated-Learning/aaa/
train_attention.py:1-210).
Pretraining the Attack-Adaptive Attention Module
Because Plato saves randomly initialised weights when no checkpoint is present, ensure you replace them with a trained model before benchmarking—otherwise aggregation reduces to a random attention mask. Follow the steps below to obtain a checkpoint that matches the attack-adaptive aggregation algorithm.
1. Capture rounds from a simulation
Edit your experiment TOML so the attack-adaptive strategy records each round. Add the following keys under algorithm (directories are created if missing):
[algorithm]# Aggregation algorithmtype=fedavg# Scaling factor (temperature) used by the attention modulescaling_factor=10# Path to the trained attention model shipped with the attack-adaptive paper.# Provide your own checkpoint before running the example.attention_model_path:./attention_model.pt# Number of PCA components per layer before feeding into attention.pca_components=10# Threshold applied after the softmax step (epsilon in the paper).threshold=0.005# Attention network hyperparameters from the reference implementation.attention_loops=5attention_hidden=32# Optional: set to capture per-round tensors for pretraining the attention module.# dataset_capture_dir = ./attack_adaptive_dataset
Once algorithm.dataset_capture_dir is set, each run produces a timestamped folder under
./attack_adaptive_dataset/ (for example run_20251015-211214) containing one
round_XXXXX.pt file per round and a metadata.json summary. If the key is
left unset, capture is skipped and no folder will be created.
2. Train the attention network
Point the pretraining script to the captured directory and choose where to save
the checkpoint:
--val-ratio (default 0.1, fraction of rounds used for validation)
--epsilon, --scale, --attention-loops, --hidden-size to match paper
hyperparameters.
The command prints the training and validation losses each epoch and reports the
best checkpoint once training finishes.
3. Re-run with the trained model
Ensure algorithm.attention_model_path in the experiment TOML points to the
new checkpoint, then launch a fresh run. The server will load the pretrained
weights automatically.
Tip: Capture rounds from several scenarios (with/without attacks) and
merge them into a single directory before training to improve robustness.