Server Aggregation Algorithms

FedAtt

FedAtt is a server aggregation algorithm where client updates were aggregated using a layer-wise attention-based mechanism that considered the similarity between the server and client models. The objective was to improve the accuracy or perplexity of the trained model with the same number of communication rounds.


cd examples/server_aggregation/fedatt
uv run fedatt.py -c fedatt_FashionMNIST_lenet5.toml

Reference: S. Ji, S. Pan, G. Long, X. Li, J. Jiang, Z. Huang. "Learning Private Neural Language Modeling with Attentive Aggregation," in Proc. International Joint Conference on Neural Networks (IJCNN), 2019.

Alignment with the paper

The strategy’s attention scores (fedatt_server_strategy.py:63-78) line up with Eq. (2) of Ji et al. (2019) by softmaxing the layer-wise parameter distances; the update step (fedatt_server_strategy.py:80-101) implements Eq. (4) with the optional Gaussian randomization from Eq. (6), using epsilon and magnitude as the paper’s $\epsilon$ and $\beta$ .

The author's own reference implementation, shaoxiongji/fed-att/src/agg/aggregate.py, follows the same pipeline (norm-based attention → softmax → weighted sum of server–client deltas → stepsize + noise), and every operation has a direct counterpart in the current Plato strategy implementation in fedatt_server_strategy.py.

FedAdp

FedAdp is another server aggregation algorithm, which exploited the implicit connection between data distribution on a client and the contribution from that client to the global model, measured at the server by inferring gradient information of participating clients.


cd examples/server_aggregation/fedadp/
uv run fedadp.py -c fedadp_FashionMNIST_lenet5.toml

Reference: H. Wu, P. Wang. "Fast-Convergent Federated Learning with Adaptive Weighting," in IEEE Trans. on Cognitive Communications and Networking (TCCN), 2021.

Alignment with the paper

The original paper did not come with source code. This example implementation follows Algorithm 1 in Wu & Wang (TCCN 2021) in the following ways: sample-weighted aggregation of local gradients (Alg. 1 line 9 / Eq. 7) happens in aggregate_deltas; the smoothed angle update (Eq. 8) is implemented via the running average in calc_contribution; node contribution uses the Gompertz mapping (Eq. 9) at fedadp_server_strategy.py:139-147; and final weights apply the Softmax-style normalization with dataset sizes (Eq. 10) in calc_adaptive_weighting. The guardrails for zero norms/weights simply prevent numerical issues and do not change the prescribed behavior.

FedNova

FedNova addresses the objective inconsistency problem in heterogeneous federated optimization, where clients may perform different numbers of local training epochs. The algorithm normalizes local updates based on the number of local steps taken, enabling effective aggregation across heterogeneous clients. On the client side, each client randomly selects the number of local epochs (between 2 and max_local_epochs) and trains accordingly. The number of epochs is included in the report sent to the server. The server then computes the effective number of steps (tau_eff) across all clients and normalizes each client's update by their individual number of local epochs, ensuring fair aggregation.


cd examples/server_aggregation/fednova/
uv run fednova.py -c fednova_MNIST_lenet5.toml

Key configuration parameters:

algorithm.max_local_epochs: Maximum number of local epochs (default: 10)
algorithm.pattern: Pattern for selecting local epochs (uniform_random or constant)
trainer.epochs: Base number of epochs (used when pattern is constant)

Reference: J. Wang, Q. Liu, H. Liang, G. Joshi, H. V. Poor. "Tackling the Objective Inconsistency Problem in Heterogeneous Federated Optimization," in Proc. NeurIPS, 2020.

Alignment with the paper

The strategy implements Eq. (12) from Wang et al. (NeurIPS 2020) in the following ways: average normalized updates with weight $p_i \tau_{\mathrm{eff}} / \tau_i$ , after first accumulating $\tau_{\mathrm{eff}} = \sum_i p_i \tau_i$ . This matches the description in §5/fn.11-12 of the paper.

The official FedNova code — JYWa/FedNova, distoptim/FedNova.py — uses the same scaling factor tau_eff / local_normalizing_vec, confirming that this example implementation mirrors the reference source.

FedDF

FedDF (Federated Distillation and Fusion) replaces direct parameter averaging with server-side distillation on a proxy set. The server selects a deterministic unlabeled proxy subset, ships those proxy inputs alongside the current global weights, and each client returns teacher logits on that shared payload instead of ordinary weight deltas. The server then aggregates those logits and distills their ensemble into the next global model using temperature-scaled soft targets.


cd examples/server_aggregation/feddf/
uv run feddf.py -c feddf_MNIST_lenet5.toml

Key configuration parameters:

algorithm.proxy_set_size: Number of unlabeled proxy samples used on the server for distillation.
algorithm.proxy_batch_size: Batch size for iterating through the proxy set.
algorithm.proxy_seed: Seed used to select the deterministic proxy subset shared by clients and server.
algorithm.temperature: Softmax temperature used to smooth teacher logits before distillation.
algorithm.distillation_epochs: Number of server-side distillation passes per round.
algorithm.distillation_batch_size: Batch size for the distillation optimizer.
algorithm.learning_rate: Learning rate for the server-side distillation optimizer.

Reference: Tao Lin, Lingjing Kong, Sebastian U. Stich, Martin Jaggi. "Ensemble Distillation for Robust Model Fusion in Federated Learning," arXiv:2006.07242, 2020.

Alignment with the paper

The module split follows the FedDF workflow directly: feddf.py stays as a thin launcher, feddf_server.py packages the current global weights together with the shared proxy inputs, feddf_client.py performs the standard local update and then emits logits on that server-supplied proxy payload, feddf_server_strategy.py resolves the deterministic proxy subset and routes the logits payload through direct weight aggregation, and feddf_algorithm.py encapsulates the weighted-logit ensemble plus the temperature-scaled KL distillation step.

The configuration surface above mirrors the paper’s core knobs. proxy_set_size, proxy_batch_size, and proxy_seed control the deterministic unlabeled proxy data used for ensemble distillation, temperature shapes the softened teacher distribution, and distillation_epochs, distillation_batch_size, and learning_rate control the server-side student optimization that replaces direct averaging.

MOON (client-training customization)

MOON is included in Plato as a client-training customization rather than as a server-aggregation rule. The server still performs sample-weighted FedAvg; the distinguishing mechanism is the contrastive local objective together with the historical-model buffer maintained by each client.

See 5. Algorithms with Customized Client Training Loops for the runnable example, key configuration parameters, and the implementation alignment notes for MOON.

Attack-Adaptive Aggregation

Attack-Adaptive Aggregation is a robust server aggregation algorithm designed to defend against malicious clients in federated learning. The algorithm works by calculating the cosine similarity between the baseline weights and each client's weight deltas, then applying a temperature-scaled softmax to compute attention weights for each client. These attention weights are used to perform a weighted aggregation of client updates, giving more weight to clients whose updates are more aligned with the server model.


cd examples/server_aggregation/attack_adaptive/
uv run attack_adaptive.py -c attack_adaptive_MNIST_lenet5.toml

Key configuration parameters:

algorithm.scaling_factor: Temperature scaling factor for softmax (default: 10)

Reference:

C. P. Wan, Q. Chen. "Robust Federated Learning with Attack-Adaptive Aggregation," 2021.

Alignment with the paper

The Plato strategy keeps the original pipeline in the paper: it stacks floating-point client deltas, applies layer-wise PCA, seeds the estimator with the per-channel median, and feeds query/key pairs through a multi-pass attention block with temperature scaling and $\epsilon$ -thresholding before normalising weights, exactly mirroring Algorithm 1 and the PCA discussion in Sections 3.2–4.1 of Wan & Chen (2021). Refer to attack_adaptive_server_strategy.py:425-486 and the shared attention module in attack_adaptive_server_strategy.py:174-289.

The attention trainer, included in examples/server_aggregation/attack_adaptive/pretraining, reuses the Plato strategy, learning weights via an L1 loss on captured round projections, matching the paper’s goal of approximating the robust mean by data-driven weighting. Relevant code can be found at examples/server_aggregation/attack_adaptive/pretraining/train_attention.py:78-156.

This example, referred to as the Plato implementation, also matches closely with the author's reference implmentation.

PCA handling is more defensive in Plato: _pca supports both over- and under-determined SVD paths with fallback (attack_adaptive_server_strategy.py:128-153), whereas the original convert_pca.getPCA_torch_over assumes the over- determined case (Attack-Adaptive-Aggregation-in-Federated-Learning/utils/convert_pca.py:19-45).
Plato discovers trainable parameter names from the live trainer context and skips non-floating tensors at stack time (attack_adaptive_server_strategy.py:425-436), while the reference deletes keys by mutating the stacked dict in place (Attack-Adaptive-Aggregation-in-Federated-Learning/aaa/attention.py:104-110).
When no checkpoint exists, Plato caches and persists a random attention state so subsequent rounds stay deterministic (attack_adaptive_server_strategy.py:439-466); the upstream code simply expects ./aaa/attention.pt to be present (Attack-Adaptive-Aggregation-in-Federated-Learning/aaa/attention.py:114-126).
Plato normalises weights with a NaN guard that falls back to uniform weights (attack_adaptive_server_strategy.py:480- 484), while the original normalises blindly (Attack-Adaptive-Aggregation-in-Federated-Learning/aaa/attention.py:118- 123).
Dataset capture is built into the aggregation step: each round records the projection, attention output, FedAdp- style reference weights, sample counts, and metadata (attack_adaptive_server_strategy.py:487-651). The upstream workflow instead saves PCA tensors and a benign/malicious label vector when --save_model_weights is enabled (Attack- Adaptive-Aggregation-in-Federated-Learning/server.py:237-279 and _main.py:64-85), leaving downstream scripts to derive targets.
Plato’s trainer uses a concise CLI with standard argparse flags and optional validation split (examples/ server_aggregation/attack_adaptive/pretraining/train_attention.py:21-110); the reference script expects path lists via @args files and logs to per-hyperparameter text files (Attack-Adaptive-Aggregation-in-Federated-Learning/aaa/ train_attention.py:1-210).

Pretraining the Attack-Adaptive Attention Module

Because Plato saves randomly initialised weights when no checkpoint is present, ensure you replace them with a trained model before benchmarking—otherwise aggregation reduces to a random attention mask. Follow the steps below to obtain a checkpoint that matches the attack-adaptive aggregation algorithm.

1. Capture rounds from a simulation

Edit your experiment TOML so the attack-adaptive strategy records each round. Add the following keys under algorithm (directories are created if missing):


[algorithm]

# Aggregation algorithm
type = fedavg

# Scaling factor (temperature) used by the attention module
scaling_factor = 10

# Path to the trained attention model shipped with the attack-adaptive paper.
# Provide your own checkpoint before running the example.
attention_model_path: ./attention_model.pt

# Number of PCA components per layer before feeding into attention.
pca_components = 10

# Threshold applied after the softmax step (epsilon in the paper).
threshold = 0.005

# Attention network hyperparameters from the reference implementation.
attention_loops = 5
attention_hidden = 32

# Optional: set to capture per-round tensors for pretraining the attention module.
# dataset_capture_dir = ./attack_adaptive_dataset

Run the experiment as usual:


cd examples/server_aggregation/attack_adaptive
uv run attack_adaptive.py -c attack_adaptive_MNIST_lenet5.toml

Once algorithm.dataset_capture_dir is set, each run produces a timestamped folder under ./attack_adaptive_dataset/ (for example run_20251015-211214) containing one round_XXXXX.pt file per round and a metadata.json summary. If the key is left unset, capture is skipped and no folder will be created.

2. Train the attention network

Point the pretraining script to the captured directory and choose where to save the checkpoint:


cd examples/server_aggregation/attack_adaptive/pretraining
uv run train_attention.py \
  --dataset-dir ../attack_adaptive_dataset/run_20251015-211214 \
  --save-path ../attention_model.pt \
  --epochs 200

Additional options:

--batch-size (default 16)
--learning-rate (default 1e-4)
--val-ratio (default 0.1, fraction of rounds used for validation)
--epsilon, --scale, --attention-loops, --hidden-size to match paper hyperparameters.

The command prints the training and validation losses each epoch and reports the best checkpoint once training finishes.

3. Re-run with the trained model

Ensure algorithm.attention_model_path in the experiment TOML points to the new checkpoint, then launch a fresh run. The server will load the pretrained weights automatically.

Tip: Capture rounds from several scenarios (with/without attacks) and merge them into a single directory before training to improve robustness.