Server Aggregation Algorithms

FedAtt

FedAtt is a server aggregation algorithm where client updates were aggregated using a layer-wise attention-based mechanism that considered the similarity between the server and client models. The objective was to improve the accuracy or perplexity of the trained model with the same number of communication rounds.


cd examples/server_aggregation/fedatt
uv run fedatt.py -c fedatt_FashionMNIST_lenet5.toml

Reference: S. Ji, S. Pan, G. Long, X. Li, J. Jiang, Z. Huang. "Learning Private Neural Language Modeling with Attentive Aggregation," in Proc. International Joint Conference on Neural Networks (IJCNN), 2019.

Alignment with the paper

The strategy’s attention scores (fedatt_server_strategy.py:63-78) line up with Eq. (2) of Ji et al. (2019) by softmaxing the layer-wise parameter distances; the update step (fedatt_server_strategy.py:80-101) implements Eq. (4) with the optional Gaussian randomization from Eq. (6), using epsilon and magnitude as the paper’s $\epsilon$ and $\beta$ .

The author's own reference implementation, shaoxiongji/fed-att/src/agg/aggregate.py, follows the same pipeline (norm-based attention → softmax → weighted sum of server–client deltas → stepsize + noise), and every operation has a direct counterpart in the current Plato strategy implementation in fedatt_server_strategy.py.

FedAdp

FedAdp is another server aggregation algorithm, which exploited the implicit connection between data distribution on a client and the contribution from that client to the global model, measured at the server by inferring gradient information of participating clients.


cd examples/server_aggregation/fedadp/
uv run fedadp.py -c fedadp_FashionMNIST_lenet5.toml

Reference: H. Wu, P. Wang. "Fast-Convergent Federated Learning with Adaptive Weighting," in IEEE Trans. on Cognitive Communications and Networking (TCCN), 2021.

Alignment with the paper

The original paper did not come with source code. This example implementation follows Algorithm 1 in Wu & Wang (TCCN 2021) in the following ways: sample-weighted aggregation of local gradients (Alg. 1 line 9 / Eq. 7) happens in aggregate_deltas; the smoothed angle update (Eq. 8) is implemented via the running average in calc_contribution; node contribution uses the Gompertz mapping (Eq. 9) at fedadp_server_strategy.py:139-147; and final weights apply the Softmax-style normalization with dataset sizes (Eq. 10) in calc_adaptive_weighting. The guardrails for zero norms/weights simply prevent numerical issues and do not change the prescribed behavior.

FedNova

FedNova addresses the objective inconsistency problem in heterogeneous federated optimization, where clients may perform different numbers of local training epochs. The algorithm normalizes local updates based on the number of local steps taken, enabling effective aggregation across heterogeneous clients. On the client side, each client randomly selects the number of local epochs (between 2 and max_local_epochs) and trains accordingly. The number of epochs is included in the report sent to the server. The server then computes the effective number of steps (tau_eff) across all clients and normalizes each client's update by their individual number of local epochs, ensuring fair aggregation.


cd examples/server_aggregation/fednova/
uv run fednova.py -c fednova_MNIST_lenet5.toml

Key configuration parameters:

algorithm.max_local_epochs: Maximum number of local epochs (default: 10)
algorithm.pattern: Pattern for selecting local epochs (uniform_random or constant)
trainer.epochs: Base number of epochs (used when pattern is constant)

Reference: J. Wang, Q. Liu, H. Liang, G. Joshi, H. V. Poor. "Tackling the Objective Inconsistency Problem in Heterogeneous Federated Optimization," in Proc. NeurIPS, 2020.

Alignment with the paper

The strategy implements Eq. (12) from Wang et al. (NeurIPS 2020) in the following ways: average normalized updates with weight $p_i \tau_{\mathrm{eff}} / \tau_i$ , after first accumulating $\tau_{\mathrm{eff}} = \sum_i p_i \tau_i$ . This matches the description in §5/fn.11-12 of the paper.

The official FedNova code — JYWa/FedNova, distoptim/FedNova.py — uses the same scaling factor tau_eff / local_normalizing_vec, confirming that this example implementation mirrors the reference source.

MOON

MOON (Model-Contrastive Federated Learning) enhances standard FedAvg by adding a model-level contrastive regularizer. Each client augments the shared model with a projection head, clones the incoming global model as a positive anchor, and reuses a small buffer of its historical checkpoints as negatives. The server still performs sample-weighted averaging but records a short history of global states for downstream analysis or warm restarts.


cd examples/server_aggregation/moon/
uv run moon.py -c moon_MNIST_lenet5.toml

Key configuration parameters:

algorithm.mu: Weight assigned to the contrastive term (default: 5.0).
algorithm.temperature: Softmax temperature applied to cosine similarities (default: 0.5).
algorithm.history_size: Number of historical local models cached per client as negatives (default: 2).
trainer.model_name: Name used for checkpointing the projection-ready backbone (default: moon_lenet5).

Reference: Qinbin Li, Bingsheng He, Dawn Song. “Model-Contrastive Federated Learning,” in Proc. CVPR, 2021.

Alignment with the paper

Here’s how Plato's implementation lines up with Li et al. (CVPR 2021) and the authors’ reference implementation:

Projection head & representations – moon_model.py:31-79 implements the LeNet-style backbone plus a two-layer projection head, returning both logits and L2-normalised embeddings. The paper’s Eq. (3) (and typical contrastive- learning practice) calls for that projection step; the public repo’s simple CNN head even hints at it (they keep the projection MLP commented out). So keeping the projection in our model is faithful—and helps the cosine similarities stay well behaved.
Local training objective – moon_trainer.py:26-152 combines the supervised cross-entropy with the temperature-scaled contrastive loss exactly like Eq. (1): positives come from the frozen global model, negatives from the stored local-history models, using the same $\mu$ and $\tau$ hyper-parameters exposed in the config (moon_MNIST_lenet5.toml:41-45). This mirrors train_net_fedcon in the reference implementation, which also weights the contrastive term by $\mu$ and uses CrossEntropy on logits built from cosine similarities.
Historical model buffer – the client keeps a FIFO queue of past local checkpoints (moon_client.py:21-64), equivalent to model_buffer_size in the paper and the author's reference implementation; that buffer is fed into the trainer through the strategy context so MOON always has negatives available.
Server aggregation – the server still performs sample-weighted FedAvg (moon_server.py:12-35, moon_server_strategy.py:19-63), matching the MOON design which leaves the aggregation rule unchanged. The extra global-history deque is bookkeeping-only.
Shared architecture – moon.py:8-15 now instantiates MoonModel once and passes it into both the client and server (model=model). That guarantees the projection-enabled architecture is shared exactly, as required for the contrastive comparisons.

The only intentional deviation is that we L2-normalise the projection outputs before computing cosine similarities (moon_model.py:76-79), which the paper assumes implicitly and improves stability. Aside from that, the workflow, hyper- parameters, and loss all line up with the CVPR paper and the publicly released PyTorch reference.

Attack-Adaptive Aggregation

Attack-Adaptive Aggregation is a robust server aggregation algorithm designed to defend against malicious clients in federated learning. The algorithm works by calculating the cosine similarity between the baseline weights and each client's weight deltas, then applying a temperature-scaled softmax to compute attention weights for each client. These attention weights are used to perform a weighted aggregation of client updates, giving more weight to clients whose updates are more aligned with the server model.


cd examples/server_aggregation/attack_adaptive/
uv run attack_adaptive.py -c attack_adaptive_MNIST_lenet5.toml

Key configuration parameters:

algorithm.scaling_factor: Temperature scaling factor for softmax (default: 10)

Reference:

C. P. Wan, Q. Chen. "Robust Federated Learning with Attack-Adaptive Aggregation," 2021.

Alignment with the paper

The Plato strategy keeps the original pipeline in the paper: it stacks floating-point client deltas, applies layer-wise PCA, seeds the estimator with the per-channel median, and feeds query/key pairs through a multi-pass attention block with temperature scaling and $\epsilon$ -thresholding before normalising weights, exactly mirroring Algorithm 1 and the PCA discussion in Sections 3.2–4.1 of Wan & Chen (2021). Refer to attack_adaptive_server_strategy.py:425-486 and the shared attention module in attack_adaptive_server_strategy.py:174-289.

The attention trainer, included in examples/server_aggregation/attack_adaptive/pretraining, reuses the Plato strategy, learning weights via an L1 loss on captured round projections, matching the paper’s goal of approximating the robust mean by data-driven weighting. Relevant code can be found at examples/server_aggregation/attack_adaptive/pretraining/train_attention.py:78-156.

This example, referred to as the Plato implementation, also matches closely with the author's reference implmentation.

PCA handling is more defensive in Plato: _pca supports both over- and under-determined SVD paths with fallback (attack_adaptive_server_strategy.py:128-153), whereas the original convert_pca.getPCA_torch_over assumes the over- determined case (Attack-Adaptive-Aggregation-in-Federated-Learning/utils/convert_pca.py:19-45).
Plato discovers trainable parameter names from the live trainer context and skips non-floating tensors at stack time (attack_adaptive_server_strategy.py:425-436), while the reference deletes keys by mutating the stacked dict in place (Attack-Adaptive-Aggregation-in-Federated-Learning/aaa/attention.py:104-110).
When no checkpoint exists, Plato caches and persists a random attention state so subsequent rounds stay deterministic (attack_adaptive_server_strategy.py:439-466); the upstream code simply expects ./aaa/attention.pt to be present (Attack-Adaptive-Aggregation-in-Federated-Learning/aaa/attention.py:114-126).
Plato normalises weights with a NaN guard that falls back to uniform weights (attack_adaptive_server_strategy.py:480- 484), while the original normalises blindly (Attack-Adaptive-Aggregation-in-Federated-Learning/aaa/attention.py:118- 123).
Dataset capture is built into the aggregation step: each round records the projection, attention output, FedAdp- style reference weights, sample counts, and metadata (attack_adaptive_server_strategy.py:487-651). The upstream workflow instead saves PCA tensors and a benign/malicious label vector when --save_model_weights is enabled (Attack- Adaptive-Aggregation-in-Federated-Learning/server.py:237-279 and _main.py:64-85), leaving downstream scripts to derive targets.
Plato’s trainer uses a concise CLI with standard argparse flags and optional validation split (examples/ server_aggregation/attack_adaptive/pretraining/train_attention.py:21-110); the reference script expects path lists via @args files and logs to per-hyperparameter text files (Attack-Adaptive-Aggregation-in-Federated-Learning/aaa/ train_attention.py:1-210).

Pretraining the Attack-Adaptive Attention Module

Because Plato saves randomly initialised weights when no checkpoint is present, ensure you replace them with a trained model before benchmarking—otherwise aggregation reduces to a random attention mask. Follow the steps below to obtain a checkpoint that matches the attack-adaptive aggregation algorithm.

1. Capture rounds from a simulation

Edit your experiment TOML so the attack-adaptive strategy records each round. Add the following keys under algorithm (directories are created if missing):


[algorithm]

# Aggregation algorithm
type = fedavg

# Scaling factor (temperature) used by the attention module
scaling_factor = 10

# Path to the trained attention model shipped with the attack-adaptive paper.
# Provide your own checkpoint before running the example.
attention_model_path: ./attention_model.pt

# Number of PCA components per layer before feeding into attention.
pca_components = 10

# Threshold applied after the softmax step (epsilon in the paper).
threshold = 0.005

# Attention network hyperparameters from the reference implementation.
attention_loops = 5
attention_hidden = 32

# Optional: set to capture per-round tensors for pretraining the attention module.
# dataset_capture_dir = ./attack_adaptive_dataset

Run the experiment as usual:


cd examples/server_aggregation/attack_adaptive
uv run attack_adaptive.py -c attack_adaptive_MNIST_lenet5.toml

Once algorithm.dataset_capture_dir is set, each run produces a timestamped folder under ./attack_adaptive_dataset/ (for example run_20251015-211214) containing one round_XXXXX.pt file per round and a metadata.json summary. If the key is left unset, capture is skipped and no folder will be created.

2. Train the attention network

Point the pretraining script to the captured directory and choose where to save the checkpoint:


cd examples/server_aggregation/attack_adaptive/pretraining
uv run train_attention.py \
  --dataset-dir ../attack_adaptive_dataset/run_20251015-211214 \
  --save-path ../attention_model.pt \
  --epochs 200

Additional options:

--batch-size (default 16)
--learning-rate (default 1e-4)
--val-ratio (default 0.1, fraction of rounds used for validation)
--epsilon, --scale, --attention-loops, --hidden-size to match paper hyperparameters.

The command prints the training and validation losses each epoch and reports the best checkpoint once training finishes.

3. Re-run with the trained model

Ensure algorithm.attention_model_path in the experiment TOML points to the new checkpoint, then launch a fresh run. The server will load the pretrained weights automatically.

Tip: Capture rounds from several scenarios (with/without attacks) and merge them into a single directory before training to improve robustness.