Skip to content

Nanochat in Plato

Plato includes a Nanochat integration for federated training plus optional CORE evaluation. This guide documents the working end-to-end setup for the current repository state.

Reference files:

  • configs/Nanochat/synthetic_micro.toml
  • configs/Nanochat/parquet_micro.toml
  • examples/nanochat/README.md
  • external/nanochat/

What you need to know first

There are two separate pieces of setup:

  1. Running Nanochat training inside Plato
  2. Running Nanochat CORE evaluation

The current synthetic_micro.toml configuration already enables:

[evaluation]
type = "nanochat_core"

So if you run that config as-is, you need both:

  • the Nanochat Rust tokenizer extension (rustbpe)
  • a trained tokenizer under ~/.cache/nanochat/tokenizer/

The CORE evaluation bundle itself is downloaded automatically on first use, but the tokenizer is not.

Step 1: Clone Plato and initialize submodules

From a fresh checkout:

git clone git@github.com:TL-System/plato.git
cd plato
git submodule update --init --recursive

The Nanochat code lives in the submodule:

  • external/nanochat

If this step is skipped, Nanochat imports and build steps will fail.

Step 2: Install Plato with the Nanochat extra

From the repository root:

uv sync --extra nanochat

This installs the Python dependencies needed by Plato's Nanochat integration.

Step 3: Install maturin

The Rust tokenizer build step uses maturin. Install it once if it is not already available on your machine:

uv tool install maturin

If you already have maturin, you can skip this step.

Step 4: Build the Nanochat Rust tokenizer extension for Plato's environment

From the repository root:

uv run --extra nanochat maturin develop --release --manifest-path external/nanochat/rustbpe/Cargo.toml

This builds and installs the rustbpe extension into the environment used by uv run in the Plato repository.

Troubleshooting: Both VIRTUAL_ENV and CONDA_PREFIX are set

If maturin fails with a message like this:

Both VIRTUAL_ENV and CONDA_PREFIX are set. Please unset one of them

run:

unset CONDA_PREFIX
uv run --extra nanochat maturin develop --release --manifest-path external/nanochat/rustbpe/Cargo.toml

Step 5A: Run the training-only synthetic smoke test

If you only want to verify that Nanochat training works and do not need CORE evaluation yet, first make a local copy of the config and disable the evaluation block there:

cp configs/Nanochat/synthetic_micro.toml /tmp/nanochat_synthetic_train_only.toml

Edit /tmp/nanochat_synthetic_train_only.toml and comment out:

# [evaluation]
# type = "nanochat_core"
# max_per_task = 16

Then run:

uv run --extra nanochat python plato.py --config /tmp/nanochat_synthetic_train_only.toml --cpu

This launches a 1-round synthetic training smoke test on CPU without requiring tokenizer setup for CORE evaluation.

Step 5B: Prepare the tokenizer required by CORE evaluation

If you want to run configs/Nanochat/synthetic_micro.toml without editing it, you must prepare a tokenizer first.

Why this extra step is needed

Nanochat's CORE evaluator expects these tokenizer artifacts under:

  • ~/.cache/nanochat/tokenizer/tokenizer.pkl
  • ~/.cache/nanochat/tokenizer/token_bytes.pt

Plato can auto-download the CORE evaluation bundle, but it does not auto-create the tokenizer.

5B.1 Download Nanochat base data shards

Run this from the Plato repository root:

PYTHONPATH=external/nanochat:$PYTHONPATH \
uv run --extra nanochat python external/nanochat/nanochat/dataset.py -n 2

This downloads data into:

  • ~/.cache/nanochat/base_data/

Download at least 2 shards

Do not download just one shard.

Nanochat's tokenizer training script uses all but the last shard for the training split and reserves the last shard for validation. With only one shard, the training split is empty and tokenizer training produces an unusable tokenizer.

5B.2 Train the tokenizer

For the micro config, use a small vocabulary size matching the smoke-test setup:

PYTHONPATH=external/nanochat:$PYTHONPATH \
uv run --extra nanochat python external/nanochat/scripts/tok_train.py --max_chars 1000000 --vocab_size 512

This should create:

  • ~/.cache/nanochat/tokenizer/tokenizer.pkl
  • ~/.cache/nanochat/tokenizer/token_bytes.pt

Step 6: Run the synthetic Nanochat config with CORE evaluation

Once the tokenizer exists, run:

uv run --extra nanochat python plato.py --config configs/Nanochat/synthetic_micro.toml --cpu

What to expect:

  • training runs on CPU
  • one client participates in one round
  • the CORE evaluation bundle is downloaded automatically on the first run if missing
  • some CORE examples may be skipped when they exceed the tiny micro model's sequence_len

That last point is expected for this micro configuration.

Step 7: Run the Parquet micro config

The repository also includes:

  • configs/Nanochat/parquet_micro.toml

Run it with:

uv run --extra nanochat python plato.py --config configs/Nanochat/parquet_micro.toml

Notes:

  • this config is set up for mode = "parquet"
  • it reuses Nanochat's cached base data under ~/.cache/nanochat/base_data/
  • it also requires the tokenizer prepared in Step 5B, because Parquet tokenization and CORE evaluation both depend on it
  • by default it will reuse the tokenizer trained in Step 5B; if you train a different tokenizer or vocabulary, keep the tokenizer, cached data flow, and model settings aligned
  • it defaults to device = "cuda"
  • it performs many more rounds than the synthetic smoke test
  • it is intended for a more realistic run than the CPU micro smoke test

You may need to adjust device, batch counts, rounds, and data volume for your hardware.

Common failure modes

ModuleNotFoundError: No module named 'nanochat'

This usually happens when running upstream Nanochat scripts directly.

Use the commands in this guide from the Plato repository root and include:

PYTHONPATH=external/nanochat:$PYTHONPATH

when calling upstream scripts such as:

  • external/nanochat/nanochat/dataset.py
  • external/nanochat/scripts/tok_train.py

You do not need this extra PYTHONPATH prefix when running plato.py, because Plato's Nanochat integration adds the submodule path internally.

AttributeError: module 'rustbpe' has no attribute 'Tokenizer'

The Rust extension was not built for Plato's active Python environment.

Re-run:

uv run --extra nanochat maturin develop --release --manifest-path external/nanochat/rustbpe/Cargo.toml

FileNotFoundError: ... ~/.cache/nanochat/tokenizer/tokenizer.pkl

The tokenizer has not been trained yet.

Follow the steps in:

CORE evaluation crashes on long prompts

The synthetic_micro.toml model uses a very small sequence length. Some CORE prompts can exceed that limit.

Plato now skips examples that do not fit into the model context window during Nanochat CORE evaluation, instead of aborting the entire run.

Minimal training-only smoke test

  1. git submodule update --init --recursive
  2. uv sync --extra nanochat
  3. uv tool install maturin
  4. uv run --extra nanochat maturin develop --release --manifest-path external/nanochat/rustbpe/Cargo.toml
  5. cp configs/Nanochat/synthetic_micro.toml /tmp/nanochat_synthetic_train_only.toml
  6. comment out [evaluation] in /tmp/nanochat_synthetic_train_only.toml
  7. uv run --extra nanochat python plato.py --config /tmp/nanochat_synthetic_train_only.toml --cpu

Full synthetic run with CORE evaluation

  1. git submodule update --init --recursive
  2. uv sync --extra nanochat
  3. uv tool install maturin
  4. uv run --extra nanochat maturin develop --release --manifest-path external/nanochat/rustbpe/Cargo.toml
  5. PYTHONPATH=external/nanochat:$PYTHONPATH uv run --extra nanochat python external/nanochat/nanochat/dataset.py -n 2
  6. PYTHONPATH=external/nanochat:$PYTHONPATH uv run --extra nanochat python external/nanochat/scripts/tok_train.py --max_chars 1000000 --vocab_size 512
  7. uv run --extra nanochat python plato.py --config configs/Nanochat/synthetic_micro.toml --cpu