Layout-MAE Architecture Audit¶

Status: ViT-S MAE prototype — RFC 0001 recipe, no pretrained weights yet. Last audited: 2026-05-23 against docs/rfcs/0001-base-model.md and the canonical MAE reference (He et al., Masked Autoencoders Are Scalable Vision Learners, CVPR 2022).

OpenLithoHub ships a LayoutMAE module (src/openlithohub/models/layout_mae.py) implementing a ViT-S masked-autoencoder over rasterised layout patches, intended as a self-supervised pretraining base for downstream OPC / hotspot tasks. This page records what we implement, what RFC 0001 specifies, and where they diverge.

Audit method¶

Compared models/layout_mae.py against:

docs/rfcs/0001-base-model.md — the project RFC that pins the recipe.
He et al., Masked Autoencoders Are Scalable Vision Learners (CVPR 2022, arXiv:2111.06377) — canonical MAE reference. Bib key: He2022_MAE (added to docs/references.bib 2026-05-23).

Confidence:

A — verified against the source in this repo, with paper-side claims grounded in the canonical MAE reference now pinned in docs/references.bib as He2022_MAE.
B — verified against RFC 0001's stated values.
C — derived from the canonical MAE reference where the paper PDF would otherwise be needed; with He2022_MAE now in the bib, items that were previously dual-marked A / C are upgraded to A.

What RFC 0001 specifies¶

Item	RFC 0001 specification	Confidence
Encoder	ViT-S — `embed_dim=384`, `depth=12`, `num_heads=6`.	B
Decoder	Lightweight ViT — `decoder_embed_dim=256`, `decoder_depth=4`, `decoder_num_heads=8`.	B
Patch	16×16 patches over 256×256 inputs (16² = 256 patches).	B
Masking	Random 75% mask ratio.	B
Loss	L1 reconstruction over masked patches only.	B
Position embedding	2D sin-cos (non-learned).	B
Pretraining target	200k steps on A100 — v0.2 deliverable, not in this prototype.	B

What OpenLithoHub implements¶

src/openlithohub/models/layout_mae.py :: LayoutMAE:

Item	OpenLithoHub	Matches RFC / paper?	Confidence
Encoder	ViT-S — `embed_dim=384`, `depth=12`, `num_heads=6` (`LayoutMAEConfig` defaults, ll. ~36–38).	Yes.	A
Decoder	`decoder_embed_dim=256`, `decoder_depth=4`, `decoder_num_heads=8` (ll. ~39–41).	Yes.	A
Patch	`patch_size=16`, `image_size=256`, `in_channels=1` (single-channel rasterised layout).	Yes for spatial dims; canonical MAE uses 3-channel ImageNet. Single-channel is the project's domain choice.	A
Patch embedding	`nn.Conv2d` with kernel=stride=patch_size (ll. ~95–97).	Yes — standard ViT patchify-via-conv.	A
Position embedding	2D sin-cos via `_sincos_pos_embed`, non-learned (`requires_grad=False`, ll. ~98–100).	Yes — matches MAE paper §A.1.	A
Random masking	Per-batch sample-the-noise, argsort-shuffle, keep first `n*(1-mask_ratio)` indices (ll. ~141–154). `mask_ratio=0.75` default.	Yes — the canonical MAE shuffle algorithm.	A
Encoder block	`LayerNorm → MultiheadAttention → residual → LayerNorm → MLP(GELU) → residual`, pre-norm (`_Block`, ll. ~70–84).	Yes — standard pre-norm Transformer.	A
Decoder	Linear projection from encoder dim to `decoder_embed_dim`, mask-token expansion + restore-via-gather, decoder-side pos-embed addition, `decoder_depth` blocks, final linear → flat patch values (ll. ~178–189).	Yes — matches MAE paper §3.4.	A
Mask token	Single learned embedding of shape `(1, 1, decoder_embed_dim)`, init `N(0, 0.02)` (ll. ~107–108).	Yes — matches MAE paper.	A
Reconstruction loss	L1 over masked patches only — `(pred − target).abs().mean(dim=-1)` weighted by `mask`, summed and normalised (ll. ~192–199).	Yes in form. The MAE paper uses MSE (per-patch L2), not L1. RFC 0001 explicitly chooses L1; this is a documented divergence.	A
Patch normalization	Not implemented. MAE paper §3.4 normalizes target patches by per-patch mean/std before computing loss.	No — RFC 0001 does not mention this. Possible quality gap on natural-image-like layouts; less likely to matter on binary rasterised layouts (mean/std are nearly constant per patch).	A
`train_step`	Single step: forward → loss → backward → optimizer.step. Returns scalar loss.	Yes — minimal training-loop primitive.	A
`set_decoder` / fine-tune adapter API	Not implemented. RFC 0001 marks this as a v0.2 follow-up.	N/A — by design for the v0.1 prototype.	A
Pretrained weights	None. The 200k-step A100 pretrain is a v0.2 deliverable.	N/A — by design.	A
HF Hub `from_pretrained`	Not wired for this model.	N/A — by design until weights exist.	A

Findings¶

Architecture matches the canonical MAE recipe. ViT-S encoder, lightweight ViT decoder, 75% mask ratio, per-batch shuffle masking, sin-cos pos-embeds, mask-token-and-gather decoder input — all in place.
L1 reconstruction loss diverges from the MAE paper. The paper uses MSE; we use L1 per RFC 0001 §Architecture. Implication: absolute pretrain loss numbers are not comparable to the published MAE / SimMIM values. L1 is more robust to the binary-edge structure of rasterised layouts (where MSE would be dominated by edge pixels), so the divergence is well-motivated.
No per-patch target normalization. The MAE paper normalizes each target patch by its own mean/std before L1/L2. We do not. For binary rasterised layouts this is unlikely to matter — patches are mostly in {0, 1}, so per-patch normalization collapses to a near-identity. Worth revisiting if the input ever becomes anti-aliased greyscale.
No pretrained weights. The v0.1 prototype is the recipe, not a pretrained model. LayoutMAE() constructed today returns randomly-initialised weights — useful only for the train_step smoke test or as a starting point for project-internal pretraining. The v0.2 deliverable in RFC 0001 calls for 200k pretraining steps on an A100; that work is not part of this audit's scope.
No fine-tune adapter API. RFC 0001 explicitly defers this to v0.2. The encode() method is the documented frozen-feature path for any consumer that wants to pretrain elsewhere and consume features here.
He2022_MAE is now in docs/references.bib (added 2026-05-23). Paper-side architecture claims that were previously dual-marked A / C are now A — the canonical citation is pinned in the bib, so the audit no longer relies on an out-of-tree paper note. Adding docs/papers/He2022_MAE.pdf would be a further nicety but is no longer required for A confidence on these rows.

Implications for users¶

Treat this as a recipe, not a model. Random-init LayoutMAE() will not give meaningful features. The work to pretrain on a layout corpus is downstream of this audit.
Don't compare reconstruction loss values to published MAE numbers. L1 vs. MSE + no patch-normalization → different absolute scale.
Citation hygiene: if you publish using this module, cite He2022_MAE (now in docs/references.bib) for architecture lineage, and docs/rfcs/0001-base-model.md for the project-level recipe choices (L1, single-channel, 256×256, 16×16 patches).

Re-audit triggers¶

Re-run this audit when any of the following change:

LayoutMAEConfig defaults change (encoder/decoder dims, depth, mask_ratio).
reconstruction_loss switches from L1 to MSE, or gains target patch-normalization.
set_decoder / fine-tune adapter API lands.
A from_pretrained path lands and v0.2 pretrained weights are published.
The He et al. MAE paper PDF lands in docs/papers/ — a nicety since He2022_MAE is already in docs/references.bib (2026-05-23), but useful for any future claim that needs a direct quote rather than a citation.