RFC 0002 — Layout Tokens¶
| Status | Draft |
| Author | OpenLithoHub maintainers |
| Created | 2026-05-19 |
| Targets | v0.2 |
| Related | RFC 0001 (Layout-MAE Base Model), openlithohub.synth.DiffusionLayoutGenerator |
Summary¶
Define a polygon-level tokenisation of layouts so a transformer can generate layouts as sequences of tokens instead of rasters. This unlocks:
- Autoregressive layout generation that respects PDK rules at the tokeniser level — no post-hoc DRC repair pass.
- Conditioning on tokenised PDK rules (so a single model can serve multiple PDKs).
- Length-prefixed compression: 256×256 raster → ~1k tokens for typical patches.
This RFC defines only the tokeniser format and the public interface.
The transformer that consumes these tokens is out of scope; the
intended downstream consumer is DiffusionLayoutGenerator (which we
expect to replace with an autoregressive token model in v0.2).
Why tokens¶
- PDK awareness is built in: every coordinate is snapped to the manufacturing grid and every emitted polygon respects min-width and min-spacing by construction (a token sequence that violates these is unrepresentable, not just discouraged).
- Lossless: round-trip layout → tokens → layout matches at the pixel level, unlike rasterisation at finite pixel pitch.
- Variable-resolution: the same token stream can render at any pixel size; useful when downstream tools expect different grid resolutions.
- Cheaper attention: typical 256×256 layouts have <2k polygon vertices but ~65k pixels.
Non-goals¶
- Replacing rasters everywhere. RFC 0001's MAE backbone stays raster- based; tokens are an additional representation, not a replacement.
- Hierarchical / cell-level tokenisation. v0.2 is flat polygon tokens.
- Net-level / connectivity tokens. We tokenise geometry only; netlist awareness is a v0.3 problem.
Token vocabulary¶
Five token classes:
| Class | Encoding | Notes |
|---|---|---|
<bos> |
reserved id 0 | Start of layout. |
<eos> |
reserved id 1 | End of layout. |
<polygon> |
reserved id 2 | Marks the start of a polygon. |
<vertex> |
(x, y) tuple, quantised to PDK grid | 2 ints per vertex. |
<close> |
reserved id 3 | Closes the current polygon. |
Coordinate quantisation: each (x, y) is snapped to a multiple of the PDK manufacturing grid (typically 1nm or 0.5nm). The vocabulary size is then ~2 × (canvas_size / grid). For a 256nm canvas at 1nm grid: ~512 distinct coordinate ids.
For full PDK-rule awareness we additionally support a rule-prefix sequence — a small, fixed-format header that encodes (pdk_name, pixel_size_nm, min_width_nm, min_spacing_nm). This lets a single model condition on the target PDK at inference time.
Sequence format¶
Polygons are emitted in scan order (top-to-bottom, left-to-right of their bounding-box minimum), and vertices within each polygon are emitted clockwise starting from the top-left vertex of the polygon's bounding box. This canonicalisation is required — without it the same layout has many valid token sequences and the model wastes capacity modelling that ambiguity.
Public API¶
from openlithohub.tokens import LayoutTokenizer
tokenizer = LayoutTokenizer.from_pdk("freepdk45")
# Mask tensor → token ids.
ids = tokenizer.encode(mask_tensor) # (T,) int64
# Token ids → mask tensor.
mask = tokenizer.decode(ids, size=256)
# Round-trip should be exact.
assert torch.equal(mask, tokenizer.decode(tokenizer.encode(mask), size=mask.shape[-1]))
The tokenizer is implemented as a pure Python class on top of
shapely — no Torch ops needed for tokenisation itself.
Encoder pipeline¶
- Vectorise: extract polygon outlines from the binary mask via marching-squares + Douglas-Peucker simplification.
- Snap: round each vertex to the PDK grid; drop polygons whose
simplification yields fewer than 3 vertices or whose area is below
min_area_nm2. - Canonicalise: order polygons + vertices as defined above.
- Emit: write
<bos> <pdk-rules> <polygon> ... <eos>.
Decoder pipeline¶
- Parse: split sequence at
<polygon>/<close>boundaries. - Reject: drop sequences where any polygon has self-intersection
or area below
min_area_nm2. (Reject silently is fine — sampling re-tries.) - Rasterise: use
shapelypolygon → raster at the requested pixel size.
Prior art¶
- PolyGen (Nash et al., 2020): autoregressive polygon meshes for 3D geometry. We borrow the (vertex, face) factoring and the canonical-ordering trick.
- VectorMask internal/proprietary work at major fabs is known to use polygon tokenisation for ML-OPC; details are not public but the shape of the problem is the same.
- DeepSDF / CodeOPC academic work uses signed-distance fields rather than polygons — orthogonal approach, not pursued here.
Evaluation¶
Tokeniser-only evaluation (no model in the loop):
- Round-trip exactness: encode-then-decode on synthetic + public layouts gives identical pixel masks. Target: 100% on synthetic, ≥99% on public (the ≤1% loss is from sub-grid features that don't survive snapping — those are MRC violations anyway).
- Compression ratio: tokens-per-pixel for representative patches. Target: <5% (256×256 = 65k pixels → <3.3k tokens).
- Vocabulary size: <4k tokens including all coordinate ids for the 256nm canvas. (Larger canvases use windowed encoding.)
Open questions¶
- Variable-coordinate vocabulary: emitting (x, y) as two ints doubles sequence length; emitting them as a single packed id grows the vocab quadratically. We pick two-int emission for v0.2 and revisit if attention cost dominates.
- Multi-layer layouts: Manhattan layouts often have multiple metal layers; v0.2 tokenises one layer at a time. Layer-stack tokenisation is a v0.3 question.
- Holes: polygons with holes (e.g., ring shapes) are represented
as outer + inner contours separated by a
<hole>token. Spec'd in v0.2; not the common case.
Rollout¶
- v0.2 alpha: tokeniser implementation, round-trip tests, no model.
- v0.2 beta: small autoregressive transformer trained on synthetic layouts, replacing the diffusion stub.
- v0.2 GA: documented public API + a HuggingFace dataset of pre- tokenised public layouts.
Alternatives considered¶
- Pixel-level autoregression (PixelCNN-style on rasters): does not scale beyond ~64×64 patches and offers no PDK-rule guarantees.
- Bezier-curve tokens: nice for analog cells but 95% of digital layout is Manhattan polygons, so the added vocab/complexity is not worth it for v0.2.
- DSL-based sequences (gates, routes): captures intent better but is fab-specific. We tokenise post-place-and-route geometry, not intent.