trmn/llm

Purpose
Architecture
Deployment constraint
Conventions
Gotchas
Key Files

Purpose

NOT a user-facing chat feature. This is infrastructure for automatic procedural dialogue between characters in the show. The chat panel was a test harness only. Future use: generate spoken lines for characters as part of the deterministic life-simulation loop.

Architecture

In-browser LLM weights bundled into the site via nix build. No backend, no external inference at runtime.

Model: roneneldan/TinyStories-1M — GPT-Neo architecture, model_type: "gpt_neo", hidden=64, vocab=50257 (GPT-2 BPE tokenizer). Output is surreal/nonsensical children's story prose — intentionally acceptable for this project.

Final ONNX INT8 size: 15 MB (well under Cloudflare Pages 25 MiB per-file limit).

Runtime stack:

onnxruntime-web loaded as a global <script> tag (not ESM importmap)
@huggingface/transformers AutoTokenizer only (no pipeline) — tokenizer files served from /model/TinyStories-1M/
Custom top-k generation loop in JS; no KV cache (full sequence re-processed each step)

Deployment constraint

Cloudflare Pages has a 25 MiB per-file limit. This is why TinyStories-1M (15 MB) was chosen over 8M (40 MB). TinyStories-8M's embedding alone (50k vocab × hidden=256 × FP32) is 52 MB; ~quantize_dynamic can't touch it (Gather op). TinyStories-1M's hidden=64 makes the embedding 4× smaller.

Conventions

ONNX model has ONLY input_ids (int64) as input; output: logits (float32 [1, seq, 50257])
attention_mask is NOT an input to the exported ONNX graph (was patched away during tracing)
Tokenizer: env.localModelPath = '/model/'; env.allowRemoteModels = false; AutoTokenizer.from_pretrained('TinyStories-1M')
Prompt must be story-style: `A man was asked "${text}". He smiled and said, "`
- Do NOT use dialogue-format prompts like "you: X\nhim: " — model immediately predicts \n, empty output
Stop: first complete sentence (/^(.*?[.!?])/s) on decoded text with " stripped (replace(/"/g, ''))
Top-k=40, temperature=0.9, max 60 new tokens

Gotchas

Nix build:

Do NOT use optimum — use torch.onnx.export directly with dynamo=False
Add onnxscript to nativeBuildInputs (PyTorch 2.12 lazily imports it even for legacy export)
Patch create_causal_mask to lambda **kwargs: None before tracing (transformers 5.5.4 incompatibility)
quantize_dynamic skips Gather (embedding) — the embedding dominates file size for large vocab models
Tokenizer files (tokenizer.json, tokenizer_config.json, special_tokens_map.json) are identical between TinyStories-1M and 8M — same GPT-2 BPE tokenizer; only config.json and pytorch_model.bin differ
nativeBuildInputs: transformers onnxruntime onnxscript torch (no optimum)

JS runtime:

Load onnxruntime-web via <script> tag BEFORE the importmap/module scripts
Only pass input_ids to session.run() — passing attention_mask throws INVALID_ARGUMENT
Generation is slow (no KV cache): each step re-runs full accumulated sequence

Key Files

flake.nix — tinyStoriesOnnx sub-derivation: fetchurl + torch.onnx.export + quantize_dynamic
index.html — Two separate module scripts: Three.js scene + inference (ort global + AutoTokenizer)

3.3 KiB Raw Blame History Unescape Escape