roam/trmnllm.org

:PROPERTIES:
:ID: 4b44cf43-6106-4498-81a3-b23ebb25dabf
:END:

#+title: trmn/llm
#+filetags: :project: :knowledge: :llm:

** Purpose
NOT a user-facing chat feature. This is infrastructure for *automatic procedural dialogue* between
characters in the show. The chat panel was a test harness only. Future use: generate spoken lines
for characters as part of the deterministic life-simulation loop.

** Architecture
In-browser LLM weights bundled into the site via ~nix build~. No backend, no external inference at runtime.

Model: ~roneneldan/TinyStories-1M~ — GPT-Neo architecture, ~model_type: "gpt_neo"~,
hidden=64, vocab=50257 (GPT-2 BPE tokenizer). Output is surreal/nonsensical children's story prose —
intentionally acceptable for this project.

Final ONNX INT8 size: *15 MB* (well under Cloudflare Pages 25 MiB per-file limit).

Runtime stack:
- ~onnxruntime-web~ loaded as a global ~<script>~ tag (not ESM importmap)
- ~@huggingface/transformers~ AutoTokenizer only (no pipeline) — tokenizer files served from ~/model/TinyStories-1M/~
- Custom top-k generation loop in JS; no KV cache (full sequence re-processed each step)

** Deployment constraint
Cloudflare Pages has a *25 MiB per-file limit*. This is why TinyStories-1M (15 MB) was chosen over 8M (40 MB).
TinyStories-8M's embedding alone (50k vocab × hidden=256 × FP32) is ~52 MB; ~quantize_dynamic~ can't touch it
(Gather op). TinyStories-1M's hidden=64 makes the embedding 4× smaller.

** Conventions
- ONNX model has ONLY ~input_ids~ (int64) as input; output: ~logits~ (float32 [1, seq, 50257])
- ~attention_mask~ is NOT an input to the exported ONNX graph (was patched away during tracing)
- Tokenizer: ~env.localModelPath = '/model/'; env.allowRemoteModels = false; AutoTokenizer.from_pretrained('TinyStories-1M')~
- Prompt must be story-style: ~`A man was asked "${text}". He smiled and said, "`~
  - Do NOT use dialogue-format prompts like ~"you: X\nhim: "~ — model immediately predicts ~\n~, empty output
- Stop: first complete sentence ~(/^(.*?[.!?])/s)~ on decoded text with ~"~ stripped (~replace(/"/g, '')~)
- Top-k=40, temperature=0.9, max 60 new tokens

** Gotchas
*Nix build:*
- Do NOT use ~optimum~ — use ~torch.onnx.export~ directly with ~dynamo=False~
- Add ~onnxscript~ to nativeBuildInputs (PyTorch 2.12 lazily imports it even for legacy export)
- Patch ~create_causal_mask~ to ~lambda **kwargs: None~ before tracing (transformers 5.5.4 incompatibility)
- ~quantize_dynamic~ skips Gather (embedding) — the embedding dominates file size for large vocab models
- Tokenizer files (~tokenizer.json~, ~tokenizer_config.json~, ~special_tokens_map.json~) are identical
  between TinyStories-1M and 8M — same GPT-2 BPE tokenizer; only ~config.json~ and ~pytorch_model.bin~ differ
- nativeBuildInputs: ~transformers onnxruntime onnxscript torch~ (no optimum)

*JS runtime:*
- Load ~onnxruntime-web~ via ~<script>~ tag BEFORE the importmap/module scripts
- Only pass ~input_ids~ to ~session.run()~ — passing ~attention_mask~ throws INVALID_ARGUMENT
- Generation is slow (no KV cache): each step re-runs full accumulated sequence

** Key Files
- ~flake.nix~ — tinyStoriesOnnx sub-derivation: fetchurl + torch.onnx.export + quantize_dynamic
- ~index.html~ — Two separate module scripts: Three.js scene + inference (ort global + AutoTokenizer)