3.3 KiB
trmn/llm
Purpose
NOT a user-facing chat feature. This is infrastructure for automatic procedural dialogue between characters in the show. The chat panel was a test harness only. Future use: generate spoken lines for characters as part of the deterministic life-simulation loop.
Architecture
In-browser LLM weights bundled into the site via nix build. No backend, no external inference at runtime.
Model: roneneldan/TinyStories-1M — GPT-Neo architecture, model_type: "gpt_neo",
hidden=64, vocab=50257 (GPT-2 BPE tokenizer). Output is surreal/nonsensical children's story prose —
intentionally acceptable for this project.
Final ONNX INT8 size: 15 MB (well under Cloudflare Pages 25 MiB per-file limit).
Runtime stack:
onnxruntime-webloaded as a global<script>tag (not ESM importmap)@huggingface/transformersAutoTokenizer only (no pipeline) — tokenizer files served from/model/TinyStories-1M/- Custom top-k generation loop in JS; no KV cache (full sequence re-processed each step)
Deployment constraint
Cloudflare Pages has a 25 MiB per-file limit. This is why TinyStories-1M (15 MB) was chosen over 8M (40 MB).
TinyStories-8M's embedding alone (50k vocab × hidden=256 × FP32) is 52 MB; ~quantize_dynamic can't touch it
(Gather op). TinyStories-1M's hidden=64 makes the embedding 4× smaller.
Conventions
- ONNX model has ONLY
input_ids(int64) as input; output:logits(float32 [1, seq, 50257]) attention_maskis NOT an input to the exported ONNX graph (was patched away during tracing)- Tokenizer:
env.localModelPath = '/model/'; env.allowRemoteModels = false; AutoTokenizer.from_pretrained('TinyStories-1M') -
Prompt must be story-style:
`A man was asked "${text}". He smiled and said, "`- Do NOT use dialogue-format prompts like
"you: X\nhim: "— model immediately predicts\n, empty output
- Do NOT use dialogue-format prompts like
- Stop: first complete sentence
(/^(.*?[.!?])/s)on decoded text with"stripped (replace(/"/g, '')) - Top-k=40, temperature=0.9, max 60 new tokens
Gotchas
Nix build:
- Do NOT use
optimum— usetorch.onnx.exportdirectly withdynamo=False - Add
onnxscriptto nativeBuildInputs (PyTorch 2.12 lazily imports it even for legacy export) - Patch
create_causal_masktolambda **kwargs: Nonebefore tracing (transformers 5.5.4 incompatibility) quantize_dynamicskips Gather (embedding) — the embedding dominates file size for large vocab models- Tokenizer files (
tokenizer.json,tokenizer_config.json,special_tokens_map.json) are identical between TinyStories-1M and 8M — same GPT-2 BPE tokenizer; onlyconfig.jsonandpytorch_model.bindiffer - nativeBuildInputs:
transformers onnxruntime onnxscript torch(no optimum)
JS runtime:
- Load
onnxruntime-webvia<script>tag BEFORE the importmap/module scripts - Only pass
input_idstosession.run()— passingattention_maskthrows INVALID_ARGUMENT - Generation is slow (no KV cache): each step re-runs full accumulated sequence
Key Files
flake.nix— tinyStoriesOnnx sub-derivation: fetchurl + torch.onnx.export + quantize_dynamicindex.html— Two separate module scripts: Three.js scene + inference (ort global + AutoTokenizer)