Files
roam/trmnllm.org
2026-06-17 07:36:46 +03:00

59 lines
3.3 KiB
Org Mode
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
:PROPERTIES:
:ID: 4b44cf43-6106-4498-81a3-b23ebb25dabf
:END:
#+title: trmn/llm
#+filetags: :project: :knowledge: :llm:
** Purpose
NOT a user-facing chat feature. This is infrastructure for *automatic procedural dialogue* between
characters in the show. The chat panel was a test harness only. Future use: generate spoken lines
for characters as part of the deterministic life-simulation loop.
** Architecture
In-browser LLM weights bundled into the site via ~nix build~. No backend, no external inference at runtime.
Model: ~roneneldan/TinyStories-1M~ — GPT-Neo architecture, ~model_type: "gpt_neo"~,
hidden=64, vocab=50257 (GPT-2 BPE tokenizer). Output is surreal/nonsensical children's story prose —
intentionally acceptable for this project.
Final ONNX INT8 size: *15 MB* (well under Cloudflare Pages 25 MiB per-file limit).
Runtime stack:
- ~onnxruntime-web~ loaded as a global ~<script>~ tag (not ESM importmap)
- ~@huggingface/transformers~ AutoTokenizer only (no pipeline) — tokenizer files served from ~/model/TinyStories-1M/~
- Custom top-k generation loop in JS; no KV cache (full sequence re-processed each step)
** Deployment constraint
Cloudflare Pages has a *25 MiB per-file limit*. This is why TinyStories-1M (15 MB) was chosen over 8M (40 MB).
TinyStories-8M's embedding alone (50k vocab × hidden=256 × FP32) is ~52 MB; ~quantize_dynamic~ can't touch it
(Gather op). TinyStories-1M's hidden=64 makes the embedding 4× smaller.
** Conventions
- ONNX model has ONLY ~input_ids~ (int64) as input; output: ~logits~ (float32 [1, seq, 50257])
- ~attention_mask~ is NOT an input to the exported ONNX graph (was patched away during tracing)
- Tokenizer: ~env.localModelPath = '/model/'; env.allowRemoteModels = false; AutoTokenizer.from_pretrained('TinyStories-1M')~
- Prompt must be story-style: ~`A man was asked "${text}". He smiled and said, "`~
- Do NOT use dialogue-format prompts like ~"you: X\nhim: "~ — model immediately predicts ~\n~, empty output
- Stop: first complete sentence ~(/^(.*?[.!?])/s)~ on decoded text with ~"~ stripped (~replace(/"/g, '')~)
- Top-k=40, temperature=0.9, max 60 new tokens
** Gotchas
*Nix build:*
- Do NOT use ~optimum~ — use ~torch.onnx.export~ directly with ~dynamo=False~
- Add ~onnxscript~ to nativeBuildInputs (PyTorch 2.12 lazily imports it even for legacy export)
- Patch ~create_causal_mask~ to ~lambda **kwargs: None~ before tracing (transformers 5.5.4 incompatibility)
- ~quantize_dynamic~ skips Gather (embedding) — the embedding dominates file size for large vocab models
- Tokenizer files (~tokenizer.json~, ~tokenizer_config.json~, ~special_tokens_map.json~) are identical
between TinyStories-1M and 8M — same GPT-2 BPE tokenizer; only ~config.json~ and ~pytorch_model.bin~ differ
- nativeBuildInputs: ~transformers onnxruntime onnxscript torch~ (no optimum)
*JS runtime:*
- Load ~onnxruntime-web~ via ~<script>~ tag BEFORE the importmap/module scripts
- Only pass ~input_ids~ to ~session.run()~ — passing ~attention_mask~ throws INVALID_ARGUMENT
- Generation is slow (no KV cache): each step re-runs full accumulated sequence
** Key Files
- ~flake.nix~ — tinyStoriesOnnx sub-derivation: fetchurl + torch.onnx.export + quantize_dynamic
- ~index.html~ — Two separate module scripts: Three.js scene + inference (ort global + AutoTokenizer)