backup: 2026-06-17 07:36

2026-06-17 07:36:46 +03:00
parent 78bcf45137
commit 4806acf57d
13 changed files with 544 additions and 66 deletions
@@ -0,0 +1,58 @@
+:PROPERTIES:
+:ID: 4b44cf43-6106-4498-81a3-b23ebb25dabf
+:END:
+
+#+title: trmn/llm
+#+filetags: :project: :knowledge: :llm:
+
+** Purpose
+NOT a user-facing chat feature. This is infrastructure for *automatic procedural dialogue* between
+characters in the show. The chat panel was a test harness only. Future use: generate spoken lines
+for characters as part of the deterministic life-simulation loop.
+
+** Architecture
+In-browser LLM weights bundled into the site via ~nix build~. No backend, no external inference at runtime.
+
+Model: ~roneneldan/TinyStories-1M~ — GPT-Neo architecture, ~model_type: "gpt_neo"~,
+hidden=64, vocab=50257 (GPT-2 BPE tokenizer). Output is surreal/nonsensical children's story prose —
+intentionally acceptable for this project.
+
+Final ONNX INT8 size: *15 MB* (well under Cloudflare Pages 25 MiB per-file limit).
+
+Runtime stack:
+- ~onnxruntime-web~ loaded as a global ~<script>~ tag (not ESM importmap)
+- ~@huggingface/transformers~ AutoTokenizer only (no pipeline) — tokenizer files served from ~/model/TinyStories-1M/~
+- Custom top-k generation loop in JS; no KV cache (full sequence re-processed each step)
+
+** Deployment constraint
+Cloudflare Pages has a *25 MiB per-file limit*. This is why TinyStories-1M (15 MB) was chosen over 8M (40 MB).
+TinyStories-8M's embedding alone (50k vocab × hidden=256 × FP32) is ~52 MB; ~quantize_dynamic~ can't touch it
+(Gather op). TinyStories-1M's hidden=64 makes the embedding 4× smaller.
+
+** Conventions
+- ONNX model has ONLY ~input_ids~ (int64) as input; output: ~logits~ (float32 [1, seq, 50257])
+- ~attention_mask~ is NOT an input to the exported ONNX graph (was patched away during tracing)
+- Tokenizer: ~env.localModelPath = '/model/'; env.allowRemoteModels = false; AutoTokenizer.from_pretrained('TinyStories-1M')~
+- Prompt must be story-style: ~`A man was asked "${text}". He smiled and said, "`~
+  - Do NOT use dialogue-format prompts like ~"you: X\nhim: "~ — model immediately predicts ~\n~, empty output
+- Stop: first complete sentence ~(/^(.*?[.!?])/s)~ on decoded text with ~"~ stripped (~replace(/"/g, '')~)
+- Top-k=40, temperature=0.9, max 60 new tokens
+
+** Gotchas
+*Nix build:*
+- Do NOT use ~optimum~ — use ~torch.onnx.export~ directly with ~dynamo=False~
+- Add ~onnxscript~ to nativeBuildInputs (PyTorch 2.12 lazily imports it even for legacy export)
+- Patch ~create_causal_mask~ to ~lambda **kwargs: None~ before tracing (transformers 5.5.4 incompatibility)
+- ~quantize_dynamic~ skips Gather (embedding) — the embedding dominates file size for large vocab models
+- Tokenizer files (~tokenizer.json~, ~tokenizer_config.json~, ~special_tokens_map.json~) are identical
+  between TinyStories-1M and 8M — same GPT-2 BPE tokenizer; only ~config.json~ and ~pytorch_model.bin~ differ
+- nativeBuildInputs: ~transformers onnxruntime onnxscript torch~ (no optimum)
+
+*JS runtime:*
+- Load ~onnxruntime-web~ via ~<script>~ tag BEFORE the importmap/module scripts
+- Only pass ~input_ids~ to ~session.run()~ — passing ~attention_mask~ throws INVALID_ARGUMENT
+- Generation is slow (no KV cache): each step re-runs full accumulated sequence
+
+** Key Files
+- ~flake.nix~ — tinyStoriesOnnx sub-derivation: fetchurl + torch.onnx.export + quantize_dynamic
+- ~index.html~ — Two separate module scripts: Three.js scene + inference (ort global + AutoTokenizer)