backup: 2026-06-17 07:36
This commit is contained in:
+58
@@ -0,0 +1,58 @@
|
||||
:PROPERTIES:
|
||||
:ID: 4b44cf43-6106-4498-81a3-b23ebb25dabf
|
||||
:END:
|
||||
|
||||
#+title: trmn/llm
|
||||
#+filetags: :project: :knowledge: :llm:
|
||||
|
||||
** Purpose
|
||||
NOT a user-facing chat feature. This is infrastructure for *automatic procedural dialogue* between
|
||||
characters in the show. The chat panel was a test harness only. Future use: generate spoken lines
|
||||
for characters as part of the deterministic life-simulation loop.
|
||||
|
||||
** Architecture
|
||||
In-browser LLM weights bundled into the site via ~nix build~. No backend, no external inference at runtime.
|
||||
|
||||
Model: ~roneneldan/TinyStories-1M~ — GPT-Neo architecture, ~model_type: "gpt_neo"~,
|
||||
hidden=64, vocab=50257 (GPT-2 BPE tokenizer). Output is surreal/nonsensical children's story prose —
|
||||
intentionally acceptable for this project.
|
||||
|
||||
Final ONNX INT8 size: *15 MB* (well under Cloudflare Pages 25 MiB per-file limit).
|
||||
|
||||
Runtime stack:
|
||||
- ~onnxruntime-web~ loaded as a global ~<script>~ tag (not ESM importmap)
|
||||
- ~@huggingface/transformers~ AutoTokenizer only (no pipeline) — tokenizer files served from ~/model/TinyStories-1M/~
|
||||
- Custom top-k generation loop in JS; no KV cache (full sequence re-processed each step)
|
||||
|
||||
** Deployment constraint
|
||||
Cloudflare Pages has a *25 MiB per-file limit*. This is why TinyStories-1M (15 MB) was chosen over 8M (40 MB).
|
||||
TinyStories-8M's embedding alone (50k vocab × hidden=256 × FP32) is ~52 MB; ~quantize_dynamic~ can't touch it
|
||||
(Gather op). TinyStories-1M's hidden=64 makes the embedding 4× smaller.
|
||||
|
||||
** Conventions
|
||||
- ONNX model has ONLY ~input_ids~ (int64) as input; output: ~logits~ (float32 [1, seq, 50257])
|
||||
- ~attention_mask~ is NOT an input to the exported ONNX graph (was patched away during tracing)
|
||||
- Tokenizer: ~env.localModelPath = '/model/'; env.allowRemoteModels = false; AutoTokenizer.from_pretrained('TinyStories-1M')~
|
||||
- Prompt must be story-style: ~`A man was asked "${text}". He smiled and said, "`~
|
||||
- Do NOT use dialogue-format prompts like ~"you: X\nhim: "~ — model immediately predicts ~\n~, empty output
|
||||
- Stop: first complete sentence ~(/^(.*?[.!?])/s)~ on decoded text with ~"~ stripped (~replace(/"/g, '')~)
|
||||
- Top-k=40, temperature=0.9, max 60 new tokens
|
||||
|
||||
** Gotchas
|
||||
*Nix build:*
|
||||
- Do NOT use ~optimum~ — use ~torch.onnx.export~ directly with ~dynamo=False~
|
||||
- Add ~onnxscript~ to nativeBuildInputs (PyTorch 2.12 lazily imports it even for legacy export)
|
||||
- Patch ~create_causal_mask~ to ~lambda **kwargs: None~ before tracing (transformers 5.5.4 incompatibility)
|
||||
- ~quantize_dynamic~ skips Gather (embedding) — the embedding dominates file size for large vocab models
|
||||
- Tokenizer files (~tokenizer.json~, ~tokenizer_config.json~, ~special_tokens_map.json~) are identical
|
||||
between TinyStories-1M and 8M — same GPT-2 BPE tokenizer; only ~config.json~ and ~pytorch_model.bin~ differ
|
||||
- nativeBuildInputs: ~transformers onnxruntime onnxscript torch~ (no optimum)
|
||||
|
||||
*JS runtime:*
|
||||
- Load ~onnxruntime-web~ via ~<script>~ tag BEFORE the importmap/module scripts
|
||||
- Only pass ~input_ids~ to ~session.run()~ — passing ~attention_mask~ throws INVALID_ARGUMENT
|
||||
- Generation is slow (no KV cache): each step re-runs full accumulated sequence
|
||||
|
||||
** Key Files
|
||||
- ~flake.nix~ — tinyStoriesOnnx sub-derivation: fetchurl + torch.onnx.export + quantize_dynamic
|
||||
- ~index.html~ — Two separate module scripts: Three.js scene + inference (ort global + AutoTokenizer)
|
||||
Reference in New Issue
Block a user