Files
roam/trmnllm.org
T
2026-06-17 07:36:46 +03:00

3.3 KiB
Raw Blame History

trmn/llm

Purpose

NOT a user-facing chat feature. This is infrastructure for automatic procedural dialogue between characters in the show. The chat panel was a test harness only. Future use: generate spoken lines for characters as part of the deterministic life-simulation loop.

Architecture

In-browser LLM weights bundled into the site via nix build. No backend, no external inference at runtime.

Model: roneneldan/TinyStories-1M — GPT-Neo architecture, model_type: "gpt_neo", hidden=64, vocab=50257 (GPT-2 BPE tokenizer). Output is surreal/nonsensical children's story prose — intentionally acceptable for this project.

Final ONNX INT8 size: 15 MB (well under Cloudflare Pages 25 MiB per-file limit).

Runtime stack:

  • onnxruntime-web loaded as a global <script> tag (not ESM importmap)
  • @huggingface/transformers AutoTokenizer only (no pipeline) — tokenizer files served from /model/TinyStories-1M/
  • Custom top-k generation loop in JS; no KV cache (full sequence re-processed each step)

Deployment constraint

Cloudflare Pages has a 25 MiB per-file limit. This is why TinyStories-1M (15 MB) was chosen over 8M (40 MB). TinyStories-8M's embedding alone (50k vocab × hidden=256 × FP32) is 52 MB; ~quantize_dynamic can't touch it (Gather op). TinyStories-1M's hidden=64 makes the embedding 4× smaller.

Conventions

  • ONNX model has ONLY input_ids (int64) as input; output: logits (float32 [1, seq, 50257])
  • attention_mask is NOT an input to the exported ONNX graph (was patched away during tracing)
  • Tokenizer: env.localModelPath = '/model/'; env.allowRemoteModels = false; AutoTokenizer.from_pretrained('TinyStories-1M')
  • Prompt must be story-style: `A man was asked "${text}". He smiled and said, "`

    • Do NOT use dialogue-format prompts like "you: X\nhim: " — model immediately predicts \n, empty output
  • Stop: first complete sentence (/^(.*?[.!?])/s) on decoded text with " stripped (replace(/"/g, ''))
  • Top-k=40, temperature=0.9, max 60 new tokens

Gotchas

Nix build:

  • Do NOT use optimum — use torch.onnx.export directly with dynamo=False
  • Add onnxscript to nativeBuildInputs (PyTorch 2.12 lazily imports it even for legacy export)
  • Patch create_causal_mask to lambda **kwargs: None before tracing (transformers 5.5.4 incompatibility)
  • quantize_dynamic skips Gather (embedding) — the embedding dominates file size for large vocab models
  • Tokenizer files (tokenizer.json, tokenizer_config.json, special_tokens_map.json) are identical between TinyStories-1M and 8M — same GPT-2 BPE tokenizer; only config.json and pytorch_model.bin differ
  • nativeBuildInputs: transformers onnxruntime onnxscript torch (no optimum)

JS runtime:

  • Load onnxruntime-web via <script> tag BEFORE the importmap/module scripts
  • Only pass input_ids to session.run() — passing attention_mask throws INVALID_ARGUMENT
  • Generation is slow (no KV cache): each step re-runs full accumulated sequence

Key Files

  • flake.nix — tinyStoriesOnnx sub-derivation: fetchurl + torch.onnx.export + quantize_dynamic
  • index.html — Two separate module scripts: Three.js scene + inference (ort global + AutoTokenizer)