On this tutorial, we discover the implementation of OpenMythos, a theoretical reconstruction of the Claude Mythos structure that allows deeper reasoning by iterative computation fairly than elevated parameter measurement. We construct and analyze fashions utilizing each GQA and MLA consideration mechanisms, study reminiscence effectivity by KV-cache comparisons, and validate stability by way of the spectral properties of the recurrent replace. We then practice the mannequin on a structured parity job and examine how rising loop depth at inference improves efficiency with out retraining. Alongside the way in which, we additionally examine adaptive computation by way of ACT halting and monitor knowledgeable utilization within the MoE layers, offering a complete, hands-on understanding of this rising structure.
import subprocess, sys
attempt:
import open_mythos # noqa: F401
besides ImportError:
subprocess.check_call([sys.executable, “-m”, “pip”, “install”, “-q”,
“open-mythos”])
import math, time, copy
from collections import Counter, defaultdict
import numpy as np
import torch, torch.nn as nn, torch.nn.useful as F
import matplotlib.pyplot as plt
from open_mythos.essential import (
OpenMythos, MythosConfig,
ACTHalting, MoEFFN,
)
torch.manual_seed(0); np.random.seed(0)
gadget = “cuda” if torch.cuda.is_available() else “cpu”
print(f”▸ gadget = {gadget} | torch = {torch.__version__}”)
def make_config(attn_type: str, *, dim=128, n_heads=4, n_experts=4,
max_loops=8, seq_len=128, vocab=256):
base = dict(
vocab_size=vocab, dim=dim, n_heads=n_heads,
max_seq_len=seq_len, max_loop_iters=max_loops,
prelude_layers=1, coda_layers=1,
n_experts=n_experts, n_shared_experts=1,
n_experts_per_tok=2, expert_dim=dim // 2,
lora_rank=8, attn_type=attn_type,
)
if attn_type == “gqa”:
return MythosConfig(**base, n_kv_heads=2)
return MythosConfig(
**base, n_kv_heads=n_heads,
kv_lora_rank=32, q_lora_rank=64,
qk_rope_head_dim=16, qk_nope_head_dim=16, v_head_dim=16,
)
cfg_gqa = make_config(“gqa”)
cfg_mla = make_config(“mla”)
m_gqa = OpenMythos(cfg_gqa).to(gadget)
m_mla = OpenMythos(cfg_mla).to(gadget)
print(“n─── Half 1 ─ mannequin sizes ──────────────────────────────”)
print(f”GQA params : {sum(p.numel() for p in m_gqa.parameters()):>10,}”)
print(f”MLA params : {sum(p.numel() for p in m_mla.parameters()):>10,}”)
We set up and import all required dependencies and initialize the environment for working OpenMythos. We assemble configurations for each GQA and MLA consideration mechanisms and instantiate their respective fashions. We additionally examine their parameter sizes to grasp how architectural variations influence mannequin scale.
def cache_bytes(kv: dict) -> int:
whole = 0
for entry in kv.values():
for t in entry.values():
whole += t.element_size() * t.numel()
return whole
x = torch.randint(0, 256, (1, 64), gadget=gadget)
ck_gqa, ck_mla = {}, {}
with torch.no_grad():
m_gqa(x, n_loops=4, kv_cache=ck_gqa)
m_mla(x, n_loops=4, kv_cache=ck_mla)
gqa_kb = cache_bytes(ck_gqa) / 1024
mla_kb = cache_bytes(ck_mla) / 1024
print(“n─── Half 2 ─ KV-cache footprint (1×64 tokens, 4 loops) ─”)
print(f”GQA cache : {gqa_kb:6.2f} KB ({len(ck_gqa)} layer-keys)”)
print(f”MLA cache : {mla_kb:6.2f} KB ({len(ck_mla)} layer-keys)”)
print(f”ratio : MLA is ≈{gqa_kb / max(mla_kb, 1e-9):.2f}× smaller”)
def show_stability(mannequin, tag):
A = mannequin.recurrent.injection.get_A()
print(f”{tag:3s} ρ(A): min={A.min():.4f} max={A.max():.4f} ”
f”imply={A.imply():.4f} secure={bool((A < 1).all() and (A > 0).all())}”)
print(“n─── Half 3 ─ spectral radius at init ──────────────────”)
show_stability(m_gqa, “GQA”)
show_stability(m_mla, “MLA”)
choose = torch.optim.Adam(m_mla.parameters(), lr=1.0)
for _ in vary(30):
loss = m_mla(torch.randint(0, 256, (2, 16), gadget=gadget),
n_loops=2).sq.().imply()
choose.zero_grad(); loss.backward(); choose.step()
show_stability(m_mla, “MLA after abusive coaching (lr=1.0, 30 steps)”)
We compute and examine the KV-cache reminiscence footprint for each GQA and MLA consideration sorts throughout ahead passes. We then examine the soundness of the recurrent element by analyzing the spectral radius of matrix A. We additional stress-test the mannequin with excessive coaching circumstances to substantiate that stability is preserved.
VOCAB = 64
SEQ_LEN = 24
def make_batch(batch=64, seq_len=SEQ_LEN):
x = torch.randint(1, 3, (batch, seq_len), gadget=gadget)
bits = x – 1
parity = bits.cumsum(dim=1) % 2
y = parity + 1
return x, y
cfg = MythosConfig(
vocab_size=VOCAB, dim=64, n_heads=4, n_kv_heads=2,
max_seq_len=SEQ_LEN + 4, max_loop_iters=16,
prelude_layers=1, coda_layers=1,
n_experts=4, n_shared_experts=1, n_experts_per_tok=2,
expert_dim=32, lora_rank=4, attn_type=”gqa”,
act_threshold=0.99,
)
mannequin = OpenMythos(cfg).to(gadget)
choose = torch.optim.AdamW(mannequin.parameters(), lr=3e-4)
T_TRAIN = 3
print(“n─── Half 5 ─ coaching (T_train = 3) ───────────────────”)
print(f”params: {sum(p.numel() for p in mannequin.parameters()):,}”)
losses = []
t0 = time.time()
for step in vary(600):
x, y = make_batch(64)
logits = mannequin(x, n_loops=T_TRAIN)
loss = F.cross_entropy(logits.reshape(-1, VOCAB), y.reshape(-1))
choose.zero_grad(); loss.backward()
choose.step()
losses.append(loss.merchandise())
if step % 100 == 0 or step == 599:
with torch.no_grad():
acc = (logits.argmax(-1) == y).float().imply().merchandise()
print(f”step {step:3d} loss={loss.merchandise():.4f} acc@T3={acc:.3f}”)
print(f”coaching wallclock: {time.time() – t0:.1f}s”)
We outline a cumulative parity job to coach our mannequin on a structured sequential downside. We initialize the OpenMythos mannequin with a hard and fast loop depth and practice it utilizing cross-entropy loss. All through coaching, we monitor loss and accuracy to guage how effectively the mannequin learns below constrained depth.
mannequin.eval()
T_sweep = [1, 2, 3, 4, 6, 8, 10, 12, 14, 16]
accs = []
with torch.no_grad():
x_eval, y_eval = make_batch(512)
for T in T_sweep:
logits = mannequin(x_eval, n_loops=T)
accs.append((logits.argmax(-1) == y_eval).float().imply().merchandise())
print(“n─── Half 6 ─ depth extrapolation (T_train=3) ──────────”)
for T, a in zip(T_sweep, accs):
bar = “█” * int(a * 40)
marker = ” ← educated right here” if T == T_TRAIN else “”
print(f”T={T:2nd} acc={a:.3f} {bar}{marker}”)
halt_trace: checklist[torch.Tensor] = []
orig_halt = mannequin.recurrent.act.ahead
def halt_hook(self, h):
p = orig_halt(h)
halt_trace.append(p.detach().cpu())
return p
mannequin.recurrent.act.ahead = halt_hook.__get__(mannequin.recurrent.act, ACTHalting)
with torch.no_grad():
x_h, _ = make_batch(1)
_ = mannequin(x_h, n_loops=16)
mannequin.recurrent.act.ahead = orig_halt
halts = torch.stack(halt_trace, dim=0)[:, 0].numpy()
print(f”n─── Half 7 ─ ACT halting matrix (loops × positions) ───”)
print(f”form: {halts.form} | ”
f”imply halt-prob per loop: ”
f”{‘, ‘.be a part of(f'{v:.2f}’ for v in halts.imply(1))}”)
We consider the educated mannequin by various the variety of inference loops to check depth extrapolation. We observe how rising loop depth improves accuracy with out retraining the mannequin. We additionally instrument the ACT mechanism to seize halting possibilities at every sequence place and iteration.
expert_hits = Counter()
orig_moe = mannequin.recurrent.block.ffn.ahead
def moe_hook(self, x):
flat = x.view(-1, x.form[-1])
logits = self.router(flat) + self.router_bias
scores = F.softmax(logits, dim=-1)
_, idx = scores.topk(self.topk, dim=-1)
for e in idx.flatten().tolist():
expert_hits[e] += 1
return orig_moe(x)
mannequin.recurrent.block.ffn.ahead = moe_hook.__get__(
mannequin.recurrent.block.ffn, MoEFFN)
with torch.no_grad():
x_m, _ = make_batch(32)
_ = mannequin(x_m, n_loops=T_TRAIN)
mannequin.recurrent.block.ffn.ahead = orig_moe
print(“n─── Half 8 ─ MoE knowledgeable utilization ───────────────────”)
whole = sum(expert_hits.values())
for eid in vary(cfg.n_experts):
share = expert_hits.get(eid, 0) / max(whole, 1)
print(f”knowledgeable {eid}: {share*100:5.2f}% of topk slots”)
immediate = torch.tensor([[1, 2, 1, 1, 2, 2, 1, 2]], gadget=gadget)
print(“n─── Half 9 ─ technology ───────────────────────────────”)
print(f”immediate (parity sample): {immediate.tolist()[0]}”)
for T_gen in [1, 4, 12]:
with torch.no_grad():
out = mannequin.generate(immediate, max_new_tokens=8,
n_loops=T_gen, temperature=0.1, top_k=2)
print(f”T_gen={T_gen:2nd} → {out.tolist()[0]}”)
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
axes[0].plot(losses)
axes[0].set_title(“Coaching loss (parity job)”)
axes[0].set_xlabel(“step”); axes[0].set_ylabel(“cross-entropy”)
axes[0].grid(alpha=0.3)
axes[1].plot(T_sweep, accs, “o-“, linewidth=2, markersize=8)
axes[1].axvline(T_TRAIN, colour=”purple”, linestyle=”–“,
label=f”T_train = {T_TRAIN}”)
axes[1].set_title(“Depth extrapolation: accuracy vs inference loops”)
axes[1].set_xlabel(“n_loops at inference”); axes[1].set_ylabel(“accuracy”)
axes[1].legend(); axes[1].grid(alpha=0.3); axes[1].set_ylim(0, 1.05)
im = axes[2].imshow(halts, side=”auto”, cmap=”viridis”,
vmin=0, vmax=halts.max())
axes[2].set_title(“ACT halting probabilityn(loop t × place)”)
axes[2].set_xlabel(“place”); axes[2].set_ylabel(“loop iteration t”)
plt.colorbar(im, ax=axes[2], fraction=0.046, pad=0.04)
plt.tight_layout()
plt.savefig(“openmythos_tutorial.png”, dpi=120, bbox_inches=”tight”)
plt.present()
We analyze knowledgeable utilization within the MoE layer by monitoring how tokens are routed throughout specialists. We then generate sequences at totally different loop depths to watch their results on outputs. Lastly, we visualize coaching loss, depth extrapolation efficiency, and ACT halting habits by plots.
In conclusion, we demonstrated that OpenMythos successfully leverages looped computation to realize depth extrapolation, enabling the mannequin to enhance accuracy just by rising the variety of inference-time loops. We noticed that the recurrent mechanism stays secure even below excessive coaching circumstances, and that MLA consideration considerably reduces KV-cache reminiscence utilization in comparison with GQA. We additionally noticed how ACT permits dynamic computation throughout sequence positions and the way MoE routing distributes workload throughout specialists. Total, we established that this structure gives a compelling path for compute-adaptive reasoning, the place we commerce extra inference compute for higher efficiency with out modifying the mannequin’s parameters.
Try the Full Codes with Pocket book right here. Additionally, be happy to observe us on Twitter and don’t neglect to hitch our 130k+ ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you’ll be able to be a part of us on telegram as effectively.
Have to accomplice with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and many others.? Join with us

