Zyphra AI has launched ZAYA1-8B, a small Combination of Specialists (MoE) language mannequin with 760 million lively parameters and eight.4 billion whole parameters. Skilled end-to-end on AMD {hardware}, the mannequin outperforms open-weight fashions many instances its dimension on math and coding benchmarks, and is now out there underneath an Apache 2.0 license on Hugging Face and as a serverless endpoint on Zyphra Cloud.
With underneath 1 billion lively parameters, ZAYA1-8B achieves scores aggressive with first-generation frontier reasoning fashions like DeepSeek-R1-0528, Gemini-2.5-Professional, and Claude 4.5 Sonnet on difficult mathematical reasoning duties. With its novel test-time compute methodology referred to as Markovian RSA, it surpasses Claude 4.5 Sonnet and GPT-5-Excessive on HMMT’25 (89.6 vs 88.3) and closes in on frontier open-weight fashions like DeepSeek-V3.2 on arithmetic benchmarks.
What’s a Combination of Specialists Mannequin and Why Does Energetic Parameter Rely Matter?
The excellence between ‘lively’ and ‘whole’ parameters issues an incredible deal. In a typical dense mannequin, each parameter prompts for each enter token. In a Combination of Specialists mannequin, solely a subset of the community’s parameters — the ‘consultants’ — are activated at inference time. ZAYA1-8B has 8.4B whole parameters however solely 760M are lively per ahead move. This dramatically reduces inference compute and reminiscence bandwidth necessities whereas retaining the representational capability of a a lot bigger mannequin.
ZAYA1-8B might be deployed on-device for native LLM functions, run effectively in test-time compute harnesses, and serve requests at decrease latency in comparison with dense fashions with comparable benchmark efficiency.
https://www.zyphra.com/submit/zaya1-8b
Structure: MoE++ and Three Key Improvements
ZAYA1-8B is constructed on Zyphra’s MoE++ structure, which introduces three particular modifications over customary MoE designs. Collectively, these kind the bottom of ZAYA1-8B’s intelligence effectivity which is the design objective Zyphra frames as maximizing intelligence extracted per parameter and per FLOP.
- Compressed Convolutional Consideration (CCA), a sequence mixing mechanism developed by Zyphra that operates in a compressed latent area and achieves 8× KV-cache compression versus customary consideration. The KV-cache is the reminiscence used throughout inference to retailer intermediate consideration states — an 8× discount instantly lowers reminiscence necessities at inference time and permits longer efficient contexts inside the similar {hardware} envelope.
- ZAYA1 MLP-based router with PID-controller bias balancing. Commonplace MoE routers usually use linear projections to find out which skilled processes a given token. Zyphra replaces this with an MLP-based router and provides PID-controller-style bias balancing to enhance routing stability — actively stopping load imbalance throughout consultants, which is a identified failure mode in MoE coaching.
- Realized residual scaling, which controls residual-norm progress by depth at negligible parameter and FLOP price. In deep networks, residual stream norms can develop unstably layer over layer; discovered scaling addresses this with out including significant overhead.
Coaching Infrastructure: Totally Constructed on AMD
ZAYA1-8B is a MoE mannequin pretrained, midtrained, and supervised fine-tuned on an AMD Intuition MI300 stack. The total coaching pipeline ran on a cluster of 1,024 AMD Intuition MI300x nodes related by way of AMD Pensando Pollara interconnect, in a customized coaching cluster constructed with IBM.
Reasoning-First Pretraining and a 5-Stage Put up-Coaching Pipeline
ZAYA1-8B’s efficiency displays improvements throughout the total stack: Zyphra’s MoE++ structure, reasoning-first pretraining, a reasoning RL cascade methodology, and the novel Markovian RSA test-time compute technique.
Zyphra’s post-training pipeline consists of 5 sequential levels:
- The primary is a typical SFT stage overlaying fundamental chat, instruction following, code, math, and test-time compute (TTC) talents.
- The second is a reasoning warmup combining mathematical duties, logic and puzzle fixing, with TTC prompts to coach the mannequin to natively self-aggregate candidate options.
- Third is a big RLVE-Gymnasium section with dynamically adjusted puzzle problem to coach core reasoning circuits.
- Fourth is a large-scale math and code RL section to deepen efficiency in these two basic domains.
- Lastly, a comparatively light-weight RLHF/RLAIF section improves chat conduct, instruction following, and writing fashion.
Zyphra’s analysis group noticed essentially the most substantial functionality boosts on arithmetic and coding throughout RL, with smaller however significant good points in multiple-choice data retrieval (MMLU and GPQA-Diamond) and non-verifiable duties similar to inventive writing.
Markovian RSA: A Novel Check-Time Compute Methodology
Probably the most technically vital contribution alongside the mannequin is Markovian RSA, a test-time compute (TTC) scheme that mixes two prior concepts in a brand new approach.
The primary is Recursive Self-Aggregation (RSA), which generates a number of reasoning traces in parallel and aggregates them recursively throughout iterations. The second is the Markovian thinker thought, which performs reasoning in fixed-duration chunks — solely the tail finish of the earlier chunk is handed to the subsequent, maintaining the context window bounded no matter how lengthy the mannequin causes.
Markovian RSA combines these: for every immediate, a number of traces are generated in parallel; fixed-length tail segments are extracted from every hint; new aggregation prompts are constructed by sub-sampling from the candidate pool; and these aggregated prompts seed the subsequent spherical of parallel responses. The consequence has favorable inference properties — rollout technology is parallelizable, and the Markovian chunking technique ensures intermediate chain-of-thought lengths by no means exceed a hard and fast context window dimension.
A key discovering comes out to be that co-design between the post-training methodology and the inference harness is important. ZAYA1-8B was skilled to grasp and reply to Markovian RSA aggregation prompts and chunking beginning in SFT and persevering with by RL. When Zyphra utilized the identical methodology to Qwen3-4B-Pondering-2507 with out this co-design, the efficiency uplift was considerably smaller — stating that the harness and post-training should be developed collectively to appreciate the good points.
With Markovian RSA at an extra-high test-time compute funds of 5.5 million tokens per downside, ZAYA1-8B outperforms DeepSeek-V3.2 and GPT-OSS-Excessive on the difficult APEX-shortlist arithmetic benchmark.
Benchmark Outcomes
On the in-class comparability towards equally sized fashions, ZAYA1-8B scores 89.1 on AIME’26, 71.6 on HMMT Feb.’26, 59.3 on IMO-AnswerBench, 32.2 on APEX-shortlist, 65.8 on LiveCodeBench-v6, and 71.0 on GPQA-Diamond — outperforming Qwen3-4B-Pondering-2507 and Gemma-4-E4B-it throughout all arithmetic and coding classes.
Towards bigger open-weight fashions, ZAYA1-8B with 760M lively parameters surpasses Mistral-Small-4-119B (6B lively, 119B whole) on math and coding benchmarks particularly — scoring 89.1 vs 86.4 on AIME’26, 71.6 vs 70.6 on HMMT Feb.’26, and 63.8 vs 57.9 on LiveCodeBench-v6. Mistral-Small-4-119B retains benefits on GPQA-Diamond (77.2 vs 71.0) and MMLU-Professional (81.6 vs 74.2), the place data breadth issues greater than mathematical reasoning depth.
https://www.zyphra.com/submit/zaya1-8b
https://www.zyphra.com/submit/zaya1-8b
Key Takeaways
- ZAYA1-8B delivers frontier-level math and coding efficiency with solely 760M lively parameters, outperforming open-weight fashions many instances its dimension.
- Its MoE++ structure introduces three improvements — CCA with 8× KV-cache compression, an MLP-based router with PID-controller bias balancing, and discovered residual scaling — to maximise intelligence per parameter.
- A novel test-time compute technique referred to as Markovian RSA, combining Recursive Self-Aggregation with Markovian chunking, pushes ZAYA1-8B previous DeepSeek-V3.2 and GPT-OSS-Excessive on APEX-shortlist at 5.5M tokens per downside.
- ZAYA1-8B is the primary MoE mannequin pretrained, midtrained, and SFT’d totally on AMD Intuition MI300 {hardware} — on a 1,024 MI300x node cluster constructed with IBM.
- Launched underneath Apache 2.0, it’s out there on Hugging Face and Zyphra Cloud.
Try the Paper, Mannequin Weights and Technical particulars. Additionally, be at liberty to comply with us on Twitter and don’t neglect to affix our 150k+ ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you’ll be able to be a part of us on telegram as effectively.
Must associate with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so on.? Join with us

