IBM launched two new open speech recognition fashions— Granite Speech 4.1 2B and Granite Speech 4.1 2B-NAR — and so they make a compelling case for what a ~2B-parameter speech mannequin can do. Each can be found on Hugging Face below the Apache 2.0 license.
The pair targets a selected downside that enterprise AI groups know effectively: most production-grade computerized speech recognition (ASR) programs both demand large compute or sacrifice accuracy to remain inside finances. IBM’s guess is that cautious structure selections can let you’ve it each methods.
What These Fashions Truly Do
Granite Speech 4.1 2B is a compact and environment friendly speech-language mannequin designed for multilingual computerized speech recognition (ASR) and bidirectional computerized speech translation (AST) overlaying English, French, German, Spanish, Portuguese, and Japanese. Its non-autoregressive counterpart, Granite Speech 4.1 2B-NAR, focuses completely on ASR — particularly concentrating on latency-sensitive deployments — and helps English, French, German, Spanish, and Portuguese, however not Japanese. That’s a significant distinction: groups that want Japanese transcription or any speech translation functionality ought to attain for the usual autoregressive mannequin.
IBM additionally quietly launched a 3rd variant alongside these two. Granite Speech 4.1 2B-Plus provides speaker-attributed ASR and word-level timestamps for functions the place understanding who mentioned what — and precisely when — is a requirement.
Phrase Error Fee (WER) is the first metric for measuring transcription high quality. Decrease is best. A WER of 5% means roughly 5 out of each 100 phrases are mistaken. On the Open ASR Leaderboard (as of April 2026), Granite Speech 4.1 2B scores a imply WER of 5.33. Drilling into benchmark element — on LibriSpeech clear, the mannequin achieves a WER of 1.33, and a couple of.5 on LibriSpeech different.
The Structure, Defined
Each fashions share the identical three-component design at a excessive degree — a speech encoder, a modality adapter, and a language mannequin — although the decoding mechanism diverges considerably.
The first part is the speech encoder. The structure makes use of 16 conformer blocks educated with Connectionist Temporal Classification (CTC) with two classification heads — one for graphemic (character-level) outputs and one for BPE items — utilizing body significance sampling to give attention to informative elements of the audio. A Conformer is a neural community layer that mixes convolutional layers (good at capturing native acoustic patterns) with consideration mechanisms (good at capturing long-range dependencies). CTC is a coaching method that lets the mannequin be taught from audio-text pairs with no need precise frame-level alignment.
The second part is a speech-text modality adapter. A 2-layer window question transformer (Q-Former) operates on blocks of 15 1024-dimensional acoustic embeddings coming from the final conformer block, downsampling by an element of 5 utilizing 3 trainable queries per block and per layer — for a complete temporal downsampling issue of 10 — leading to a 10Hz acoustic embedding fee for the LLM. This adapter bridges the hole between steady acoustic options and discrete textual content tokens, compressing the audio illustration so the language mannequin can course of it effectively. Within the NAR mannequin, the Q-Former has 160M parameters and downsamples the concatenated hidden representations from 4 encoder layers (layers 4, 8, 12, and 16).
The third part is the language mannequin. Granite Speech 4.1 2B makes use of an intermediate checkpoint of granite-4.0-1b-base with 128k context size, fine-tuned on all coaching corpora. Within the NAR variant, this turns into a 1B-parameter bidirectional LLM editor — granite-4.0-1b-base with its causal consideration masks eliminated to allow bidirectional context — tailored with LoRA at rank 128 utilized to each consideration and MLP layers.
The Autoregressive vs. Non-Autoregressive Tradeoff
That is the place the 2 fashions diverge most sharply, and it has direct penalties for manufacturing deployment.
In the usual Granite Speech 4.1 2B, textual content is generated autoregressively — one token at a time, every relying on each token earlier than it. This produces correct, steady transcripts with full assist for AST, keyword-biased recognition, and punctuation, however is inherently sequential and slower at scale.
Granite Speech 4.1 2B-NAR takes a essentially completely different method. Somewhat than decoding tokens one by one, it edits a CTC speculation in a single ahead cross utilizing a bidirectional LLM, attaining aggressive accuracy with sooner inference than autoregressive alternate options. That is the NLE (Non-autoregressive LLM-based Enhancing) structure. Concretely: the CTC encoder produces a tough preliminary transcript, that speculation is interleaved with insertion slots, after which a bidirectional LLM predicts edits — copy, insert, delete, or change — in any respect positions concurrently in a single cross.
The NAR mannequin measured an RTFx of roughly 1820 on a single H100 GPU utilizing batched inference at batch measurement 128. RTFx (real-time issue multiplier) measures what number of instances sooner than actual time a mannequin can course of audio — an RTFx of 1820 means a one-hour audio file could be transcribed in below two seconds on that {hardware}. One sensible constraint engineers ought to notice: the NAR mannequin requires flash_attention_2 for inference, since this backend helps sequence packing and respects the is_causal=False flag.
Coaching Knowledge and Infrastructure
The 2 fashions had been educated on completely different datasets. The usual mannequin was educated on 174,000 hours of audio from public corpora for ASR and AST, in addition to artificial datasets tailor-made to assist Japanese ASR, keyword-biased ASR, and speech translation. The NAR mannequin was educated on roughly 130,000 hours of speech throughout 5 languages utilizing publicly out there datasets together with CommonVoice 15, MLS, LibriSpeech, LibriHeavy, AMI, Granary VoxPopuli, Granary YODAS, Earnings-22, Fisher, CallHome, and SwitchBoard.
The infrastructure hole between the 2 is equally telling. The usual mannequin’s coaching was accomplished in 30 days — 26 days for the encoder and 4 days for the projector — on 8 H100 GPUs. The NAR mannequin educated in simply 3 days on 16 H100 GPUs (2 nodes) for five epochs — a a lot lighter coaching run, which displays the architectural simplicity of modifying over full autoregressive technology.
Key Takeaways
Listed below are 5 brief key takeaways:
- IBM launched two open ASR fashions — Granite Speech 4.1 2B (autoregressive) and Granite Speech 4.1 2B-NAR (non-autoregressive) — each ~2B parameters, and Apache 2.0 licensed.
- The usual mannequin achieves a imply WER of 5.33 on the Open ASR Leaderboard, helps 6 languages for ASR (together with Japanese), bidirectional speech translation, key phrase biasing, and punctuation/truecasing — aggressive with fashions a number of instances its measurement.
- The NAR mannequin trades capabilities for pace — it drops Japanese, AST, and key phrase biasing, however delivers an RTFx of ~1820 on a single H100 GPU by modifying a CTC speculation in a single ahead cross somewhat than producing tokens one by one.
- The structure has three core elements — a 16-layer Conformer encoder educated with dual-head CTC, a 2-layer window Q-Former projector that downsamples audio to a 10Hz embedding fee, and a fine-tuned granite-4.0-1b-base language mannequin.
- A 3rd variant, Granite Speech 4.1 2B-Plus, additionally exists — extending the usual mannequin with speaker-attributed ASR and word-level timestamps for functions the place speaker id and exact timing are required.
Take a look at the Mannequin-Granite Speech 4.1 2B and Mannequin-Granite Speech 4.1 2B (NAR). Additionally, be at liberty to comply with us on Twitter and don’t neglect to hitch our 130k+ ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you’ll be able to be a part of us on telegram as effectively.
Must accomplice with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and many others.? Join with us

