Elon Musk’s AI firm xAI has launched two standalone audio APIs — a Speech-to-Textual content (STT) API and a Textual content-to-Speech (TTS) API — each constructed on the identical infrastructure that powers Grok Voice on cellular apps, Tesla automobiles, and Starlink buyer assist. The discharge strikes xAI squarely into the aggressive speech API market at the moment occupied by ElevenLabs, Deepgram, and AssemblyAI.
What Is the Grok Speech-to-Textual content API?
Speech-to-Textual content is the know-how that converts spoken audio into written textual content. For builders constructing assembly transcription instruments, voice brokers, name middle analytics, or accessibility options, an STT API is a core constructing block. Slightly than creating this from scratch, builders name an endpoint, ship audio, and obtain a structured transcript in return.
The Grok STT API is now usually obtainable, providing transcription throughout 25 languages with each batch and streaming modes. The batch mode is designed for processing pre-recorded audio recordsdata, whereas streaming permits real-time transcription as audio is captured. Pricing is stored simple: Speech-to-Textual content is $0.10 per hour for batch and $0.20 per hour for streaming.
The API consists of word-level timestamps, speaker diarization, and multichannel assist, together with clever Inverse Textual content Normalization that appropriately handles numbers, dates, currencies, and extra. It additionally accepts 12 audio codecs — 9 container codecs (WAV, MP3, OGG, Opus, FLAC, AAC, MP4, M4A, MKV) and three uncooked codecs (PCM, µ-law, A-law), with a most file measurement of 500 MB per request.
Speaker diarization is the method of separating audio by particular person audio system — answering the query ‘who stated what.’ That is vital for multi-speaker recordings like conferences, interviews, or buyer calls. Phrase-level timestamps assign exact begin and finish occasions to every phrase within the transcript, enabling use circumstances like subtitle era, searchable recordings, and authorized documentation. Inverse Textual content Normalization converts spoken varieties like ‘100 sixty-seven thousand 9 hundred eighty-three {dollars} and fifteen cents’ into readable structured output: “$167,983.15.”.
Benchmark Efficiency
xAI analysis crew is making sturdy claims on accuracy. On cellphone name entity recognition — names, account numbers, dates — Grok STT claims a 5.0% error price versus ElevenLabs at 12.0%, Deepgram at 13.5%, and AssemblyAI at 21.3%. That may be a substantial margin if it holds in manufacturing. For video and podcast transcription, Grok and ElevenLabs tied at a 2.4% error price, with Deepgram and AssemblyAI trailing at 3.0% and three.2% respectively. xAI crew additionally stories a 6.9% phrase error price on basic audio benchmarks.
https://x.ai/information/grok-stt-and-tts-apis
https://x.ai/information/grok-stt-and-tts-apis
What’s the Grok Textual content-to-Speech API?
Textual content-to-Speech converts written textual content into spoken audio. Builders use TTS APIs to energy voice assistants, read-aloud options, podcast era, IVR (interactive voice response) techniques, and accessibility instruments.
The Grok TTS API delivers quick, pure speech synthesis with detailed management by way of speech tags, and is priced at $4.20 per 1 million characters. The API accepts as much as 15,000 characters per REST request; for longer content material, a WebSocket streaming endpoint is accessible that has no textual content size restrict and begins returning audio earlier than the total enter is processed. The API helps 20 languages and 5 distinct voices: Ara, Eve, Leo, Rex, and Sal — with Eve set because the default.
Past voice choice, builders can inject inline and wrapping speech tags to manage supply. These embrace inline tags like [laugh], [sigh], and [breath], and wrapping tags like textual content and textual content, letting builders create participating, lifelike supply with out advanced markup. This expressiveness addresses one of many core limitations of conventional TTS techniques, which frequently produce technically appropriate however emotionally flat output.
Key Takeaways
- xAI has launched two standalone audio APIs — Grok Speech-to-Textual content (STT) and Textual content-to-Speech (TTS) — constructed on the identical manufacturing stack already serving tens of millions of customers throughout Grok cellular apps, Tesla automobiles, and Starlink buyer assist.
- The Grok STT API presents real-time and batch transcription throughout 25 languages with speaker diarization, word-level timestamps, Inverse Textual content Normalization, and assist for 12 audio codecs — priced at $0.10/hour for batch and $0.20/hour for streaming.
- On cellphone name entity recognition benchmarks, Grok STT stories a 5.0% error price, considerably outperforming ElevenLabs (12.0%), Deepgram (13.5%), and AssemblyAI (21.3%), with significantly sturdy efficiency in medical, authorized, and monetary use circumstances.
- The Grok TTS API helps 5 expressive voices (Ara, Eve, Leo, Rex, Sal) throughout 20 languages, with inline and wrapping speech tags like [laugh], [sigh], and giving builders fine-grained management over vocal supply — priced at $4.20 per 1 million characters.
Try the Technical particulars right here. Additionally, be at liberty to comply with us on Twitter and don’t overlook to hitch our 130k+ ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you may be a part of us on telegram as properly.
Must companion with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and many others.? Join with us
Michal Sutter is a knowledge science skilled with a Grasp of Science in Knowledge Science from the College of Padova. With a stable basis in statistical evaluation, machine studying, and information engineering, Michal excels at reworking advanced datasets into actionable insights.

