LightSeek Basis Releases TokenSpeed, an Open-Supply LLM Inference Engine Concentrating on TensorRT-LLM-Stage Efficiency for Agentic Workloads

Inference effectivity has quietly turn into probably the most consequential bottlenecks in AI deployment. As agentic coding techniques resembling Claude Code, Codex, and Cursor scale from developer instruments to infrastructure powering software program improvement at giant, the underlying inference engines serving these requests are underneath growing pressure. The LightSeek Basis researchers have launched TokenSpeed, an open-source LLM inference engine launched underneath the MIT license and designed particularly for the calls for of agentic workloads. The TokenSpeed engine is at the moment in preview standing.

Why Agentic Inference is a Totally different Drawback

To know what makes TokenSpeed’s design selections significant, it helps to know what makes agentic inference arduous. Coding brokers don’t behave like a typical chatbot flip. Contexts routinely exceed 50K tokens, and conversations typically span dozens of turns. This creates simultaneous strain on two metrics: per-GPU TPM (tokens per minute), which determines what number of customers a single GPU can serve, and per-user TPS (tokens per second), which determines whether or not a person consumer perceives the system as responsive. Most public benchmarks don’t absolutely seize this conduct.

TokenSpeed has been designed to maximise each. The target is to maximise per-GPU TPM whereas sustaining a per-user TPS ground — sometimes 70 TPS, and typically 200 TPS or greater.

Structure: 5 Interlocking Subsystems

TokenSpeed’s structure is constructed round 5 design pillars: a compiler-backed modeling mechanism for parallelism, a high-performance scheduler, a secure KV useful resource reuse restriction, a pluggable layered kernel system that helps heterogeneous accelerators, and SMG integration for a low-overhead CPU-side request entrypoint.

The modeling layer makes use of an area SPMD (Single Program, A number of Information) strategy. SPMD is a parallel execution mannequin the place all processes run the identical program however on completely different subsets of knowledge — a standard sample in distributed deep studying. Reasonably than requiring builders to manually implement the communication logic between processes, TokenSpeed permits builders to specify I/O placement annotations at module boundaries, and a light-weight static compiler then robotically generates the required collective operations throughout mannequin building, eliminating the necessity to manually implement communication logic.

The scheduler makes a structural break up between the management airplane and the execution airplane. The management airplane is carried out in C++ as a finite-state machine that works with the sort system to implement secure useful resource administration — together with KV cache state switch and utilization — at compile time moderately than at runtime. Request lifecycle, KV cache sources, and overlap timing are represented by means of express FSM transitions and possession semantics, so correctness is enforced by a verifiable management system moderately than conference. By encoding these correctness constraints into the sort system moderately than leaving them to runtime conference, errors in KV cache administration — probably the most error-prone areas in LLM serving — are caught earlier. The execution airplane is carried out in Python to keep up improvement effectivity, enabling quicker characteristic iteration and decrease cognitive load for builders

The kernel layer treats GPU kernels as a first-class modular subsystem moderately than baking them into the engine core. It offers a conveyable public API, a centralized registry and choice mannequin, and an extensible plugin mechanism to assist heterogeneous accelerators — that means it isn’t locked to NVIDIA {hardware}. The dev staff has additionally developed one of many quickest MLA (Multi-head Latent Consideration) kernels for agentic workloads on NVIDIA Blackwell. Within the decode kernel, q_seqlen and num_heads are grouped to totally make the most of Tensor Cores, as num_heads are small in a few of these use instances. The binary prefill kernel features a fine-tuned softmax implementation. Notably, TokenSpeed MLA has been adopted by vLLM.

https://lightseek.org/weblog/lightseek-tokenspeed.html

Lastly, TokenSpeed integrates SMG — a PyTorch-native part — for a low-overhead CPU-side request entrypoint, decreasing the handoff value between CPU orchestration and GPU execution.

Benchmark Outcomes Towards TensorRT-LLM on NVIDIA B200

It’s value noting upfront that these benchmarks cowl single (non-disaggregated) deployment solely. PD disaggregation assist remains to be present process cleanup and could also be coated in a devoted follow-up from the TokenSpeed staff.

Along with the EvalScope staff, TokenSpeed was evaluated towards SWE-smith traces, which carefully mirror manufacturing coding-agent visitors, benchmarked towards TensorRT-LLM — the present state-of-the-art on NVIDIA Blackwell. The check mannequin was Kimi K2.5.

For coding brokers working above 70 TPS/Consumer, one of the best configuration is Consideration TP4 + MoE TP4, the place TokenSpeed dominates TensorRT-LLM throughout all the Pareto frontier: roughly 9% quicker within the min-latency case (batch measurement 1), and roughly 11% greater throughput round 100 TPS/Consumer. TP4 right here refers to tensor parallelism throughout 4 GPUs, a way that shards mannequin weights throughout a number of units to cut back per-device reminiscence strain and latency.

On the MLA kernel, the good points are extra pronounced on the decode stage. The decode kernel folds the query-sequence axis into the top axis to higher fill the BMM1 M tile, enhancing Tensor Core utilization. The binary-version prefill kernel makes use of NVIDIA-internal knobs to fine-tune the softmax implementation, outperforming TensorRT-LLM’s MLA throughout all 5 typical prefill workloads for coding brokers with lengthy prefix KV cache. Mixed with different optimizations, this practically halves latency relative to TensorRT-LLM on typical decode workloads with speculative decoding at batch sizes 4, 8, and 16 with lengthy prefix KV cache.

Key Takeaways

TokenSpeed is a brand new MIT-licensed, open-source LLM inference engine by LightSeek Basis, constructed particularly for agentic workloads. (Accessible in preview mode)
Its scheduler makes use of a C++ finite-state machine to implement KV cache security at compile time, whereas protecting the execution airplane in Python for usability.
On NVIDIA B200, TokenSpeed outperforms TensorRT-LLM by ~9% in min-latency and ~11% in throughput at 100 TPS/Consumer on Kimi K2.5.
The TokenSpeed MLA kernel practically halves decode latency vs. TensorRT-LLM on speculative decoding workloads and has already been adopted by vLLM.

Take a look at the Technical particulars and GitHub Repo. Additionally, be at liberty to comply with us on Twitter and don’t neglect to affix our 150k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you’ll be able to be a part of us on telegram as properly.

Must companion with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so on.? Join with us

What's Hot

To Keep away from Getting Sick, Our Lab Knowledge Exhibits You Want This Air Air purifier Mannequin

Huawei launches extraordinarily skinny MatePad Professional Max globally

Fitbit Ditches the Display With Its New $99 Whoop Rival

Right here is How you can Use it

ICE Plans to Develop Personal Good Glasses to ‘Complement’ Its Facial Recognition App

Perplexity’s AI answering engine shouldn’t be coming to Snapchat, in spite of everything

7 On a regular basis Distributions Defined Merely

Safe short-term GPU capability for ML workloads with EC2 Capability Blocks for ML and SageMaker coaching plans

Diabetes Detection Wants Higher Instruments. They’re on the Method

To Keep away from Getting Sick, Our Lab Knowledge Exhibits You Want This Air Air purifier Mannequin

Huawei launches extraordinarily skinny MatePad Professional Max globally

Fitbit Ditches the Display With Its New $99 Whoop Rival

To Keep away from Getting Sick, Our Lab Knowledge Exhibits You Want This Air Air purifier Mannequin

Huawei launches extraordinarily skinny MatePad Professional Max globally

Fitbit Ditches the Display With Its New $99 Whoop Rival

Usefull link

categories

What's Hot

Why Agentic Inference is a Totally different Drawback

Structure: 5 Interlocking Subsystems

Benchmark Outcomes Towards TensorRT-LLM on NVIDIA B200

Key Takeaways

Related Posts

Usefull link

categories