Massive language fashions are remarkably succesful, but frustratingly opaque. When a mannequin misbehaves — producing responses within the improper language, repeating itself endlessly, or refusing protected requests — AI devs have only a few instruments to diagnose why it occurred on the degree of inside computations. That’s the issue Qwen-Scope is constructed to resolve.
Qwen Group simply launched Qwen-Scope, an open-source suite of sparse autoencoders (SAEs) skilled on the Qwen3 and Qwen3.5 mannequin households. The discharge contains 14 teams of SAE weights throughout 7 mannequin variants — 5 dense fashions (Qwen3-1.7B, Qwen3-8B, Qwen3.5-2B, Qwen3.5-9B, and Qwen3.5-27B) and two mixture-of-experts (MoE) fashions (Qwen3-30B-A3B and Qwen3.5-35B-A3B).
What’s a Sparse Autoencoder, and Why Ought to You Care?
Consider a sparse autoencoder as a translation layer between uncooked neural community activations and human-understandable ideas. When an LLM processes textual content, it produces high-dimensional hidden states — vectors with hundreds of numbers — which can be tough to interpret straight. An SAE learns to decompose these activations into a big dictionary of sparse latent options, the place every enter prompts solely a small subset of options. Every of these options tends to correspond to a particular, interpretable idea: a language, a method, a safety-relevant conduct.
Concretely, for every spine and transformer layer, Qwen-Scope trains a separate SAE to reconstruct residual-stream activations utilizing a sparse set of latent options. The SAE encoder maps every activation to an overcomplete latent illustration, and a High-k activation rule retains solely the biggest ok latent activations for reconstruction (with ok set to both 50 or 100 within the launch). For dense backbones, the SAE width scales to 16× the mannequin hidden measurement; for MoE backbones, customary SAEs use 32K width (16× enlargement), and wider SAEs as much as 128K width (64× enlargement) are additionally launched to seize finer-grained illustration construction.
The result’s a layer-wise characteristic dictionary for each transformer layer throughout all seven backbones. One essential technical element: Qwen3.5-27B is the one spine whose SAEs are skilled on the instruct variant; all different six backbones use their base mannequin checkpoints.
4 Methods Qwen-Scope Adjustments the Improvement Workflow
1. Inference-Time Steering
Probably the most quick utility is steering — influencing mannequin output with out modifying any mannequin weights. The thought rests on a well-supported speculation: high-level behaviors are encoded as instructions within the mannequin’s inside illustration area. By including or subtracting a characteristic route from the residual stream at inference time utilizing the formulation h’ ← h + αd, the place h is the hidden state, d is the SAE characteristic route, and α controls power, engineers can push the mannequin towards or away from particular behaviors.
The analysis staff demonstrates two case research on Qwen3 fashions. Within the first, a mannequin prompted in English unexpectedly mixes in Chinese language textual content. Rating SAE options by activation power reveals a extremely activated Chinese language-language characteristic (id: 6159). Suppressing it throughout technology removes the language mixing completely. Within the second, activating a classical-Chinese language characteristic (id: 36398) efficiently steers a story-continuation process towards a classical literary model. Each examples required zero weight updates.
https://qianwen-res.oss-accelerate.aliyuncs.com/qwen-scope/Qwen_Scope.pdf
2. Analysis Evaluation With out Working Fashions
Evaluating LLMs sometimes means operating many ahead passes throughout massive benchmark datasets — costly in compute and time. Qwen-Scope proposes a less expensive different: utilizing SAE characteristic activations as a representation-level proxy for benchmark evaluation.
The core perception is that when a mannequin processes a benchmark pattern, the SAE decomposes its activation right into a sparse set of energetic options, every interpretable as a ‘micro-capability.’ A benchmark whose samples all activate the identical options is redundant; two benchmarks that activate largely overlapping characteristic units are comparable. The analysis staff defines a characteristic redundancy metric that achieves a Spearman rank correlation of ρ ≈ 0.85 with performance-based redundancy throughout 17 widely-used benchmarks — together with MMLU, GSM8K, MATH, EvalPlus, and GPQA-Diamond — with out operating a single mannequin analysis. The evaluation additionally reveals that 63% of GSM8K’s options are already coated by MATH, suggesting that analysis suites containing MATH can safely omit GSM8K with minimal lack of discriminative data.
The framework additionally extends to inter-benchmark similarity: the analysis staff measures characteristic overlap between pairs of benchmarks to find out whether or not they probe the identical capabilities. After controlling for common mannequin means by partialing out MMLU scores, the partial Pearson correlation between characteristic overlap and performance-based similarity throughout 28 benchmark pairs improves to 75.5%, offering proof that characteristic overlap captures benchmark-specific functionality similarity reasonably than simply common mannequin high quality. This has a direct sensible implication: benchmarks with low mutual characteristic overlap probe distinct capabilities and will each be retained; benchmarks with excessive overlap are candidates for consolidation.
3. Knowledge-Centric Workflows: Toxicity Classification and Security Knowledge Synthesis
SAE options additionally show efficient as light-weight classifiers. The analysis staff builds a multilingual toxicity classifier throughout 13 languages utilizing a easy two-stage pipeline: establish SAE options that fireside extra incessantly on poisonous examples than clear ones (on a small discovery set), then apply an OR-rule over these options on held-out check knowledge — no extra classifier head, no gradient-based becoming. On English, this achieves an F1 rating above 0.90 on each Qwen3-1.7B and Qwen3-8B. The analysis staff additional exhibits that options found in English switch meaningfully to different languages with out rediscovery — efficiency declines with linguistic distance (strongest for European languages like Russian and French, weaker for Arabic, Chinese language, and Amharic), and scaling to Qwen3-8B improves each the extent and stability of cross-lingual switch. Crucially, utilizing solely 10% of the unique discovery knowledge nonetheless recovers about 99% of classification efficiency, demonstrating robust knowledge effectivity.
On the synthesis aspect, the analysis staff introduces a feature-driven security knowledge synthesis pipeline: establish safety-relevant SAE options which can be lacking from present supervision, generate prompt-completion pairs designed to activate these options, and confirm retention in characteristic area. Below a matched funds, feature-driven synthesis achieves 99.74% protection of the goal security characteristic set, in comparison with the considerably decrease protection achieved by pure sampling or random safety-related synthesis. Including 4k feature-driven artificial examples to 4k actual security examples produces a security accuracy of 77.75 — approaching the efficiency of coaching on 120k safety-only examples.
4. Put up-Coaching: Supervised Advantageous-Tuning and Reinforcement Studying
Maybe essentially the most technically novel contribution is utilizing SAE options as alerts throughout coaching, not simply inference.
For supervised fine-tuning, the analysis staff addresses surprising code-switching — the place multilingual LLMs spontaneously produce tokens in an unintended language. Their technique, known as Sparse Autoencoder-guided Supervised Advantageous-Tuning (SASFT), first identifies language-specific options through a monolinguality rating, then introduces an auxiliary regularization loss that suppresses these characteristic activations throughout coaching on non-target-language knowledge. Throughout 5 fashions spanning three mannequin households — Gemma-2, Llama-3.1, and Qwen3 — and three goal languages (Chinese language, Russian, and Korean), SASFT achieves over 50% discount in code-switching ratio within the majority of experimental settings, with full elimination in sure configurations (e.g., Qwen3-1.7B on Korean), whereas sustaining efficiency on six multilingual benchmarks.
For reinforcement studying, the analysis staff tackles infinite repetition — a low-frequency however disruptive failure mode the place fashions loop in repeated content material. Customary on-line RL hardly ever encounters repetitive rollouts, so it might’t study a robust corrective sign. Qwen-Scope addresses this by utilizing SAE characteristic steering to synthetically generate one repetition-biased rollout per coaching group, which is then integrated as a uncommon destructive pattern within the DAPO RL pipeline. The outcome: repetition ratio drops sharply and constantly throughout Qwen3-1.7B, Qwen3-8B, and Qwen3-30B-A3B, whereas common benchmark efficiency stays aggressive with vanilla RL.
Take a look at the Paper, Weights, and Technical particulars. Additionally, be at liberty to comply with us on Twitter and don’t overlook to affix our 130k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you’ll be able to be a part of us on telegram as effectively.
Must accomplice with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so on.? Join with us

