Trendy giant language fashions are not educated solely on uncooked web textual content. More and more, corporations are utilizing highly effective “instructor” fashions to assist practice smaller or extra environment friendly “scholar” fashions. This course of, broadly often known as LLM distillation or model-to-model coaching, has grow to be a key approach for constructing high-performing fashions at decrease computational value. Meta used its large Llama 4 Behemoth mannequin to assist practice Llama 4 Scout and Maverick, whereas Google leveraged Gemini fashions in the course of the improvement of Gemma 2 and Gemma 3. Equally, DeepSeek distilled reasoning capabilities from DeepSeek-R1 into smaller Qwen and Llama-based fashions.
The core thought is easy: as an alternative of studying solely from human-written textual content, a scholar mannequin may also be taught from the outputs, possibilities, reasoning traces, or behaviors of one other LLM. This enables smaller fashions to inherit capabilities resembling reasoning, instruction following, and structured technology from a lot bigger methods. Distillation can occur throughout pre-training, the place instructor and scholar fashions are educated collectively, or throughout post-training, the place a completely educated instructor transfers information to a separate scholar mannequin.
On this article, we’ll discover three main approaches used for coaching one LLM utilizing one other: Gentle-label distillation, the place the scholar learns from the instructor’s chance distributions; Exhausting-label distillation, the place the scholar imitates the instructor’s generated outputs; and Co-distillation, the place a number of fashions be taught collaboratively by sharing predictions and behaviors throughout coaching.
Gentle-Label Distillation
Gentle-label distillation is a coaching approach the place a smaller scholar LLM learns by imitating the output chance distribution of a bigger instructor LLM. As a substitute of coaching solely on the proper subsequent token, the scholar is educated to match the instructor’s softmax possibilities throughout all the vocabulary. For instance, if the instructor predicts the following token with possibilities like “cat” = 70%, “canine” = 20%, and “animal” = 10%, the scholar learns not simply the ultimate reply, but in addition the relationships and uncertainty between completely different tokens. This richer sign is usually referred to as the instructor’s “darkish information” as a result of it incorporates hidden details about reasoning patterns and semantic understanding.
The largest benefit of soft-label distillation is that it permits smaller fashions to inherit capabilities from a lot bigger fashions whereas remaining quicker and cheaper to deploy. For the reason that scholar learns from the instructor’s full chance distribution, coaching turns into extra secure and informative in comparison with studying from exhausting one-word targets alone. Nevertheless, this technique additionally comes with sensible challenges. To generate smooth labels, you want entry to the instructor mannequin’s logits or weights, which is usually not attainable with closed-source fashions. As well as, storing chance distributions for each token throughout vocabularies containing 100k+ tokens turns into extraordinarily memory-intensive at LLM scale, making pure soft-label distillation costly for trillion-token datasets.
Exhausting-label distillation
Exhausting-label distillation is an easier strategy the place the scholar LLM learns solely from the instructor mannequin’s closing predicted output token as an alternative of its full chance distribution. On this setup, a pre-trained instructor mannequin generates the almost certainly subsequent token or response, and the scholar mannequin is educated utilizing customary supervised studying to breed that output. The instructor primarily acts as a high-quality annotator that creates artificial coaching knowledge for the scholar. DeepSeek used this strategy to distill reasoning capabilities from DeepSeek-R1 into smaller Qwen and Llama 3.1 fashions.
In contrast to soft-label distillation, the scholar doesn’t see the instructor’s inside confidence scores or token relationships — it solely learns the ultimate reply. This makes hard-label distillation computationally less expensive and simpler to implement since there isn’t any must retailer large chance distributions for each token. It’s also particularly helpful when working with proprietary “black-box” fashions like GPT-4 APIs, the place builders solely have entry to generated textual content and never the underlying logits. Whereas exhausting labels comprise much less info than smooth labels, they continue to be extremely efficient for instruction tuning, reasoning datasets, artificial knowledge technology, and domain-specific fine-tuning duties.
Co-distillation
Co-distillation is a coaching strategy the place each the instructor and scholar fashions are educated collectively as an alternative of utilizing a set pre-trained instructor. On this setup, the instructor LLM and scholar LLM course of the identical coaching knowledge concurrently and generate their very own softmax chance distributions. The instructor is educated usually utilizing the ground-truth exhausting labels, whereas the scholar learns by matching the instructor’s smooth labels together with the precise right solutions. Meta used a type of this strategy whereas coaching Llama 4 Scout and Maverick alongside the bigger Llama 4 Behemoth mannequin.
One problem with co-distillation is that the instructor mannequin will not be absolutely educated in the course of the early levels, which means its predictions might initially be noisy or inaccurate. To beat this, the scholar is often educated utilizing a mix of soft-label distillation loss and customary hard-label cross-entropy loss. This creates a extra secure studying sign whereas nonetheless permitting information switch between fashions. In contrast to conventional one-way distillation, co-distillation permits each fashions to enhance collectively throughout coaching, typically main to higher efficiency, stronger reasoning switch, and smaller efficiency gaps between the instructor and scholar fashions.
Evaluating the Three Distillation Methods
Gentle-label distillation transfers the richest type of information as a result of the scholar learns from the instructor’s full chance distribution as an alternative of solely the ultimate reply. This helps smaller fashions seize reasoning patterns, uncertainty, and relationships between tokens, typically resulting in stronger general efficiency. Nevertheless, it’s computationally costly, requires entry to the instructor’s logits or weights, and turns into tough to scale as a result of storing chance distributions for large vocabularies consumes monumental reminiscence.
Exhausting-label distillation is easier and extra sensible. The coed solely learns from the instructor’s closing generated outputs, making it less expensive and simpler to implement. It really works particularly nicely with proprietary black-box fashions like GPT-4 APIs the place inside possibilities are unavailable. Whereas this strategy loses a number of the deeper “darkish information” current in smooth labels, it stays extremely efficient for instruction tuning, artificial knowledge technology, and task-specific fine-tuning.
Co-distillation takes a collaborative strategy the place instructor and scholar fashions be taught collectively throughout coaching. The instructor improves whereas concurrently guiding the scholar, permitting each fashions to profit from shared studying indicators. This could cut back the efficiency hole seen in conventional one-way distillation strategies, nevertheless it additionally makes coaching extra complicated because the instructor’s predictions are initially unstable. In follow, soft-label distillation is most popular for optimum information switch, hard-label distillation for scalability and practicality, and co-distillation for large-scale joint coaching setups.
I’m a Civil Engineering Graduate (2022) from Jamia Millia Islamia, New Delhi, and I’ve a eager curiosity in Knowledge Science, particularly Neural Networks and their software in numerous areas.

