On this tutorial, we work immediately with Qwen3.5 fashions distilled with Claude-style reasoning and arrange a Colab pipeline that lets us swap between a 27B GGUF variant and a light-weight 2B 4-bit model with a single flag. We begin by validating GPU availability, then conditionally set up both llama.cpp or transformers with bitsandbytes, relying on the chosen path. Each branches are unified by way of shared generate_fn and stream_fn interfaces, guaranteeing constant inference throughout backends. We additionally implement a ChatSession class for multi-turn interplay and construct utilities to parse traces, permitting us to explicitly separate reasoning from ultimate outputs throughout execution.
MODEL_PATH = “2B_HF”
import torch
if not torch.cuda.is_available():
elevate RuntimeError(
“❌ No GPU! Go to Runtime → Change runtime sort → T4 GPU.”
)
gpu_name = torch.cuda.get_device_name(0)
vram_gb = torch.cuda.get_device_properties(0).total_memory / 1e9
print(f”✅ GPU: {gpu_name} — {vram_gb:.1f} GB VRAM”)
import subprocess, sys, os, re, time
generate_fn = None
stream_fn = None
We initialize the execution by setting the mannequin path flag and checking whether or not a GPU is out there on the system. We retrieve and print the GPU identify together with out there VRAM to make sure the surroundings meets the necessities. We additionally import all required base libraries and outline placeholders for the unified technology capabilities that will likely be assigned later.
if MODEL_PATH == “27B_GGUF”:
print(“n📦 Putting in llama-cpp-python with CUDA (takes 3-5 min)…”)
env = os.environ.copy()
env[“CMAKE_ARGS”] = “-DGGML_CUDA=on”
subprocess.check_call(
[sys.executable, “-m”, “pip”, “install”, “-q”, “llama-cpp-python”, “huggingface_hub”],
env=env,
)
print(“✅ Put in.n”)
from huggingface_hub import hf_hub_download
from llama_cpp import Llama
GGUF_REPO = “Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-GGUF”
GGUF_FILE = “Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-Q4_K_M.gguf”
print(f”⏳ Downloading {GGUF_FILE} (~16.5 GB)… seize a espresso ☕”)
model_path = hf_hub_download(repo_id=GGUF_REPO, filename=GGUF_FILE)
print(f”✅ Downloaded: {model_path}n”)
print(“⏳ Loading into llama.cpp (GPU offload)…”)
llm = Llama(
model_path=model_path,
n_ctx=8192,
n_gpu_layers=40,
n_threads=4,
verbose=False,
)
print(“✅ 27B GGUF mannequin loaded!n”)
def generate_fn(
immediate, system_prompt=”You’re a useful assistant. Suppose step-by-step.”,
max_new_tokens=2048, temperature=0.6, top_p=0.95, **kwargs
):
output = llm.create_chat_completion(
messages=[
{“role”: “system”, “content”: system_prompt},
{“role”: “user”, “content”: prompt},
],
max_tokens=max_new_tokens,
temperature=temperature,
top_p=top_p,
)
return output[“choices”][0][“message”][“content”]
def stream_fn(
immediate, system_prompt=”You’re a useful assistant. Suppose step-by-step.”,
max_new_tokens=2048, temperature=0.6, top_p=0.95,
):
print(“⏳ Streaming output:n”)
for chunk in llm.create_chat_completion(
messages=[
{“role”: “system”, “content”: system_prompt},
{“role”: “user”, “content”: prompt},
],
max_tokens=max_new_tokens,
temperature=temperature,
top_p=top_p,
stream=True,
):
delta = chunk[“choices”][0].get(“delta”, {})
textual content = delta.get(“content material”, “”)
if textual content:
print(textual content, finish=””, flush=True)
print()
class ChatSession:
def __init__(self, system_prompt=”You’re a useful assistant. Suppose step-by-step.”):
self.messages = [{“role”: “system”, “content”: system_prompt}]
def chat(self, user_message, temperature=0.6):
self.messages.append({“position”: “person”, “content material”: user_message})
output = llm.create_chat_completion(
messages=self.messages, max_tokens=2048,
temperature=temperature, top_p=0.95,
)
resp = output[“choices”][0][“message”][“content”]
self.messages.append({“position”: “assistant”, “content material”: resp})
return resp
We deal with the 27B GGUF path by putting in llama.cpp with CUDA assist and downloading the Qwen3.5 27B distilled mannequin from Hugging Face. We load the mannequin with GPU offloading and outline a standardized generate_fn and stream_fn for inference and streaming outputs. We additionally implement a ChatSession class to keep up dialog historical past for multi-turn interactions.
elif MODEL_PATH == “2B_HF”:
print(“n📦 Putting in transformers + bitsandbytes…”)
subprocess.check_call([
sys.executable, “-m”, “pip”, “install”, “-q”,
“transformers @ git+https://github.com/huggingface/transformers.git@main”,
“accelerate”, “bitsandbytes”, “sentencepiece”, “protobuf”,
])
print(“✅ Put in.n”)
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TextStreamer
HF_MODEL_ID = “Jackrong/Qwen3.5-2B-Claude-4.6-Opus-Reasoning-Distilled”
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type=”nf4″,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
print(f”⏳ Loading {HF_MODEL_ID} in 4-bit…”)
tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_ID, trust_remote_code=True)
mannequin = AutoModelForCausalLM.from_pretrained(
HF_MODEL_ID,
quantization_config=bnb_config,
device_map=”auto”,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
)
print(f”✅ Mannequin loaded! Reminiscence: {mannequin.get_memory_footprint() / 1e9:.2f} GBn”)
def generate_fn(
immediate, system_prompt=”You’re a useful assistant. Suppose step-by-step.”,
max_new_tokens=2048, temperature=0.6, top_p=0.95,
repetition_penalty=1.05, do_sample=True, **kwargs
):
messages = [
{“role”: “system”, “content”: system_prompt},
{“role”: “user”, “content”: prompt},
]
textual content = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(textual content, return_tensors=”pt”).to(mannequin.system)
with torch.no_grad():
output_ids = mannequin.generate(
**inputs, max_new_tokens=max_new_tokens, temperature=temperature,
top_p=top_p, repetition_penalty=repetition_penalty, do_sample=do_sample,
)
generated = output_ids[0][inputs[“input_ids”].form[1]:]
return tokenizer.decode(generated, skip_special_tokens=True)
def stream_fn(
immediate, system_prompt=”You’re a useful assistant. Suppose step-by-step.”,
max_new_tokens=2048, temperature=0.6, top_p=0.95,
):
messages = [
{“role”: “system”, “content”: system_prompt},
{“role”: “user”, “content”: prompt},
]
textual content = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(textual content, return_tensors=”pt”).to(mannequin.system)
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
print(“⏳ Streaming output:n”)
with torch.no_grad():
mannequin.generate(
**inputs, max_new_tokens=max_new_tokens, temperature=temperature,
top_p=top_p, do_sample=True, streamer=streamer,
)
class ChatSession:
def __init__(self, system_prompt=”You’re a useful assistant. Suppose step-by-step.”):
self.messages = [{“role”: “system”, “content”: system_prompt}]
def chat(self, user_message, temperature=0.6):
self.messages.append({“position”: “person”, “content material”: user_message})
textual content = tokenizer.apply_chat_template(self.messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(textual content, return_tensors=”pt”).to(mannequin.system)
with torch.no_grad():
output_ids = mannequin.generate(
**inputs, max_new_tokens=2048, temperature=temperature, top_p=0.95, do_sample=True,
)
generated = output_ids[0][inputs[“input_ids”].form[1]:]
resp = tokenizer.decode(generated, skip_special_tokens=True)
self.messages.append({“position”: “assistant”, “content material”: resp})
return resp
else:
elevate ValueError(“MODEL_PATH have to be ’27B_GGUF’ or ‘2B_HF'”)
We implement the light-weight 2B path utilizing transformers with 4-bit quantization by way of bitsandbytes. We load the Qwen3.5 2B distilled mannequin effectively onto the GPU and configure technology parameters for managed sampling. We once more outline unified technology, streaming, and chat session logic in order that each mannequin paths behave identically throughout execution.
def parse_thinking(response: str) -> tuple:
m = re.search(r”(.*?)”, response, re.DOTALL)
if m:
return m.group(1).strip(), response[m.end():].strip()
return “”, response.strip()
def display_response(response: str):
pondering, reply = parse_thinking(response)
if pondering:
print(“🧠 THINKING:”)
print(“-” * 60)
print(pondering[:1500] + (“n… [truncated]” if len(pondering) > 1500 else “”))
print(“-” * 60)
print(“n💬 ANSWER:”)
print(reply)
print(“✅ All helpers prepared. Working exams…n”)
We outline helper capabilities to extract reasoning traces enclosed inside tags and separate them from ultimate solutions. We create a show utility that codecs and prints each the pondering course of and the response in a structured means. This permits us to examine how the Qwen-based mannequin causes internally throughout technology.
print(“=” * 70)
print(“📝 TEST 1: Primary reasoning”)
print(“=” * 70)
response = generate_fn(
“If I’ve 3 apples and provides away half, then purchase 5 extra, what number of do I’ve? ”
“Clarify your reasoning.”
)
display_response(response)
print(“n” + “=” * 70)
print(“📝 TEST 2: Streaming output”)
print(“=” * 70)
stream_fn(
“Clarify the distinction between concurrency and parallelism. ”
“Give a real-world analogy for every.”
)
print(“n” + “=” * 70)
print(“📝 TEST 3: Considering ON vs OFF”)
print(“=” * 70)
query = “What’s the capital of France?”
print(“n— Considering ON (default) —“)
resp = generate_fn(query)
display_response(resp)
print(“n— Considering OFF (concise) —“)
resp = generate_fn(
query,
system_prompt=”Reply immediately and concisely. Don’t use tags.”,
max_new_tokens=256,
)
display_response(resp)
print(“n” + “=” * 70)
print(“📝 TEST 4: Bat & ball trick query”)
print(“=” * 70)
response = generate_fn(
“A bat and a ball price $1.10 in whole. ”
“How a lot does the ball price? Present full reasoning and confirm.”,
system_prompt=”You’re a exact mathematical reasoner. Arrange equations and confirm.”,
temperature=0.3,
)
display_response(response)
print(“n” + “=” * 70)
print(“📝 TEST 5: Prepare assembly downside”)
print(“=” * 70)
response = generate_fn(
“A practice leaves Station A at 9:00 AM at 60 mph towards Station B. ”
“One other leaves Station B at 10:00 AM at 80 mph towards Station A. ”
“Stations are 280 miles aside. When and the place do they meet?”,
temperature=0.3,
)
display_response(response)
print(“n” + “=” * 70)
print(“📝 TEST 6: Logic puzzle (5 homes)”)
print(“=” * 70)
response = generate_fn(
“5 homes in a row are painted totally different colours. ”
“The crimson home is left of the blue home. ”
“The inexperienced home is within the center. ”
“The yellow home will not be subsequent to the blue home. ”
“The white home is at one finish. ”
“What’s the order from left to proper?”,
temperature=0.3,
max_new_tokens=3000,
)
display_response(response)
print(“n” + “=” * 70)
print(“📝 TEST 7: Code technology — longest palindromic substring”)
print(“=” * 70)
response = generate_fn(
“Write a Python perform to seek out the longest palindromic substring ”
“utilizing Manacher’s algorithm. Embody docstring, sort hints, and exams.”,
system_prompt=”You’re an skilled Python programmer. Suppose by way of the algorithm fastidiously.”,
max_new_tokens=3000,
temperature=0.3,
)
display_response(response)
print(“n” + “=” * 70)
print(“📝 TEST 8: Multi-turn dialog (physics tutor)”)
print(“=” * 70)
session = ChatSession(
system_prompt=”You’re a educated physics tutor. Clarify clearly with examples.”
)
turns = [
“What is the Heisenberg uncertainty principle?”,
“Can you give me a concrete example with actual numbers?”,
“How does this relate to quantum tunneling?”,
]
for i, q in enumerate(turns, 1):
print(f”n{‘─’*60}”)
print(f”👤 Flip {i}: {q}”)
print(f”{‘─’*60}”)
resp = session.chat(q, temperature=0.5)
_, reply = parse_thinking(resp)
print(f”🤖 {reply[:1000]}{‘…’ if len(reply) > 1000 else ”}”)
print(“n” + “=” * 70)
print(“📝 TEST 9: Temperature comparability — inventive writing”)
print(“=” * 70)
creative_prompt = “Write a one-paragraph opening for a sci-fi story about AI consciousness.”
configs = [
{“label”: “Low temp (0.1)”, “temperature”: 0.1, “top_p”: 0.9},
{“label”: “Med temp (0.6)”, “temperature”: 0.6, “top_p”: 0.95},
{“label”: “High temp (1.0)”, “temperature”: 1.0, “top_p”: 0.98},
]
for cfg in configs:
print(f”n🎛️ {cfg[‘label’]}”)
print(“-” * 60)
begin = time.time()
resp = generate_fn(
creative_prompt,
system_prompt=”You’re a inventive fiction author.”,
max_new_tokens=512,
temperature=cfg[“temperature”],
top_p=cfg[“top_p”],
)
elapsed = time.time() – begin
_, reply = parse_thinking(resp)
print(reply[:600])
print(f”⏱️ {elapsed:.1f}s”)
print(“n” + “=” * 70)
print(“📝 TEST 10: Pace benchmark”)
print(“=” * 70)
begin = time.time()
resp = generate_fn(
“Clarify how a neural community learns, step-by-step, for a newbie.”,
system_prompt=”You’re a affected person, clear instructor.”,
max_new_tokens=1024,
)
elapsed = time.time() – begin
approx_tokens = int(len(resp.cut up()) * 1.3)
print(f”~{approx_tokens} tokens in {elapsed:.1f}s”)
print(f”~{approx_tokens / elapsed:.1f} tokens/sec”)
print(f”GPU: {torch.cuda.get_device_name(0)}”)
print(f”Peak VRAM: {torch.cuda.max_memory_allocated() / 1e9:.2f} GB”)
import gc
for identify in [“model”, “llm”]:
if identify in globals():
del globals()[name]
gc.acquire()
torch.cuda.empty_cache()
print(f”n✅ Reminiscence freed. VRAM: {torch.cuda.memory_allocated() / 1e9:.2f} GB”)
print(“n” + “=” * 70)
print(“🎉 Tutorial full!”)
print(“=” * 70)
We run a complete check suite that evaluates the mannequin throughout reasoning, streaming, logic puzzles, code technology, and multi-turn conversations. We examine outputs beneath totally different temperature settings and measure efficiency by way of pace and token throughput. Lastly, we clear up reminiscence and free GPU sources, guaranteeing the pocket book stays reusable for additional experiments.
In conclusion, we have now a compact however versatile setup for operating Qwen3.5-based reasoning fashions enhanced with Claude-style distillation throughout totally different {hardware} constraints. The script abstracts backend variations whereas exposing constant technology, streaming, and conversational interfaces, making it straightforward to experiment with reasoning conduct. Via the check suite, we probe how the mannequin handles structured reasoning, edge-case questions, and longer multi-step duties, whereas additionally measuring pace and reminiscence utilization. What we find yourself with is not only a demo, however a reusable scaffold for evaluating and lengthening Qwen-based reasoning methods in Colab with out altering the core code.
Try the Full Pocket book and Supply Web page. Additionally, be at liberty to observe us on Twitter and don’t neglect to hitch our 120k+ ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you’ll be able to be a part of us on telegram as nicely.
