HomeSample Page

Sample Page Title


On this tutorial, we work instantly with Qwen3.5 fashions distilled with Claude-style reasoning and arrange a Colab pipeline that lets us change between a 27B GGUF variant and a light-weight 2B 4-bit model with a single flag. We begin by validating GPU availability, then conditionally set up both llama.cpp or transformers with bitsandbytes, relying on the chosen path. Each branches are unified via shared generate_fn and stream_fn interfaces, guaranteeing constant inference throughout backends. We additionally implement a ChatSession class for multi-turn interplay and construct utilities to parse <suppose> traces, permitting us to explicitly separate reasoning from ultimate outputs throughout execution.

MODEL_PATH = "2B_HF"


import torch


if not torch.cuda.is_available():
   elevate RuntimeError(
       "❌ No GPU! Go to Runtime → Change runtime sort → T4 GPU."
   )


gpu_name = torch.cuda.get_device_name(0)
vram_gb = torch.cuda.get_device_properties(0).total_memory / 1e9
print(f"✅ GPU: {gpu_name} — {vram_gb:.1f} GB VRAM")


import subprocess, sys, os, re, time


generate_fn = None
stream_fn = None

We initialize the execution by setting the mannequin path flag and checking whether or not a GPU is obtainable on the system. We retrieve and print the GPU identify together with obtainable VRAM to make sure the atmosphere meets the necessities. We additionally import all required base libraries and outline placeholders for the unified era features that will probably be assigned later.

if MODEL_PATH == "27B_GGUF":
   print("n📦 Putting in llama-cpp-python with CUDA (takes 3-5 min)...")
   env = os.environ.copy()
   env["CMAKE_ARGS"] = "-DGGML_CUDA=on"
   subprocess.check_call(
       [sys.executable, "-m", "pip", "install", "-q", "llama-cpp-python", "huggingface_hub"],
       env=env,
   )
   print("✅ Put in.n")


   from huggingface_hub import hf_hub_download
   from llama_cpp import Llama


   GGUF_REPO = "Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-GGUF"
   GGUF_FILE = "Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-Q4_K_M.gguf"


   print(f"⏳ Downloading {GGUF_FILE} (~16.5 GB)... seize a espresso ☕")
   model_path = hf_hub_download(repo_id=GGUF_REPO, filename=GGUF_FILE)
   print(f"✅ Downloaded: {model_path}n")


   print("⏳ Loading into llama.cpp (GPU offload)...")
   llm = Llama(
       model_path=model_path,
       n_ctx=8192,
       n_gpu_layers=40,
       n_threads=4,
       verbose=False,
   )
   print("✅ 27B GGUF mannequin loaded!n")


   def generate_fn(
       immediate, system_prompt="You're a useful assistant. Suppose step-by-step.",
       max_new_tokens=2048, temperature=0.6, top_p=0.95, **kwargs
   ):
       output = llm.create_chat_completion(
           messages=[
               {"role": "system", "content": system_prompt},
               {"role": "user", "content": prompt},
           ],
           max_tokens=max_new_tokens,
           temperature=temperature,
           top_p=top_p,
       )
       return output["choices"][0]["message"]["content"]


   def stream_fn(
       immediate, system_prompt="You're a useful assistant. Suppose step-by-step.",
       max_new_tokens=2048, temperature=0.6, top_p=0.95,
   ):
       print("⏳ Streaming output:n")
       for chunk in llm.create_chat_completion(
           messages=[
               {"role": "system", "content": system_prompt},
               {"role": "user", "content": prompt},
           ],
           max_tokens=max_new_tokens,
           temperature=temperature,
           top_p=top_p,
           stream=True,
       ):
           delta = chunk["choices"][0].get("delta", {})
           textual content = delta.get("content material", "")
           if textual content:
               print(textual content, finish="", flush=True)
       print()


   class ChatSession:
       def __init__(self, system_prompt="You're a useful assistant. Suppose step-by-step."):
           self.messages = [{"role": "system", "content": system_prompt}]
       def chat(self, user_message, temperature=0.6):
           self.messages.append({"function": "consumer", "content material": user_message})
           output = llm.create_chat_completion(
               messages=self.messages, max_tokens=2048,
               temperature=temperature, top_p=0.95,
           )
           resp = output["choices"][0]["message"]["content"]
           self.messages.append({"function": "assistant", "content material": resp})
           return resp

We deal with the 27B GGUF path by putting in llama.cpp with CUDA assist and downloading the Qwen3.5 27B distilled mannequin from Hugging Face. We load the mannequin with GPU offloading and outline a standardized generate_fn and stream_fn for inference and streaming outputs. We additionally implement a ChatSession class to keep up dialog historical past for multi-turn interactions.

elif MODEL_PATH == "2B_HF":
   print("n📦 Putting in transformers + bitsandbytes...")
   subprocess.check_call([
       sys.executable, "-m", "pip", "install", "-q",
       "transformers @ git+https://github.com/huggingface/transformers.git@main",
       "accelerate", "bitsandbytes", "sentencepiece", "protobuf",
   ])
   print("✅ Put in.n")


   from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TextStreamer


   HF_MODEL_ID = "Jackrong/Qwen3.5-2B-Claude-4.6-Opus-Reasoning-Distilled"


   bnb_config = BitsAndBytesConfig(
       load_in_4bit=True,
       bnb_4bit_quant_type="nf4",
       bnb_4bit_compute_dtype=torch.bfloat16,
       bnb_4bit_use_double_quant=True,
   )


   print(f"⏳ Loading {HF_MODEL_ID} in 4-bit...")
   tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_ID, trust_remote_code=True)
   mannequin = AutoModelForCausalLM.from_pretrained(
       HF_MODEL_ID,
       quantization_config=bnb_config,
       device_map="auto",
       trust_remote_code=True,
       torch_dtype=torch.bfloat16,
   )
   print(f"✅ Mannequin loaded! Reminiscence: {mannequin.get_memory_footprint() / 1e9:.2f} GBn")


   def generate_fn(
       immediate, system_prompt="You're a useful assistant. Suppose step-by-step.",
       max_new_tokens=2048, temperature=0.6, top_p=0.95,
       repetition_penalty=1.05, do_sample=True, **kwargs
   ):
       messages = [
           {"role": "system", "content": system_prompt},
           {"role": "user", "content": prompt},
       ]
       textual content = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
       inputs = tokenizer(textual content, return_tensors="pt").to(mannequin.gadget)
       with torch.no_grad():
           output_ids = mannequin.generate(
               **inputs, max_new_tokens=max_new_tokens, temperature=temperature,
               top_p=top_p, repetition_penalty=repetition_penalty, do_sample=do_sample,
           )
       generated = output_ids[0][inputs["input_ids"].form[1]:]
       return tokenizer.decode(generated, skip_special_tokens=True)


   def stream_fn(
       immediate, system_prompt="You're a useful assistant. Suppose step-by-step.",
       max_new_tokens=2048, temperature=0.6, top_p=0.95,
   ):
       messages = [
           {"role": "system", "content": system_prompt},
           {"role": "user", "content": prompt},
       ]
       textual content = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
       inputs = tokenizer(textual content, return_tensors="pt").to(mannequin.gadget)
       streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
       print("⏳ Streaming output:n")
       with torch.no_grad():
           mannequin.generate(
               **inputs, max_new_tokens=max_new_tokens, temperature=temperature,
               top_p=top_p, do_sample=True, streamer=streamer,
           )


   class ChatSession:
       def __init__(self, system_prompt="You're a useful assistant. Suppose step-by-step."):
           self.messages = [{"role": "system", "content": system_prompt}]
       def chat(self, user_message, temperature=0.6):
           self.messages.append({"function": "consumer", "content material": user_message})
           textual content = tokenizer.apply_chat_template(self.messages, tokenize=False, add_generation_prompt=True)
           inputs = tokenizer(textual content, return_tensors="pt").to(mannequin.gadget)
           with torch.no_grad():
               output_ids = mannequin.generate(
                   **inputs, max_new_tokens=2048, temperature=temperature, top_p=0.95, do_sample=True,
               )
           generated = output_ids[0][inputs["input_ids"].form[1]:]
           resp = tokenizer.decode(generated, skip_special_tokens=True)
           self.messages.append({"function": "assistant", "content material": resp})
           return resp
else:
   elevate ValueError("MODEL_PATH should be '27B_GGUF' or '2B_HF'")

We implement the light-weight 2B path utilizing transformers with 4-bit quantization via bitsandbytes. We load the Qwen3.5 2B distilled mannequin effectively onto the GPU and configure era parameters for managed sampling. We once more outline unified era, streaming, and chat session logic in order that each mannequin paths behave identically throughout execution.

def parse_thinking(response: str) -> tuple:
   m = re.search(r"<suppose>(.*?)</suppose>", response, re.DOTALL)
   if m:
       return m.group(1).strip(), response[m.end():].strip()
   return "", response.strip()




def display_response(response: str):
   pondering, reply = parse_thinking(response)
   if pondering:
       print("🧠 THINKING:")
       print("-" * 60)
       print(pondering[:1500] + ("n... [truncated]" if len(pondering) > 1500 else ""))
       print("-" * 60)
   print("n💬 ANSWER:")
   print(reply)




print("✅ All helpers prepared. Working assessments...n")

We outline helper features to extract reasoning traces enclosed inside <suppose> tags and separate them from ultimate solutions. We create a show utility that codecs and prints each the pondering course of and the response in a structured approach. This permits us to examine how the Qwen-based mannequin causes internally throughout era.

print("=" * 70)
print("📝 TEST 1: Fundamental reasoning")
print("=" * 70)


response = generate_fn(
   "If I've 3 apples and provides away half, then purchase 5 extra, what number of do I've? "
   "Clarify your reasoning."
)
display_response(response)


print("n" + "=" * 70)
print("📝 TEST 2: Streaming output")
print("=" * 70)


stream_fn(
   "Clarify the distinction between concurrency and parallelism. "
   "Give a real-world analogy for every."
)


print("n" + "=" * 70)
print("📝 TEST 3: Pondering ON vs OFF")
print("=" * 70)


query = "What's the capital of France?"


print("n--- Pondering ON (default) ---")
resp = generate_fn(query)
display_response(resp)


print("n--- Pondering OFF (concise) ---")
resp = generate_fn(
   query,
   system_prompt="Reply instantly and concisely. Don't use <suppose> tags.",
   max_new_tokens=256,
)
display_response(resp)


print("n" + "=" * 70)
print("📝 TEST 4: Bat & ball trick query")
print("=" * 70)


response = generate_fn(
   "A bat and a ball value $1.10 in complete. "
   "How a lot does the ball value? Present full reasoning and confirm.",
   system_prompt="You're a exact mathematical reasoner. Arrange equations and confirm.",
   temperature=0.3,
)
display_response(response)


print("n" + "=" * 70)
print("📝 TEST 5: Practice assembly drawback")
print("=" * 70)


response = generate_fn(
   "A practice leaves Station A at 9:00 AM at 60 mph towards Station B. "
   "One other leaves Station B at 10:00 AM at 80 mph towards Station A. "
   "Stations are 280 miles aside. When and the place do they meet?",
   temperature=0.3,
)
display_response(response)


print("n" + "=" * 70)
print("📝 TEST 6: Logic puzzle (5 homes)")
print("=" * 70)


response = generate_fn(
   "5 homes in a row are painted completely different colours. "
   "The purple home is left of the blue home. "
   "The inexperienced home is within the center. "
   "The yellow home isn't subsequent to the blue home. "
   "The white home is at one finish. "
   "What's the order from left to proper?",
   temperature=0.3,
   max_new_tokens=3000,
)
display_response(response)


print("n" + "=" * 70)
print("📝 TEST 7: Code era — longest palindromic substring")
print("=" * 70)


response = generate_fn(
   "Write a Python perform to seek out the longest palindromic substring "
   "utilizing Manacher's algorithm. Embody docstring, sort hints, and assessments.",
   system_prompt="You're an knowledgeable Python programmer. Suppose via the algorithm rigorously.",
   max_new_tokens=3000,
   temperature=0.3,
)
display_response(response)


print("n" + "=" * 70)
print("📝 TEST 8: Multi-turn dialog (physics tutor)")
print("=" * 70)


session = ChatSession(
   system_prompt="You're a educated physics tutor. Clarify clearly with examples."
)


turns = [
   "What is the Heisenberg uncertainty principle?",
   "Can you give me a concrete example with actual numbers?",
   "How does this relate to quantum tunneling?",
]


for i, q in enumerate(turns, 1):
   print(f"n{'─'*60}")
   print(f"👤 Flip {i}: {q}")
   print(f"{'─'*60}")
   resp = session.chat(q, temperature=0.5)
   _, reply = parse_thinking(resp)
   print(f"🤖 {reply[:1000]}{'...' if len(reply) > 1000 else ''}")


print("n" + "=" * 70)
print("📝 TEST 9: Temperature comparability — artistic writing")
print("=" * 70)


creative_prompt = "Write a one-paragraph opening for a sci-fi story about AI consciousness."


configs = [
   {"label": "Low temp (0.1)",  "temperature": 0.1, "top_p": 0.9},
   {"label": "Med temp (0.6)",  "temperature": 0.6, "top_p": 0.95},
   {"label": "High temp (1.0)", "temperature": 1.0, "top_p": 0.98},
]


for cfg in configs:
   print(f"n🎛️  {cfg['label']}")
   print("-" * 60)
   begin = time.time()
   resp = generate_fn(
       creative_prompt,
       system_prompt="You're a artistic fiction author.",
       max_new_tokens=512,
       temperature=cfg["temperature"],
       top_p=cfg["top_p"],
   )
   elapsed = time.time() - begin
   _, reply = parse_thinking(resp)
   print(reply[:600])
   print(f"⏱️  {elapsed:.1f}s")


print("n" + "=" * 70)
print("📝 TEST 10: Pace benchmark")
print("=" * 70)


begin = time.time()
resp = generate_fn(
   "Clarify how a neural community learns, step-by-step, for a newbie.",
   system_prompt="You're a affected person, clear instructor.",
   max_new_tokens=1024,
)
elapsed = time.time() - begin


approx_tokens = int(len(resp.break up()) * 1.3)
print(f"~{approx_tokens} tokens in {elapsed:.1f}s")
print(f"~{approx_tokens / elapsed:.1f} tokens/sec")
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"Peak VRAM: {torch.cuda.max_memory_allocated() / 1e9:.2f} GB")


import gc


for identify in ["model", "llm"]:
   if identify in globals():
       del globals()[name]
gc.acquire()
torch.cuda.empty_cache()


print(f"n✅ Reminiscence freed. VRAM: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print("n" + "=" * 70)
print("🎉 Tutorial full!")
print("=" * 70)

We run a complete check suite that evaluates the mannequin throughout reasoning, streaming, logic puzzles, code era, and multi-turn conversations. We examine outputs below completely different temperature settings and measure efficiency by way of pace and token throughput. Lastly, we clear up reminiscence and free GPU assets, guaranteeing the pocket book stays reusable for additional experiments.

In conclusion, we have now a compact however versatile setup for working Qwen3.5-based reasoning fashions enhanced with Claude-style distillation throughout completely different {hardware} constraints. The script abstracts backend variations whereas exposing constant era, streaming, and conversational interfaces, making it simple to experiment with reasoning habits. By means of the check suite, we probe how the mannequin handles structured reasoning, edge-case questions, and longer multi-step duties, whereas additionally measuring pace and reminiscence utilization. What we find yourself with isn’t just a demo, however a reusable scaffold for evaluating and increasing Qwen-based reasoning methods in Colab with out altering the core code.


Try the Full Pocket book and Supply Web pageAdditionally, be at liberty to comply with us on Twitter and don’t neglect to affix our 120k+ ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you possibly can be a part of us on telegram as properly.


Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles