On this tutorial, we work with Microsoft’s OpenMementos dataset and discover how reasoning traces are structured via blocks and mementos in a sensible, Colab-ready workflow. We stream the dataset effectively, parse its special-token format, examine how reasoning and summaries are organized, and measure the compression supplied by the souvenir illustration throughout totally different domains. As we transfer via the evaluation, we additionally visualize dataset patterns, align the streamed format with the richer full subset, simulate inference-time compression, and put together the info for supervised fine-tuning. On this method, we construct each an intuitive and technical understanding of how OpenMementos captures long-form reasoning whereas preserving compact summaries that may help environment friendly coaching and inference.
Copy CodeCopied!pip set up -q -U datasets transformers matplotlib pandas
import re, itertools, textwrap
from collections import Counter
from typing import Dict
import pandas as pd
import matplotlib.pyplot as plt
from datasets import load_dataset
DATASET = “microsoft/OpenMementos”
ds_stream = load_dataset(DATASET, cut up=”prepare”, streaming=True)
first_row = subsequent(iter(ds_stream))
print(“Columns :”, listing(first_row.keys()))
print(“Area :”, first_row[“domain”], “| Supply:”, first_row[“source”])
print(“Drawback head:”, first_row[“problem”][:160].change(“n”, ” “), “…”)
We set up the required libraries and import the core instruments wanted for dataset streaming, parsing, evaluation, and visualization. We then hook up with the Microsoft OpenMementos dataset in streaming mode to examine it with out downloading your complete dataset regionally. By studying the primary instance, we start understanding the dataset schema, the issue format, and the area and supply metadata connected to every reasoning hint.
Copy CodeCopiedBLOCK_RE = re.compile(r”<|block_start|>(.*?)<|block_end|>”, re.DOTALL)
SUMMARY_RE = re.compile(r”<|summary_start|>(.*?)<|summary_end|>”, re.DOTALL)
THINK_RE = re.compile(r”<suppose>(.*?)</suppose>”, re.DOTALL)
def parse_memento(response: str) -> Dict:
blocks = [m.strip() for m in BLOCK_RE.findall(response)]
summaries = [m.strip() for m in SUMMARY_RE.findall(response)]
think_m = THINK_RE.search(response)
final_ans = response.cut up(“</suppose>”)[-1].strip() if “</suppose>” in response else “”
return {“blocks”: blocks, “summaries”: summaries,
“reasoning”: (think_m.group(1) if think_m else “”),
“final_answer”: final_ans}
parsed = parse_memento(first_row[“response”])
print(f”n→ {len(parsed[‘blocks’])} blocks, {len(parsed[‘summaries’])} mementos parsed”)
print(“First block :”, parsed[“blocks”][0][:140].change(“n”, ” “), “…”)
print(“First memento :”, parsed[“summaries”][0][:140].change(“n”, ” “), “…”)
N_SAMPLES = 500
rows = []
for i, ex in enumerate(itertools.islice(
load_dataset(DATASET, cut up=”prepare”, streaming=True), N_SAMPLES)):
p = parse_memento(ex[“response”])
if not p[“blocks”] or len(p[“blocks”]) != len(p[“summaries”]):
proceed
blk_c = sum(len(b) for b in p[“blocks”])
sum_c = sum(len(s) for s in p[“summaries”])
blk_w = sum(len(b.cut up()) for b in p[“blocks”])
sum_w = sum(len(s.cut up()) for s in p[“summaries”])
rows.append(dict(area=ex[“domain”], supply=ex[“source”],
n_blocks=len(p[“blocks”]),
block_chars=blk_c, summ_chars=sum_c,
block_words=blk_w, summ_words=sum_w,
compress_char=sum_c / max(blk_c, 1),
compress_word=sum_w / max(blk_w, 1)))
if (i + 1) % 100 == 0:
print(f” processed {i+1}/{N_SAMPLES}”)
df = pd.DataFrame(rows)
print(f”nAnalyzed {len(df)} rows. Area counts:”)
print(df[“domain”].value_counts().to_string())
per_dom = df.groupby(“area”).agg(
n=(“area”, “depend”),
median_blocks=(“n_blocks”, “median”),
median_block_words=(“block_words”, “median”),
median_summ_words=(“summ_words”, “median”),
median_char_ratio=(“compress_char”, “median”),
median_word_ratio=(“compress_word”, “median”),
).spherical(3)
print(“nPer-domain medians (ratio = mementos / blocks):”)
print(per_dom.to_string())
We outline the regex-based parser that extracts reasoning blocks, memento summaries, the primary considering part, and the ultimate reply from every response. We take a look at the parser on the primary streamed instance and ensure that the block-summary construction is being captured accurately. We then run a streaming evaluation over a number of samples to compute block counts, phrase counts, character counts, and compression ratios, which helps us research how the dataset behaves throughout examples and domains.
Copy CodeCopieddef compress_trace(response: str, keep_last_k: int = 1) -> str:
blocks, summaries = BLOCK_RE.findall(response), SUMMARY_RE.findall(response)
if not blocks or len(blocks) != len(summaries):
return response
out, n = [“<think>”], len(blocks)
for i, (b, s) in enumerate(zip(blocks, summaries)):
if i >= n – keep_last_k:
out.append(f”<|block_start|>{b}<|block_end|>”)
out.append(f”<|summary_start|>{s}<|summary_end|>”)
else:
out.append(f”<|summary_start|>{s}<|summary_end|>”)
out.append(“</suppose>”)
out.append(response.cut up(“</suppose>”)[-1])
return “n”.be part of(out)
orig, comp = first_row[“response”], compress_trace(first_row[“response”], 1)
print(f”nOriginal : {len(orig):>8,} chars”)
print(f”Compressed : {len(comp):>8,} chars ({len(comp)/len(orig)*100:.1f}% of authentic)”)
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained(“gpt2”)
MEM_TOKENS = [“<|block_start|>”, “<|block_end|>”,
“<|summary_start|>”, “<|summary_end|>”,
“<think>”, “</think>”]
tok.add_special_tokens({“additional_special_tokens”: MEM_TOKENS})
def tlen(s): return len(tok(s, add_special_tokens=False).input_ids)
blk_tok = sum(tlen(b) for b in parsed[“blocks”])
sum_tok = sum(tlen(s) for s in parsed[“summaries”])
print(f”nTrace-level token compression for this instance:”)
print(f” block tokens = {blk_tok}”)
print(f” memento tokens = {sum_tok}”)
print(f” compression = {blk_tok / max(sum_tok,1):.2f}× (paper experiences ~6×)”)
def to_chat(ex):
return {“messages”: [
{“role”: “user”, “content”: ex[“problem”]},
{“function”: “assistant”, “content material”: ex[“response”]},
]}
chat_stream = load_dataset(DATASET, cut up=”prepare”, streaming=True).map(to_chat)
chat_ex = subsequent(iter(chat_stream))
print(“nSFT chat instance (truncated):”)
for m in chat_ex[“messages”]:
print(f” [{m[‘role’]:9s}] {m[‘content’][:130].change(chr(10),’ ‘)}…”)
We visualize the dataset’s structural patterns by plotting block counts, compression ratios, and the connection between block measurement and memento measurement. We evaluate these distributions throughout domains to see how reasoning group differs between math, code, and science examples. We additionally stream one instance from the total subset and examine its extra sentence-level and block-alignment fields, which helps us perceive the richer inside annotation pipeline behind the dataset.
Copy CodeCopieddef compress_trace(response: str, keep_last_k: int = 1) -> str:
blocks, summaries = BLOCK_RE.findall(response), SUMMARY_RE.findall(response)
if not blocks or len(blocks) != len(summaries):
return response
out, n = [“<think>”], len(blocks)
for i, (b, s) in enumerate(zip(blocks, summaries)):
if i >= n – keep_last_k:
out.append(f”<|block_start|>{b}<|block_end|>”)
out.append(f”<|summary_start|>{s}<|summary_end|>”)
else:
out.append(f”<|summary_start|>{s}<|summary_end|>”)
out.append(“</suppose>”)
out.append(response.cut up(“</suppose>”)[-1])
return “n”.be part of(out)
orig, comp = first_row[“response”], compress_trace(first_row[“response”], 1)
print(f”nOriginal : {len(orig):>8,} chars”)
print(f”Compressed : {len(comp):>8,} chars ({len(comp)/len(orig)*100:.1f}% of authentic)”)
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained(“gpt2”)
MEM_TOKENS = [“<|block_start|>”, “<|block_end|>”,
“<|summary_start|>”, “<|summary_end|>”,
“<think>”, “</think>”]
tok.add_special_tokens({“additional_special_tokens”: MEM_TOKENS})
def tlen(s): return len(tok(s, add_special_tokens=False).input_ids)
blk_tok = sum(tlen(b) for b in parsed[“blocks”])
sum_tok = sum(tlen(s) for s in parsed[“summaries”])
print(f”nTrace-level token compression for this instance:”)
print(f” block tokens = {blk_tok}”)
print(f” memento tokens = {sum_tok}”)
print(f” compression = {blk_tok / max(sum_tok,1):.2f}× (paper experiences ~6×)”)
def to_chat(ex):
return {“messages”: [
{“role”: “user”, “content”: ex[“problem”]},
{“function”: “assistant”, “content material”: ex[“response”]},
]}
chat_stream = load_dataset(DATASET, cut up=”prepare”, streaming=True).map(to_chat)
chat_ex = subsequent(iter(chat_stream))
print(“nSFT chat instance (truncated):”)
for m in chat_ex[“messages”]:
print(f” [{m[‘role’]:9s}] {m[‘content’][:130].change(chr(10),’ ‘)}…”)
We simulate inference-time compression by rewriting a reasoning hint in order that older blocks are changed by their mementos whereas the most recent blocks stay intact. We then evaluate the unique and compressed hint lengths to see how a lot context may be diminished in follow. After that, we combine a tokenizer, add particular memento tokens, measure token-level compression, and convert the dataset to an SFT-style chat format appropriate for coaching workflows.
Copy CodeCopieddef render_trace(response: str, width: int = 220) -> None:
p = parse_memento(response)
print(“=” * 72)
print(f”{len(p[‘blocks’])} blocks · {len(p[‘summaries’])} mementos”)
print(“=” * 72)
for i, (b, s) in enumerate(zip(p[“blocks”], p[“summaries”]), 1):
ratio = len(s) / max(len(b), 1) * 100
print(f”n BLOCK {i} ({len(b):,} chars)”)
print(textwrap.indent(textwrap.shorten(b.change(“n”, ” “), width=width), ” “))
print(f” MEMENTO {i} ({len(s):,} chars · {ratio:.1f}% of block)”)
print(textwrap.indent(textwrap.shorten(s.change(“n”, ” “), width=width), ” “))
if p[“final_answer”]:
print(“n★ FINAL ANSWER”)
print(textwrap.indent(textwrap.shorten(p[“final_answer”].change(“n”,” “),
width=width*2), ” “))
render_trace(first_row[“response”])
We construct a pretty-printer that renders a single reasoning hint in a way more readable block-by-block format. We show every block alongside its paired memento and calculate the abstract’s measurement relative to the unique block, making the compression impact straightforward to examine manually. By operating this renderer on the primary instance, we create a clear qualitative view of how OpenMementos organizes reasoning and preserves important info via summaries.
In conclusion, we gained a transparent view of how OpenMementos represents reasoning as a sequence of detailed blocks paired with concise mementos, and we noticed why this construction is helpful for context compression. We parsed actual examples, computed domain-level statistics, in contrast block and abstract lengths, and noticed how compressed traces can cut back token utilization whereas nonetheless retaining key info. We additionally aligned the streamed dataset format with the total subset, transformed the info to an SFT-ready chat construction, and constructed instruments to extra clearly examine traces. By way of this end-to-end workflow, we perceive the dataset itself and see the way it can function a sensible basis for finding out reasoning traces, memory-style summarization, and environment friendly long-context mannequin conduct.
Try the Full Codes right here. Additionally, be happy to comply with us on Twitter and don’t neglect to affix our 130k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you may be part of us on telegram as nicely.
Have to associate with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so on.? Join with us
The submit A Coding Implementation on Microsoft’s OpenMementos with Hint Construction Evaluation, Context Compression, and Positive-Tuning Information Preparation appeared first on MarkTechPost.
