On this tutorial, we discover easy methods to use the ParseBench dataset to guage doc parsing techniques in a structured, sensible means. We start by loading the dataset immediately from Hugging Face, inspecting its a number of dimensions, corresponding to textual content, tables, charts, and format, and remodeling it right into a unified dataframe for deeper evaluation. As we progress, we establish key fields, detect linked PDFs, and construct a light-weight baseline utilizing PyMuPDF to extract and evaluate textual content. All through the method, we concentrate on creating a versatile pipeline that permits us to know the dataset schema, consider parsing high quality, and put together inputs for extra superior OCR or vision-language fashions.
Copy CodeCopied!pip set up -q -U datasets huggingface_hub pandas matplotlib wealthy pymupdf rapidfuzz tqdm
import json, re, textwrap, random, math
from pathlib import Path
from collections import Counter
import pandas as pd
import matplotlib.pyplot as plt
from tqdm.auto import tqdm
from wealthy.console import Console
from wealthy.desk import Desk
from wealthy.panel import Panel
from huggingface_hub import hf_hub_download, list_repo_files
from rapidfuzz import fuzz
import fitz
console = Console()
DATASET_ID = “llamaindex/ParseBench”
WORKDIR = Path(“/content material/parsebench_tutorial”)
WORKDIR.mkdir(mother and father=True, exist_ok=True)
console.print(Panel.match(“Superior ParseBench Tutorial on Google Colab”, type=”daring inexperienced”))
information = list_repo_files(DATASET_ID, repo_type=”dataset”)
jsonl_files = [f for f in files if f.endswith(“.jsonl”)]
pdf_files = [f for f in files if f.endswith(“.pdf”)]
console.print(f”Discovered {len(jsonl_files)} JSONL information”)
console.print(f”Discovered {len(pdf_files)} PDF information”)
desk = Desk(title=”ParseBench JSONL Recordsdata”)
desk.add_column(“File”)
desk.add_column(“Dimension”)
for f in jsonl_files:
desk.add_row(f, Path(f).stem)
console.print(desk)
We set up all required libraries and arrange our working surroundings for the tutorial. We initialize the dataset supply and put together a workspace to retailer all outputs. We additionally fetch and checklist all JSONL and PDF information from the ParseBench repository to know the dataset construction.
Copy CodeCopieddef load_jsonl_from_hf(filename, max_rows=None):
path = hf_hub_download(repo_id=DATASET_ID, filename=filename, repo_type=”dataset”)
rows = []
with open(path, “r”, encoding=”utf-8″) as fp:
for i, line in enumerate(fp):
if max_rows and that i >= max_rows:
break
line = line.strip()
if line:
rows.append(json.hundreds(line))
return rows, path
def flatten_dict(d, parent_key=””, sep=”.”):
gadgets = {}
if isinstance(d, dict):
for ok, v in d.gadgets():
new_key = f”{parent_key}{sep}{ok}” if parent_key else str(ok)
if isinstance(v, dict):
gadgets.replace(flatten_dict(v, new_key, sep=sep))
else:
gadgets[new_key] = v
return gadgets
dimension_data = {}
for jf in jsonl_files:
rows, local_path = load_jsonl_from_hf(jf)
dimension_data[Path(jf).stem] = rows
console.print(f”{jf}: {len(rows)} examples loaded”)
summary_rows = []
for dim, rows in dimension_data.gadgets():
keys = Counter()
for r in rows[:100]:
keys.replace(flatten_dict(r).keys())
summary_rows.append({
“dimension”: dim,
“examples”: len(rows),
“top_fields”: “, “.be a part of([k for k, _ in keys.most_common(12)])
})
summary_df = pd.DataFrame(summary_rows)
show(summary_df)
plt.determine(figsize=(10, 5))
plt.bar(summary_df[“dimension”], summary_df[“examples”])
plt.title(“ParseBench Examples by Dimension”)
plt.xlabel(“Dimension”)
plt.ylabel(“Variety of Examples”)
plt.xticks(rotation=30, ha=”proper”)
plt.present()
for dim, rows in dimension_data.gadgets():
console.print(Panel.match(f”Pattern schema for {dim}”, type=”daring cyan”))
if rows:
console.print(json.dumps(rows[0], indent=2)[:3000])
We load the JSONL information from the dataset and convert them into usable Python objects. We flatten nested constructions to research them simply in a tabular format. We additionally summarize every dimension and visualize the distribution of examples throughout totally different parsing duties.
Copy CodeCopiedall_records = []
for dim, rows in dimension_data.gadgets():
for i, r in enumerate(rows):
flat = flatten_dict(r)
flat[“_dimension”] = dim
flat[“_row_id”] = i
all_records.append(flat)
df = pd.DataFrame(all_records)
console.print(f”Mixed dataframe form: {df.form}”)
show(df.head())
missing_report = []
for col in df.columns:
missing_report.append({
“column”: col,
“non_null”: int(df[col].notna().sum()),
“lacking”: int(df[col].isna().sum()),
“coverage_pct”: spherical(100 * df[col].notna().imply(), 2)
})
missing_df = pd.DataFrame(missing_report).sort_values(“coverage_pct”, ascending=False)
show(missing_df.head(40))
def find_candidate_columns(df, key phrases):
cols = []
for c in df.columns:
lc = c.decrease()
if any(ok.decrease() in lc for ok in key phrases):
cols.append(c)
return cols
doc_cols = find_candidate_columns(df, [“doc”, “pdf”, “file”, “path”, “source”, “image”])
text_cols = find_candidate_columns(df, [“text”, “content”, “markdown”, “ground”, “answer”, “expected”, “target”, “reference”])
rule_cols = find_candidate_columns(df, [“rule”, “check”, “assert”, “criteria”, “question”, “prompt”])
bbox_cols = find_candidate_columns(df, [“bbox”, “box”, “polygon”, “coordinates”, “layout”])
console.print(“[bold]Potential doc columns:[/bold]”, doc_cols[:30])
console.print(“[bold]Potential textual content/reference columns:[/bold]”, text_cols[:30])
console.print(“[bold]Potential rule/query columns:[/bold]”, rule_cols[:30])
console.print(“[bold]Potential format columns:[/bold]”, bbox_cols[:30])
We mix all parsed data right into a single dataframe for unified evaluation. We consider lacking values and establish which fields are most informative throughout the dataset. We additionally detect candidate columns associated to paperwork, textual content, guidelines, and format to information downstream processing.
Copy CodeCopieddef pick_first_existing(row, candidates):
for c in candidates:
if c in row and pd.notna(row[c]):
worth = row[c]
if isinstance(worth, str) and worth.strip():
return worth
if not isinstance(worth, str):
return worth
return None
def normalize_text(x):
if x is None or (isinstance(x, float) and math.isnan(x)):
return “”
x = str(x)
x = re.sub(r”s+”, ” “, x)
return x.strip().decrease()
def simple_text_similarity(a, b):
a = normalize_text(a)
b = normalize_text(b)
if not a or not b:
return None
return fuzz.token_set_ratio(a, b) / 100
def locate_pdf_path(worth):
if worth is None:
return None
worth = str(worth)
candidates = []
if worth.endswith(“.pdf”):
candidates.append(worth)
candidates.prolong([f for f in pdf_files if f.endswith(value.split(“/”)[-1])])
else:
candidates.prolong([
f for f in pdf_files
if value in f or Path(f).stem in value or value in Path(f).stem
])
return candidates[0] if candidates else None
def extract_pdf_text_from_hf(pdf_repo_path, max_pages=2):
local_pdf = hf_hub_download(repo_id=DATASET_ID, filename=pdf_repo_path, repo_type=”dataset”)
doc = fitz.open(local_pdf)
texts = []
for page_idx in vary(min(max_pages, len(doc))):
texts.append(doc[page_idx].get_text(“textual content”))
doc.shut()
return “n”.be a part of(texts), local_pdf
def render_pdf_first_page(pdf_repo_path, zoom=2):
local_pdf = hf_hub_download(repo_id=DATASET_ID, filename=pdf_repo_path, repo_type=”dataset”)
doc = fitz.open(local_pdf)
web page = doc[0]
pix = web page.get_pixmap(matrix=fitz.Matrix(zoom, zoom))
out_path = WORKDIR / (Path(pdf_repo_path).stem + “_page1.png”)
pix.save(out_path)
doc.shut()
return out_path
sample_records = df.pattern(min(25, len(df)), random_state=42).to_dict(“data”)
pdf_candidates = []
for row in sample_records:
for c in doc_cols:
pdf_path = locate_pdf_path(row.get(c))
if pdf_path:
pdf_candidates.append((row[“_dimension”], row[“_row_id”], pdf_path))
break
pdf_candidates = checklist(dict.fromkeys(pdf_candidates))
console.print(f”Detected {len(pdf_candidates)} PDF-linked sampled data”)
if pdf_candidates:
dim, row_id, pdf_path = pdf_candidates[0]
console.print(Panel.match(f”Rendering pattern PDFnDimension: {dim}nRow: {row_id}nPDF: {pdf_path}”, type=”daring yellow”))
image_path = render_pdf_first_page(pdf_path)
img = plt.imread(image_path)
plt.determine(figsize=(10, 12))
plt.imshow(img)
plt.axis(“off”)
plt.title(f”{dim}: {Path(pdf_path).identify}”)
plt.present()
else:
console.print(“[yellow]No PDF-linked rows had been detected from the pattern.[/yellow]”)
We outline helper capabilities for textual content normalization, similarity scoring, and PDF dealing with. We find and obtain PDF information related to dataset entries and extract their textual content material. We additionally present a pattern PDF web page for visible inspection of the doc construction.
Copy CodeCopiedpreferred_gt_cols = [
c for c in text_cols
if any(k in c.lower() for k in [“ground”, “expected”, “target”, “answer”, “content”, “text”, “markdown”, “reference”])
]
evaluation_rows = []
eval_sample = df.pattern(min(50, len(df)), random_state=7).to_dict(“data”)
for row in tqdm(eval_sample, desc=”Operating light-weight PDF textual content extraction baseline”):
pdf_path = None
for c in doc_cols:
pdf_path = locate_pdf_path(row.get(c))
if pdf_path:
break
if not pdf_path:
evaluation_rows.append({
“dimension”: row.get(“_dimension”),
“row_id”: row.get(“_row_id”),
“pdf”: None,
“ground_truth_column”: None,
“similarity_score”: None,
“standing”: “no_pdf_detected”
})
proceed
gt_col = None
gt = None
for c in preferred_gt_cols:
if c in row and pd.notna(row[c]):
gt_col = c
gt = row[c]
break
if gt is None:
evaluation_rows.append({
“dimension”: row.get(“_dimension”),
“row_id”: row.get(“_row_id”),
“pdf”: pdf_path,
“ground_truth_column”: None,
“similarity_score”: None,
“standing”: “no_reference_detected”
})
proceed
attempt:
extracted, local_pdf = extract_pdf_text_from_hf(pdf_path, max_pages=2)
rating = simple_text_similarity(extracted, gt)
evaluation_rows.append({
“dimension”: row.get(“_dimension”),
“row_id”: row.get(“_row_id”),
“pdf”: pdf_path,
“ground_truth_column”: gt_col,
“similarity_score”: rating,
“extracted_chars”: len(extracted),
“ground_truth_chars”: len(str(gt)),
“standing”: “scored”
})
besides Exception as e:
evaluation_rows.append({
“dimension”: row.get(“_dimension”),
“row_id”: row.get(“_row_id”),
“pdf”: pdf_path,
“ground_truth_column”: gt_col,
“similarity_score”: None,
“standing”: “error”,
“error”: str(e)
})
eval_df = pd.DataFrame(evaluation_rows)
if eval_df.empty:
eval_df = pd.DataFrame(columns=[
“dimension”, “row_id”, “pdf”, “ground_truth_column”,
“similarity_score”, “extracted_chars”, “ground_truth_chars”,
“status”, “error”
])
show(eval_df.head(30))
if “standing” in eval_df.columns:
show(eval_df[“status”].value_counts().reset_index().rename(columns={“index”: “standing”, “standing”: “depend”}))
if not eval_df.empty and “similarity_score” in eval_df.columns:
valid_eval = eval_df.dropna(subset=[“similarity_score”])
if len(valid_eval):
console.print(f”Common light-weight textual content similarity: {valid_eval[‘similarity_score’].imply():.3f}”)
plt.determine(figsize=(8, 5))
plt.hist(valid_eval[“similarity_score”], bins=10)
plt.title(“Light-weight Baseline Similarity Distribution”)
plt.xlabel(“RapidFuzz Token Set Similarity”)
plt.ylabel(“Depend”)
plt.present()
per_dim = valid_eval.groupby(“dimension”)[“similarity_score”].imply().reset_index()
show(per_dim)
plt.determine(figsize=(9, 5))
plt.bar(per_dim[“dimension”], per_dim[“similarity_score”])
plt.title(“Common Baseline Similarity by Dimension”)
plt.xlabel(“Dimension”)
plt.ylabel(“Common Similarity”)
plt.xticks(rotation=30, ha=”proper”)
plt.present()
else:
console.print(“[yellow]No legitimate similarity scores had been produced. This often means sampled rows didn’t comprise each detectable PDFs and reference textual content.[/yellow]”)
else:
console.print(“[yellow]No similarity_score column discovered.[/yellow]”)
We run a light-weight analysis pipeline by evaluating extracted textual content with out there reference fields. We compute similarity scores and analyze how effectively easy extraction performs throughout totally different dimensions. We additionally visualize the outcomes to know efficiency developments and limitations.
Copy CodeCopieddef inspect_dimension(dimension_name, n=3):
rows = dimension_data.get(dimension_name, [])
console.print(Panel.match(f”Inspecting {dimension_name}: {len(rows)} rows”, type=”daring magenta”))
for idx, row in enumerate(rows[:n]):
console.print(f”n[bold]Instance {idx}[/bold]”)
console.print(json.dumps(row, indent=2)[:2500])
for dim in checklist(dimension_data.keys())[:5]:
inspect_dimension(dim, n=1)
def make_parsebench_subset(dimension=None, n=20, seed=123):
subset = df.copy()
if dimension:
subset = subset[subset[“_dimension”] == dimension]
if len(subset) == 0:
return subset
return subset.pattern(min(n, len(subset)), random_state=seed)
subset = make_parsebench_subset(n=20)
show(subset.head())
def create_llm_parser_prompt(row):
dimension = row.get(“_dimension”, “unknown”)
candidate_truth = pick_first_existing(row, preferred_gt_cols)
rule_hint = pick_first_existing(row, rule_cols)
immediate = f”””
You’re evaluating a doc parser on ParseBench.
Dimension:
{dimension}
Activity:
Parse the PDF web page right into a structured illustration that preserves the data wanted for agentic workflows.
Related benchmark trace or rule:
{rule_hint if rule_hint just isn’t None else “No apparent rule discipline detected.”}
Reference discipline preview:
{str(candidate_truth)[:1000] if candidate_truth just isn’t None else “No apparent reference discipline detected.”}
Return:
1. Markdown illustration
2. Extracted tables as JSON arrays when tables exist
3. Extracted chart values as JSON when charts exist
4. Format-sensitive notes when visible grounding issues
“””
return textwrap.dedent(immediate).strip()
prompt_examples = []
if len(subset):
for _, row in subset.head(3).iterrows():
prompt_examples.append(create_llm_parser_prompt(row.to_dict()))
if prompt_examples:
console.print(Panel.match(“Instance immediate for testing an exterior OCR or VLM parser”, type=”daring blue”))
console.print(prompt_examples[0])
else:
console.print(“[yellow]No immediate examples could possibly be created as a result of the subset is empty.[/yellow]”)
def compare_parser_outputs(reference, candidate):
return {
“token_set_similarity”: simple_text_similarity(reference, candidate),
“partial_ratio”: fuzz.partial_ratio(normalize_text(reference), normalize_text(candidate)) / 100 if reference and candidate else None,
“candidate_length”: len(str(candidate)) if candidate else 0,
“reference_length”: len(str(reference)) if reference else 0
}
if not eval_df.empty and “similarity_score” in eval_df.columns:
scored_eval = eval_df.dropna(subset=[“similarity_score”])
if len(scored_eval):
greatest = scored_eval.sort_values(“similarity_score”, ascending=False).head(1)
worst = scored_eval.sort_values(“similarity_score”, ascending=True).head(1)
console.print(Panel.match(“Greatest light-weight baseline instance”, type=”daring inexperienced”))
show(greatest)
console.print(Panel.match(“Worst light-weight baseline instance”, type=”daring crimson”))
show(worst)
else:
console.print(“[yellow]No legitimate similarity scores had been out there for greatest/worst comparability.[/yellow]”)
output_path = WORKDIR / “parsebench_flattened_sample.csv”
df.head(500).to_csv(output_path, index=False)
console.print(f”Saved flattened pattern to: {output_path}”)
console.print(Panel.match(“””
Tutorial full.
What we construct:
1. Load ParseBench information immediately from Hugging Face.
2. Examine benchmark dimensions and schemas.
3. Flatten data right into a dataframe.
4. Detect linked PDFs and render pattern pages when doable.
5. Run a light-weight PyMuPDF extraction baseline.
6. Rating extracted textual content when reference fields can be found.
7. Generate reusable prompts for OCR, VLM, and doc parser analysis.
“””, type=”daring inexperienced”))
We examine dataset samples and create subsets for experimentation. We generate structured prompts for evaluating exterior parsing techniques, corresponding to OCR and vision-language fashions. Additionally, we evaluate outputs, establish greatest and worst instances, and save processed knowledge for future use.
In conclusion, we constructed a whole workflow that permits us to research, consider, and experiment with doc parsing utilizing the ParseBench dataset. We extracted and in contrast textual content material and in addition generated structured prompts for testing exterior parsing techniques, corresponding to OCR engines and VLMs. This method helps us transfer past easy textual content extraction and towards constructing agent-ready representations that protect construction, format, and semantic that means. Additionally, we established a robust basis that we are able to prolong additional for benchmarking, bettering parsing fashions, and integrating doc understanding into real-world AI pipelines.
Try the Full Codes right here. Additionally, be at liberty to comply with us on Twitter and don’t neglect to affix our 130k+ ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you possibly can be a part of us on telegram as effectively.
Must associate with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and many others.? Join with us
The publish A Coding Implementation on Doc Parsing Benchmarking with LlamaIndex ParseBench Utilizing Python, Hugging Face, and Analysis Metrics appeared first on MarkTechPost.

