On this tutorial, we discover learn how to use Google’s LangExtract library to remodel unstructured textual content into structured, machine-readable info. We start by putting in the required dependencies and securely configuring our OpenAI API key to leverage highly effective language fashions for extraction duties. Additionally, we’ll construct a reusable extraction pipeline that permits us to course of a variety of doc varieties, together with contracts, assembly notes, product bulletins, and operational logs. Via rigorously designed prompts and instance annotations, we show how LangExtract can determine entities, actions, deadlines, dangers, and different structured attributes whereas grounding them to their actual supply spans. We additionally visualize the extracted info and arrange it into tabular datasets, enabling downstream analytics, automation workflows, and decision-making methods.
!pip -q set up -U “langextract[openai]” pandas IPython
import os
import json
import textwrap
import getpass
import pandas as pd
OPENAI_API_KEY = getpass.getpass(“Enter OPENAI_API_KEY: “)
os.environ[“OPENAI_API_KEY”] = OPENAI_API_KEY
import langextract as lx
from IPython.show import show, HTML
We set up the required libraries, together with LangExtract, Pandas, and IPython, in order that our Colab atmosphere is prepared for structured extraction duties. We securely request the OpenAI API key from the consumer and retailer it as an atmosphere variable for protected entry throughout runtime. We then import the core libraries wanted to run LangExtract, show outcomes, and deal with structured outputs.
MODEL_ID = “gpt-4o-mini”
def run_extraction(
text_or_documents,
prompt_description,
examples,
output_stem,
model_id=MODEL_ID,
extraction_passes=1,
max_workers=4,
max_char_buffer=1800,
):
outcome = lx.extract(
text_or_documents=text_or_documents,
prompt_description=prompt_description,
examples=examples,
model_id=model_id,
api_key=os.environ[“OPENAI_API_KEY”],
fence_output=True,
use_schema_constraints=False,
extraction_passes=extraction_passes,
max_workers=max_workers,
max_char_buffer=max_char_buffer,
)
jsonl_name = f”{output_stem}.jsonl”
html_name = f”{output_stem}.html”
lx.io.save_annotated_documents([result], output_name=jsonl_name, output_dir=”.”)
html_content = lx.visualize(jsonl_name)
with open(html_name, “w”, encoding=”utf-8″) as f:
if hasattr(html_content, “knowledge”):
f.write(html_content.knowledge)
else:
f.write(html_content)
return outcome, jsonl_name, html_name
def extraction_rows(outcome):
rows = []
for ex in outcome.extractions:
start_pos = None
end_pos = None
if getattr(ex, “char_interval”, None):
start_pos = ex.char_interval.start_pos
end_pos = ex.char_interval.end_pos
rows.append({
“class”: ex.extraction_class,
“textual content”: ex.extraction_text,
“attributes”: json.dumps(ex.attributes or {}, ensure_ascii=False),
“begin”: start_pos,
“finish”: end_pos,
})
return pd.DataFrame(rows)
def preview_result(title, outcome, html_name, max_rows=50):
print(“=” * 80)
print(title)
print(“=” * 80)
print(f”Whole extractions: {len(outcome.extractions)}”)
df = extraction_rows(outcome)
show(df.head(max_rows))
show(HTML(f’
Open interactive visualization: {html_name}
‘))
We outline the core utility features that energy all the extraction pipeline. We create a reusable run_extraction perform that sends textual content to the LangExtract engine and generates each JSONL and HTML outputs. We additionally outline helper features to transform the extraction outcomes into tabular rows and preview them interactively within the pocket book.
contract_prompt = textwrap.dedent(“””
Extract contract-risk info so as of look.
Guidelines:
1. Use actual textual content spans from the supply. Don’t paraphrase extraction_text.
2. Extract the next courses when current:
– occasion
– obligation
– deadline
– payment_term
– penalty
– termination_clause
– governing_law
3. Add helpful attributes:
– party_name for obligations or fee phrases when related
– risk_level as low, medium, or excessive
– class for the enterprise which means
4. Hold output grounded to the precise wording within the supply.
5. Don’t merge non-contiguous spans into one extraction.
“””)
contract_examples = [
lx.data.ExampleData(
text=(
“Acme Corp shall deliver the equipment by March 15, 2026. ”
“The Client must pay within 10 days of invoice receipt. ”
“Late payment incurs a 2% monthly penalty. ”
“This agreement is governed by the laws of Ontario.”
),
extractions=[
lx.data.Extraction(
extraction_class=”party”,
extraction_text=”Acme Corp”,
attributes={“category”: “supplier”, “risk_level”: “low”}
),
lx.data.Extraction(
extraction_class=”obligation”,
extraction_text=”shall deliver the equipment”,
attributes={“party_name”: “Acme Corp”, “category”: “delivery”, “risk_level”: “medium”}
),
lx.data.Extraction(
extraction_class=”deadline”,
extraction_text=”by March 15, 2026″,
attributes={“category”: “delivery_deadline”, “risk_level”: “medium”}
),
lx.data.Extraction(
extraction_class=”party”,
extraction_text=”The Client”,
attributes={“category”: “customer”, “risk_level”: “low”}
),
lx.data.Extraction(
extraction_class=”payment_term”,
extraction_text=”must pay within 10 days of invoice receipt”,
attributes={“party_name”: “The Client”, “category”: “payment”, “risk_level”: “medium”}
),
lx.data.Extraction(
extraction_class=”penalty”,
extraction_text=”2% monthly penalty”,
attributes={“category”: “late_payment”, “risk_level”: “high”}
),
lx.data.Extraction(
extraction_class=”governing_law”,
extraction_text=”laws of Ontario”,
attributes={“category”: “legal_jurisdiction”, “risk_level”: “low”}
),
]
)
]
contract_text = “””
BluePeak Analytics shall present a production-ready dashboard and underlying ETL pipeline no later than April 30, 2026.
North Ridge Manufacturing will remit fee inside 7 calendar days after last acceptance.
If fee is delayed past 15 days, BluePeak Analytics could droop assist companies and cost curiosity at 1.5% per 30 days.
This Settlement shall be ruled by the legal guidelines of British Columbia.
“””
contract_result, contract_jsonl, contract_html = run_extraction(
text_or_documents=contract_text,
prompt_description=contract_prompt,
examples=contract_examples,
output_stem=”contract_risk_extraction”,
extraction_passes=2,
max_workers=4,
max_char_buffer=1400,
)
preview_result(“USE CASE 1 — Contract danger extraction”, contract_result, contract_html)
We construct a contract intelligence extraction workflow by defining an in depth immediate and structured examples. We offer LangExtract with annotated training-style examples in order that it understands learn how to determine entities reminiscent of obligations, deadlines, penalties, and governing legal guidelines. We then run the extraction pipeline on a contract textual content and preview the structured risk-related outputs.
meeting_prompt = textwrap.dedent(“””
Extract motion objects from assembly notes so as of look.
Guidelines:
1. Use actual textual content spans from the supply. No paraphrasing in extraction_text.
2. Extract these courses when current:
– assignee
– action_item
– due_date
– blocker
– resolution
3. Add attributes:
– precedence as low, medium, or excessive
– workstream when inferable from native context
– proprietor for action_item when tied to a named assignee
4. Hold all spans grounded to the supply textual content.
5. Protect order of look.
“””)
meeting_examples = [
lx.data.ExampleData(
text=(
“Sarah will finalize the launch email by Friday. ”
“The team decided to postpone the webinar. ”
“Blocked by missing legal approval.”
),
extractions=[
lx.data.Extraction(
extraction_class=”assignee”,
extraction_text=”Sarah”,
attributes={“priority”: “medium”, “workstream”: “marketing”}
),
lx.data.Extraction(
extraction_class=”action_item”,
extraction_text=”will finalize the launch email”,
attributes={“owner”: “Sarah”, “priority”: “high”, “workstream”: “marketing”}
),
lx.data.Extraction(
extraction_class=”due_date”,
extraction_text=”by Friday”,
attributes={“priority”: “medium”, “workstream”: “marketing”}
),
lx.data.Extraction(
extraction_class=”decision”,
extraction_text=”decided to postpone the webinar”,
attributes={“priority”: “medium”, “workstream”: “events”}
),
lx.data.Extraction(
extraction_class=”blocker”,
extraction_text=”missing legal approval”,
attributes={“priority”: “high”, “workstream”: “compliance”}
),
]
)
]
meeting_text = “””
Arjun will put together the revised pricing sheet by Tuesday night.
Mina to substantiate the enterprise buyer’s knowledge residency necessities this week.
The group agreed to ship the pilot just for the Oman area first.
Blocked by pending safety assessment from the consumer’s IT staff.
Ravi will draft the rollback plan earlier than the manufacturing cutover.
“””
meeting_result, meeting_jsonl, meeting_html = run_extraction(
text_or_documents=meeting_text,
prompt_description=meeting_prompt,
examples=meeting_examples,
output_stem=”meeting_action_extraction”,
extraction_passes=2,
max_workers=4,
max_char_buffer=1400,
)
preview_result(“USE CASE 2 — Assembly notes to motion tracker”, meeting_result, meeting_html)
We design a gathering intelligence extractor that focuses on motion objects, selections, assignees, and blockers. We once more present instance annotations to assist the mannequin construction meet info persistently. We execute the extraction on assembly notes and show the ensuing structured job tracker.
longdoc_prompt = textwrap.dedent(“””
Extract product launch intelligence so as of look.
Guidelines:
1. Use actual textual content spans from the supply.
2. Extract:
– firm
– product
– launch_date
– area
– metric
– partnership
3. Add attributes:
– class
– significance as low, medium, or excessive
4. Hold the extraction grounded within the authentic textual content.
5. Don’t paraphrase the extracted span.
“””)
longdoc_examples = [
lx.data.ExampleData(
text=(
“Nova Robotics launched Atlas Mini in Europe on 12 January 2026. ”
“The company reported 18% faster picking speed and partnered with Helix Warehousing.”
),
extractions=[
lx.data.Extraction(
extraction_class=”company”,
extraction_text=”Nova Robotics”,
attributes={“category”: “vendor”, “significance”: “medium”}
),
lx.data.Extraction(
extraction_class=”product”,
extraction_text=”Atlas Mini”,
attributes={“category”: “product_name”, “significance”: “high”}
),
lx.data.Extraction(
extraction_class=”region”,
extraction_text=”Europe”,
attributes={“category”: “market”, “significance”: “medium”}
),
lx.data.Extraction(
extraction_class=”launch_date”,
extraction_text=”12 January 2026″,
attributes={“category”: “timeline”, “significance”: “medium”}
),
lx.data.Extraction(
extraction_class=”metric”,
extraction_text=”18% faster picking speed”,
attributes={“category”: “performance_claim”, “significance”: “high”}
),
lx.data.Extraction(
extraction_class=”partnership”,
extraction_text=”partnered with Helix Warehousing”,
attributes={“category”: “go_to_market”, “significance”: “medium”}
),
]
)
]
long_text = “””
Vertex Dynamics launched FleetSense 3.0 for industrial logistics groups throughout the GCC on 5 February 2026.
The corporate mentioned the discharge improves the accuracy of route deviation detection by 22% and reduces handbook assessment time by 31%.
Within the first rollout part, the platform will assist Oman and the United Arab Emirates.
Vertex Dynamics additionally partnered with Falcon Telematics to combine dwell driver conduct occasions into the dashboard.
Every week later, FleetSense 3.0 added a risk-scoring module for security managers.
The replace provides supervisors a day by day ranked record of high-risk journeys and exception occasions.
The corporate described the module as particularly worthwhile for oilfield transport operations and contractor fleet audits.
By late February 2026, the staff introduced a pilot with Desert Haul Companies.
The pilot covers 240 heavy automobiles and focuses on rushing up incident triage, compliance assessment, and proof retrieval.
Inside testing confirmed analysts may assemble assessment packets in beneath 8 minutes as a substitute of the earlier 20 minutes.
“””
longdoc_result, longdoc_jsonl, longdoc_html = run_extraction(
text_or_documents=long_text,
prompt_description=longdoc_prompt,
examples=longdoc_examples,
output_stem=”long_document_extraction”,
extraction_passes=3,
max_workers=8,
max_char_buffer=1000,
)
preview_result(“USE CASE 3 — Lengthy-document extraction”, longdoc_result, longdoc_html)
batch_docs = [
“””
The supplier must replace defective batteries within 14 days of written notice.
Any unresolved safety issue may trigger immediate suspension of shipments.
“””,
“””
Priya will circulate the revised onboarding checklist tomorrow morning.
The team approved the API deprecation plan for the legacy endpoint.
“””,
“””
Orbit Health launched a remote triage assistant in Singapore on 14 March 2026.
The company claims the assistant reduces nurse intake time by 17%.
“””
]
batch_prompt = textwrap.dedent(“””
Extract operationally helpful spans so as of look.
Allowed courses:
– obligation
– deadline
– penalty
– assignee
– action_item
– resolution
– firm
– product
– launch_date
– metric
Use actual textual content solely and fix a easy attribute:
– source_type
“””)
batch_examples = [
lx.data.ExampleData(
text=”Jordan will submit the report by Monday. Late delivery incurs a service credit.”,
extractions=[
lx.data.Extraction(
extraction_class=”assignee”,
extraction_text=”Jordan”,
attributes={“source_type”: “meeting”}
),
lx.data.Extraction(
extraction_class=”action_item”,
extraction_text=”will submit the report”,
attributes={“source_type”: “meeting”}
),
lx.data.Extraction(
extraction_class=”deadline”,
extraction_text=”by Monday”,
attributes={“source_type”: “meeting”}
),
lx.data.Extraction(
extraction_class=”penalty”,
extraction_text=”service credit”,
attributes={“source_type”: “contract”}
),
]
)
]
batch_results = []
for idx, doc in enumerate(batch_docs, begin=1):
res, jsonl_name, html_name = run_extraction(
text_or_documents=doc,
prompt_description=batch_prompt,
examples=batch_examples,
output_stem=f”batch_doc_{idx}”,
extraction_passes=2,
max_workers=4,
max_char_buffer=1200,
)
df = extraction_rows(res)
df.insert(0, “document_id”, idx)
batch_results.append(df)
print(f”Completed doc {idx} -> {html_name}”)
batch_df = pd.concat(batch_results, ignore_index=True)
print(“nCombined batch output”)
show(batch_df)
print(“nContract extraction counts by class”)
show(
extraction_rows(contract_result)
.groupby(“class”, as_index=False)
.dimension()
.sort_values(“dimension”, ascending=False)
)
print(“nMeeting motion objects solely”)
meeting_df = extraction_rows(meeting_result)
show(meeting_df[meeting_df[“class”] == “action_item”])
print(“nLong-document metrics solely”)
longdoc_df = extraction_rows(longdoc_result)
show(longdoc_df[longdoc_df[“class”] == “metric”])
final_df = pd.concat([
extraction_rows(contract_result).assign(use_case=”contract_risk”),
extraction_rows(meeting_result).assign(use_case=”meeting_actions”),
extraction_rows(longdoc_result).assign(use_case=”long_document”),
], ignore_index=True)
final_df.to_csv(“langextract_tutorial_outputs.csv”, index=False)
print(“nSaved CSV: langextract_tutorial_outputs.csv”)
print(“nGenerated recordsdata:”)
for title in [
contract_jsonl, contract_html,
meeting_jsonl, meeting_html,
longdoc_jsonl, longdoc_html,
“langextract_tutorial_outputs.csv”
]:
print(” -“, title)
We implement a long-document intelligence pipeline able to extracting structured insights from massive narrative textual content. We run the extraction throughout product launch reviews and operational paperwork, and likewise show batch processing throughout a number of paperwork. We additionally analyze the extracted outcomes, filter key courses, and export the structured dataset to a CSV file for downstream evaluation.
In conclusion, we constructed a complicated LangExtract workflow that converts advanced textual content paperwork into structured datasets with traceable supply grounding. We ran a number of extraction situations, together with contract danger evaluation, assembly motion monitoring, long-document intelligence extraction, and batch processing throughout a number of paperwork. We additionally visualized the extractions and exported the ultimate structured outcomes right into a CSV file for additional evaluation. Via this course of, we noticed how immediate design, example-based extraction, and scalable processing methods enable us to construct sturdy info extraction methods with minimal code.
Try the Full Codes right here. Additionally, be at liberty to comply with us on Twitter and don’t neglect to hitch our 120k+ ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you possibly can be a part of us on telegram as effectively.
Have to accomplice with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and many others.? Join with us

