Struggling to make AI methods dependable and constant? Many groups face the identical downside. A robust LLM offers nice outcomes, however a less expensive mannequin typically fails on the identical activity. This makes manufacturing methods arduous to scale. Harness engineering gives an answer. As a substitute of adjusting the mannequin, you construct a system round it. You utilize prompts, instruments, middleware, and analysis to information the mannequin towards dependable outputs. On this article, I’ve constructed a dependable AI coding agent utilizing LangChain’s DeepAgents and LangSmith. We additionally take a look at its efficiency utilizing commonplace benchmarks.
What’s Harness Engineering?
Harness engineering focuses on constructing a structured system round an LLM to enhance reliability. As a substitute of adjusting the fashions, you management the environments through which they function. A harness features a system immediate, instruments or APIs, a testing setup, and middleware that guides the mannequin’s habits. The purpose is to enhance activity success and handle prices whereas utilizing the identical underlying mannequin.
For this text, we use LangChain’s DeepAgents library. DeepAgents acts as an agent harness with built-in capabilities resembling activity planning, an in-memory digital file system, and sub-agent spawning. These options will assist construction the agent’s workflow and make it extra dependable.
Additionally Learn: A Information to LangGraph and LangSmith for Constructing AI Brokers
Analysis and Metrics
HumanEval is a benchmark comprising 164 hand-crafted Python issues used to guage practical correctness; we will use this information to check the AI brokers that we’ll construct.
- Go@1 (First-Shot Success): The share of issues solved appropriately by the mannequin in a single try. That is the gold commonplace for manufacturing methods the place customers count on an accurate reply in a single go.
- Go@ok (Multi-Pattern Success): The likelihood that not less than one among ok generated samples is right. That is used to measure the mannequin’s data or exploration energy.
Constructing a Coding Agent with Harness Engineering
We are going to construct a coding agent and consider it on benchmarks and metrics that we’ll outline. The agent can be carried out utilizing the DeepAgents library by LangChain and use the concepts behind harness engineering to construct the AI system.
Pre-Requisites (API Keys)
- Go to the LangSmith dashboard and click on on the ‘Setup Observability’ Button. Then you’ll see this display screen. Now, click on on the ‘Generate API Key’ possibility and maintain the LangSmith key helpful.
- We may also require an OpenAI API Key, and we’ll use the gpt-4.1-mini mannequin because the mind of the system. You may get your fingers on the API key from this hyperlink.
Installations
!git clone https://github.com/openai/human-eval.git
!sed -i ‘/evaluate_functional_correctness/d’ human-eval/setup.py
!pip set up -qU ./human-eval deepagents langchain-openai
Initializations
import os
from google.colab import userdata
os.environ[‘LANGCHAIN_TRACING_V2’] = ‘true’
os.environ[‘LANGSMITH_API_KEY’] = userdata.get(‘LANGSMITH_API_KEY’)
os.environ[‘LANGSMITH_PROJECT’] = ‘DeepAgent’
os.environ[‘OPENAI_API_KEY’] = userdata.get(‘OPENAI_API_KEY’)
Defining the Prompts
from langsmith import Consumer
from langchain_core.prompts import ChatPromptTemplate
ls = Consumer()
PROMPTS = {
“coding-agent-1”: (
“You’re a Python coding assistant.n”
“Given a perform signature and docstring, full the implementation.n”
“Return ONLY the finished Python perform — no prose, no markdown fences.”
),
“coding-agent-2”: (
“You’re a Python coding assistant with a self-verification self-discipline.n”
“Steps you MUST observe:n”
“1. Learn the docstring and edge circumstances rigorously.n”
“2. Write the implementation.n”
“3. Mentally run the supplied examples in opposition to your code.n”
“4. If any instance fails, rewrite and repeat step 3.n”
“Return ONLY the finished Python perform. No prose, no markdown fences.”
),
“coding-agent-3”: (
“You’re an professional Python engineer. Assume step-by-step earlier than coding.n”
“nProcess:n”
“n”
” – Restate what the perform should do in a single sentence.n”
” – Checklist nook circumstances (empty inputs, negatives, massive values).n”
” – Select the best right algorithm.n”
“n”
“Then output the finished Python perform verbatim — no markdown, no rationalization.”
),
}
for identify, textual content in PROMPTS.objects():
immediate = ChatPromptTemplate.from_messages(
[(“system”, text), (“human”, “{input}”)]
)
ls.push_prompt(identify, object=immediate)
print(f”pushed: {identify}”)
Output:
pushed: coding-agent-1
pushed: coding-agent-2
pushed: coding-agent-3
We have now outlined and pushed the prompts to LangSmith. You’ll be able to confirm the identical within the prompts part in LangSmith dashboard:
Defining our First Agent
from deepagents import create_deep_agent
from langchain.chat_models import init_chat_model
PROMPT = “coding-agent-1″
pulled = ls.pull_prompt(PROMPT)
system_prompt = pulled.messages[0].immediate.template
print(f”Loaded immediate: {PROMPT}”)
print(system_prompt[:120], “…”)
mannequin = init_chat_model(“openai:gpt-5-mini”)
# Creating the DeepAgent
agent = create_deep_agent(
mannequin=mannequin,
system_prompt=system_prompt,
)
print(“nAgent prepared”)
The agent ought to be able to use, it makes use of the ‘coding-agent-1’ immediate that had outlined earlier.
Check the Agent
# Obtain the HumanEval benchmark dataset (164 Python coding issues)
!wget -q https://github.com/openai/human-eval/uncooked/grasp/information/HumanEval.jsonl.gz -O HumanEval.jsonl.gz
# Import required libraries
import gzip
import json
# Perform to learn the HumanEval dataset
def read_problems(path=”HumanEval.jsonl.gz”):
issues = {}
strive:
with gzip.open(path, “rt”) as f:
for line in f:
p = json.masses(line)
issues[p[“task_id”]] = p
besides FileNotFoundError:
print(“Dataset file not discovered.”)
return issues
# Load all issues
issues = read_problems()
# Extract activity IDs
task_ids = record(issues.keys())
# Print whole variety of issues
print(f”Complete issues: {len(task_ids)}”)
# Optionally available: examine the primary downside
instance = issues[task_ids[0]]
print(“nExample Process ID:”, instance[“task_id”])
print(“nPrompt:n”, instance[“prompt”])
print(“nCanonical Resolution:n”, instance[“canonical_solution”])
Complete issues: 164
We now have 164 coding issues that we are able to use to check the system.
Producing Code with the Agent
import re
def extract_code(textual content: str, immediate: str) -> str:
“””Return simply the finished perform, stripping any markdown wrapping.”””
textual content = re.sub(r”“`pythons*”, “”, textual content)
textual content = re.sub(r”“`s*”, “”, textual content)
if textual content.strip().startswith(“def “):
return textual content.strip()
return immediate + textual content
def clear up(downside: dict) -> str:
consequence = agent.invoke(
{“messages”: [{“role”: “user”, “content”: problem[“prompt”]}]},
config={
“metadata”: {
“task_id”: downside[“task_id”],
“prompt_name”: PROMPT,
}
},
)
uncooked = consequence[“messages”][-1].content material
return extract_code(uncooked, downside[“prompt”])
# Check the system on the primary downside earlier than operating the total analysis
pattern = issues[task_ids[0]]
code = clear up(pattern)
print(code)
Output:
Nice! We have now a working system. Let’s take a look at it on 5 coding issues now!
import pandas as pd
# Calculate move@1 and common latency
handed = sum(r[“passed”] for r in outcomes)
pass_at_1 = handed / len(outcomes)
avg_latency = sum(r[“latency_s”] for r in outcomes) / len(outcomes)
print(f”Outcomes : move@1 = {pass_at_1:.2%} ({handed}/{len(outcomes)})”)
print(f”Avg latency = {avg_latency:.1f}s”)
# Convert outcomes to DataFrame for simpler inspection
df = pd.DataFrame(outcomes)
print(df[[“task_id”, “passed”, “latency_s”]].to_string(index=False))
# Print failed duties for debugging
print(“n── Failures ──”)
for _, row in df[~df[“passed”]].iterrows():
print(f”n{‘─’*60}”)
print(f”TASK: {row[‘task_id’]}”)
print(row[“code”][:400]) # Present first 400 chars of code
Output:
Nice! We ran the exams efficiently and we are able to see the latency of every as nicely. Let’s open LangSmith to see the token utilization, price and different particulars.
Open LangSmith -> Go to Tracing part -> Open the DeepAgent challenge:
This can be helpful to check our outcomes with the brand new agent that we’ll construct.
Defining a New Agent
from deepagents import create_deep_agent
from langchain.brokers.middleware import ModelCallLimitMiddleware
from langchain.chat_models import init_chat_model
SYSTEM_PROMPT = “coding-agent-3”
pulled = ls.pull_prompt(SYSTEM_PROMPT)
system_prompt = pulled.messages[0].immediate.template
# Construct the agent
base_model = init_chat_model(“openai:gpt-5-mini”)
new_agent = create_deep_agent(
mannequin=base_model,
system_prompt=system_prompt,
middleware=[
# Limit model calls to 2 per invocation
ModelCallLimitMiddleware(
run_limit=2,
exit_behavior=”end”,
),
],
)
def clear up(downside: dict) -> str:
consequence = new_agent.invoke(
{“messages”: [{“role”: “user”, “content”: problem[“prompt”]}]},
config={
“metadata”: {
“task_id”: downside[“task_id”],
“prompt_name”: SYSTEM_PROMPT,
}
},
)
uncooked = consequence[“messages”][-1].content material
return extract_code(uncooked, downside[“prompt”])
Testing the New Agent
import time
from human_eval.execution import check_correctness
N_PROBLEMS = 5
TIMEOUT = 5 # seconds per take a look at case
outcomes = []
for task_id in task_ids[:N_PROBLEMS]:
downside = issues[task_id]
t0 = time.time()
# Resolve the issue utilizing the agent
code = clear up(downside)
latency = time.time() – t0
# Test correctness of the generated code
end result = check_correctness(downside, code, timeout=TIMEOUT)
outcomes.append({
“task_id”: task_id,
“handed”: end result[“passed”],
“latency_s”: spherical(latency, 2),
“code”: code,
})
standing = “PASS” if end result[“passed”] else “FAIL”
print(f”{standing} {task_id:30s} {latency:.1f}s”)
We are able to see that our prompt-3 has handed 4 issues however has failed to resolve one coding downside.
Conclusion
Does this imply our prompt-1 was higher ? The reply isn’t so simple as that, we must run move@1 exams a number of occasions to check the consistency of the agent and with a testing dimension that’s a lot greater than 5. This helps us discover the common latency, price and essentially the most vital issue: activity reliability. Additionally discovering and plugging the fitting middleware might help the system carry out in line with our wants, there are middlewares current to increase the capabilities of the agent and management the variety of mannequin calls, device calls and rather more. It’s vital to guage the agent and LangSmith can certainly help in traceability, storing the prompts and likewise present errors (if any) from the agent. It’s vital to notice that whereas Immediate Engineering focuses on the enter, Harness Engineering focuses on the surroundings and constraints.
Steadily Requested Questions
Q1. What’s middleware?
A. Middleware is software program that acts as a bridge between parts, enabling communication and increasing an agent’s capabilities.
Q2. What are options to LangSmith?
A.Widespread options for LLM tracing and monitoring embrace Langfuse, Arize Phoenix, and so on..
Q3. What benchmarks are thought-about trade commonplace for evaluating coding brokers?
A. Trade benchmarks embrace SWE-bench and BigCodeBench for measuring real-world coding efficiency.
Keen about expertise and innovation, a graduate of Vellore Institute of Know-how. At present working as a Knowledge Science Trainee, specializing in Knowledge Science. Deeply fascinated about Deep Studying and Generative AI, desirous to discover cutting-edge strategies to resolve advanced issues and create impactful options.
Login to proceed studying and luxuriate in expert-curated content material.
Maintain Studying for Free

