Methods to Construct and Evolve a Customized OpenAI Agent with A-Evolve Utilizing Benchmarks, Abilities, Reminiscence, and Workspace Mutations

On this tutorial, we work immediately with the A-Evolve framework in Colab and construct an entire evolutionary agent pipeline from the bottom up. We arrange the repository, configure an OpenAI-powered agent, outline a customized benchmark, and construct our personal evolution engine to see how A-Evolve really improves an agent by means of iterative workspace mutations. By the code, we use the framework’s core abstractions for prompts, abilities, reminiscence, benchmarking, and evolution, which assist us perceive not simply find out how to run A-Evolve, but in addition find out how to prolong it in a sensible, Colab-friendly means.

import os
import sys
import json
import textwrap
import subprocess
import shutil
from pathlib import Path
from getpass import getpass
from collections import Counter, defaultdict

subprocess.check_call([sys.executable, “-m”, “pip”, “install”, “-q”, “openai>=1.30.0”, “pyyaml>=6.0”, “matplotlib>=3.8”])
REPO_DIR = Path(“/content material/a-evolve”)
if REPO_DIR.exists():
shutil.rmtree(REPO_DIR)
subprocess.check_call([“git”, “clone”, “–depth”, “1”, “https://github.com/A-EVO-Lab/a-evolve.git”, str(REPO_DIR)])
sys.path.insert(0, str(REPO_DIR))

if not os.environ.get(“OPENAI_API_KEY”):
os.environ[“OPENAI_API_KEY”] = getpass(“Enter your OpenAI API key: “).strip()

OPENAI_MODEL = “gpt-4o-mini”

import yaml
import matplotlib.pyplot as plt

import agent_evolve as ae
from agent_evolve.protocol.base_agent import BaseAgent
from agent_evolve.benchmarks.base import BenchmarkAdapter
from agent_evolve.engine.base import EvolutionEngine
from agent_evolve.sorts import Job, Trajectory, Suggestions, StepResult
from agent_evolve.contract.workspace import AgentWorkspace
from openai import OpenAI

consumer = OpenAI(api_key=os.environ[“OPENAI_API_KEY”])

WORKSPACE_ROOT = Path(“/content material/a_evolve_demo_workspace”)
if WORKSPACE_ROOT.exists():
shutil.rmtree(WORKSPACE_ROOT)

(WORKSPACE_ROOT / “prompts”).mkdir(dad and mom=True, exist_ok=True)
(WORKSPACE_ROOT / “abilities”).mkdir(dad and mom=True, exist_ok=True)
(WORKSPACE_ROOT / “reminiscence”).mkdir(dad and mom=True, exist_ok=True)
(WORKSPACE_ROOT / “instruments”).mkdir(dad and mom=True, exist_ok=True)

manifest = banana
with open(WORKSPACE_ROOT / “manifest.yaml”, “w”) as f:
yaml.dump(manifest, f, sort_keys=False)

initial_system_prompt = textwrap.dedent(“””
You’re a exact text-transformation agent.

Clear up the duty precisely.
Be concise.
Return solely the ultimate reply with no rationalization except the duty explicitly asks for JSON.
“””).strip()

(WORKSPACE_ROOT / “prompts” / “system.md”).write_text(initial_system_prompt)

We put together the complete Colab surroundings wanted to run the tutorial from begin to end. We set up the required packages, clone the A-Evolve repository, load the framework imports, and securely gather the OpenAI API key for mannequin entry. We additionally outline the workspace construction and initialize the manifest and system immediate, offering our evolving agent with a sound place to begin inside the A-Evolve framework.

def build_dataset():
practice = [
zebra”
,
berry,
cherry”
,
zebra”
,
mango”
,
lion,
mango”
,
{
“id”: “train-08”,
“rule”: “vowel_parity”,
“input”: “Word: education”,
“answer”: “ODD”
},
]

holdout = [
{
“id”: “holdout-01”,
“rule”: “json_sum”,
“input”: “Numbers: 100, 1, 9”,
“answer”: ‘{“sum”:110}’
},
{
“id”: “holdout-02”,
“rule”: “acronym_upper”,
“input”: “Create the acronym from: artificial general intelligence”,
“answer”: “AGI”
},
mango”
,
{
“id”: “holdout-04”,
“rule”: “vowel_parity”,
“input”: “Word: aeroplane”,
“answer”: “ODD”
},
]
return practice, holdout

TRAIN_DATA, HOLDOUT_DATA = build_dataset()

def normalize_text(x: str) -> str:
return x.strip().substitute(” “, “”)

class MiniTextBenchmark(BenchmarkAdapter):
def __init__(self):
self.practice = TRAIN_DATA
self.holdout = HOLDOUT_DATA

def get_tasks(self, cut up: str = “practice”, restrict: int = 10):
knowledge = self.practice if cut up == “practice” else self.holdout
duties = []
for row in knowledge[:limit]:
duties.append(
Job(
id=row[“id”],
enter=row[“input”],
metadata={
“rule”: row[“rule”],
“reply”: row[“answer”]
}
)
)
return duties

def consider(self, process: Job, trajectory: Trajectory):
pred = trajectory.output.strip()
gold = process.metadata[“answer”].strip()
success = normalize_text(pred) == normalize_text(gold)
element = {
“rule”: process.metadata[“rule”],
“gold”: gold,
“pred”: pred,
“enter”: process.enter,
“success”: success
}
rating = 1.0 if success else 0.0
return Suggestions(
success=success,
rating=rating,
element=json.dumps(element, ensure_ascii=False),
uncooked=element
)

SKILL_ROUTING = {
“json_sum”: [“json”, “sum”],
“acronym_upper”: [“acronym”, “uppercase”],
“pipe_unique_sorted_lower”: [“unique”, “sorted”, “lowercase”, “pipe”],
“vowel_parity”: [“vowel”, “odd”, “even”, “parity”]
}

We outline the coaching and holdout datasets used to measure the agent earlier than and after evolution. We construct a customized benchmark class that packages every instance into A-Evolve duties and evaluates predictions in opposition to actual anticipated outputs. We additionally arrange the routing hints for abilities, which prepares the system to attach completely different process sorts with the best behavioral patterns later within the workflow.

class ColabAEResolverAgent(BaseAgent):
def __init__(self, workspace_dir: str | Path, mannequin: str = OPENAI_MODEL):
self.mannequin = mannequin
tremendous().__init__(workspace_dir)

def _pick_relevant_skills(self, process: Job):
rule = process.metadata.get(“rule”, “”)
chosen = []
for talent in self.abilities:
hay = f”{talent.title} {talent.description}”.decrease()
if rule == “json_sum” and (“json” in hay or “sum” in hay):
chosen.append(talent)
elif rule == “acronym_upper” and (“acronym” in hay or “uppercase” in hay):
chosen.append(talent)
elif rule == “pipe_unique_sorted_lower” and any(ok in hay for ok in [“unique”, “sorted”, “lowercase”, “pipe”]):
chosen.append(talent)
elif rule == “vowel_parity” and any(ok in hay for ok in [“vowel”, “odd”, “even”, “parity”]):
chosen.append(talent)
return chosen[:3]

def remedy(self, process: Job) -> Trajectory:
relevant_skills = self._pick_relevant_skills(process)
relevant_skill_texts = []
for s in relevant_skills:
relevant_skill_texts.append(self.get_skill_content(s.title))

memory_text = “n”.be part of(
[f”- {m.get(‘content’, ”)}” for m in self.memories[-8:]]
).strip()

skill_block = “nn”.be part of(relevant_skill_texts).strip()
if not skill_block:
skill_block = “(no abilities loaded but)”

if not memory_text:
memory_text = “(no reminiscence but)”

user_prompt = textwrap.dedent(f”””
TASK RULE: {process.metadata.get(“rule”)}
TASK INPUT:
{process.enter}

ACTIVE SYSTEM PROMPT:
{self.system_prompt}

RELEVANT SKILLS:
{skill_block}

RECENT MEMORIES:
{memory_text}

Clear up the duty precisely.
Return solely the ultimate reply.
“””).strip()

response = consumer.chat.completions.create(
mannequin=self.mannequin,
temperature=0,
messages=[
{“role”: “system”, “content”: “You are an exact text-transformation agent.”},
{“role”: “user”, “content”: user_prompt}
]
)

output = (response.selections[0].message.content material or “”).strip()

self.keep in mind(
content material=f”Job {process.id} beneath rule {process.metadata.get(‘rule’)} produced output: {output}”,
class=”episodic”
)

return Trajectory(
task_id=process.id,
output=output,
steps=[
{
“rule”: task.metadata.get(“rule”),
“used_skills”: [s.name for s in relevant_skills],
“system_prompt_chars”: len(self.system_prompt),
“memory_items_seen”: len(self.recollections)
}
]
)

SKILL_TEMPLATES = {
“json_sum”: textwrap.dedent(“””
—
title: json-sum-exact
description: Add all integers and output strict compact JSON with the only key sum.
—
# JSON Sum Actual

Process:
1. Extract all integers from the duty enter.
2. Add them.
3. Return precisely one compact JSON object on this format:
{“sum”:NUMBER}
4. Don’t add areas, explanations, markdown, or further keys.
“””).strip(),

“acronym_upper”: textwrap.dedent(“””
—
title: acronym-upper-exact
description: Construct an uppercase acronym by taking the primary letter of every phrase.
—
# Acronym Higher Actual

Process:
1. Determine the phrase after the colon.
2. Take the primary letter of every phrase.
3. Convert each letter to uppercase.
4. Return solely the ultimate acronym, with no punctuation or rationalization.
“””).strip(),

“pipe_unique_sorted_lower”: textwrap.dedent(“””
—
title: pipe-unique-sorted-lower
description: Normalize tokens to lowercase, deduplicate them, type ascending, and be part of them with pipes.
—
# Pipe Distinctive Sorted Decrease

Process:
1. Learn the token checklist after the colon.
2. Cut up by commas.
3. Trim areas and lowercase each token.
4. Take away duplicates.
5. Type alphabetically ascending.
6. Be part of with “|” and return solely the ultimate string.
“””).strip(),

“vowel_parity”: textwrap.dedent(“””
—
title: vowel-parity-exact
description: Depend vowels within the phrase and output ODD or EVEN solely.
—
# Vowel Parity Actual

Process:
1. Learn the goal phrase after the colon.
2. Depend vowels utilizing a, e, i, o, u.
3. If the depend is odd, output ODD.
4. If the depend is even, output EVEN.
5. Return solely ODD or EVEN with no further textual content.
“””).strip(),
}

PROMPT_APPENDIX = textwrap.dedent(“””
## STRICT OUTPUT CONTRACT
– Output solely the ultimate reply.
– By no means clarify your reasoning.
– If a process expects JSON, return compact JSON with actual keys solely.
– When a related talent exists, observe it actually.
– Actual format is extra essential than being conversational.
“””).strip()

We implement the customized A-Evolve agent that reads the energetic immediate, abilities, and reminiscence from the workspace and makes use of OpenAI to unravel every process. We design the agent so it selects related abilities, injects latest reminiscence, and returns trajectories within the construction anticipated by the framework. We additionally outline the talent templates and the strict output contract, which function the primary elements that the evolution engine can add to enhance efficiency over time.

class ColabMutationEngine(EvolutionEngine):
def __init__(self):
self.cycle_count = 0

def step(self, workspace: AgentWorkspace, observations, historical past, trial):
self.cycle_count += 1

failed_by_rule = defaultdict(checklist)
for obs in observations:
if not obs.suggestions.success:
failed_by_rule[obs.task.metadata[“rule”]].append({
“task_id”: obs.process.id,
“enter”: obs.process.enter,
“gold”: obs.process.metadata[“answer”],
“pred”: obs.trajectory.output
})

mutated = False
summaries = []

current_prompt = workspace.read_prompt()
if “STRICT OUTPUT CONTRACT” not in current_prompt:
workspace.write_prompt(current_prompt.rstrip() + “nn” + PROMPT_APPENDIX + “n”)
mutated = True
summaries.append(“immediate hardened”)

existing_skill_names = {s.title for s in workspace.list_skills()}

needed_rule_to_skill_name = {
“json_sum”: “json-sum-exact”,
“acronym_upper”: “acronym-upper-exact”,
“pipe_unique_sorted_lower”: “pipe-unique-sorted-lower”,
“vowel_parity”: “vowel-parity-exact”,
}

for rule, fails in failed_by_rule.objects():
skill_name = needed_rule_to_skill_name[rule]
if skill_name not in existing_skill_names:
workspace.write_skill(skill_name, SKILL_TEMPLATES[rule])
mutated = True
summaries.append(f”added talent {skill_name}”)

workspace.add_memory({
“content material”: f”Cycle {self.cycle_count}: rule={rule} failed {len(fails)} time(s). Widespread failure sample: output formatting or process mismatch. Gold examples should be adopted precisely.”,
“rule”: rule,
“examples”: fails[:2]
}, class=”episodic”)

if not failed_by_rule:
workspace.add_memory({
“content material”: f”Cycle {self.cycle_count}: all present coaching duties succeeded. Protect actual formatting conduct.”
}, class=”episodic”)

abstract = ” | “.be part of(summaries) if summaries else “no mutation wanted”
return StepResult(
mutated=mutated,
abstract=abstract,
metadata={
“failed_rules”: checklist(failed_by_rule.keys()),
“num_failed_rules”: len(failed_by_rule),
“cycle”: self.cycle_count
}
)

def evaluate_split(agent, benchmark, cut up=”practice”):
duties = benchmark.get_tasks(cut up=cut up, restrict=100)
rows = []
complete = 0
appropriate = 0
for process in duties:
traj = agent.remedy(process)
fb = benchmark.consider(process, traj)
rows.append({
“task_id”: process.id,
“rule”: process.metadata[“rule”],
“enter”: process.enter,
“gold”: process.metadata[“answer”],
“pred”: traj.output,
“rating”: fb.rating,
“success”: fb.success
})
complete += 1
appropriate += int(fb.success)
rating = appropriate / max(complete, 1)
return rating, rows

def print_table(rows, title, max_rows=20):
print(“n” + “=” * 110)
print(title)
print(“=” * 110)
proven = rows[:max_rows]
for r in proven:
print(f”[{r[‘task_id’]}] rule={r[‘rule’]}”)
print(f” enter : {r[‘input’]}”)
print(f” gold : {r[‘gold’]}”)
print(f” pred : {r[‘pred’]}”)
print(f” rating : {r[‘score’]} success={r[‘success’]}”)
print(“-” * 110)

def show_workspace(root: Path):
print(“n” + “=” * 110)
print(“EVOLVED WORKSPACE SNAPSHOT”)
print(“=” * 110)
for path in sorted(root.rglob(“*”)):
rel = path.relative_to(root)
if path.is_dir():
print(f”[DIR ] {rel}/”)
else:
print(f”[FILE] {rel}”)

def show_skill_contents(root: Path):
skill_files = sorted((root / “abilities”).glob(“*/SKILL.md”))
print(“n” + “=” * 110)
print(“SKILL FILES”)
print(“=” * 110)
if not skill_files:
print(“No talent information but.”)
for sf in skill_files:
print(f”n— {sf.dad or mum.title}/SKILL.md —“)
print(sf.read_text())

We construct a customized evolution engine that inspects failures and decides find out how to mutate the workspace. We use it to harden the immediate, add lacking abilities, and retailer episodic reminiscence in order that the agent regularly learns higher formatting and task-specific conduct throughout cycles. We additionally outline analysis and reporting utilities that assist us rating the agent, examine predictions, and look at the developed workspace clearly.

benchmark = MiniTextBenchmark()
agent = ColabAEResolverAgent(WORKSPACE_ROOT, mannequin=OPENAI_MODEL)
engine = ColabMutationEngine()

baseline_train_score, baseline_train_rows = evaluate_split(agent, benchmark, cut up=”practice”)
baseline_holdout_score, baseline_holdout_rows = evaluate_split(agent, benchmark, cut up=”holdout”)

print(f”Baseline practice rating : {baseline_train_score:.3f}”)
print(f”Baseline holdout rating : {baseline_holdout_score:.3f}”)

print_table(baseline_train_rows, “BASELINE TRAIN RESULTS”)
print_table(baseline_holdout_rows, “BASELINE HOLDOUT RESULTS”)

config = ae.EvolveConfig(
batch_size=8,
max_cycles=4,
egl_window=2
)

evolver = ae.Evolver(
agent=agent,
benchmark=benchmark,
config=config,
engine=engine
)

outcome = evolver.run(cycles=4)

print(“n” + “=” * 110)
print(“A-EVOLVE RUN SUMMARY”)
print(“=” * 110)
print(f”Cycles accomplished : {outcome.cycles_completed}”)
print(f”Remaining practice rating: {outcome.final_score:.3f}”)
print(f”Rating historical past : {outcome.score_history}”)
print(f”Converged : {outcome.converged}”)

agent.reload_from_fs()
final_train_score, final_train_rows = evaluate_split(agent, benchmark, cut up=”practice”)
final_holdout_score, final_holdout_rows = evaluate_split(agent, benchmark, cut up=”holdout”)

print(f”nFinal practice rating : {final_train_score:.3f}”)
print(f”Remaining holdout rating : {final_holdout_score:.3f}”)

print_table(final_train_rows, “FINAL TRAIN RESULTS”)
print_table(final_holdout_rows, “FINAL HOLDOUT RESULTS”)

show_workspace(WORKSPACE_ROOT)
show_skill_contents(WORKSPACE_ROOT)

print(“n” + “=” * 110)
print(“FINAL SYSTEM PROMPT”)
print(“=” * 110)
print((WORKSPACE_ROOT / “prompts” / “system.md”).read_text())

episodic_path = WORKSPACE_ROOT / “reminiscence” / “episodic.jsonl”
if episodic_path.exists():
print(“n” + “=” * 110)
print(“RECENT EPISODIC MEMORY”)
print(“=” * 110)
strains = episodic_path.read_text().strip().splitlines()
for line in strains[-10:]:
print(line)

plt.determine(figsize=(8, 4))
plt.plot(vary(1, len(outcome.score_history) + 1), outcome.score_history, marker=”o”)
plt.xlabel(“Evolution cycle”)
plt.ylabel(“Prepare rating”)
plt.title(“A-Evolve rating historical past”)
plt.grid(True)
plt.present()

print(“n” + “=” * 110)
print(“COMPARISON”)
print(“=” * 110)
print(f”Prepare : {baseline_train_score:.3f} -> {final_train_score:.3f}”)
print(f”Holdout : {baseline_holdout_score:.3f} -> {final_holdout_score:.3f}”)

improved_rules = []
for earlier than, after in zip(sorted(baseline_train_rows, key=lambda x: x[“task_id”]), sorted(final_train_rows, key=lambda x: x[“task_id”])):
if (not earlier than[“success”]) and after[“success”]:
improved_rules.append(after[“rule”])

print(f”Improved practice circumstances by rule: {dict(Counter(improved_rules))}”)

print(“nDone. This pocket book used the actual A-Evolve framework and demonstrated:”)
print(“1) a sound agent workspace”)
print(“2) a BaseAgent subclass”)
print(“3) a BenchmarkAdapter subclass”)
print(“4) an EvolutionEngine subclass”)
print(“5) immediate / talent / reminiscence mutations throughout A-Evolve cycles”)

We put all the things collectively and run the complete A-Evolve loop from baseline analysis to post-evolution evaluation. We measure the agent earlier than coaching, execute a number of evolution cycles, reload the workspace, after which evaluate the ultimate practice and holdout efficiency to see what improves. We additionally examine the developed immediate, abilities, reminiscence, and rating historical past, which lets us clearly observe how the framework transforms the agent step-by-step.

In conclusion, we efficiently constructed and ran a full A-Evolve workflow reasonably than simply inspecting the repository at a floor stage. We created a sound workspace, plugged in a customized agent, benchmarked it on structured duties, after which developed its conduct by modifying prompts, including abilities, and storing reminiscence throughout cycles. Additionally, we noticed how A-Evolve’s design allows us to deal with agent enchancment as a repeatable engineering course of, by which we are able to measure baseline efficiency, apply managed mutations, and observe how the system turns into extra correct over time.

Try the Full Coding Pocket book right here. Additionally, be happy to observe us on Twitter and don’t neglect to hitch our 120k+ ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you possibly can be part of us on telegram as effectively.

What's Hot

NYT Strands hints and solutions for Tuesday, Might 12 (sport #800)

OpenAI Introduces Dawn: A Cybersecurity Initiative That Places Codex Safety on the Middle of Vulnerability Detection and Patch Validation

FAQ on hantavirus and outbreak on cruise ship Hondius

OpenAI Introduces Dawn: A Cybersecurity Initiative That Places Codex Safety on the Middle of Vulnerability Detection and Patch Validation

College students Boo Graduation Speaker After She Calls AI the ‘Subsequent Industrial Revolution’

10 GitHub Repositories to Grasp FastAPI

Ilya Sutskever Stands by His Function in Sam Altman’s OpenAI Ouster: ‘I Didn’t Need It to Be Destroyed’

Constructing internet search-enabled brokers with Strands and Exa

Understanding LLM Distillation Methods – MarkTechPost

NYT Strands hints and solutions for Tuesday, Might 12 (sport #800)

OpenAI Introduces Dawn: A Cybersecurity Initiative That Places Codex Safety on the Middle of Vulnerability Detection and Patch Validation

FAQ on hantavirus and outbreak on cruise ship Hondius

NYT Strands hints and solutions for Tuesday, Might 12 (sport #800)

OpenAI Introduces Dawn: A Cybersecurity Initiative That Places Codex Safety on the Middle of Vulnerability Detection and Patch Validation

FAQ on hantavirus and outbreak on cruise ship Hondius

Usefull link

categories

What's Hot

Related Posts

Usefull link

categories