Construct Traceable and Evaluated LLM Workflows Utilizing Promptflow, Prompty, and OpenAI

(WORK_DIR / “decide.prompty”).write_text(“””—
identify: Decide
mannequin:
api: chat
configuration:
kind: openai
connection: open_ai_connection
mannequin: gpt-4o-mini
parameters:
temperature: 0
max_tokens: 150
response_format: {kind: json_object}
inputs:
query: {kind: string}
reply: {kind: string}
anticipated: {kind: string}
—
system:
You might be an exacting grader. Resolve whether or not the assistant’s reply comprises the anticipated reality (case-insensitive, permitting affordable phrasing/synonyms). Reply ONLY as JSON: {“rating”: 0 or 1, “cause”: “…”}.

consumer:
Query: {{query}}
Anticipated: {{anticipated}}
Reply: {{reply}}
“””)

(WORK_DIR / “eval_flow.py”).write_text(textwrap.dedent(”’
import json
from pathlib import Path
from promptflow.tracing import hint
from promptflow.core import Prompty

BASE = Path(__file__).mum or dad

class Evaluator:
def __init__(self):
self.decide = Prompty.load(supply=BASE / “decide.prompty”)

@hint
def __call__(self, query: str, reply: str, anticipated: str) -> dict:
uncooked = self.decide(query=query, reply=reply, anticipated=anticipated)
if isinstance(uncooked, str):
strive: uncooked = json.hundreds(uncooked)
besides Exception: uncooked = {“rating”: 0, “cause”: f”unparseable:{uncooked[:80]}”}
return {“rating”: int(uncooked.get(“rating”, 0)), “cause”: str(uncooked.get(“cause”,””))}

def __aggregate__(self, line_results):
“””Run-level aggregation. No matter this returns exhibits up in pf.get_metrics().”””
scores = [r[“score”] for r in line_results if r]
return {
“accuracy”: (sum(scores) / len(scores)) if scores else 0.0,
“handed”: sum(scores),
“whole”: len(scores),
}
”’))

(WORK_DIR / “eval.flex.yaml”).write_text(
“$schema: https://azuremlschemas.azureedge.internet/promptflow/newest/Movement.schema.jsonn”
“entry: eval_flow:Evaluatorn”
)

print(“n=== Analysis run ===”)
eval_run = pf.run(
movement=str(WORK_DIR / “eval.flex.yaml”),
knowledge=str(data_path),
run=base_run,
column_mapping={
“query”: “${knowledge.query}”,
“anticipated”: “${knowledge.anticipated}”,
“reply”: “${run.outputs.reply}”,
},
stream=True,
)

eval_details = pf.get_details(eval_run)
print(eval_details)

print(“n=== Aggregated metrics (from __aggregate__) ===”)
print(json.dumps(pf.get_metrics(eval_run), indent=2))

import pandas as pd
if “outputs.rating” in eval_details.columns:
s = pd.to_numeric(eval_details[“outputs.score”], errors=”coerce”).fillna(0)
print(f”Handbook accuracy: {s.imply():.2%} ({int(s.sum())}/{len(s)})”)

What's Hot

NYT Strands hints and solutions for Tuesday, Might 12 (sport #800)

OpenAI Introduces Dawn: A Cybersecurity Initiative That Places Codex Safety on the Middle of Vulnerability Detection and Patch Validation

FAQ on hantavirus and outbreak on cruise ship Hondius

OpenAI Introduces Dawn: A Cybersecurity Initiative That Places Codex Safety on the Middle of Vulnerability Detection and Patch Validation

College students Boo Graduation Speaker After She Calls AI the ‘Subsequent Industrial Revolution’

10 GitHub Repositories to Grasp FastAPI

Ilya Sutskever Stands by His Function in Sam Altman’s OpenAI Ouster: ‘I Didn’t Need It to Be Destroyed’

Constructing internet search-enabled brokers with Strands and Exa

Understanding LLM Distillation Methods – MarkTechPost

NYT Strands hints and solutions for Tuesday, Might 12 (sport #800)

OpenAI Introduces Dawn: A Cybersecurity Initiative That Places Codex Safety on the Middle of Vulnerability Detection and Patch Validation

FAQ on hantavirus and outbreak on cruise ship Hondius

NYT Strands hints and solutions for Tuesday, Might 12 (sport #800)

OpenAI Introduces Dawn: A Cybersecurity Initiative That Places Codex Safety on the Middle of Vulnerability Detection and Patch Validation

FAQ on hantavirus and outbreak on cruise ship Hondius

Usefull link

categories

What's Hot

Related Posts

Usefull link

categories