Evaluation set

A useful evaluation set includes realistic prompts, expected source usage, refusal examples, edge cases, and examples that should require approval or escalation. Keep the set small enough to run often but representative enough to catch regressions in agents, workflows, support replies, and reusable prompts.

Regression triggers

Re-run evaluations after provider changes, model changes, Knowledge Base edits, prompt template updates, new custom tools, workflow edits, and major WooCommerce policy changes. Pair this with Agents, Prompt Template Governance, and Audit Log Review so production behavior remains explainable.

Release decision

Do not publish an agent or workflow only because one demo prompt worked. The owner should review failures, false confidence, missing citations, unsafe tool use, and customer-impacting recommendations before broad rollout.

Owner and cadence

Primary owner: operations lead for the affected workflow, watcher, agent, playbook, or custom tool.
Review cadence: before first run, after failed runs, after provider changes, and during monthly automation review.
Escalate when evaluations fail, regressions appear after provider changes, or demo prompts hide unsafe edge cases.

Production checklist

Create realistic prompts for expected answers, refusals, citations, tool usage, approval handoff, and edge cases.
Re-run evaluations after provider changes, model changes, Knowledge Base edits, prompt template changes, workflow edits, or tool updates.
Define trigger, owner, input data, output, approval requirement, retry behavior, failure notification, and kill switch before enabling automation.
Start with read-only runs or staging examples until the team has reviewed successful traces and audit records.

Acceptance checks

Failures are reviewed before the agent or workflow reaches a broader audience.
Evaluation results explain what changed and which production behavior remains paused.
The workflow or agent has a named owner who can pause it and explain its last run.
Failures produce enough audit, diagnostics, and notification context for another operator to respond.

Common mistakes

Publishing an agent or workflow after one polished demo prompt without testing refusals, edge cases, citations, and approval handoffs.
Turning a useful prompt into automation before defining trigger, owner, input scope, approval rule, and failure handling.
Ignoring noisy alerts or failed runs until operators stop trusting the workflow surface.

Review Agents before customer-facing rollout.
Connect eval cases to Prompt Template Governance.
Use Workflow Safety before enabling recurring automation.
Use Automation Safe Mode and Kill Switches before production automation rollout.
Review Audit Log Review after the first production runs.
Use Model Evaluation and Regression Review before broad agent or workflow rollout.
Use Playbooks and Quick Actions for repeatable structured tasks.
Use Prompt Template Governance before sharing reusable instructions.
Use Playbook Import Export and Agency Reuse before reusing client workflows.
Use Tool Validation and Schema Testing before exposing custom tools.
Use Webhook and External Service Security before sending data outside WordPress.
Use Insights and Reporting Review before acting on AI summaries.
Use Content and SEO Workflows before AI-assisted publishing work.
Use Localization and Translation Review before publishing multilingual copy.
Use Media Library Asset Lifecycle before reusing generated assets.
Use Marketing Studio Campaign Review before campaign launches.
Use Analytics Attribution Review before acting on campaign summaries.

Model Evaluation and Regression Review