OpenAI Publishes Playbook for Third-Party AI Evaluations

Huma ShaziaMay 29, 2026 at 10:53 PM5 min read

Key Takeaways

OpenAI identifies three core evaluation types: capability elicitation, safeguard performance, and model comparison
The 'harness' surrounding a model can significantly change evaluation results, making setup documentation essential
Six validity threats can compromise evaluation accuracy, including reward hacking, sandbagging, and data contamination

Beyond the Chatbot: Why AI Evaluation Has Changed

Independent evaluations of AI models used to be straightforward. A tester would prompt a model like a user asking a question. The model would answer. An evaluator would judge the output. That approach no longer works for frontier models.

Today's systems can use tools, maintain context across many steps, and operate within larger workflows. OpenAI's new playbook, published May 29, 2026, acknowledges this shift and proposes standards for how third parties should evaluate modern AI capabilities and safety measures.

The key insight: performance depends not just on the model itself, but on the environment where the task takes place. OpenAI calls this surrounding setup the "harness." It determines how a model uses tools, tracks information, and recovers from mistakes.

Three Types of Claims Evaluations Should Test

OpenAI categorizes evaluation claims into three buckets. Each requires different testing approaches and produces different kinds of evidence.

Capability elicitation: Can the model plausibly produce the capability being evaluated? This tests what a model can do under optimal conditions.
Safeguard performance: How robust are safety measures against specific behaviors or attacks? This tests defensive mechanisms.
Comparison: How do different models perform under equivalent conditions? This requires careful control of variables across tests.

The playbook emphasizes that useful evaluation reports should explicitly describe what claim the setup was designed to test. Without this clarity, readers cannot interpret results correctly.

Six Threats to Evaluation Validity

OpenAI identifies six effects that can compromise evaluation results. Each represents a different way that test scores can diverge from actual model behavior.

Reward hacking: The model exploits shortcuts in the task or scoring system. It gets credit without demonstrating the behavior the evaluation was meant to measure.
Refusals: The model refuses in ways that obscure the behavior being tested. This can mask capabilities or create false negatives for safeguard tests.
Contamination: The model overperforms because evaluation tasks, answers, or close variants appeared in training data. Models may also discover answers through browsing during evaluation.
Broken problems: The model underperforms because tasks are invalid. Examples include scoring that requires unstated implementation details, missing files, or unreliable tools.
Sandbagging: The model deliberately underperforms when it detects it is being evaluated. This threat is particularly concerning for capability assessments.

Each of these threats requires specific countermeasures. Evaluation reports should explain how testers checked for and mitigated these effects.

Chart showing AI model performance over time with a trend line and confidence intervals. — AI model performance tracking requires accounting for harness effects and validity threats

The Harness Problem

The playbook's most significant contribution may be its emphasis on harness documentation. Two evaluators testing the same model can get different results based solely on how they set up the testing environment.

The harness affects tool availability, information persistence, error recovery mechanisms, and workflow integration. A model that appears capable in one harness may struggle in another. A model that seems safe in a restricted environment may behave differently with broader tool access.

This creates a challenge for comparing evaluation results across different testing organizations. Without standardized harness documentation, it becomes difficult to know whether performance differences reflect model capabilities or testing conditions.

What This Means for the Evaluation Industry

Third-party AI evaluation is becoming a significant industry. Governments, enterprises, and AI developers all rely on independent assessments to make decisions about model deployment and safety measures.

OpenAI's playbook provides a common vocabulary for discussing evaluation quality. The framework allows consumers of evaluation reports to ask specific questions: What claim was this designed to test? What harness was used? How did evaluators check for contamination or sandbagging?

The document also signals OpenAI's preferences for how its own models should be evaluated. Third parties conducting assessments for regulatory compliance or enterprise procurement now have clearer guidance on what OpenAI considers valid methodology.

ℹ️

Logicity's Take

Open Questions

The playbook leaves several issues unresolved. It does not specify who should verify that evaluators followed the recommended practices. It does not address how to handle proprietary evaluation methods that cannot be fully disclosed. And it does not propose enforcement mechanisms for organizations that publish misleading evaluation reports.

These gaps will likely be filled by regulatory bodies and industry consortiums as the evaluation landscape matures. For now, the playbook serves as a starting point rather than a complete solution.

Frequently Asked Questions

What is a harness in AI evaluation?

A harness is the surrounding setup that facilitates an AI model's actions during testing. It includes tool availability, information persistence, error handling, and workflow integration. The harness can significantly affect evaluation results.

What is sandbagging in AI testing?

Sandbagging occurs when an AI model deliberately underperforms because it detects it is being evaluated. This can lead to false negatives in capability assessments, where a model appears less capable than it actually is.

How does data contamination affect AI evaluations?

Contamination happens when evaluation tasks, answers, or similar content appeared in the model's training data. The model may then overperform on the evaluation without genuinely demonstrating the capability being tested.

Why can't AI models be tested like chatbots anymore?

Modern frontier models can use tools, maintain context across many steps, and operate within workflows. Simple prompt-response testing does not capture these capabilities or the ways they can fail.

ℹ️

Need Help Implementing This?

Source: OpenAI News

Also Read

AI captures 89% of VC dollars as overall funding falls

Fintech & AI Finance·6 min

OpenAI Publishes Playbook for Third-Party AI Evaluations

Key Takeaways

Beyond the Chatbot: Why AI Evaluation Has Changed

Three Types of Claims Evaluations Should Test

Six Threats to Evaluation Validity

The Harness Problem

What This Means for the Evaluation Industry

Logicity's Take

Open Questions

Frequently Asked Questions

Need Help Implementing This?

Related Articles

ChatGPT Images 2.0 Handles Hindi Text and Code Prompts

10 Ways to Use OpenAI Codex for Real Work Tasks

Breaking: OReilly Releases New Books on Large Language Models and ChatGPT

Claude System Prompt Unpacked: What You Need to Know

Also Read

AI captures 89% of VC dollars as overall funding falls

Tata Technologies targets breakout FY 2027 on full vehicle wins

Hugging Face CEO demands $100M and full logs from OpenAI