OpenAI Publishes Playbook for Third-Party AI Evaluations

Key Takeaways

- OpenAI identifies three core evaluation types: capability elicitation, safeguard performance, and model comparison
- The 'harness' surrounding a model can significantly change evaluation results, making setup documentation essential
- Six validity threats can compromise evaluation accuracy, including reward hacking, sandbagging, and data contamination
Beyond the Chatbot: Why AI Evaluation Has Changed
Independent evaluations of AI models used to be straightforward. A tester would prompt a model like a user asking a question. The model would answer. An evaluator would judge the output. That approach no longer works for frontier models.
Today's systems can use tools, maintain context across many steps, and operate within larger workflows. OpenAI's new playbook, published May 29, 2026, acknowledges this shift and proposes standards for how third parties should evaluate modern AI capabilities and safety measures.
The key insight: performance depends not just on the model itself, but on the environment where the task takes place. OpenAI calls this surrounding setup the "harness." It determines how a model uses tools, tracks information, and recovers from mistakes.
Three Types of Claims Evaluations Should Test
OpenAI categorizes evaluation claims into three buckets. Each requires different testing approaches and produces different kinds of evidence.
- Capability elicitation: Can the model plausibly produce the capability being evaluated? This tests what a model can do under optimal conditions.
- Safeguard performance: How robust are safety measures against specific behaviors or attacks? This tests defensive mechanisms.
- Comparison: How do different models perform under equivalent conditions? This requires careful control of variables across tests.
The playbook emphasizes that useful evaluation reports should explicitly describe what claim the setup was designed to test. Without this clarity, readers cannot interpret results correctly.
Six Threats to Evaluation Validity
OpenAI identifies six effects that can compromise evaluation results. Each represents a different way that test scores can diverge from actual model behavior.
- Reward hacking: The model exploits shortcuts in the task or scoring system. It gets credit without demonstrating the behavior the evaluation was meant to measure.
- Refusals: The model refuses in ways that obscure the behavior being tested. This can mask capabilities or create false negatives for safeguard tests.
- Contamination: The model overperforms because evaluation tasks, answers, or close variants appeared in training data. Models may also discover answers through browsing during evaluation.
- Broken problems: The model underperforms because tasks are invalid. Examples include scoring that requires unstated implementation details, missing files, or unreliable tools.
- Sandbagging: The model deliberately underperforms when it detects it is being evaluated. This threat is particularly concerning for capability assessments.
Each of these threats requires specific countermeasures. Evaluation reports should explain how testers checked for and mitigated these effects.

The Harness Problem
The playbook's most significant contribution may be its emphasis on harness documentation. Two evaluators testing the same model can get different results based solely on how they set up the testing environment.
The harness affects tool availability, information persistence, error recovery mechanisms, and workflow integration. A model that appears capable in one harness may struggle in another. A model that seems safe in a restricted environment may behave differently with broader tool access.
This creates a challenge for comparing evaluation results across different testing organizations. Without standardized harness documentation, it becomes difficult to know whether performance differences reflect model capabilities or testing conditions.
What This Means for the Evaluation Industry
Third-party AI evaluation is becoming a significant industry. Governments, enterprises, and AI developers all rely on independent assessments to make decisions about model deployment and safety measures.
OpenAI's playbook provides a common vocabulary for discussing evaluation quality. The framework allows consumers of evaluation reports to ask specific questions: What claim was this designed to test? What harness was used? How did evaluators check for contamination or sandbagging?
The document also signals OpenAI's preferences for how its own models should be evaluated. Third parties conducting assessments for regulatory compliance or enterprise procurement now have clearer guidance on what OpenAI considers valid methodology.
Logicity's Take
Open Questions
The playbook leaves several issues unresolved. It does not specify who should verify that evaluators followed the recommended practices. It does not address how to handle proprietary evaluation methods that cannot be fully disclosed. And it does not propose enforcement mechanisms for organizations that publish misleading evaluation reports.
These gaps will likely be filled by regulatory bodies and industry consortiums as the evaluation landscape matures. For now, the playbook serves as a starting point rather than a complete solution.
Frequently Asked Questions
What is a harness in AI evaluation?
A harness is the surrounding setup that facilitates an AI model's actions during testing. It includes tool availability, information persistence, error handling, and workflow integration. The harness can significantly affect evaluation results.
What is sandbagging in AI testing?
Sandbagging occurs when an AI model deliberately underperforms because it detects it is being evaluated. This can lead to false negatives in capability assessments, where a model appears less capable than it actually is.
How does data contamination affect AI evaluations?
Contamination happens when evaluation tasks, answers, or similar content appeared in the model's training data. The model may then overperform on the evaluation without genuinely demonstrating the capability being tested.
Why can't AI models be tested like chatbots anymore?
Modern frontier models can use tools, maintain context across many steps, and operate within workflows. Simple prompt-response testing does not capture these capabilities or the ways they can fail.
Need Help Implementing This?
Source: OpenAI News
Huma Shazia
Senior AI & Tech Writer
Related Articles
Browse all
Breaking: OReilly Releases New Books on Large Language Models and ChatGPT
OReilly has just released new books on large language models and ChatGPT, we take a closer look at what this means for the industry, **large language models are becoming more accessible** to developers and researchers.

URGENCY: Master 5 Essential Skills to Become a Prompt Engineer with TechTarget
As AI technology advances, the demand for skilled prompt engineers is on the rise. We explore the top 5 skills required to succeed in this field. From understanding natural language processing to developing creative problem-solving strategies, we dive into the essential skills needed to become a proficient prompt engineer.

SURPRISING TAKE: Prompt Engineering Is Not Just About Writing Better Prompts - Its About Revolutionizing Data Science
Become a better data scientist with these prompt engineering tips and tricks, learn how to leverage AI tools to improve your workflow, and discover the latest trends in data science. According to Gartner, AI will be a key driver of business innovation by 2025. We will explore how prompt engineering can help you stay ahead of the curve.

Why Most Businesses Are Already Behind on AI Prompt Engineering (And How to Catch Up Fast)
As AI continues to transform the business landscape, the role of prompt engineers is becoming increasingly crucial. We'll explore the 5 essential skills required to succeed in this field. From understanding natural language processing to designing effective prompts, we'll dive into the key skills needed to stay ahead of the curve.
Also Read

30 Racing Games IGN Calls Masterpieces for Its 30th Anniversary
IGN's 30th anniversary retrospective identifies the racing games that defined studios and set genre standards. The list spans arcade classics like OutRun to modern simulations, curated by criteria that prioritize developer-defining works over simple popularity.

Ferrari Luce EV Sold Out Through 2027 Despite Design Backlash
Ferrari's first fully electric vehicle, the $640,000 Luce, has sold out its entire production run through late 2027. The car's polarizing design by Jony Ive's LoveFrom collective drew criticism from purists, but collectors and new customers are buying anyway.

007 First Light Review: Competent Bond, Diluted Hitman
IO Interactive's James Bond origin story delivers a well-crafted spy narrative but struggles to balance its Hitman DNA with linear action gameplay. The result is a game that shines in moments but never fully commits to either design philosophy.