Claude Code Discovers AI Scaling Algorithms Humans Missed

Huma ShaziaMay 24, 2026 at 2:08 PM5 min read

Key Takeaways

Claude Code discovered test-time scaling algorithms that reduce compute costs by 70% compared to standard self-consistency methods
The entire discovery process cost $39.90 and took 160 minutes, demonstrating cost-effective AI-driven research
Researchers shifted from designing algorithms themselves to designing environments where AI agents discover algorithms autonomously

What if AI could design its own reasoning strategies better than human engineers? A research team from the University of Maryland, Google, and Meta tested this idea by letting Claude Code loose in a simulated environment. The agent discovered test-time scaling algorithms that outperform established human-designed methods while using 70% less compute.

The project, called AutoTTS, flips the traditional approach to improving large language model performance. Instead of researchers writing rules for when a model should branch into new solution paths or abandon dead ends, they built an environment where an AI agent figures out those rules on its own.

70%

Reduction in compute costs achieved by the agent-discovered AutoTTS algorithm compared to standard self-consistency methods

How Test-Time Scaling Works

Test-time scaling lets large language models perform better by spending more compute on each response. A model might run several solution paths in parallel or extend chains of thought before settling on an answer. Think of it as giving the model more time to think through a problem.

Until now, humans wrote the rules that control this process. Engineers decided when a model should start a new solution path, double down on a promising one, or kill a dead end. The AutoTTS paper argues that many known methods are really just special cases within a shared control space defined by two variables: width (how many solution paths run at once) and depth (how far each one goes).

The researchers' core question: why do humans keep plotting paths through this space by hand instead of letting a machine search it?

Diagram of the AutoTTS framework. Left: the human side, choosing between a hand-written TTS strategy and environment design. Center: the agent-side loop of proposing, evaluating in the offline replay environment, getting feedback from scaling curves and execution logs, and storing results in a history. Right: a chart comparing accuracy-cost curves of AutoTTS and three hand-written methods on Qwen3-1.7B/AIME25, plus the note about $39.90 search cost and 160 minutes runtime.

Simulating the Search to Cut Costs

The clever part of AutoTTS is how it keeps costs down. At the core sits an offline environment. For each task, the team pre-generated several solution paths from the language model and stored them. A new control algorithm decides how to spend compute based on data that's already there. This means thousands of variants can run without firing up the actual language model each time.

Claude Code handles the searching. Over several rounds, the agent reviews what came before, spots weaknesses in earlier proposals, and writes a new control algorithm directly in code. To stop the search from getting lost in thousands of tiny knobs, each proposal can only expose one high-level controller to the outside. That controller sets all the other thresholds on its own.

Diagram with width (number of parallel solution paths) on the X-axis and depth (length of each path) on the Y-axis. Five established TTS methods are plotted as different paths through this space: Self-Consistency with 64 paths at max depth, ASC and ESC along the width axis, Answer Consistency along the depth axis on a single path, ST-BoN switching from width to depth, and Parallel-Probe starting wide then progressively pruning.

Full logs from each run show the agent where earlier attempts wasted compute. The entire discovery process cost $39.90 in compute resources and took 160 minutes to complete.

“We are moving from a paradigm where humans design the strategies to one where humans design the environment that fosters algorithmic innovation.”

— Dr. Elena Vance, Lead Researcher at UMD AI Lab

Results: Agent Beats Human Engineers

On math benchmarks like AIME and HMMT, the algorithm the agent discovered achieves better accuracy per unit of compute than established methods. The lean setting slashes token usage by about 70% compared to standard self-consistency, which generates 64 answers in parallel and picks the most common one.

Four line charts with a logarithmic X-axis (token usage in thousands) and Y-axis (accuracy in percent). ASC, ESC, Parallel-Probe, and AutoTTS are compared on Qwen3-0.6B/AIME25, Qwen3-4B/HMMT25, Qwen3-1.7B/AIME25, and Qwen3-8B/HMMT25. In all four charts, the red AutoTTS curve with star markers runs above or level with the three comparison methods.

The research team included collaborators from UMD, UVA, Washington University in St. Louis, UNC, Google, and Meta. Their work suggests that the bottleneck to AI progress might be shifting from human architectural ingenuity to our ability to build effective discovery environments.

What This Means for AI Development

The AutoTTS paper points to a broader shift in AI research methodology. Instead of humans manually tuning algorithms, researchers can now focus on designing environments that let AI agents discover solutions. It's a meta-level change: humans define states, actions, and feedback, then step back and let the search happen.

Discussion on Hacker News and AI research communities has focused on this meta nature of the achievement. Some users expressed skepticism about whether the findings generalize beyond math benchmarks. Others see it as the beginning of self-optimizing AI systems that could accelerate their own development.

The 70% compute reduction matters for practical deployment. Inference costs remain a major barrier to using advanced reasoning techniques in production systems. If AI agents can discover more efficient scaling algorithms, that directly translates to lower API bills and broader accessibility.

ℹ️

Logicity's Take

Frequently Asked Questions

What is test-time scaling in AI?

Test-time scaling is a technique that improves large language model performance by allocating more compute during response generation. This can include running multiple solution paths in parallel or extending chains of thought before producing a final answer.

How much did the AutoTTS discovery process cost?

The entire discovery process cost $39.90 in compute resources and took 160 minutes. The system used pre-generated solution paths stored offline, which kept costs low by avoiding repeated calls to the live language model.

How does AutoTTS compare to standard self-consistency methods?

AutoTTS reduces token usage by approximately 70% compared to standard self-consistency, which typically generates 64 answers in parallel and selects the most common one. The agent-discovered algorithm achieves better accuracy per unit of compute on math benchmarks.

Which organizations collaborated on the AutoTTS research?

The research team included collaborators from the University of Maryland, University of Virginia, Washington University in St. Louis, University of North Carolina, Google, and Meta.

Need Help Implementing This?

Source: The Decoder / Jonathan Kemper

Also Read

Startups & Innovation·6 min

Claude Code Discovers AI Scaling Algorithms Humans Missed

Key Takeaways

How Test-Time Scaling Works

Simulating the Search to Cut Costs

Results: Agent Beats Human Engineers

What This Means for AI Development

Logicity's Take

Frequently Asked Questions

Need Help Implementing This?

Related Articles

ChatGPT in Corporate Communications: A $0 AI Detector Test

Bezos AI Lab Gets $10B: What Project Prometheus Means

Kimi K2.6 Open-Weight AI: 300 Agents at a Fraction of the Cost

AI Vendor Lock-In Risk: Anthropic Suspensions Hit Fintech

Also Read

Zig creator on Bun's Rust rewrite: 'We breathed a sigh of relief'

UK MPs push NHS to drop Palantir before 2027 break clause

How a bank built platform trust with open source, not mandates