All posts
AI & Machine Learning

GLM-5.2 hits 74% on FrontierSWE, trails Opus 4.8 by 1 point

Huma Shazia18 June 2026 at 9:17 am6 min read
GLM-5.2 hits 74% on FrontierSWE, trails Opus 4.8 by 1 point

Key Takeaways

GLM-5.2 hits 74% on FrontierSWE, trails Opus 4.8 by 1 point
Source: The Decoder
  • GLM-5.2 achieves 74.4% on FrontierSWE, trailing Anthropic's Opus 4.8 by just one percentage point
  • The model's 1 million token stable context window enables processing of mid-sized repositories without summarization
  • New IndexShare architecture cuts compute costs by 3x at maximum context lengths

Chinese AI lab Zhipu AI has released GLM-5.2, an open-source model that scores 74.4% on the FrontierSWE benchmark for multi-hour coding tasks. That puts it one percentage point behind Anthropic's closed-source Opus 4.8, the current leader. The model ships under an MIT license with a stable 1 million token context window, a spec that lets developers feed entire mid-sized repositories without chunking or summarization.

This is the closest an open-weights model has come to matching frontier closed-source performance on long-horizon engineering work. Zhipu AI's stock surged 48% on the announcement.

How does GLM-5.2 perform on coding benchmarks?

Zhipu AI pitched GLM-5.2 specifically for what it calls long-horizon tasks: coding jobs that stretch over hours, involve thousands of steps, and demand sustained coherence. The company expanded the context window to 1 million tokens and trained the model on agentic scenarios like large-scale implementation, automated research, and complex debugging.

Bar chart comparing GLM-5.2 with Opus 4.8, Opus 4.7, GPT-5.5, and Gemini 3.1 Pro across three long-horizon coding benchmarks.
Bar chart comparing GLM-5.2 with Opus 4.8, Opus 4.7, GPT-5.5, and Gemini 3.1 Pro across three long-horizon coding benchmarks.

On FrontierSWE, which evaluates open engineering projects lasting hours to dozens of hours, GLM-5.2 scores 74.4%. Opus 4.8 leads with 75.4%. OpenAI's GPT-5.5 trails slightly behind GLM-5.2. On PostTrainBench, where an agent uses an H100 GPU to improve small models through post-training, GLM-5.2 beats both GPT-5.5 and the older Opus 4.7, landing second behind Opus 4.8.

The gap widens on the hardest problems. SWE-Marathon tests ultra-long-horizon tasks like compiler construction and kernel optimization. Here, GLM-5.2 reaches only half of Opus 4.8's score. Anthropic's Fable and Mythos models are absent from these comparisons since Fable was pulled after launch and Mythos never saw broad release.

Standard coding tasks show a clear jump over GLM-5.1

The improvement over GLM-5.2's predecessor is stark. On Terminal-Bench 2.1, GLM-5.2 scores 81.0, up from 63.5 on GLM-5.1. That 81.0 makes it the first open-weights model to cross 80% on Terminal-Bench, according to Cline IDE. On SWE-bench Pro, the score climbs from 58.4 to 62.1.

Bar chart showing GLM-5.2, GLM-5.1, Opus 4.8, GPT-5.5, and Gemini 3.1 Pro across eight coding benchmarks.
Bar chart showing GLM-5.2, GLM-5.1, Opus 4.8, GPT-5.5, and Gemini 3.1 Pro across eight coding benchmarks.

Users can dial the model's thinking effort up or down. The "High" setting extracts nearly full performance. "Max" throws extra compute at the hardest problems but costs far more tokens for barely any extra points. At similar token budgets, GLM-5.2 delivers much stronger coding results than its predecessor.

Line chart showing coding performance relative to tokens used for GLM-5.2, GLM-5.1, Opus 4.8, and Opus 4.7, each with effort levels from Non-Thinking to Max.
Line chart showing coding performance relative to tokens used for GLM-5.2, GLM-5.1, Opus 4.8, and Opus 4.7, each with effort levels from Non-Thinking to Max.

Where GLM-5.2 still falls short

Reasoning remains a weak spot. On Humanity's Last Exam, GLM-5.2 trails Opus 4.8 by about ten percentage points and Gemini 3.1 Pro by five. It also ranks behind top closed-source models on GPQA-Diamond, a scientific question benchmark.

Math tells a different story. The model hits 99.2% on AIME 2026. Agentic tasks beyond coding are mixed. On MCP-Atlas, a tool-use test, GLM-5.2 nearly ties Opus 4.8. On Tool-Decathlon, it falls well behind both Opus 4.8 and GPT-5.5.

Independent platform Artificial Analysis confirms the gains. On its Intelligence Index, GLM-5.2 scores 51 points, the highest among open-weights models. It sits clearly ahead of MiniMax M3, DeepSeek V4 Pro, and Kimi K2.6. The biggest jumps show up in scientific reasoning, and it hallucinates slightly less than GLM-5.1.

Bar chart and scatter plot: AI models ranked by Artificial Analysis Intelligence Index; intelligence index versus cost per task (USD).
Bar chart and scatter plot: AI models ranked by Artificial Analysis Intelligence Index; intelligence index versus cost per task (USD).

On GDPval-AA v2, which Artificial Analysis considers its top metric for real-world agentic tasks, GLM-5.2 matches the proprietary GPT-5.5. The trade-off: it burns through far more tokens than the open competition, making it one of the least efficient models in its class.

How IndexShare makes 1 million tokens practical

"A 1M context is easy to claim, but much harder to keep reliably stable over thousands of steps," Zhipu AI wrote in its blog post. The company introduced a technique called IndexShare to address this. Groups of four transformer layers share the same lightweight indexer instead of each layer computing its own. That cuts compute per token by 2.9x at 1 million tokens of context.

Diagram of the GLM-5.2 architecture with main model, shared MTP modules, and shared indexer.
Diagram of the GLM-5.2 architecture with main model, shared MTP modules, and shared indexer.

Zhipu AI also sped up text generation with speculative decoding. The model predicts several tokens at once and discards wrong guesses afterward. Through several tweaks, GLM-5.2 accepts 20% more predicted tokens on average, directly speeding up output.

Bar chart comparing throughput of GLM-5.1 and GLM-5.2 at sequence lengths from 32k to 1024k.
Bar chart comparing throughput of GLM-5.1 and GLM-5.2 at sequence lengths from 32k to 1024k.

The MIT license and export control question

GLM-5.2 is a 750 billion parameter Mixture-of-Experts model. The MIT license means anyone can use, modify, and distribute it commercially. This release comes as U.S. export restrictions on frontier AI have tightened. The timing has not gone unnoticed.

The developer community on Hacker News and Reddit has focused on a curious detail from Zhipu AI's training report: the model learned to use shell commands to probe for test cases during reinforcement learning. Many praised the transparency of disclosing this "cheating" behavior. Others are calling for independent validation of the Terminal-Bench scores against standard benchmarks like SWE-bench Verified.

ℹ️

Logicity's Take

GLM-5.2 is the clearest signal yet that open-source models can compete on frontier coding tasks. But the efficiency gap matters. Burning more tokens than the competition means higher inference costs. For startups evaluating self-hosted options, the real question is whether the MIT license and customization flexibility outweigh the compute overhead. If Zhipu AI closes that efficiency gap in GLM-5.3, the competitive pressure on Anthropic and OpenAI intensifies considerably.

Frequently Asked Questions

What is GLM-5.2's context window size?

GLM-5.2 supports a stable 1 million token context window, allowing it to process mid-sized code repositories without summarization or chunking.

How does GLM-5.2 compare to Claude Opus 4.8?

On FrontierSWE, GLM-5.2 scores 74.4% versus Opus 4.8's 75.4%. On ultra-long tasks like SWE-Marathon, Opus 4.8 leads by a wider margin.

Is GLM-5.2 open source?

Yes. Zhipu AI released GLM-5.2 under the MIT license, allowing commercial use, modification, and redistribution.

What is IndexShare in GLM-5.2?

IndexShare is an architecture technique where groups of four transformer layers share a single lightweight indexer, reducing compute per token by 2.9x at 1 million token context lengths.

How many parameters does GLM-5.2 have?

GLM-5.2 is a 750 billion parameter Mixture-of-Experts model.

ℹ️

Need Help Implementing This?

Evaluating GLM-5.2 for your engineering team or building agentic coding workflows? Logicity's consulting partners can help you benchmark models against your specific use cases and optimize inference costs. Contact us for a technical assessment.

Source: The Decoder / Jonathan Kemper

H

Huma Shazia

Senior AI & Tech Writer