The Coding Assistant Breakdown: More Tokens Please

SemiAnalysis 4 信息等级 4 发布：2026-04-24T22:15 抓取：2026-05-02 10:32

AI 行业动态

摘要

近期多家头部机构密集发布新一代编程大模型，其中OpenAI正式推出基于全新预训练架构“Spud”的GPT-5.5。GLM-5.1、Qwen3.6-Plus、Kimi K2.6及Gemini 3.1 Pro等模型相继问世，行业迭代显著加速。各厂商新模型普遍将智能体编程与长周期任务处理作为核心方向，反映AI编程助手市场技术竞争加剧。

客观事实

OpenAI正式发布基于“Spud”预训练架构的GPT-5.5模型。
近三个月GLM-5.1、Qwen3.6-Plus等多款编程大模型密集发布。
新发布模型普遍聚焦智能体编程与长周期任务处理能力。

OpenAI Anthropic GPT-5.5 GLM-5.1 Qwen3.6-Plus Kimi K2.6 Gemini 3.1 Pro Spud Capybara

原文

Since we called out the Claude Code inflection point on February 5th, we have seen a flurry of model releases. Opus, Mythos, Codex, Gemini, DeepSeek, Kimi, Qwen, GLM, MiniMax, Composer, Muse Spark, and more. Today we will break down all of these major model releases, explain when you can vs can’t trust the benchmarks, and give our predictions for the future of the agentic coding market.
First we have to highlight GPT-5.5 from OpenAI. In our view, GPT-5.5 is now materially better at some tasks than all other models. We believe that GPT-5.5 has arrived at the frontier. This is a huge change from November when Opus 4.5 was released. At that time, and for the 6 months since, OpenAI’s coding model was not world class in most metrics, leading to Opus being our daily driver. GPT-5.5 is now integrated in our daily work.
Meet the ModelsThere’s been at least one major lab releasing a new checkpoint purpose-built for coding every week for the past 3 months. GLM-5.1, Qwen3.6-Plus, Kimi K2.6, Composer 2, and Gemini 3.1 Pro all emphasize “agentic coding,” “long-horizon tasks,” or similar capabilities in their headlines. February was a particularly busy month.
Source: SemiAnalysis Tokenomics DashboardNew checkpoints are cool, but entirely new pre-trains are what really get the people going. Heading into April, the San Francisco rumor mill was ablaze with talk about Capybara and Spud. These are codenames for Anthropic and OpenAI’s newest pre-trains. With the release of GPT-5.5 yesterday, we now have something concrete to discuss.
GPT 5.5GPT-5.5 is the first public release based on “Spud”. As OpenAI’s first new scale up in pre-training since the failed GPT-4.5 (sorry garlic doesn't count), expectations are obviously high. And despite both NVIDIA and OpenAI claiming with precise language that the model was “trained” on a 100k GB200 NVL72 cluster, this “training” is post-training (RL) only. It never achieved that scale.
OpenAI’s flagship model has historically been cheaper than Anthropic’s, but at $5 per million input tokens and $30 per million output tokens, GPT-5.5’s API price will be 2x more expensive than GPT-5.4 and slightly more expensive than Opus 4.7. The API went live this morning after a brief ChatGPT/Codex-only window due to safety concerns. We’ve been testing the model via Codex and API during an alpha testing period and describe that experience later in this article.
Like all their other models, OpenAI will also be offering a priority tier for GPT-5.5 priced at 2.5x the standard rate. Figuring out how to charge users more money for faster tokens is becoming increasingly important, and it’s worth clarifying that priority is totally different from fast mode. Fast mode just makes some vague guarantees like “2.5x faster for 6x the price,” whereas priority makes more conservative, concrete SLAs (e.g. > 50 tokens/sec > 99% of the time). Both Anthropic and OpenAI offer fast mode and priority tiers, but we think Opus 4.6 Fast is the only SKU that’s gained real traction.
Separately, OpenAI also offers GPT-5.3-Codex-Spark, but that’s a totally different model built to run on Cerebras. Specifically, it is a distilled version of GPT-5.3. There’s a big difference between offering faster tokens via running smaller batch sizes, changing the reasoning depth, and routing requests to a priority queue without changing the underlying model (priority and fast mode) vs running a dumber, smaller model (codex spark).
Source: SemiAnalysisAlso released is GPT-5.5 Pro, which is only available via ChatGPT and API. It’s meant for scientific research or long range reasoning tasks instead of everyday agentic work. GPT-5.5 Pro earned SOTA scores on BrowseComp and FrontierMath, and is priced at the same $30/180 as GPT-5.4 Pro. We expect to see more announcements about GPT-5.5 Pro making scientific discoveries soon.
Both the standard and pro models offer different levels of reasoning: xhigh, high, medium, low, and non-reasoning, which is a tradeoff between cost vs capability. As has been clear since the release of strawberry/o1, higher reasoning levels lead to better outputs but require more tokens and users have to wait longer for a response.
Relatedly, OpenAI advertised in their model card that GPT-5.5 scores higher on benchmarks than 5.4 while simultaneously using less tokens. In other words, it’s more “token efficient.” This is an extremely important concept to understand, and we believe it will become a major talking point this year. As we explained and quantified to Tokenomics model subscribers last week, cost per task, not cost per token, is the true north star metric that determines model pricing. Mythos may be 5x more expensive than Opus on a per token basis, but much of that price increase is nullified because Mythos can solve the same problem using fewer tokens. It may also be a faster end to end response.
Source: OpenAIOpus 4.7This all comes a short week after Anthropic’s release of Claude Opus 4.7, a drop-in replacement for Claude Opus 4.6. Opus has been the daily driver for most of SemiAnalysis, and Opus 4.7 is a small improvement. With improved scores on many benchmarks and predictably good vibes, but not a step change, 4.7 has been reluctantly adopted by our team members. Why? Fast mode does not exist yet. For the first time, we have found that many of our engineers are willing to sacrifice a bit of quality (but not too much) for faster speed, claiming that the 2.5x faster for 6x the price tradeoff lets them hit “flow state”.
Source: of our frustration (i.e. Dylan on X)In practice, the noticeable changes moving from Opus 4.6 → Opus 4.7 have been from features/functionality rather than raw performance. In general, these models have gotten so good that most day-to-day tasks are accomplished successfully, with our engineers’ criticisms of a code edit or PR being more about style, approach, architectural decisions, and token efficiency (i.e. speed) rather than success on functional tests. It is increasingly rare for any of these coding models to go haywire and botch a commit completely.
As a result, the noticeable changes in this transition are:
High-resolution image support, and a clear increase in RL training objectives that include the use of screenshots for frontend styling rather than running tests programmatically via headless browsers and tools like playwright
An “xhigh” reasoning effort option that slots in between “high” and “max” on the hierarchy of effort (i.e. how much time the model is going to spend reasoning about a task, described earlier)
Thinking content is omitted by default. Of course, you still get charged for these tokens, but you have to opt in to see them.
Task budgets (in beta, and API only) where the model is given a suggestion on how efficiently to complete a task. If the model is given a task budget that is too restrictive, it can take shortcuts or refuse. This is different from max_tokens, which is a hard restriction on output length
Updated token counting, the most critical change when it comes to pricing. 4.7 uses a new tokenizer, which trades off improved performance via more granular token counting for more total token usage. They admit directly that this will lead to increases up to 35% in token usage. Implicitly, this is a 35% increase in price!
On model behavior changes, the biggest thing we have noticed in our testing is how 4.7 is using fewer tool calls by default, and using reasoning more. The jury is still out on the benefits here, but in general we don’t like it. Anthropic suggests raising the reasoning effort from high to xhigh or max to increase tool usage. And it seems that our users are doing exactly this in order to let the model bring in enough context to successfully complete a complex task or form a complete multi-step plan. Not exactly the token efficiency tradeoff claimed in the announcement.
Notably, many people have been accusing Anthropic of intentionally degrading the 4.6 model on the lead up to the 4.7 release. Anthropic has categorically denied these claims, but multiple engineers at SemiAnalysis independently said that over the last few weeks the changes in 4.6 performance have made them “feel a little schizo”. And of course, they were right.
On April 23, a week after the Opus 4.7 release, Anthropic posted a postmortem detailing three bugs that they found in March/April. All three were present for weeks, and affected basically all users of Claude Code. One of the bugs is trivial, two are interesting, and all are real. When the harness is part of the product, the model gets blamed.
Source: Anthropic PostmortemNotably the three timelines are March 4 to April 7, March 26 to April 10, and April 16 to April 20. This is weeks and weeks of bugs going unnoticed. Bugs that were introduced by Claude, and likely root-caused by Claude. Live by the sword, die by the sword.
DeepSeek V4The long awaited DeepSeek v4 drop is here. DeepSeek took the world by storm last year with its R1 release and since then there have been legitimate questions in the AI community about whether open source models will commoditize intelligence. For those keeping score at home, DeepSeek crashed the market so hard that CEOs were scrambling to explain Jevons paradox. This seems to have played out quite clearly in the 16 months since, with the Great GPU Shortage now upon us.
V4 is an improvement over V3, but it didn’t crash the market today. That said, the achievements of DeepSeek should not be discounted. They open-sourced the weights, a detailed technical report, and updated libraries such as DeepEP, DeepGEMM, and FlashMLA that are widely used by labs around the world. Ironically, DeepSeek is helping American open source AI stay alive.
This release includes two models: DeepSeek-V4-Pro and DeepSeek-V4-Flash. The former is 1.6T total / 49B active, and the latter is 284B total / 13B active. Pro is a step up from V3, which was 671B total / 37B active, while Flash is a step down. We believe that both these architectures are still meaningfully behind their closed-source counterparts on the frontier in terms of both total and active parameter counts. We detail more about how we model the architectures of leading closed source frontier models in our Tokenomics model.
The core advancement of V4 over V3 is a move from a 128k context window to 1M context. As a result, all of the main technical advancements are focused on long context performance. These include:
Compressed Sparse Attention (CSA)
Heavily Compressed Attention (HCA)
Manifold-Constrained Hyper-Connections (mHC)
And result in the following claim: “In the one-million-token context setting, DeepSeek-V4-Pro requires only 27% of single-token inference FLOPs and 10% of KV cache compared with DeepSeek-V3.2.” That’s a 90% reduction in KV Cache, way more impactful than Google’s TurboQuant paper last month! NAND Flash investors, watch out.
On benchmarks, DeepSeek did not feel that standard benchmarks were good at capturing real-world task capability, so they introduced their own set of agentic benchmarks to measure how V4 compared against other SOTA models: Chinese writing, retrieval augmented search, a suite of white-collar tasks with long horizons, and coding. V4 Pro was able to compete with top models on all these tasks but lags behind in key areas. For instance, on especially difficult Chinese writing tasks, Claude Opus 4.7 still beats DeepSeek V4 Pro. Claude mogs Chinese models in it’s own language.
Unfortunately, using public announcements on model performance benchmarks as a proxy for real world performance is unreliable. Conflicting incentives cause these labs to publish certain benchmarks and not others. Like this example, where DeepSeek takes a shot at the Kimi and GLM APIs:
Source: DeepSeek V4 Technical ReportThis is the reason why the SemiAnalysis Tokenomics Dashboard tracks all major model performance claims, pricing, release dates, usage disclosures in an unbiased manner. We also do our own hands-on testing of all the major models. Below is an example of our tracking of meaningful benchmark performance across the major model releases. We will explain later why benchmarks are bad.
Source: Tokenomics ModelSource: Tokenomics ModelDeepSeek also open sourced a Mega-Kernel inside of DeepGEMM that supports both NVIDIA GPUs and Huawei Ascend NPUs. NPU support is claimed, but only the code for SM90 (Hopper) and SM100 (Blackwell) GPUs is released publicly. It is likely a goal to run a meaningful portion of the future inference traffic on Ascends. It is notable however that the parameter size fits just inside the memory domain of an 8x H20 HGX at FP4.
Source: DeepSeek V4 Technical ReportMega MoE performance across various batch sizes is described in a PR:
Source: DeepGEMM repoOf course, the key contribution of DeepSeek V4 is that it is open source. Thanks to an all nighter, our InferenceX team, collaborating with 10x engineers from vLLM/Inferact and NVIDIA, have published day-zero support on our H200 cluster. Support for Blackwell and AMD GPUs using vLLM, SGLang and TRT-LLM with Dynamo is a work in progress.
Source: inferencex.comInterestingly, day-zero support on H200 at FP81 performance of this model hits ~150 tok/sec throughput per GPU at 20 tok/sec interactivity on 8k in 1k out. For reference, V3 hits ~1.3k to 2.3k tok/sec of throughput per GPU at 20 tok/sec interactivity on 8k in 1k out. This is a new model and we expect meaningful optimization in the coming weeks. Watch inferencex.com for real time improvements.
Source: DeepSeek V4 model card on HuggingfaceOverall, DeepSeek is an exceptional engineering release, and is right behind the SOTA frontier. It will be the lowest cost alternative to closed source models, but it’s capabilities are not at the leading edge. SemiAnalysis’s workflows likely will not be cannibalized by DeepSeek.
VIBEZ: Our Impressions of GPT-5.5 vs Opus 4.7SemiAnalysis is famous (infamous?) for shilling Claude, and we’ve been testing GPT-5.5 as part of an alpha program with OpenAI the past few weeks.
We think GPT-5.5 is a significant improvement within Codex specifically. Previously, ~all our engineers used Claude exclusively, and use of ChatGPT models for coding was restricted to wrappers like Cursor. Now, most of our engineers switch between Codex and Claude models depending on the task and IDE preference. Here are some quotes:
“What I have really appreciated about Codex recently is how it pulls in a lot of context before making changes to code. Not like just a structural change, but a change that actually requires non trivial ‘thinking’. 4.7 often feels like it just does a quick Explore and then #yolos changes whereas codex pulls in a shit ton of more granular context from the internet + codebase and then makes a directed effort at the ask”
“Currently I use Codex for reviewing PRs/bug hunting, explaining existing code, and creating/revising documentation. Its better at understanding code structure and reasoning about it.”
However, it’s not all positive for OpenAI. Some of our other engineers complained that Codex is still worse at inferring your true intent than Claude Code. Humans naturally give terse and not particularly well thought out instructions when prompting coding agents, and Codex often listens too literally.
Relatedly, another engineer commented that GPT-5.5 feels too conservative when it comes to actually making code changes. Yes, this improves token efficiency, but it comes at the cost of correctness. A similar tradeoff happened from 4.6 → 4.7 as we described previously. Seeing the words “narrow fix” in the output is now a signal to double check the model’s work.
Here’s a concrete example that illustrates our overall impression on the strengths and weaknesses of Codex vs Claude Code well. We asked both Opus 4.6 and GPT-5.5 to make a new dashboard for our accelerator model and gave it the current tokenomics dashboard as an example. As our institutional subscribers know, this dashboard includes a homepage that links to all the different tabs.
Source: SemiAnalysisOpus 4.6 made an identical looking homepage, whereas Codex ignored it entirely.
Source: SemiAnalysis If we specifically asked Codex to copy the homepage in the prompt, we’re sure it would’ve done so, but it was unable to infer this intent itself.
With that said, the actual data Codex included in the dashboard was much more accurate than Claude (though to be clear neither was perfect on the first pass). This implies stronger reasoning about the data structures and relationships with a relatively complex excel file on the part of Codex. Meanwhile, many of Claude’s numbers were straight up hallucinated and it made mistakes like including Nvidia GPUs in TPU charts. This tracks with our overall impression that Codex is “smarter” and better at doing complex reasoning to solve harder, more narrowly scoped tasks, whereas Claude is better for more open ended, greenfield problems.
It’s for these reasons that some of our engineers have settled on the following workflow:
Start off with Claude to create an initial plan/scaffolding for new applications or features, and the first implementation/POC step.
Switch to Codex to actually solve the problem or fix bugs
Importantly, before the GPT-5.5 release, ~all of SemiAnalysis used Claude Code for both of these steps. Our use of ChatGPT models had become restricted to Deep Research on the webapp and wrappers like Cursor Bugbot.
Critically, features in the plugins/CLIs are holding Codex back. For example, many of our engineers prefer fast mode with 1M context and use remote control/sandbox plugins to take sessions from laptop to phone and back. Both of these are currently possible with the Claude Code CLI, VSCode Plugin, web app and mobile app, but not the Codex CLI, VSCode Plugin, desktop app, web app or mobile app.
Even if GPT-5.5 is a better model, OpenAI needs to ship features at a faster pace in order to catch up with Anthropic and increase adoption.
Benchmarks are bad but we need to keep using them anywaysThe one thing that is always prominently featured in every new model announcement is a table comparing performance on various benchmarks.
Source: every release, manIt’s very tempting to be able to point to a small set of numbers in order to prove the “objective” superiority of your new model release, but many within the AI community have long lamented that benchmarks are no longer a useful proxy for real-world utility. We tend to agree with this point of view. There’s a big difference between claiming to measure a model’s coding/finance/reasoning abilities vs actually doing so in any meaningful capacity.
That being said, we expect all the labs to continue highlighting improved benchmark performance for all future model releases, and the following section will help you separate the signal from the noise.
Anatomy of a benchmarkEach benchmark consists of 3 things
Tasks: what the model is actually asked to do
The evaluation method: how the model is actually scored
A harness: what tools, instructions, interface, etc the model is given to solve the task
Really understanding the first two is how you determine if a benchmark is any good or not. To illustrate, we’ll walk through some famous benchmarks below in rough chronological order. This will also give you a sense of how benchmarks have changed over time.
MMLU and multiple choice/simple answer benchmarksReleased by academic researchers in 2020, Measuring Massive Multitask Language Understanding (MMLU) is a set of 15,908 multiple choice questions covering 57 subjects. These questions were manually collected by university students from online sources like standardized tests and college exams/problem sets. All of them have exactly 4 choices and are publicly available, but they range in difficulty from “elementary” to “advanced professional”.
Example MMLU questions. Source: MMLUMMLU has a minimal harness that essentially just formats the question into a prompt. Tools like web search are not included.