@rasbt: The MiniMax M2 series was one of the most widely used open-weight LLM series earlier this year. Now, we got a technical report with some int...

@rasbt 3 信息等级 3 发布：2026-05-27T15:07 抓取：2026-05-27 17:19

AI 算力

摘要

MiniMax M2技术报告发布，总结了多项技术发现：选择全注意力机制而非混合滑动窗口；线性/稀疏注意力在生产系统中部署困难且前缀缓存支持差；细粒度MoE（128专家top-8）在2B参数规模下推理和代码能力显著提升；训练流程中增加了软件工程agent行为训练。

客观事实

MiniMax M2采用全注意力机制，放弃混合滑动窗口
稀疏注意力在生产环境中部署困难且前缀缓存支持差
细粒度MoE在2B参数下将MATH从19.6提升至24.1

MiniMax M2系列 DeepSeek

原文

The MiniMax M2 series was one of the most widely used open-weight LLM series earlier this year. Now, we got a technical report with some interesting tidbits. I summarized some of them below:

Full attention as an anti-trend?:

They tried hybrid sliding-window attention variants (like so many others, like Xiaomi MiMo, Laguna, Gemma 4, Arcee, Olmo 3, etc.). But even though there were efficiency gains, they said that the production-quality tradeoffs were not worth it for M2.

Linear and sparse attention deployment issues:

They found that linear and sparse attention are attractive on paper because they reduce the cost of long-context attention, but they are harder to make work well in a production agent system.

In particular, they found that these efficient attention variants may be more fragile when KV-like state or intermediate memory is stored in lower precision.

Also, they have worse prefix caching support, which matters a lot when using coding agents (which reuse a lot of the context).

Fine-grained Mixture-of-Experts (MoEs) are useful:

Finally a recent MoE ablation study! It's only on the 2B-active parameter scale, but hey, better than nothing.

Concretely, they compare a baseline with 32 experts and top-2 routing against a fine-grained setup with 128 experts and top-8 routing.

The fine-grained setup improves MATH from 19.6 to 24.1 and HumanEval from 29.7 to 32.5. That's clearly a win for more fine-grained experts (confirming what the DeepSeek MoE paper reported ~2 years ago).

Sophisticated agent pipeline

It's probably no surprise, but this papers confirms that training for agent-like behavior on software engineering task is now a big component of the training pipeline.

They mine GitHub pull requests, builds runnable Docker environments, extracts task-specific test rewards, etc.

Interleaved thinking for context management

Interestingly, they found that removing reasoning blocks from previous turns results in worse performance, especially in multi-step agent tasks. (Another point why long-context support is so important these days).

Speed rewards

It's common to have token usage penalties, but what's interesting is that the MiniMax team adds a task-completion-time reward that depends on wall-clock time. This is to minimize unnecessary (slow) tool calls. Also, I'm thinking that this would encourage agent parallelization (if supported by the harness)

Self-evolution

Looks like self-evolution is also already a big design component of open-weight LLMs. E.g., the paper says that M2.7 already handles 30 to 50 percent of the daily RL iteration workload, modifies its own scaffold, and completed a 100-round autonomous scaffold optimization cycle with a 30 percent gain on internal evaluations.

likes: 136 | retweets: 19 | replies: 12 | views: 8392