← 返回列表

@rasbt: The MiniMax M2 series was one of the most widely used open-weight LLM series earlier this year. Now, we got a technical report with some int...

@rasbt 3 信息等级 3 1 噪音/剔除;2 较弱;3 普通事实;4 重要行业动态;5 极重大事件。该分数是信息显著性,不是投资建议。 发布:2026-05-27T15:07 抓取:2026-05-27 17:19
🔗 原文链接
摘要

MiniMax M2技术报告发布,总结了多项技术发现:选择全注意力机制而非混合滑动窗口;线性/稀疏注意力在生产系统中部署困难且前缀缓存支持差;细粒度MoE(128专家top-8)在2B参数规模下推理和代码能力显著提升;训练流程中增加了软件工程agent行为训练。

客观事实
  • MiniMax M2采用全注意力机制,放弃混合滑动窗口
  • 稀疏注意力在生产环境中部署困难且前缀缓存支持差
  • 细粒度MoE在2B参数下将MATH从19.6提升至24.1
MiniMax M2系列 DeepSeek

原文

The MiniMax M2 series was one of the most widely used open-weight LLM series earlier this year. Now, we got a technical report with some interesting tidbits. I summarized some of them below:

  1. Full attention as an anti-trend?:

They tried hybrid sliding-window attention variants (like so many others, like Xiaomi MiMo, Laguna, Gemma 4, Arcee, Olmo 3, etc.). But even though there were efficiency gains, they said that the production-quality tradeoffs were not worth it for M2.

  1. Linear and sparse attention deployment issues:

They found that linear and sparse attention are attractive on paper because they reduce the cost of long-context attention, but they are harder to make work well in a production agent system.

In particular, they found that these efficient attention variants may be more fragile when KV-like state or intermediate memory is stored in lower precision.

Also, they have worse prefix caching support, which matters a lot when using coding agents (which reuse a lot of the context).

  1. Fine-grained Mixture-of-Experts (MoEs) are useful:

Finally a recent MoE ablation study! It's only on the 2B-active parameter scale, but hey, better than nothing.

Concretely, they compare a baseline with 32 experts and top-2 routing against a fine-grained setup with 128 experts and top-8 routing.

The fine-grained setup improves MATH from 19.6 to 24.1 and HumanEval from 29.7 to 32.5. That's clearly a win for more fine-grained experts (confirming what the DeepSeek MoE paper reported ~2 years ago).

  1. Sophisticated agent pipeline

It's probably no surprise, but this papers confirms that training for agent-like behavior on software engineering task is now a big component of the training pipeline.

They mine GitHub pull requests, builds runnable Docker environments, extracts task-specific test rewards, etc.

  1. Interleaved thinking for context management

Interestingly, they found that removing reasoning blocks from previous turns results in worse performance, especially in multi-step agent tasks. (Another point why long-context support is so important these days).

  1. Speed rewards

It's common to have token usage penalties, but what's interesting is that the MiniMax team adds a task-completion-time reward that depends on wall-clock time. This is to minimize unnecessary (slow) tool calls. Also, I'm thinking that this would encourage agent parallelization (if supported by the harness)

  1. Self-evolution

Looks like self-evolution is also already a big design component of open-weight LLMs. E.g., the paper says that M2.7 already handles 30 to 50 percent of the daily RL iteration workload, modifies its own scaffold, and completed a 100-round autonomous scaffold optimization cycle with a 30 percent gain on internal evaluations.

likes: 136 | retweets: 19 | replies: 12 | views: 8392