← 返回列表

@danielhanchen: We released experimental MTP Qwen3.6 Unsloth GGUFs! Qwen3.6 27B MTP now runs at 140 tokens/s. Qwen3.6 35B-A3B MTP gets 220 tokens/s generat...

@danielhanchen 3 信息等级 3 1 噪音/剔除;2 较弱;3 普通事实;4 重要行业动态;5 极重大事件。该分数是信息显著性,不是投资建议。 发布:2026-05-13T12:20 抓取:2026-05-13 16:03
🔗 原文链接
摘要

Unsloth发布实验性Qwen3.6 MTP GGUF版本,27B模型在单GPU上达到140 tokens/s,35B-A3B模型达到220 tokens/s,相比原始GGUF速度提升1.4倍,且精度不变。建议最大草稿token数为2。

客观事实
  • 发布Qwen3.6 MTP GGUF版本,支持推测解码
  • 27B模型单GPU推理速度140 tokens/s
  • 35B-A3B模型速度220 tokens/s,提升1.4倍
Qwen Unsloth GPU

原文

We released experimental MTP Qwen3.6 Unsloth GGUFs!

Qwen3.6 27B MTP now runs at 140 tokens/s. Qwen3.6 35B-A3B MTP gets 220 tokens/s generation on a single GPU.

Qwen3.6 27B and 35B-A3B have >1.4x speed-up over the original GGUFs without any change in accuracy.

Guide + GGUFs + Benchmarks: https://t.co/x9BYC3iXCL

In terms of average speedup, we see a 1.4x for dense models at draft tokens = 2 and for the MoE around 1.15 to 1.2x.

We do not recommend more than 2 draft tokens because the acceptance rate drops precipitously from 83% to 50% with 4 draft tokens, and the forward passes for MTP become less beneficial.

Use --spec-type mtp --spec-draft-n-max 2

Thanks to Aman for https://t.co/0WKkIC0kyW!

likes: 268 | retweets: 35 | replies: 22 | views: 16336