Unsloth发布实验性Qwen3.6 MTP GGUF版本,27B模型在单GPU上达到140 tokens/s,35B-A3B模型达到220 tokens/s,相比原始GGUF速度提升1.4倍,且精度不变。建议最大草稿token数为2。
We released experimental MTP Qwen3.6 Unsloth GGUFs!
Qwen3.6 27B MTP now runs at 140 tokens/s. Qwen3.6 35B-A3B MTP gets 220 tokens/s generation on a single GPU.
Qwen3.6 27B and 35B-A3B have >1.4x speed-up over the original GGUFs without any change in accuracy.
Guide + GGUFs + Benchmarks: https://t.co/x9BYC3iXCL
In terms of average speedup, we see a 1.4x for dense models at draft tokens = 2 and for the MoE around 1.15 to 1.2x.
We do not recommend more than 2 draft tokens because the acceptance rate drops precipitously from 83% to 50% with 4 draft tokens, and the forward passes for MTP become less beneficial.
Use --spec-type mtp --spec-draft-n-max 2
Thanks to Aman for https://t.co/0WKkIC0kyW!
likes: 268 | retweets: 35 | replies: 22 | views: 16336