Qwen3.6 MTP Unsloth GGUFs运行速度提升至1.8倍,得益于llama.cpp新增--spec-draft-p-min参数。同时发布了0.8B至9B多个尺寸的MTP GGUF模型,并支持两种推测解码算法。
Qwen3.6 MTP Unsloth GGUFs now run 1.8x faster, increased from 1.4x just two days ago!
This is due to llama.cpp adding --spec-draft-p-min 0.75!
Args have also changed from
--spec-type mtp
to
--spec-type draft-mtp
Also increase --spec-draft-n-max 2 to 6
We also released Qwen3.6-0.8B, 2B, 4B, 9B MTP GGUFs! We'll be providing more soon!
For folks who find the new updated branch to have some perf regression, set --spec-draft-p-min to 0.0 to get the old behavior - we provided a plot of the old branch (red) vs the new branch (blue / green) as well.
Also you can use 2 speculative decoding algos - you can add ngram via --spec-type ngram-mod,draft-mtp - the perf isn't yet optimized so I'll do more benchmarks to find better numbers - see https://t.co/0WKkIC0kyW
Guide for MTP: https://t.co/x9BYC3iXCL
likes: 234 | retweets: 28 | replies: 26 | views: 10991