News desk

The five stories worth opening first

View the news archive
20 May Reddit r/LocalLLaMA

Qwen 3.6 35B GGUF: NTP vs MTP quantization results across GPUs and CPUs

Hey r/LocalLLaMA, We’ve released our ByteShape Qwen 3.6 35B GGUF quantizations in two families: standard NTP (Next Token Prediction or non-MTP) and MTP. [Blog](https://byteshape.com/blogs/Qwen3.6-35B-A3B/) / [Download NTP Models](https://huggingface.co/byteshape/Qwen3.6-35B-A3B-GGUF) / [Download MTP Models](https://huggingface.co/byteshape/Qwen3.6-35B-A3B-MTP-GGUF) **TL;DR** * For NTP, “pick the largest quant that fits” worked surprisingly well. * Lower bpw was not automatically better: our largest model was very hard to beat on quality/speed, including prompt processing and token generation. * MTP gave a real GPU generation-speed boost, usually around 20–40%, but the extra memory footprint can change what fits. * MTP speedup is heavily workload dependent. * CPU MTP was not attractive in our tests, so our CPU recommendation remains NTP. * We excluded MMLU from this release because Qwen 3.6 showed answer-format compliance issues in full precision, making it a noisy quantization-comparison signal. For this release, we tried to make the comparison more of a small hardware study than just a model drop. We benchmarked the original model and a broader set of quantized variants across RT…

Writing archive

Essays and opinion pieces with a point of view

View all posts

Connected projects

The wider publishing ecosystem around Kol's Korner

More from the archive

Recent pieces beyond the front row