Frontier labs have their own variants of MLA and certainly their own balance/sca...

onlyrealcuzzo · 2026-05-28T17:54:53 1779990893

> Frontier labs have their own variants of MLA

Yes, variants typically 2-3x less good...

Same with speculative decoding... They all do something, but there are known techniques that are substantially better - that just were't known when they started development of the previous models.

amluto · 2026-05-28T19:05:45 1779995145

How useful is speculative decoding in a batched setting where you get paid for throughput (aggregated across users) and you mostly don’t get paid for latency or single-session throughput?

onlyrealcuzzo · 2026-05-28T19:07:33 1779995253

It's useful at the local level, where there will be SOTA models developed...

zozbot234 · 2026-05-28T20:10:34 1779999034

Local models are moving towards batched inference too, if only for non-interactive use. An early experimental patchset for DS4 (running DeepSeek V4 Flash) seems to show 2x aggregate tok/s decode when processing 8 streams concurrently, and more than 3x when processing as many as 32 streams concurrently. Note that prefill (which is not helped significantly by this change) then becomes a larger fraction of total wall-clock time, so the overall gain is lower (i.e. prefill is akin to a 'serial' task wrt. Amdahl's law).

MTP will still be highly valuable for interactive use of course.