Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Related question, is it at all feasible to store cache locally to offload memory costs and then send it over the wire when needed?


No, the cache is a few GB large for most usual context sizes. It depends on model architecture, but if you take Gemma 4 31B at 256K context length, it takes 11.6GB of cache

note: I picked the values from a blog and they may be innacurate, but in pretty much all model the KV cache is very large, it's probably even larger in Claude.


To extend your point: it's not really the storage costs of the size of the cache that's the issue (server-side SSD storage of a few GB isn't expensive), it's the fact that all that data must be moved quickly onto a GPU in a system in which the main constraint is precisely GPU memory bandwidth. That is ultimately the main cost of the cache. If the only cost was keeping a few 10s of GB sitting around on their servers, Anthropic wouldn't need to charge nearly as much as they do for it.


That cost that you're talking about doesn't change based on how long the session is idle. No matter what happens they're storing that state and bring it back at some point, the only difference is how long it's stored out of GPU between requests.


Are you sure about that? They charge $6.25 / MTok for 5m TTL cache writes and $10 / MTok for 1hr TTL writes for Opus. Unless you believe Anthropic is dramatically inflating the price of the 1hr TTL, that implies that there is some meaningful cost for longer caches and the numbers are such that it's not just the cost of SSD storage or something. Obviously the details are secret but if I was to guess, I'd say the 5m cache is stored closer to the GPU or even on a GPU, whereas the 1hr cache is further away and costs more to move onto the GPU. Or some other plausible story - you can invent your own!


Storing on GPU would be the absolute dumbest thing they could do. Locking up the GPU memory for a full hour while waiting for someone else to make a request would result in essentially no GPU memory being available pretty rapidly. This type of caching is available from the cloud providers as well, and it isn't tied to a single session or GPU.


> Storing on GPU would be the absolute dumbest thing they could do

No. It’s not dumb. There will be multiple cache tiers in use, with the fastest and most expensive being on-GPU VRAM with cache-aware routing to specific GPUs and then progressive eviction to CPU ram and perhaps SSD after that. That is how vLLM works as you can see if you look it up, and you can find plenty of information on the multiple tiers approach from inference providers e.g. the new Inference Engineering book by Philip Kiely.

You are likely correct that the 1hr cached data probably mostly doesn’t live on GPU (although it will depend on capacity, they will keep it there as long as they can and then evict with an LRU policy). But I already said that in my last post.


Yesterday I was playing around with Gemma4 26B A4B with a 3 bit quant and sizing it for my 16GB 9070XT:

  Total VRAM: 16GB
  Model: ~12GB
  128k context size: ~3.9GB
At least I'm pretty sure I landed on 128k... might have been 64k. Regardless, you can see the massive weight (ha) of the meager context size (at least compared to frontier models).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: