Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

From what I understand you shouldn't wait more than 5min between prompts without compacting or clearing or you'll pay for reinitializing the cache. With compaction you still pay but it's less input tokens. (Is compaction itself free?)


Cache ttl on max subscriptions is 1h, FYI.


Only if you set `ENABLE_PROMPT_CACHING_1H`, which was mentioned in the release notes for a recent Claude Code release but doesn't seem to be in the official docs.


subusers supposedly get it automatic again after the fix (and now also with `DISABLE_TELEMETRY=1`)

but if you are api user you must set `ENABLE_PROMPT_CACHING_1H` as i understood

and when using your own api (via `ANTHROPIC_BASE_URL`) ensure `CLAUDE_CODE_ATTRIBUTION_HEADER=0` is set as well... https://github.com/anthropics/claude-code/issues/50085

and check out the other neckbreakers ive found pukes lots of malicious compliance by feels... :/

[BUG] new sessions will *never* hit a (full)cache #47098 https://github.com/anthropics/claude-code/issues/47098

[BUG] /clear bleeds into the next session (what also breaks cache) #47756 https://github.com/anthropics/claude-code/issues/47756

[BUG] uncachable system prompt caused by includeGitInstructions / CLAUDE_CODE_DISABLE_GIT_INSTRUCTIONS -> git status https://github.com/anthropics/claude-code/issues/47107


Bruh. It's getting hard to track down all these MAKE_IT_ACTUALLY_WORK settings that default to off for no reason.


For me it's gotten to the point where I have a wrapper script that applies like 5 environment variables and even patches the system prompt strings prior to every Claude Code invocation.

After the Claude Code source code leak someone discovered that some variables are read directly from the process environment. Can't even trust that setting them in ~/.claude/settings.json will work!

I've actually started asking Claude itself to dissect every Claude Code update in order figure out if it broke some part of the Rube Goldberg machine I was forced to set up.


That's the beginning of Googlification of feature evolution, via population statistics rather than quality.

If it increases a KPI by 5% for 95% of users but torpedos the experience for 5%? Ship it.


that sounds like a win-win to me.

on one hand 95% of users get an improved experience. While a competitor gets the chance to build a business for the remaining 5%.


no way, I didn't realise this worked.

My attention span is such that I get side tracked and wind up taking longer than 5 mins quite a bit :D


That'd be awesome but it doesn't reflect what I see. Do you have a source for that? What I see is if take a quick break the session loses ~5% right at the start of the next prompt processing. (I'm currently on max 5x)


Not at my workstation right now, but simply ask claude to analyze jsonl transcript of any session, there are two cache keys there, one is 5m, another 1h. Only 1h gets set. There are also some entries there that will tell you if request was a cache hit or miss, or if cache rewrite happened. I've had claude test another claude and on max 5x subscription, cache miss only happened if message was sent after 1h, or if session was resumed using /resume or --resume (this is a bug that exists since January - all session resumes will cause a full cache rewrite).

However, cache being hit doesn't necessarily mean Anthropic won't just subtract usage from you as if it wasn't hit. It's Anthropic we're talking about. They can do whatever they want with your usage and then blame you for it.


I have heard that if you have telemetry disabled the cache is 5 minutes, otherwise 1h. No clue how true that is however my experience (with telemetry enabled) has been the 1h cache.


They've acknowledged that as a bug and have fixed it.


It's true as far as I can tell, just by my own checking using `/status`. You can also tell by when the "clear" reminder hint shows up. Also if you look at the leaked claude code you can see that almost everything in the main thread is cached with 1H TTL (I believe subagents use 5 minute TTL)


>pay for reinitializing the cache

Why can't they save the kv cache to disk then later reload it to memory?


Probably because the costly operation is loading it onto the GPU, doesn't matter if it's from disk or from your request.


The point of prompt caching is to save on prefill which for large contexts (common for agentic workloads) is quite expensive per token. So there is a context length where storing that KV-cache is worth it, because loading it back in is more efficient than recomputing it. For larger SOTA models, the KV cache unit size is also much smaller compared to the compute cost of prefill, so caching becomes worthwhile even for smaller context.


Isn't that how the kv cache currently works? Of course they could decide to hold on to cache items for longer than an hour, but the storage requirements are pretty significant while the chance of sessions resumption slinks rapidly.


The storage requirements for large-model KV caches are actually comparatively tiny: the per-token size grows far less than model parameters. Of course, we're talking "tiny" for stashing them on bulk storage and slowly fetching them back to RAM. But that should still be viable for very long context, since the time for running prefill is quadratic.


We only have open models to go by, so looking at GLM 5.1 for instance, we're talking about almost 300 GB of kv-cache for a full context window of 200k tokens.

That's hardly tiny.


It’s a shitload of data, and it only works if all the tokens are 100% identical, i.e. all the attention values are exactly the same.

Typically it’s cached for about 5 minutes, you can pay extra for longer caches.


If I have a conversation with claude then come back 30 minutes later to resume the conversation, the KV values for that prefill prefix are going to be exactly the same. That's the whole point of this caching in the first place.

If you're willing to incur a latency penalty on a "cold resume" (which is fine for most use-cases), why couldn't they just move it to disk. The size of the KV cache should scale on the order of something like (context_length * n_layers * residual_length). I think for a standard V3-MoE model at 1M token length, this should be on the order of 100G at FP16? And you can surely play tricks with KV compression (e.g. the recent TurboQuant paper). It doesn't seem like an outrageous amount of data to put onto cheap scratch HDD (and it doesn't grow indefinitely since really old conversations can be discarded).


> If I have a conversation with claude then come back 30 minutes later to resume the conversation, the KV values for that prefill prefix are going to be exactly the same.

Correct, when you’re using the API you can choose between 60 minute or 5 minute cache writes for this reason, but I believe the subscription doesn’t offer this. 60 minute cache writes are about 25% more expensive than regular cache writes.

I don’t have insights into internals at Anthropic so I don’t know where the pain point is for increasing cache sizes.


Yeah the caching change is probably 90% of “i run out of usage so fast now!” Issues.


Ah I can see how my phrasing might be misleading, but these prompts were made within 5 minutes of each other, the timing I mentioned were what Claude spent working.


is it 5 mins between constant prompting/work or 5 mins as in if i step away from the comp for 5 mins and comp back and prompt again im not subject to reinit?

if it's the latter that's crazy. i dont even know what to do there, compactions already feel like a memory wipe




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: