From what I understand you shouldn't wait more than 5min between prompts without...

gck1 · 2026-04-18T20:20:35 1776543635

Cache ttl on max subscriptions is 1h, FYI.

bashtoni · 2026-04-18T22:25:44 1776551144

Only if you set `ENABLE_PROMPT_CACHING_1H`, which was mentioned in the release notes for a recent Claude Code release but doesn't seem to be in the official docs.

g4cg54g54 · 2026-04-18T22:36:24 1776551784

subusers supposedly get it automatic again after the fix (and now also with `DISABLE_TELEMETRY=1`)

but if you are api user you must set `ENABLE_PROMPT_CACHING_1H` as i understood

and when using your own api (via `ANTHROPIC_BASE_URL`) ensure `CLAUDE_CODE_ATTRIBUTION_HEADER=0` is set as well... https://github.com/anthropics/claude-code/issues/50085

and check out the other neckbreakers ive found pukes lots of malicious compliance by feels... :/

[BUG] new sessions will *never* hit a (full)cache #47098 https://github.com/anthropics/claude-code/issues/47098

[BUG] /clear bleeds into the next session (what also breaks cache) #47756 https://github.com/anthropics/claude-code/issues/47756

[BUG] uncachable system prompt caused by includeGitInstructions / CLAUDE_CODE_DISABLE_GIT_INSTRUCTIONS -> git status https://github.com/anthropics/claude-code/issues/47107

andersa · 2026-04-18T22:29:38 1776551378

Bruh. It's getting hard to track down all these MAKE_IT_ACTUALLY_WORK settings that default to off for no reason.

matheusmoreira · 2026-04-19T17:21:01 1776619261

For me it's gotten to the point where I have a wrapper script that applies like 5 environment variables and even patches the system prompt strings prior to every Claude Code invocation.

After the Claude Code source code leak someone discovered that some variables are read directly from the process environment. Can't even trust that setting them in ~/.claude/settings.json will work!

I've actually started asking Claude itself to dissect every Claude Code update in order figure out if it broke some part of the Rube Goldberg machine I was forced to set up.

ethbr1 · 2026-04-19T13:41:29 1776606089

That's the beginning of Googlification of feature evolution, via population statistics rather than quality.

If it increases a KPI by 5% for 95% of users but torpedos the experience for 5%? Ship it.

hnben · 2026-04-21T12:53:52 1776776032

that sounds like a win-win to me.

on one hand 95% of users get an improved experience. While a competitor gets the chance to build a business for the remaining 5%.

plaguuuuuu · 2026-04-19T02:42:44 1776566564

no way, I didn't realise this worked.

My attention span is such that I get side tracked and wind up taking longer than 5 mins quite a bit :D

_blk · 2026-04-18T20:33:05 1776544385

That'd be awesome but it doesn't reflect what I see. Do you have a source for that? What I see is if take a quick break the session loses ~5% right at the start of the next prompt processing. (I'm currently on max 5x)

gck1 · 2026-04-18T20:44:04 1776545044

Not at my workstation right now, but simply ask claude to analyze jsonl transcript of any session, there are two cache keys there, one is 5m, another 1h. Only 1h gets set. There are also some entries there that will tell you if request was a cache hit or miss, or if cache rewrite happened. I've had claude test another claude and on max 5x subscription, cache miss only happened if message was sent after 1h, or if session was resumed using /resume or --resume (this is a bug that exists since January - all session resumes will cause a full cache rewrite).

However, cache being hit doesn't necessarily mean Anthropic won't just subtract usage from you as if it wasn't hit. It's Anthropic we're talking about. They can do whatever they want with your usage and then blame you for it.

Fabricio20 · 2026-04-18T21:00:51 1776546051

I have heard that if you have telemetry disabled the cache is 5 minutes, otherwise 1h. No clue how true that is however my experience (with telemetry enabled) has been the 1h cache.

HarHarVeryFunny · 2026-04-18T21:29:04 1776547744

They've acknowledged that as a bug and have fixed it.

ethanj8011 · 2026-04-18T20:43:31 1776545011

It's true as far as I can tell, just by my own checking using `/status`. You can also tell by when the "clear" reminder hint shows up. Also if you look at the leaked claude code you can see that almost everything in the main thread is cached with 1H TTL (I believe subagents use 5 minute TTL)

krackers · 2026-04-18T22:03:47 1776549827

>pay for reinitializing the cache

Why can't they save the kv cache to disk then later reload it to memory?

stavros · 2026-04-18T23:57:42 1776556662

Probably because the costly operation is loading it onto the GPU, doesn't matter if it's from disk or from your request.

zozbot234 · 2026-04-19T00:12:22 1776557542

The point of prompt caching is to save on prefill which for large contexts (common for agentic workloads) is quite expensive per token. So there is a context length where storing that KV-cache is worth it, because loading it back in is more efficient than recomputing it. For larger SOTA models, the KV cache unit size is also much smaller compared to the compute cost of prefill, so caching becomes worthwhile even for smaller context.

vanviegen · 2026-04-19T07:07:22 1776582442

Isn't that how the kv cache currently works? Of course they could decide to hold on to cache items for longer than an hour, but the storage requirements are pretty significant while the chance of sessions resumption slinks rapidly.

zozbot234 · 2026-04-19T07:29:38 1776583778

The storage requirements for large-model KV caches are actually comparatively tiny: the per-token size grows far less than model parameters. Of course, we're talking "tiny" for stashing them on bulk storage and slowly fetching them back to RAM. But that should still be viable for very long context, since the time for running prefill is quadratic.

vanviegen · 2026-04-19T09:19:47 1776590387

We only have open models to go by, so looking at GLM 5.1 for instance, we're talking about almost 300 GB of kv-cache for a full context window of 200k tokens.

That's hardly tiny.

stingraycharles · 2026-04-19T02:04:50 1776564290

It’s a shitload of data, and it only works if all the tokens are 100% identical, i.e. all the attention values are exactly the same.

Typically it’s cached for about 5 minutes, you can pay extra for longer caches.

krackers · 2026-04-19T03:02:24 1776567744

If I have a conversation with claude then come back 30 minutes later to resume the conversation, the KV values for that prefill prefix are going to be exactly the same. That's the whole point of this caching in the first place.

If you're willing to incur a latency penalty on a "cold resume" (which is fine for most use-cases), why couldn't they just move it to disk. The size of the KV cache should scale on the order of something like (context_length * n_layers * residual_length). I think for a standard V3-MoE model at 1M token length, this should be on the order of 100G at FP16? And you can surely play tricks with KV compression (e.g. the recent TurboQuant paper). It doesn't seem like an outrageous amount of data to put onto cheap scratch HDD (and it doesn't grow indefinitely since really old conversations can be discarded).

stingraycharles · 2026-04-19T03:42:28 1776570148

> If I have a conversation with claude then come back 30 minutes later to resume the conversation, the KV values for that prefill prefix are going to be exactly the same.

Correct, when you’re using the API you can choose between 60 minute or 5 minute cache writes for this reason, but I believe the subscription doesn’t offer this. 60 minute cache writes are about 25% more expensive than regular cache writes.

I don’t have insights into internals at Anthropic so I don’t know where the pain point is for increasing cache sizes.

conception · 2026-04-18T18:39:00 1776537540

Yeah the caching change is probably 90% of “i run out of usage so fast now!” Issues.

hgoel · 2026-04-18T18:15:27 1776536127

Ah I can see how my phrasing might be misleading, but these prompts were made within 5 minutes of each other, the timing I mentioned were what Claude spent working.

trueno · 2026-04-18T20:01:29 1776542489

is it 5 mins between constant prompting/work or 5 mins as in if i step away from the comp for 5 mins and comp back and prompt again im not subject to reinit?

if it's the latter that's crazy. i dont even know what to do there, compactions already feel like a memory wipe