This looks fantastic. Also answers the relevancy of the "off-by-one" softmax\* M...

This looks fantastic. Also answers the relevancy of the "off-by-one" softmax*

My naive question is...does it work? But that sounds dismissive. At length:

It shows that the model can't respond after a certain length versus a proposed model that does continue to respond.

But can a model that continues to respond retrieve information far "in the past"?

The demo video is too low-level, at least to my brain. It shows one model stops responding but the proposed one continues.

I spent about 5 minutes going frame by frame to see if the proposed model attempts to have to "recall" information from further back, but it looks like no.

Perfection here isn't necessary or even possible AFAIK, i.e. I don't expect it to recall page 1 100% accurately at page 1000. But can it recall _anything_ from it, even if it ignores it?

The great thing about this era and work is we can check. But I hope someone has it up in a HuggingFace space before I figure out how to run it myself. :P

I'm leaning no, based on the sliding window thing. It sounds like there's 4 fixed tokens, then the last context size - 4 tokens, that's it

* at the time, two camps: one, it's some random person saying it and there's prior art on implementations that do the off-by-one. Two, you'd be surprised how much little things go unnoticed by large groups, and do matter.