I got qwen3.6:27B running on my 4090 (24GB) with ~128K context leveraging some of the recent turboquant/rotorquant memory optimizations for activations. Highly suggest going up to that. the q4_xl+rotorquant combo is pretty good.
What is your exp on performance +40k tokens? I've not gone past that as I've heard reports that were problems start to arise. I'd be happy to know your experience in that regard.
I'm super happy with the performance, I generally run with 2 parallel slots so I only get about 128K context window. My experience with all llms is that they get more forgetful if you use the full window. (256-512K is the sweet spot for frontier models, 128k works for me with this current qwen)
I forked it to also add rotorquant. This is a specific optimization that uses clifford rotors instead of static compile time random purmutation to store the activations. Reduces space and parameter count for the storage.
Some reference code if you want to throw your agent at it. https://github.com/rapatel0/rq-models