Qwen 3.7 Preview

sleepyeldrazi · 2026-05-18T17:32:20 1779125540

I don't think I can handle another small model release by qwen, I'm still trying to find the limits of 3.6 27B and they are already threatening us with a new one?

But jokes aside, I love the fast iteration, these are most probably again finetunes on the 3.5 architecture that appear better in internal testing, which is still very nice to see. Putting more and more pressure on the bigger labs to perform better is always a good thing.

genxy · 2026-05-18T18:01:02 1779127262

How good must their training pipelines be? Releasing publicly and at this rate has made them very efficient.

sleepyeldrazi · 2026-05-18T18:10:23 1779127823

Finetuning takes little resources, the base model training is the slow and expensive part. Architecturally 3.5 models are identical to their 3.6 counterparts, that is why there is a consensus that those are probably finetunes and not re-trained from scratch, like you will se many people publish their own on huggingface.

genxy · 2026-05-18T18:49:17 1779130157

Understood, but look at their larger cadence over the years and the breadth of models. They are clearly not all finetunes. Meta for all its billions, doesn't have anything comparable.

fgonzag · 2026-05-18T22:15:22 1779142522

In the china AI scene, there seem to be two separate types of companies.

Companies or labs like deepseek that produce less but larger and more innovative models, so seem to be more research oriented.

then there are companies like z.ai (GLM), Minimax, and Qwen which focus more on commercializing the AI and so produce far more versions, but with far less improvements between them (usually fine tunes)

Commercial providers like anthropic probably do the same thing, maybe even without labeling it like a different version if the model is similiar enough.

bachmeier · 2026-05-18T21:07:54 1779138474

> Meta for all its billions, doesn't have anything comparable.

Maybe nothing released to the public. I don't know that all of their models are public. I think all they really care about is that they aren't relying on one or two cloud providers for a critical piece of their infrastructure.

Computer0 · 2026-05-18T20:56:28 1779137788

competent leadership goes a long way

throwa356262 · 2026-05-19T17:57:25 1779213445

That was true up to Qwen 3.5, everything after that is made by the same people that made Gemini 3.1 suck

cyanydeez · 2026-05-18T23:17:08 1779146228

still waiting for a update to Qwen3-Coder-Next

kethinov · 2026-05-18T17:50:52 1779126652

Can someone explain what the current state of model benchmarking is? If you try to look up what the best locally runnable model is, you get a bunch of random blog posts using idiosyncratic criteria to rank things seemingly based on one dude's opinion.

Ideally I would love to see a leaderboard with relatively objective ranking criteria that 1. lets you filter by open weight / locally runnable, 2. filter by date of release (nothing older than x), and 3. is agnostic to hardware requirements. I just want to know what the best model is. Let me worry about how I will afford to run it.

I love the llmfit project for seeing what will run on your hardware, but it would be nice to know what I'm missing out on by not having better hardware, thus why objective hardware-agnostic ratings would be helpful.

vessenes · 2026-05-18T17:55:12 1779126912

That would be nice, but it's not going to be possible.

Any open benchmark has a very short life, since it will be pulled in and DPO / RL trained quickly for benchmaxxing purposes. So, you'll need a private test to have a hope of something fair. (These also get leaked over time, btw, so even then there's a window of usability).

These are expensive to run.

Now consider that there might be 15-20 viable quants for a given open model release; someone would have to want to pay for these private evals to be run on them. Even then, a good read through unsloth's commits and blog posts will remind you that there's quite a lot of engineering work to be done to get model inference working properly, even for models released by frontier or near-frontier labs. So, you'd want to make sure that you have a replicable 'best engineered' deployment to evaluate, or at least one that's closest to your hardware and fits the bill.

Upshot - it's much faster to download and try out a model, and possibly cheaper too. Well, cheaper since hugging face is paying the bandwidth bills.

cyanydeez · 2026-05-18T23:21:18 1779146478

there are benchmarks that have nothing to do with the training material, but with how the models are capable of things like reading code: https://needle-bench.cc/

Generally, you give them a document and you ask them to retrieve some subsection of the document then rate them on what they retrieved.

You can always find enough random documents, or create your own, to always run these and you can make it arbitrarily long. It's definitely a valid non-maxxable context test.

jononor · 2026-05-19T05:16:20 1779167780

This seems like a viable eval strategy. Presumably finding a bug requires some degree of understanding of the code, beyond just information retrieval. However it probably does not measure things like prompt adherence or ability to create code that implements a specification?

cyanydeez · 2026-05-19T15:06:44 1779203204

you can extend the test pretty easily. run through design turns and ask it for it again and again. effectively measure context length.

ask it to modify lines 120-130 and add more context, etc.

we have rudimentry preLLM algoritms that can measure hamming distance and hashing.

you could even go all https://en.wikipedia.org/wiki/Jabberwocky to see if its sense of context is easily polluted.

the point though is there are benchmarks beyong pelican on a bike that cant be tokenmaxx and prove real value in capabilities

sigmoid10 · 2026-05-18T18:00:07 1779127207

>I just want to know what the best model is. Let me worry about how I will afford to run it.

This is a very typical manager question that I suppose many people have who fail to see the simple truth: There is no "best" model. There are only best models for certain use-cases. Sometimes you'll find these in custom community leaderboards on platforms like huggingface, but for most business applications you'll probably have to come up with your own benchmark. Most common benchmarks are pretty worthless by now because all the usual ones are being gamed hard by model providers, to the point that there are now sometimes drastic differences between models that perform very similarly on common benchmarks.

sleepyeldrazi · 2026-05-18T18:07:12 1779127632

The best thing I have come up with is just make a bunch of prompts / tasks that I personally care about and need a model to know how to do. As an example, when qwen3.6 27B dropped, I ran it, kimi, claude and glm 5/5.1 on a bunch of LLM-architecture specific tasks (stuff like 'implement an incremental KV-cache for autoregressive transformer inference' or 'implement flash Attention backward pass with D-optimization') and analyze the results, who made tests, are the tests valid, does their implementation actually work or are they only claiming it to, that sort of thing.

It is a day/weekend worth of work, but I think this is the best way to determine if the model fits your need specifically. This is what lead me to finding out that qwen 27b outperformed even kimi on those tasks, and that opus tries gaslighting me when I give it a spec of something that has been proven, but no published solution exists online. All other models gave their best shot at solving it, opus just said it's not possible (even when I gave it the finished working product that obviously works).

Especially for small models (but also big ones) I think the only way to know if a model will improve your workflow is this, personal benchmarks, expanded over time, ran in private.

Alifatisk · 2026-05-19T16:50:38 1779209438

> Ideally I would love to see a leaderboard with relatively objective ranking criteria that 1. lets you filter by open weight / locally runnable, 2. filter by date of release (nothing older than x), and 3. is agnostic to hardware requirements. I just want to know what the best model is. Let me worry about how I will afford to run it.

Stick to artificialanalysis.ai it has become the norm

lofaszvanitt · 2026-05-18T22:57:20 1779145040

benchmarks = bs

kelsey98765431 · 2026-05-18T17:14:23 1779124463

https://xcancel.com/Alibaba_Qwen/status/2056403591464984753

> Qwen3.7 Preview lands on Arena ！

> Here come Qwen3.7-Max-Preview & Qwen3.7-Plus-Preview. Alibaba now #6 lab in Text, #5 in Vision.

> Can't wait to release Qwen3.7 series models！Stay tuned! @arena

rspoerri · 2026-05-18T17:26:59 1779125219

I am very interested in seeing new qwen models. Qwen3.6 27b is the first one that can do things and doesnt constantly loose "it's mind" and that can be run on a 3090 with a good context size. But it's sometimes getting into a loop.

BillStrong · 2026-05-18T17:53:48 1779126828

Look on HuggingFace, there is a template that is supposed to fix the updates for the Qwen Models.

https://huggingface.co/froggeric/Qwen-Fixed-Chat-Templates

Maybe will help you?

tedivm · 2026-05-18T17:38:06 1779125886

I've completely replaced GitHub Copilot using Sonnet 3.6 with OpenCode using Qwen3.6 27b, and it's been a great experience.

verdverm · 2026-05-18T17:54:20 1779126860

Similar, but I'm using 35B A3B variation with experimental MTP support

OpenCode is pretty good too

danielbln · 2026-05-18T18:18:33 1779128313

A3B is especially nice, MoE really shines on memory bandwidth contained platforms like the DGX Spark.

verdverm · 2026-05-18T21:26:29 1779139589

looks like MTP support has now been merged and also updated unsloth quants to go with it (not just the extras, all of 'em!)

2001zhaozhao · 2026-05-18T18:42:38 1779129758

Is Sonnet 3.6 a typo? Claude Sonnet 3.6 (aka 3.5 New) is an ancient model from 2024

satvikpendem · 2026-05-18T19:07:07 1779131227

Pretty sure they meant 4.6

tedivm · 2026-05-19T01:05:32 1779152732

Yeah that was a typo, I meant 4.6.

Jach · 2026-05-18T21:22:52 1779139372

I sort of thought this about qwen3.5 35b, finally a local model that isn't a complete waste of electricity, but "upgrading" to 3.6 35b left me disappointed. It seemed more like a downgrade. But honestly I've barely used either. Subjectively they still seem far from the frontier models, but for what they can do, it's great to be able to do locally.

boppo1 · 2026-05-19T00:49:29 1779151769

How are they just for chat / questions?

Jach · 2026-05-19T20:19:13 1779221953

Pretty decent, it's given similar book recommendations as Claude when I feed it my list of read books and thoughts on them. You'll have to tell them to never use emojis. I was using 3.5 a while ago to generate some flavor text while I was playing a bit of an old-school dungeon-crawler game (it's like Wizardry), a genre I don't particularly enjoy much, but it's funner with the flavor text. Worth setting up something like open webui or other front-ends since a pure CLI experience via ollama is pretty bad.

giancarlostoro · 2026-05-18T17:27:59 1779125279

I had a flavor of an older version of Qwen (I forget which one to be fair) that was coding along, then lost itself in a loop, I was so confused, it was just a random greenfield "lets see how it does" type of project anyway.

hydra-f · 2026-05-18T17:30:31 1779125431

Vision has become totally underappreciated, whereas I believe it brings important advantages to a model

Also, a big caveat in using Qwen models has always been its speech patterns. I do wonder how Google made the Gemma lineup so good at this

Let's hope Alibaba continues to open source its models

jwr · 2026-05-18T17:35:59 1779125759

Agreed. Incidentally, in my testing, qwen models (qwen3.6-35b-a3b and earlier 3.5) are WAY better with vision than gemma4-26b-a4b. I would normally want to stick with gemma4 only (I use it for spam filtering), but it just doesn't cut it for vision work, and qwen models do.

tredre3 · 2026-05-18T18:44:21 1779129861

That has been my experience has well.

Qwen 3.5/3.6 are far better at vision. Even the 9B model beats Gemma 4 31B in my use case. They describe the scene more accurately and they focus on the important elements like a human would.

Gemma 4 frequently misses important element, doesn't understand what things are, and is very coy even if you ask for lots of detail. You have to give it hints "hey what's that round thing on the left" to get half decent answers.

(Yes I did set the min-tokens correctly. I also tested bf16 and Q8 to make sure it wasn't a quant issue.)

It's unfortunate because Gemma 4 is so so so much better at natural language interactions.

greenavocado · 2026-05-18T17:48:53 1779126533

God I love qwen3.6-35b-a3b especially Q8

verdverm · 2026-05-18T17:56:22 1779126982

I second this notion, I am impressed daily with what little Qwen can do

argee · 2026-05-18T21:22:43 1779139363

> qwen models (qwen3.6-35b-a3b and earlier 3.5) are WAY better with vision than gemma4-26b-a4b

Can you give an example? And/or is there a benchmark specifically for this?

jwr · 2026-05-19T08:57:52 1779181072

I have my own benchmark: a photo of filament boxes in my drawer. The task is to list (inventory) the filaments. It's not an easy task: the letters are small, the boxes are in various orientations, the labels have lots of writing on them, etc.

This is what qwen produces (basically perfect):

Fiberlogy (Blue Boxes - Left Side)

ASA Black: 750 g (0.75 kg)

Easy PLA Graphite: 850 g (0.85 kg)

Ref Easy PLA Gray: 850 g (0.85 kg)

Ref Easy PLA White: 850 g (0.85 kg)

Ref Easy PLA Inox: 850 g (0.85 kg)

Bambu Lab (Brown Boxes - Right Side)

ASA White: 750 g (0.75 kg)

PLA-CF (Burgundy Red): 1,000 g (1 kg)

PAHT-CF (Black): 500 g (0.5 kg)

PET-CF (Black): 500 g (0.5 kg)

PETG-CF (Black): 1,000 g (1 kg)

----------

And this is gemma-4-26b-a4b (basically useless):

### *Fiberlogy Filaments*

All Fiberlogy items listed below share these common specifications: Made in Poland, $1.75\text{mm}$ Diameter, $0.75\text{kg}$ Weight, and Print Temperature of $200-230^{\circ}\text{C}$.

* *PLA Black* (Barcode: 5903281997986)

* *Easy PLA Graphite* (Barcode: 5903281994057)

* *Ref Easy PLA Grey*

* *Ref Easy PLA White* (Barcode: 5903281994386)

* *Ref Easy PLA True...* (Text is cut off)

**

### *Bambu Lab Filaments*

All Bambu Lab items listed below share these common specifications: Made in China, Removable Spool (Do Not Remove).

* *PLA-CF (Carbon Fiber Reinforced)*

    \*   Color: Burgundy Red
    \*   Diameter: $1.75 \pm 0.02\text{mm}$
    \*   Weight: $1.0\text{kg}$
    \*   Suggested Drying Conditions: $45^{\circ}\text{C}$ for $6-12$ hours

* *PAHT-CF (High Temperature Polyamide with Carbon Fiber)*

    \*   Color: Black
    \*   Diameter: $1.75 \pm 0.02\text{mm}$
    \*   Weight: $0.5\text{kg}$
    \*   Suggested Drying Conditions: $80^{\circ}\text{C}$ for $6-12$ hours

* *PETG-CF (Carbon Fiber Reinforced)*

    \*   Color: Black
    \*   Diameter: $1.75 \pm 0.02\text{mm}$
    \*   Weight: $1.0\text{kg}$

argee · 2026-05-19T16:27:25 1779208045

Thanks. Did you set the image min/max tokens for Gemma4 to 1120 for this? This might not be a fair comparison without that, to the differences in architecture.

https://www.reddit.com/r/KoboldAI/comments/1sjnjic/imagemin_...

https://github.com/ollama/ollama/issues/15626

I think 1120 vs 280 tokens is a big difference, and you were perhaps using the latter value?

jwr · 2026-05-19T18:24:48 1779215088

I did not, and I had no idea such a setting even existed. This could definitely change things. However, I don't see a way to set this in LM Studio, which is what I currently use to run models.

argee · 2026-05-19T18:40:28 1779216028

Seems like you can't set it, for now. There's an issue for it: https://github.com/lmstudio-ai/lmstudio-bug-tracker/issues/1...

benterix · 2026-05-19T10:50:12 1779187812

Thanks, that's very useful. I find people's small individual tests more important than the usual benchmarks that tend to be gamed by every single lab.

bachmeier · 2026-05-18T20:52:00 1779137520

I'm not much interested in vibe coding (for those who aren't aware that LLMs have other uses). The specific model I've been using with Ollama is hf.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:UD-Q4_K_XL and it's amazing how fast it is on 64 GB of RAM and i5-13400 CPU. No GPU on this computer. Gemma 4 E4B will think for a couple of minutes vs 3-5 seconds for Qwen. It's hard to believe how much you can do with such limited hardware using their models.

throwa356262 · 2026-05-19T18:16:46 1779214606

Since you are using unsloth models from HF, why not use Unsloth Studio instead of Ollama?

It is supposed to be faster + they will update the new models multiple times during the first month to correct bugs and performance issues

https://unsloth.ai/docs/new/studio#quickstart

metalliqaz · 2026-05-18T23:25:33 1779146733

I have a much more powerful PC and I would not call Qwen3-Coder-30B-A3B "fast" on my machine by any stretch of the word. How are you running it?

pancsta · 2026-05-19T07:17:19 1779175039

Probably in the chat / no sys prompt mode for docs only. Try to refac a codebase using a CPU only…

maille · 2026-05-18T22:44:07 1779144247

What are your use cases?

monksy · 2026-05-18T21:59:41 1779141581

For the safer link: https://xcancel.com/Alibaba_Qwen/status/2056403591464984753

trilogic · 2026-05-18T17:46:25 1779126385

Qwen 3.6 35B (finetuned) is so good that it became standard open weights for everyday use. Is not far at all from proprietary models if you give it tools, skills and agents etc, it can actually finish the job. (Thank you Qwen team, appreciated). Using opensource now we can definitely rely to design from scratch very complicated architecture and build pretty fast the full pack. Wish to see Europe AI unleashed, wake up.

Aurornis · 2026-05-18T18:28:11 1779128891

> Is not far at all from proprietary models if you give it tools, skills and agents etc,

I use Qwen 3.6 27B, the dense version of this model which is slightly better.

I don't agree that it's close at all. Maybe for some small, easy tasks, but not for working on real codebases. It's amazing for something I can run at home, but the difference between it and Opus or GPT-5.5 is huge.

trilogic · 2026-05-18T18:46:39 1779129999

Really, how so? Because we work with codebases daily, can you tell us a concrete example! In our case we work in consumer hardware (ish), 10 million ctx (1 million output, 1 million input proven, sometimes it loops or breaks at over 500k ctx byt at ~17tps linear). IT can read the full codebase, unleash agents, and write in disk editing and patching files creating a full app in 3-4 minutes. IT can do Web search and Rag pretty fast, it understands and fix the user query, sys prompts and adapt/fix them if needed on the fly. I am wondering what more do you do?

trilogic · 2026-05-18T18:52:33 1779130353

Edit: Forgot to mention that it can process images and pdf, and 100s of other files, it can even create presentations in code or mermaid, svg, charts js etc. Here a basic version of it: https://hugston.com/chat

rspoerri · 2026-05-18T19:09:27 1779131367

how do you do 1mio context with qwen3.6 27b, that only supports 256k? and what hardware would you run that on? 2 * 3090 is afaik currently at max 256k context.

nyrikki · 2026-05-18T19:44:49 1779133489

You can get all the Qwen 3.x models up to ~1 million tokens using YaRN with llama.cpp.[0]

Personally I am using `--no-context-shift` and feeding in context back in on failure at the harness level.

I have 2x1080ti + 1xTitanV that have a full 262,144 tokens context on 262,144 tokens with `-sm tensor` at 62.04 t/s which isn't so bad.

But I also have a 1x3090 running unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL at 41.89 t/s but with only 130k context, but if you have a modular programming style both work pretty well.

But play with YaRN if you really need it.

[0]https://qwen.readthedocs.io/en/v3.0/run_locally/llama.cpp.ht...

Vaskivo · 2026-05-18T20:59:14 1779137954

How can you get it to run at 41 t/s? I also have a single 3090 and even with MTP can't break 20 t/s.

HEre's my setup:

  llama-server
  --port 9999
  --model /MODELS/LLMs/Qwen3.6-27B-UD-Q4_K_XL.gguf
  --ctx-size 128000
  --threads 12
  --flash-attn on
  --device CUDA0
  --jinja
  --gpu-layers 52
  --mmproj /MODELS/LLMs/Qwen3.6-27B-mmproj-F16.gguf
  --cache-type-k q8_0
  --cache-type-v q8_0
  --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0 --repeat-penalty 1.0 --presence-penalty 0.0
  --spec-type draft-mtp --spec-draft-n-max 2

(I'm not filling out 100% of the VRAM, as I have other stuff I need it for.)

nyrikki · 2026-05-18T21:34:25 1779140065

(Note UPDATED config)

Ya, if you are using the CPU it may slowdown quick.

This may be a bit huge and overcomplicated, on this host I am running it on a AMD Ryzen 7 5700G so that I can use the APU to dedicate the 3090.

    podman run --device nvidia.com/gpu=all -d -v llama_qwen3.6mpt:/root/.cache -p 8080:8080 local/llama.cpp:full-cuda --server \
    -hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL \
    -ngl 99 \
    --ctx-size 131072 \
    --no-mmproj-offload \
    --no-context-shift \
    --kv-unified \
    --spec-type draft-mtp \
    --spec-draft-n-max 6 \
    --spec-draft-p-min 0.75 \
    -fa on --jinja --no-mmap \
    --cache-ram -1 \
    --no-warmup -np 1 \
    -n 32768 \
    --cache-type-k q8_0 \
    --cache-type-v q8_0 \
    --temp 0.6 \
    --min-p 0.00 \
    --top-k 20 \
    --top-p 0.95 \
    --presence-penalty 0.0 \
    --repeat-penalty 1.05 \
    --fit off \
    --reasoning on \
    --chat-template-kwargs '{"preserve_thinking":true}' \
    --prio 3 \
    --poll 100 \
    --port 8080 \
    --host 0.0.0.0

I am just building the container with:

     podman build -t local/llama.cpp:full-cuda --target full -f .devops/cuda.Dockerfile .

And here is the logs from a 'make me a flappy bird program in python' webui prompt.

     prompt eval time =     105.86 ms /    19 tokens (    5.57 ms per token,   179.47 tokens per second)
       eval time =  100549.41 ms /  4608 tokens (   21.82 ms per token,    45.83 tokens per second)
      total time =  100655.28 ms /  4627 tokens
     draft acceptance rate = 0.47215 ( 3408 accepted /  7218 generated)

I am down to ~25.54 t/s with a 95% full context.

nyrikki · 2026-05-18T22:01:06 1779141666

That config looked too complicated, getting rid of the --prio 3 and --poll 100, setting the draft-n-max to now recommended values, etc... kicked it up to 61 t/s

I think that was all about some earlier crashes.

     podman run --device nvidia.com/gpu=all -d -v llama_qwen3.6mpt:/root/.cache -p 8080:8080 local/llama.cpp:full-cuda --server \
    -hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL \
    -ngl 99 \
    --ctx-size 128000 \
    --no-mmproj-offload \
    --no-context-shift \
    --kv-unified \
    --spec-type draft-mtp \
    --spec-draft-n-max 2 \
    --spec-draft-p-min 0.75 \
    -fa on --jinja --no-mmap \
    --cache-ram -1 \
    --no-warmup -np 1\
    -n 32768 \
    --cache-type-k q8_0 \
    --cache-type-v q8_0 \
    --temp 0.6 \
    --min-p 0.00 \
    --top-k 20 \
    --top-p 0.95 \
    --presence-penalty 0.0 \
    --repeat-penalty 1.05 \
    --fit off \
    --reasoning on \
    --chat-template-kwargs '{"preserve_thinking":true}' \
    --port 8080 \
    --host 0.0.0.0

Vaskivo · 2026-05-19T08:25:09 1779179109

Yeah, having even a little bit in the CPU tanks the t/s...

But thanks. I've learned a few more configurations to tinker with.

omneity · 2026-05-18T19:30:01 1779132601

You can increase the context window beyond its max trained context using RoPE scaling[0] which will require more VRAM.

But you can increase your context window for the same VRAM by quantizing the KV cache with FP8 (double the context) or TurboQuant (more than double)[1].

0: https://medium.com/@leannetan/extending-context-length-with-...

1: https://docs.vllm.ai/en/latest/features/quantization/quantiz...

trilogic · 2026-05-18T19:37:11 1779133031

We managed to increase the ctx for whatever llm model that is GGUFED, here the experimental tests: https://www.reddit.com/r/Hugston/

tedivm · 2026-05-18T18:39:46 1779129586

I've had the opposite experience, and have built multiple fantastic applications with Qwen3.6 27b. What quantization have you tested with?

hedgehog · 2026-05-18T18:41:33 1779129693

Similarly I haven't seen Qwen 27B as remotely competitive with Opus, at least Q4 hooked up to Claude Code. What harness are you using?

trilogic · 2026-05-18T19:05:32 1779131132

As funny as it may sound a q4_k_m well converted and quantized properly (and finetuned, impereative) would do the job. The 27b it may be good but is heavy, it burns the hardware. I personally prefer the 397B if I am stucked and can´t progress, it can still run with 7 tps. Now with the Mtp (multitoken prediction) it nearly double the speed ( reached 82tps today with the 35b 100000ctx). I recommend it you give it a try.

0xbadcafebee · 2026-05-18T21:46:27 1779140787

> not for working on real codebases

You don't pick just one model to "work on real codebases". You use a very advanced model to plan, and a not-very-advanced, cheaper, faster model to execute planned tasks. This saves money and speeds up work. This is the guidance from Anthropic & OpenAI.

storus · 2026-05-18T19:46:16 1779133576

It's 3.7-max; max was never open-weighted before. I don't see any smaller models in that tweet.

b3ing · 2026-05-18T18:55:13 1779130513

For coding it’s really bad. Writing is ok, chat is good. It’ll get better but it’s not that close yet

nullc · 2026-05-18T23:58:34 1779148714

Bad is mystifying. Unassisted but for handing it a pile of PDFs of relevant academic papers and my initial codebase, I had hermes agent based on qwen-3.6 27B implement karatsuba multiplication of characteristic-2 polynomials in C++ in an existing codebase with an internal field arithmetic library. It correctly found the 'obvious' optimizations using the field properties. Then I had it implement the recursive halfgcd algorithm for these polynomials using it.

It wrote extensive test cases and validated them with mutation testing (per my standard instructions)-- took many tries getting the algorithms right but with the tests handy it found and fixed the errors.

It's inconceivable to me to call it bad!

jedisct1 · 2026-05-18T20:34:26 1779136466

Depends on the language and harness, I guess.

It works really well for me, at least for Python and JavaScript, with swival.dev as a harness.

kajecounterhack · 2026-05-18T20:48:31 1779137311

You should probably disclaimer that you're the author of swival.dev, but nice project :)

mettamage · 2026-05-18T17:52:32 1779126752

Do you have a good resource on how to finetune a model like Qwen? I am curious to try it out.

trilogic · 2026-05-18T18:03:35 1779127415

Here is a dataset you can choose from: https://huggingface.co/datasets/Avtrkrb/combined-reasoning-o... Get a 10000 samples from it according to your needs and go for it. The key (in my opinion) is not cutting the Sequence Length among other things. Whatever traditional finetuning repo will do, if your hardware supports it Unsloth is faster.

verdverm · 2026-05-18T17:57:01 1779127021

Unsloth has good resources

ethanpil · 2026-05-18T21:20:17 1779139217

Can you share the GGUF for this specific success story? I'd like to try it for myself.

mempko · 2026-05-18T18:06:33 1779127593

I love that open weight models are catching up so quickly. Also hilarious how far behind Grok is. I guess demand for Grok must be poor if Anthropic is able to rent resources from xAI.

ac29 · 2026-05-18T21:29:41 1779139781

Just to be clear, "Plus" and "Max" Qwen models are closed. Seems likely smaller open versions will be released, but that's not what was announced today

svachalek · 2026-05-18T18:10:44 1779127844

To play devil's advocate I do feel like Grok has a unique "feel" to it. All the Chinese models feel like GPT or Claude distillations, but Grok has a certain unique way of saying and doing things. But that said, it also feels a year behind the state of the art.

SwellJoe · 2026-05-18T20:03:26 1779134606

With an Austrian accent, perhaps?

kennywinker · 2026-05-18T22:57:28 1779145048

Yes, but sortof mechanical… like a mecha… mecha something.

martheen · 2026-05-19T09:06:33 1779181593

For an artificial it's weirdly obsessed with farmers in a certain country across the ocean

throwa356262 · 2026-05-19T19:07:54 1779217674

Would xAI do this if Grok was competitive?

https://www.nytimes.com/2026/04/03/business/spacex-ipo-grok-...

Havoc · 2026-05-18T18:57:31 1779130651

So glad they’re holding steady on open weights.

At least for now. Worried the Chinese team will change their mind once they have parity

the_duke · 2026-05-18T21:23:56 1779139436

Of course they will.

Right now they want to prevent the US labs from gaining any sort of self-reinforcing oligopoly on the space, and to let the ecosystem in China flourish.

That will all die sooner or later.

giancarlostoro · 2026-05-18T17:26:59 1779125219

There I was waiting on a smaller version of Qwen 3.6 to drop so I can run it on my Mac, and then bam, they drop this.

alfiedotwtf · 2026-05-18T23:05:26 1779145526

The jump from 3.5 to 3.6 was noticeable and set the bar. If they can keep the momentum, I’d pretty much say Qwen and China won the AI wars

satvikpendem · 2026-05-18T19:08:20 1779131300

Will they release the large models as open weight too? So far it seems only 35 or 27 B etc models are being released with nothing larger unlike before.

julianlam · 2026-05-18T22:56:56 1779145016

Gemma 4 and Qwen 3.6 were when my local inference experiments graduated from toy challenges with much hand holding to actually full day back and forth with good ability to utilise tool calls to discover how things are glued together.

I'm not talking about greenfield dev, I'm talking about interfacing with an existing decade old codebase.

0xbadcafebee · 2026-05-18T21:45:11 1779140711

I stopped caring about benchmarks at MiniMax M2.5. I no longer want more advanced models. I want cheaper models that don't slow down when everyone else is online.

brianwawok · 2026-05-18T22:12:24 1779142344

Run locally and you can now do it on an airplane

raffael_de · 2026-05-18T20:10:18 1779135018

I have a tangential question. Provided that it is correct that current proprietary models are offered at below cost-covering rates (I believe this is a consensus if I'm not mistaken¹); what factor (multiplication) would have to be applied approximately to current rates to reach break even?

¹: I think I read this a couple of times but I'm not sure if correct to begin with. Can this be substantiated based on annual financial reporting or other published business metrics by OpenAI, Anthropic et al.?

Onavo · 2026-05-18T17:26:27 1779125187

Where's Grok 4.3 on the leaderboard?

zzleeper · 2026-05-18T19:24:51 1779132291

There's a Grok 4.20 at #10? Maybe they just skipped version numbers for the 420 luls (are we 15 or what? wtf)

catketch · 2026-05-18T18:54:18 1779130458

nubg · 2026-05-18T18:50:10 1779130210

lmao at opus 4.7 being a downgrade

SwellJoe · 2026-05-18T20:04:47 1779134687

They made it less sycophantic. Which is a good thing for mental health, but maybe a bad thing for popularity contests.

vessenes · 2026-05-18T17:51:35 1779126695

Today I learned Meta's new model is preferred to everything but claude. That is .. a real surprise! Congrats to the Meta team.

vessenes · 2026-05-19T02:13:29 1779156809

I don’t mind a principled downvote, but can a downvoter explain his or her reasoning? Genuinely curious. I found the linked rankings surprising, and think of myself as relatively well informed. Please, enlighten me..