Let me test this hypothesis. I run a small business that pays for Google workspa...

int_19h · on May 9, 2024

The main reason is that GPT-4 is still significantly better than everything else.

Time will tell if OpenAI will be able to retain the lead in the race, though. While there's no public competing model with equal power yet, competitors are definitely much closer than they were before, and keep advancing. But, of course, GPT-5 might be another major leap.

MrDarcy · on May 9, 2024

Confirmed. I asked both Gemini and GPT4 to assist with a proto3 rpc service I'm working on. The initial prompt was well specified. Both provided nearly exactly the same proto file, which was correct and idiomatic.

However, I then asked both, "Update the resource.spec.model field to represent any valid JSON object."

Gemini told me to use a google.protobuf.Any type.

GPT4 told me to use a google.protobuf.Struct type.

Both are valid, but the Struct is more correct and avoids a ton of issues with middle boxes.

Anyway, sample size of 1 but it does seem like GPT4 is better, even for as well-specified prompts as I can muster.

CuriouslyC · on May 9, 2024

You need to specify a perspective to write code from (e.g. software architect who values maintainability and extensibility over performance or code terseness), and prompt models to use the most idiomatic or correct technique. GPT4 is tuned to avoid some of this but it will improve answers there as well.

CuriouslyC · on May 9, 2024

That's not true really. With well written prompts GPT4 is better at some things and worse at others than Claude/Llama3. GPT4 only appears to be the best by a wide margin if your benchmark suite is vague, poorly specified prompts and your metric for evaluation is "did it guess what I wanted accurately"

int_19h · on May 9, 2024

My benchmark is giving it novel (i.e. guaranteed to not be in the training set) logical puzzles that require actual reasoning ability, and seeing how it performs.

By that benchmark, GPT-4 significantly outperforms both LLaMA 3 and Claude in my personal experience.

CuriouslyC · on May 9, 2024

That's occurring because you're giving it weak prompts, like I said. GPT4 has been trained to do things like chain of thought by default, where as you have to tell Llama/Claude to do some of that stuff. If you update your prompts to suggest reasoning strategies and tell it to perform some chain of thought before hand the difference between models should disappear.

int_19h · on May 9, 2024

You are assuming a great deal of things. No, you can absolutely come up with puzzles where no amount of forced CoT will make the others perform on GPT-4 level.

Hell, there are puzzles where you can literally point out where the answer is wrong and ask the model to correct itself, and it will just keep walking in circles making the same mistakes over and over again.

CuriouslyC · on May 9, 2024

Llama3 and Claude also work well, they're good at different types of code and problem solving. The only thing ChatGPT does clearly better than the rest is infer meaning from poorly worded/written prompts.

fwlr · on May 9, 2024

No financial incentive or relationships to disclose, just a satisfied user: I found that SuperMaven was a better “coding copilot”. If you happen to use VSCode I’d check that one out this afternoon.