Chat GPT is just GPT version 3.5. OpenAI released many other versions of GPT before that. In fact, Open AI became really popular around the time of the GPT 2 which was a fairly good chat model.
Also, the Transformer architecture was not created by OpenAI so LLMs were a thing way before OpenAI existed :)
GPT-2 was not a fairly good chat model, it was a completely incoherent completion model. GPT-3 was not much better overall (take any entry level 1B sized model you can find today and it'll steamroll it in every way, hell probably even smaller ones), and the public at large never really had any access to it, I vaguely recall GPT 3 being locked behind an approval only paid API or something unfeasible like that. Nobody cared until instruct tunes happened.
OpenAI had a real issue with making (for their time) great models but streching their rollout over months. They gave access to press and some twitter users, everyone else had to apply for their use case only to be put on the waitlist. That completely killed any momentum.
The first version of ChatGPT wasn't a huge leap from simulating chat with instruction-tuned GPT 3.5, the real innovation was scaling it to the point where they could give the world immediate and free access. That built the hype, and that success allowed them to make future ChatGPT versions a lot better than the instruction-tuned models ever were.
The main reason ChatGPT took off was:
1) Response time of the API of that quality was 10x quicker than the Davinci-instruct-3 model that was released in summer 2022, making interaction more feasible with lower wait times and with concurrency
2) OpenAI strictly banned chat applications on the GPT API; even summarising with more than 150 tokens required your to submit a use case for review; I built an app around this in October 2022, got through the review, and it was then pointless as everybody could just use ChatGPT for the purposes of my apps new feature).
It was not possible for anybody to have just whacked the instruct models of GPT-3 into an interface for both the restrictions and latency issues that existed prior to ChatGPT. I agree with you on instruct vs ChatGPT and would further say the real innovation was entirely systematic, scaling and changing the interface. Instruct tuning was far more impactful than conversational model tuning because instruct enabled so many synthesizing use cases beyond the training data.
"Instruct tuning was far more impactful than conversational model tuning because instruct enabled so many synthesizing use cases beyond the training data."
I saw many model providers nowadays provide instruct model in name as chat model. What difference between instruct tuning and conversational model tuning specifically?
The best papers to read are the T5 paper which introduced intstruction training.
BERT showed that training with two tasks (next sentence and mask fill) was more effective than solely one task.
T5 showed that multiple instructions could be used for one task (token prediction) like not just translating, but also summarizing. They suggested this could generalize (it did)
GPT-2 showed with just token prediction and no instructions you could represent good text; GPT-3 showed this was coherent and also that sufficient context was reliably continued by models(and impacted by the format of training data, e.g. StackOverflow used Q: A: in the training data, so prompts using Q: and A: worked very well for conversation-mimicking).
Davinci-instruct essentially made GPT-3 outputs reliable, because they "corrected model outputs" not just to follow the implicit continued context but to follow text instructions with general english in the users submitted prompt. They could change this to always follow a chat format (e.g. use Pronouns and refer to the user with "You") which seems to work more naturally, but the original instruct worked based on simple commands which are responded to without the chat format (e.g. no "I am sorry" - just no token, no "I believe the book you are looking for is:") etc.
Nowadays most instruct models do actually use prompt formats and training datasets which are conversational (check out the various formats in LM studio) anyway, so the difference is lost.
Afaik there's no difference, instruct and chat are used interchangeably. Mistral calls their tunes "modelname-Instruct", Meta calles them "modelname-chat".
Strictly speaking instruct tuning would mean having one instruction and one answer, but the models are typically smart enough to still get it if you chain them together and most tuning datasets do contain examples of some back and forth discussion. That might be more what could be considered a chat tune, but in practice it's not a hard distinction.
You are saying that after having experienced all the subsequent versions. GPT-2 was fairly good, not impressive but fairly good. People were using for all sorts of stuff for the fun of it. The GPT 3 versions were really impressive and had everyone here super excited
I'd argue the GPT-3 results were really cherry picked by the few people who had access, at least if the old versions of 3.5 and turbo are anything to go by. The hype would've died instantly if anyone had actually tried them themselves and realized that there's no consistency.
If you want to try out GPT-2 to refresh your memory, here [0] is an online demo. It's bad, I'd say worse than classical graph/tree based autocomplete. I'm fairly sure Swiftkey makes more coherent sentences.
Open AI when they gave press access to gpt said that you must not publish the raw output for AI safety reasons. So naturally people self selected the best outputs to share.
The point isn't the models but the structure. Let's say you wanted AI to compare Phone 1 and Phone 2.
GPT-3 was originally a completion model. Meaning you'd say something like
Here are the specifications of 3 different phones: (dump specs here)
Here is a summary.
Phone 0
pros: cheap, tough, long battery life.
cons: ugly, low resolution.
Phone 1
pros:
And then GPT would fill it out. Phone 0 didn't matter, it was just there to get GPT in the mood.
Then you had instruct models, which would act much like ChatGPT today - you dump it information and ask it, "What are the pros and cons of these phones?" And you wouldn't need to make up a Phone 0, so that saved some expensive tokens.
But the problem with these is you did a thing and it was done. Let's say you wanted to do something else with this information.
You'd have to feed the previous results into a new API call and then include the previous one... but you might only want the better phone's result and exclude the other. Langchain was great at this. It kept everything neatly together so you could see what you were doing.
But today, with chat models, you wouldn't need it. You'd just follow up the first question with another question. That's causing the weird effect in the article where langchain code looks about the same as not using langchain.
Also, the Transformer architecture was not created by OpenAI so LLMs were a thing way before OpenAI existed :)