I don't think it's particularly helpful to dive into transformer models, positional encodings and self-attention at the very start of an introduction to LLMs. Understanding how those works does little to help explain what LLMs can do and how you can use them.
I tend to stick with the higher level explanation that they can predict the next word (or next sentence) based on their training text, and then emphasize that while this sounds pretty limited it's actually capable of doing all sorts of impressive things once you scale it up enough.
I'm not sure it's even helpful to talk about transformer at all[0]. Modern LLMs are different than the pure transformer architecture in many detailed aspects. The elephant in the room being that they're not even full transformer—only the decoder part.
[0]: One reason: Never once did I need to know the transformer architecture in order to be able to use these models (prompt engineering, chaining, working with local models, etc.).
I argue that the knowledge of concepts such as ROPE, Mirostat, monkeypatching, etc. is much more crucial than knowing how transformer models work.
> I tend to stick with the higher level explanation that they can predict the next word (or next sentence) based on their training text,
I think the same way, but I think it reduces LLMs into "black boxes"—many other models can also predict next tokens based on probabilities. I think we need something that at least captures the general mechanism by which LLMs predict the next token.
It's adorable to see how many intro writings on the subject throw up the famous "Attention Is All You Need" diagram as if that explains anything whatsoever.
>I think we need something that at least captures the general mechanism by which LLMs predict the next token.
This bit varies a lot since the capabilities involved in prediction depend on the data. If the text is a math book and the prompt is "... three plus five apples is a total of " the crucial capability is arithmetic (plus of course NL capabilities). On the other hand, if you're completing a post from /r/relationship_advice the capabilities involved will be (vaguely) maintaining literary tone, theory of mind, psychology, etc. Within a text the capabilities needed will also vary a lot, where you might need theory of mind at some crucial inflection points, but most of the time its already clear what is going to be said and the model just has to get the wording right.
So, my take would be to really think hard about the data to understand how predictions might be made.
The way LLMs process text is a lot like a typical human (top-of-the-head answer vs actually analyzing the question in depth). This is a recurring pattern with the current LLMs I've interacted with (ex., A: "x" Q: "but what about y?" A: "oh, you're right, now that I [actually] think about it, really z"). Example (Anthropic's Claude):
Q: if one basket contains 5 apples and another basket contains 3 plums, and another basket contains 2 walnuts how many fruits are there combined?
A: The first basket contains \*5 apples\*. The second basket contains \*3 plums\*. The third basket contains \*2 walnuts\*. Adding up the numbers of fruits in each basket, there are a total of \*5 + 3 + 2 = 10 fruits combined\*.
Q: is a walnut a fruit?
A: You're right, a walnut is not technically a fruit. A walnut is a type of tree nut. To correctly answer the original question:
The first basket contains 5 apples. The second basket contains 3 plums. While the third basket contains 2 walnuts, walnuts are not fruits. So the total number of fruits combined is 5 apples + 3 plums = 8 fruits.
An interesting fundamental limitation of transformer LLMs is that their architecture doesn't allow for the option for the model to choose to think a bit more or to choose to review their answer, as the quantity and structure computations they'll do is fixed in the architecture without any conditionals - but you repeating the same prompt means that the model gets to run a second iteration while having the result of the first iteration as the input data, and that can enable it to do more.
Of course, if we'd want to generalize it, we should have a structure where the model itself can actually make the decision to continue work for one more iteration, and also ensure that it can retain some of the temporary work-in-progress notes (currently it can see only what the previous iteration output as part of the answer).
"Think before you speak: Training Language Models With Pause Tokens"
Basic idea is to teach the LLM to occasionally insert a "pause" token, which outputs nothing but gives it a chance to perform another round of operations on the way to the answer.
"three plus five apples is a total of..." is a really interesting example, because it doesn't actually require arithmetic at all.
A language model trained on enough text will be able to complete this just based on having encountered the pattern "three plus five SOMETHING is a total of..." enough times in its training data.
This becomes even more apparent when you work with smaller models - the 7B etc models which can run on a laptop. They can often solve small arithmetic problems like this while having no chance at all of working with larger numbers that they haven't ever encountered in their training data.
I really like using those smaller models as tools to better understand how this technology works.
I wanted to talk about what powers LLMs, which I believe is important. The answer to that is transformers. While I may not have delved deeper into how a transformer actually works, I tried to explain the concepts in the simplest way possible.
Right. For example, they are quite useful at generating code and translating the output to Python dictionaries (or JSON). That makes them very useful in code integrations.
One application of casting things to dictionaries is that it makes it easy to grab keyterms from the ingested texts. With the right datastore, we can take the binary set objects and query them very quickly for similarity overlap of other texts, similar to how vector search might return a cosine distance (multi-dimensional angle offset?).
I was thinking these sets could be used to segment vector space, and the only thing that makes that possible is that LLMs are great to use for things like this.
just wanted to follow up and say this was a fantastic talk. the personal AI ethics point about never publishing anything that takes the reader longer to read than it took you to write was really thought provoking. that time asymmetry between content producer and consumer is a fascinating lens to view generative AI ethics through.
My observation from this is that the process by which humans assemble and summarize information, at least at a somewhat high level, is pretty darn close to the way that LLMs do it. I think that falls apart when you want to talk about deeper learning, drawing inferences and so on but if you are just trying to pull together an executive summary on XYZ, an LLM with some fact checking gets you pretty far IMHO.
Up until recently I was editor at an AI company, and parts of this read exactly like some of the outputs I'd get out of GPT-4/ChatGPT Plus.
Many of the linked references are too recent to be in the training corpus (Llama 2, for example), so unless there's some web-search component to this it looks like an LLM wrote the first draft, and a human went through to edit, add links, and populate with images, etc.
I tend to stick with the higher level explanation that they can predict the next word (or next sentence) based on their training text, and then emphasize that while this sounds pretty limited it's actually capable of doing all sorts of impressive things once you scale it up enough.