>Instead of a single transformer, we combine (i) a stack of heterogeneous DNNs paired with small language models as perception modules
It seems that we're reinventing the brain's organs one by one from first principles. (Though Transformer + Common Crawl unintentionally builds a whole bunch of them we don't even understand yet.)
I found some broader context and the whole thing is indeed very harness-shaped:
Well, Harness is the wrong word here... "environment/tools the LLM interacts with" definitely fits though. Or "other organoid" to use the previous metaphor.
Workflows are the most common, there's a pipeline like processing loan documents before data gets loaded to the next step or translating user comments before being stored in the database.
Agents are where you have a chat based system or a brain of sorts that calls many tools to achieve a user goal. The model doing this is a lot better at non deterministic task which then delegates to Interfaze for specific deterministic actions like OCR, Web extract then consumes that data. That's the article you referenced :)
The graph doesn't exactly make it clear but it describes a pipeline that goes beyond the LLM, so the CNN could be a separate model there.