So Transformers are Architects now? Michal Bay comes up with the wildest things …

ChatGPT Image May 1, 2025, 10_36_40 PM.png

Jokes aside, the Transformer Architecture is basically why this all started. So, without getting too technical, I wanted to share what architecture even means when it comes to models and then briefly go into the Transformer one.

Robots going to Architecture School

The first thing that I needed to fully comprehend was what the Architecture of a model actually refers to. It’s the way they are built and how different parts of a model talk to each other.

So, think of the Org Chart in a Company as the architecture of said company. How new information is captured and distributed, as well as how scalable it is. Years ago, there was a famous Org Chart Meme about it. See below.

image.png

Having worked for two companies in there, I can tell you one is spot on, the other, not so sure, maybe I was too junior to see it.

Anyways, this is what architecture is. Different models are built different, yielding different results. To push another analogy about models, I wanted to use Excel as an example.

Say we had values from A1 to A10 and we needed to add them up. There are several different ways we can approach this. We could simple do a =A1+A2+A3 … Or we could do =SUM(A1:A10) or even a =SUMIFS(A1:A10,A1:A10,">0") (don't roast me). But you get the idea. These different methods can represent a different model architecture. And each one has its pro and cons, and can be more or less efficient.

So, that's what a Model architecture is, now …

Autobots, Assemble! Or where do Transformers come in?

They don't, but the Transformer architecture does.

This idea came from a 2017 Google research paper titled “Attention Is All You Need.” It introduced a new way for computers to understand the meaning of long sequences of text. Before this, models processed one word at a time, trying to remember the words that came before. The longer the sentence, the harder it became to hold on to the early context.

The Transformer Architecture changed that. Instead of reading words one by one, it looked at all the words at the same time and using a mechanism called “attention” it decided which words were most important for understanding each section of the string of text. This meant the model could grasp the full context of a sentence or paragraph more effectively. To add to that, since it didn’t need to work sequentially (one word at a time), it could process more data in parallel (at the same time), making training faster and unlocking the scale needed for today’s large models.

<aside> 💡

Basically, robots stopped reading like humans, a word at a time, and started reading like … robots? all words at once.

</aside>

ChatGPT Image Jul 6, 2025, 07_50_42 AM.png

So… what is “attention” anyway?

So, this Attention mechanism we spoke about, what is it? Well, it’s exactly what it sounds like, it’s a way for models to identify what part of the long string of text they need to pay attention to.