Traditional machine learning models are built to generalize. And while that works well in many cases, it’s also their Achilles‘ heel, they often miss the sharp edges, the anomalies, the rare patterns that matter most. But transformers changed the game.
In this post, I introduce a new mental model for understanding attention: the 3D similarity cube. By exploring how queries, keys, and values interact spatially, we can see why transformers don’t just generalize, they discriminate, focus, and understand.
1. The Blind Spots of Traditional ML Models
Classical models often rely on:
- Averaging effects
- Loss minimization across all data
- Rigid feature importance
As a result, they tend to smooth out the very things we might care about: black swan events, sharp transitions, or rare-but-crucial relationships. These are edge cases – and traditional models don’t handle them well.
2. Attention: From Generalization to Distinction
Transformers introduced attention – a mechanism that asks:
What parts of the input are relevant to this specific token right now?
Unlike pooling or averaging, attention allows the model to:
- Dynamically assign importance per token
- Learn contextual relevance
- Focus on what matters instead of everything equally
It’s not about compressing the data. It’s about amplifying the meaningful parts.
3. Attention as a 3D Similarity Cube
At the heart of the transformer lies the raw attention computation:
Attention(Q, K, V) = softmax((Q × Kᵀ) / sqrt(d_k)) × V
This isn’t just math – it’s a dynamic system of relationships. Here’s what’s happening:
QK^T / √
d_k creates a similarity matrix between every query and every key.- The
softmax
transforms that matrix into attention weights – a relevance map. - Multiplying by
V
means each output is a weighted combination of values, focused by similarity.
Now, imagine this whole operation as scanning through a 3D cube:
- One axis holds queries
- One holds keys
- One holds values
The attention mechanism traverses this cube, slicing through it based on query-key similarity, then pulling a weighted projection from the value space. The cube isn’t just a metaphor – it’s a way to visualize the full relational structure being navigated in real time.
This is how transformers don’t just generalize – they build context-sensitive, directional meaning across time and space.
Conclusion
Where traditional models flatten complexity, transformers unfold it. This „cube“ perspective helps explain how transformers find hidden structure – not just by pattern-matching, but by creating rich, multidimensional comparisons at every layer.