— LEARNING / PAPER NOTES — 001
Attention Is All You Need, Revisited
TL;DR
A reread focused on which transformer design choices proved foundational and which were quickly replaced.
takeaways
- Residual stream + multi-head attention became a durable scaling interface.
- Positional representations, normalization, and activations all moved substantially.
- The block structure survived because its boundaries were more important than many initial internals.