— LEARNING / PAPER NOTES — 001

Attention Is All You Need, Revisited

TL;DR

A reread focused on which transformer design choices proved foundational and which were quickly replaced.

takeaways

  1. Residual stream + multi-head attention became a durable scaling interface.
  2. Positional representations, normalization, and activations all moved substantially.
  3. The block structure survived because its boundaries were more important than many initial internals.