Home

Notes & Takeaways

Transformer architecture

Encoder-decoder

Encoder has a multi-head attention layer and a feed forward layer.

Decoder has a masked multi-head attention layer, another multi-head attention layer, and a feed-forward layer.

Only decoder.

The transformer architecture is not easy, man. For a noob that has never read a paper or implemented an architecture before. Truthfully speaking, I’m still stuck on the scaled dot-product attention notations (Q, K, V). Well, I think I’m slightly getting ahead of myself by trying to understand this on chapter 1 when it’s discussed on chapter 3 in the book. I’ll just take it step-by-step from here, I guess.

Overall, a straightforward chapter with a very high-level overview of things. Hands-on approaches start from chapter 2, I think!

Home