Notes & Takeaways
Stages when creating LLMs
- Data collection, sampling, attention mechanism, building the architecture
- Pretraining to build the foundation model
- Training (basically finetuning) for specific domain purpose
Transformer architecture
Encoder-decoder
Encoder has a multi-head attention layer and a feed forward layer.
Decoder has a masked multi-head attention layer, another multi-head attention layer, and a feed-forward layer.
GPT architecture
Only decoder.
Other takeaways
- The transformer architecture is not easy, man. For a noob that has never read a paper or implemented an architecture before. Truthfully speaking, I’m still stuck on the scaled dot-product attention notations (Q, K, V). Well, I think I’m slightly getting ahead of myself by trying to understand this on chapter 1 when it’s discussed on chapter 3 in the book. I’ll just take it step-by-step from here, I guess.
Overall, a straightforward chapter with a very high-level overview of things. Hands-on approaches start from chapter 2, I think!