1. Vectors in NLP

2. Self-Attention Mechanism

3. Multi-head Attention

4. Transformers

5. BERT

Conclusion

Understanding these concepts provides a solid foundation for working with state-of-the-art NLP models like BERT and GPT.

Jargons

Top K: Select an output from the top-k results after applying a random-weighted strategy using the probabilities.

Top-P: Selects and outputs using the random-weighted strategy with a cumulative probability <=p.

Temperature: The higher the temp, the higher the randomness and vice-versa. It alters the softmax functions and changes the probability values.