LLM Study Diary #3: PyTorch

DEV Community

Sofia

May 7, 2026, 12:02 AM

Continuation of the course...This lesson talks a lot related to pytorch. It talks about the tensors as the core building blocks for parameters, gradients, and optimizer states. And then he discusses floating-point representations, including FP32 (full precision), BF16 (brain float, often preferred for deep learning), and the move toward FP8 for efficiency Float Data Types He introduces einops as a more readable and robust alternative to standard PyTorch indexing (e.g., -1, -2), helping developers manage dimensions without confusion. You can understand it as tag for tensor data. For example, here z = einsum(x, y, "batch seq1 hidden, batch seq2 hidden -> batch seq1 seq2") they name the output tensor as batch seq1 seq2. A deep dive into calculating the total number of floating-point operations. The instructor establishes the rule of thumb that training requires approximately 6x parameters × tokens (a total derived from 2x FLOPs for the forward pass and 4x FLOPs for the backward pass) Note: If you forgot what the forward pass and back propagation are, here is a video to walk through the math behinds a simple Neural Networks training: The Math behind Neural Networks He demonstrates on building a simple linear model, implementing custom optimizers like AdaGrad to understand how states persist across steps, and the importance of proper initialization (e.g., Xavier initialization) to maintain numerical stability in deep networks There is practical advice on data loading with memmap to handle massive datasets (only load specific part of the data into memory), the importance of checkpointing to prevent progress loss (this is similar to the batch processing and the streaming processing), and the synergy between hardware constraints and model architecture