Chapter 11: The Full GPT - Assembling the Model

DEV Community

Gary Jackson

Apr 30, 2026, 05:07 PM

What You'll Build Four files that together make the project complete: Model.cs - the GptModel class that holds all parameters and implements the full forward pass (replacing the simplified Forward function from Chapters 6-7) AdamOptimiser.cs - a reusable class wrapping the Adam state and update from Chapter 7 FullTraining.cs - the real training loop that uses GptModel across 10,000 steps Program.cs - the finalised dispatcher with the full case wired up All previous chapters. A few design notes before the code. The Forward method takes a single token at a time, not the whole sequence at once. The KV cache (passed in as parameters) holds the context from previous positions. This is the same one-token-at-a-time approach from Chapter 9: we process tokens sequentially during both training and inference. Each document or sample needs its own fresh KV cache. The model provides CreateKvCache() for that, and the caller passes it back into every Forward call for that sequence. The parameter dictionary uses string keys like "wte", "wpe", "lm_head", and "layer0.attn_wq". That means a typo in a key would be a runtime error, not a compile error, but it's the most direct mapping to how PyTorch stores model weights. If you ever wanted to load real GPT-2 checkpoints, the keys would line up. To keep the C# code readable inside Forward, we add private property aliases (TokenEmbeddings, PositionEmbeddings, OutputProjection) over the most-used dict entries. The cryptic two-letter names live in the dict, and the descriptive C# names live in the code that uses them. Forward itself is short because it delegates to two private methods (AttentionBlock and MlpBlock) that mirror the "communicate, compute" framing from Chapter 10. Each block contains a pre-norm, the transformation, and the residual add. // --- Model.cs --- using static MicroGPT.Helpers; namespace MicroGPT; public class GptModel { // The state dict keys follow PyTorch / GPT-2 convention (wte = weight token embedding, // wpe = weight position embedding, etc.) so this code can map directly to PyTorch // checkpoints if you ever want to load real GPT-2 weights. The aliased properties // below give us readable C# names to use inside Forward without losing that bridge. private readonly Dictionary>> _stateDict; private readonly int _embeddingSize; private readonly int _headCount; private readonly int _layerCount; private readonly int _headDimension; private List> TokenEmbeddings => _stateDict["wte"]; private List> PositionEmbeddings => _stateDict["wpe"]; private List> OutputProjection => _stateDict["lm_head"]; /// All trainable parameters, flattened into a single list for the optimiser. public List Parameters { get; } public GptModel( int vocabSize, int embeddingSize, int headCount, int layerCount, int maxSequenceLength, Random random ) { _embeddingSize = embeddingSize; _headCount = headCount; _layerCount = layerCount; _headDimension = embeddingSize / headCount; _stateDict = new Dictionary>> { ["wte"] = CreateMatrix(random, vocabSize, embeddingSize), ["wpe"] = CreateMatrix(random, maxSequenceLength, embeddingSize), ["lm_head"] = CreateMatrix(random, vocabSize, embeddingSize), }; for (int i = 0; i enumeration order is not guaranteed by the spec. // In .NET Core+ it preserves insertion order in practice, so Adam's momentum[]/squaredGradAvg[] // line up across runs - but if that implementation detail ever changes, switch // to a List> to make the order explicit. Parameters = [.. _stateDict.Values.SelectMany(mat => mat).SelectMany(row => row)]; } public List Forward( int tokenId, int posId, List>[] keys, List>[] values ) { List tokenEmbedding = TokenEmbeddings[tokenId]; List positionEmbedding = PositionEmbeddings[posId]; var x = new List(); for (int i = 0; i AttentionBlock( List x, int layerIndex, List>[] keys, List>[] values ) { var xResidual = new List(x); x = RmsNorm(x); List query = Linear(x, _stateDict[$"layer{layerIndex}.attn_wq"]); List key = Linear(x, _stateDict[$"layer{layerIndex}.attn_wk"]); List value = Linear(x, _stateDict[$"layer{layerIndex}.attn_wv"]); keys[layerIndex].Add(key); values[layerIndex].Add(value); var concatenatedHeads = new List(); for (int h = 0; h queryForHead = query.GetRange(headStart, _headDimension); var attentionLogits = new List(); int cachedCount = keys[layerIndex].Count; for (int t = 0; t keyForHead = keys[layerIndex][t].GetRange(headStart, _headDimension); var dot = new Value(0); for (int j = 0; j attentionWeights = Softmax(attentionLogits); var headOutput = new List(); for (int j = 0; j valueForHead = values[layerIndex] [t] .GetRange(headStart, _headDimension); Value w = attentionWeights[t]; for (int j = 0; j MlpBlock(List x, int layerIndex) { var xResidual = new List(x); x = RmsNorm(x); x = Linear(x, _stateDict[$"layer{layerIndex}.mlp_fc1"]); x = [.. x.Select(xi => xi.Relu())]; x = Linear(x, _stateDict[$"layer{layerIndex}.mlp_fc2"]); for (int i = 0; i Creates a fresh KV cache for a new document/sample. public List>[] CreateKvCache() { var cache = new List>[_layerCount]; for (int i = 0; i _parameters; private readonly double[] _momentum; private readonly double[] _squaredGradAvg; private readonly double _baseLearningRate; private readonly int _totalSteps; public AdamOptimiser(IReadOnlyList parameters, double learningRate, int totalSteps) { _parameters = parameters; _momentum = new double[parameters.Count]; _squaredGradAvg = new double[parameters.Count]; _baseLearningRate = learningRate; _totalSteps = totalSteps; } // Reset every parameter's gradient to zero. Call before each Backward. public void ZeroGrad() { foreach (Value p in _parameters) { p.Grad = 0; } } // Apply one Adam update to every parameter using its current Grad. public void Step(int step) { double currentLearningRate = _baseLearningRate * (1 - (double)step / _totalSteps); for (int i = 0; i docs = Tokenizer.LoadDocs("input.txt", random); var tokenizer = new Tokenizer(docs); Console.WriteLine($"num docs: {docs.Count}"); Console.WriteLine($"vocab size: {tokenizer.VocabSize}"); // ── Model ──────────────────────────────────────────────── var model = new GptModel( tokenizer.VocabSize, embeddingSize, headCount, layerCount, maxSequenceLength, random ); Console.WriteLine($"num params: {model.Parameters.Count}"); // ── Training Loop ──────────────────────────────────────── var optimiser = new AdamOptimiser(model.Parameters, learningRate, numSteps); // Reusable buffers for Backward (see Chapter 2's convenience overload for the // simpler allocating version - here we hoist them out of the hot loop so 10,000 // training steps don't allocate 10,000 fresh sets). var topo = new List(); var visited = new HashSet(); var backwardStack = new Stack(); // Running average to smooth out the noisy per-step loss. double avgLoss = 0.0; // Milestone tracking so we can report the previous milestone's avg loss // alongside the current one every 1000 steps. double lastMilestoneLoss = 0.0; for (int step = 0; step { tokenizer.Bos }; tokens.AddRange(doc.Select(tokenizer.Encode)); tokens.Add(tokenizer.Bos); // Any name longer than maxSequenceLength - 1 is silently truncated here. int tokenCount = Math.Min(maxSequenceLength, tokens.Count - 1); List>[] keys = model.CreateKvCache(); List>[] values = model.CreateKvCache(); var losses = new List(); for (int posId = 0; posId logits = model.Forward(tokens[posId], posId, keys, values); List probabilities = Softmax(logits); losses.Add(-probabilities[tokens[posId + 1]].Log()); } var loss = new Value(0); foreach (Value l in losses) { loss += l; } loss *= 1.0 / tokenCount; // Track running average (exponential moving average with alpha = 0.01) avgLoss = step == 0 ? loss.Data : 0.99 * avgLoss + 0.01 * loss.Data; if (step == 0) { lastMilestoneLoss = avgLoss; } optimiser.ZeroGrad(); topo.Clear(); visited.Clear(); backwardStack.Clear(); loss.Backward(topo, visited, backwardStack); optimiser.Step(step); if (step == 0 || (step + 1) % 100 == 0) { Console.WriteLine( $"step {step + 1, 5} / {numSteps, 5} | loss {loss.Data:F4} | avg {avgLoss:F4}" ); } // Every 1000 steps, print a milestone showing overall progress. if ((step + 1) % 1000 == 0) { Console.WriteLine( $" [milestone] avg loss: {avgLoss:F4} (was {lastMilestoneLoss:F4})" ); lastMilestoneLoss = avgLoss; } } // Chapter 12's inference loop lives here too - see the next chapter. } } A reminder about name length. As in Chapter 7, maxSequenceLength = 8 means any name longer than 7 characters is silently truncated during training. This is why the generated samples in Chapter 12 lean toward shorter names. The model simply hasn't seen the tails of longer ones during training. Raising maxSequenceLength to 16 removes the truncation at roughly 2x the training cost. With FullTraining.cs in place, we can finalise Program.cs. You've been uncommenting one case in the dispatcher at the end of every chapter since Chapter 1. Now uncomment the final case "full" line and replace the temporary sanity-check default from Chapter 0 with the usage message. After this edit your Program.cs should look like this: // --- Program.cs --- // // Dispatcher: `dotnet run -- chN` runs a specific chapter exercise, // `dotnet run -- full` runs the full training + inference, // `dotnet run` (no args) defaults to the full run. namespace MicroGPT; public static class Program { public static void Main(string[] args) { string chapter = args.Length > 0 ? args[0].ToLowerInvariant() : "full"; switch (chapter) { case "gradcheck": GradientCheck.RunAll(); break; case "ch1": Chapter1Exercise.Run(); break; case "ch2": Chapter2Exercise.Run(); break; case "ch3": Chapter3Exercise.Run(); break; case "ch4": Chapter4Exercise.Run(); break; case "ch5": Chapter5Exercise.Run(); break; case "ch6": Chapter6Exercise.Run(); break; case "ch7": Chapter7Exercise.Run(); break; case "ch8": Chapter8Exercise.Run(); break; case "ch9": Chapter9Exercise.Run(); break; case "ch10": Chapter10Exercise.Run(); break; case "full": FullTraining.Run(); break; default: Console.WriteLine($"Unknown chapter: {chapter}"); Console.WriteLine("Usage: dotnet run -- [gradcheck|ch1..ch10|full]"); break; } } } Two things changed from the Chapter 0 skeleton: the "full" case is now wired to FullTraining.Run(), and the default no-args value is "full" instead of "" so that dotnet run with no arguments runs the full training and inference. With everything wired up, you can invoke any part of the project uniformly: dotnet run -- ch1 # Chapter 1 exercise dotnet run -- ch10 # Chapter 10 exercise dotnet run -- full # full training + inference (Chapters 11-12) dotnet run # same as "full" With embeddingSize=16, headCount=4, layerCount=1, maxSequenceLength=8, and vocabSize=27: Token embeddings (wte): 27 x 16 = 432 Position embeddings (wpe): 8 x 16 = 128 Output projection (lm_head): 27 x 16 = 432 Per layer: Q(256) + K(256) + V(256) + O(256) + FC1(1024) + FC2(1024) = 3,072 Total: 432 + 128 + 432 + 3,072 = 4,064 For perspective: GPT-2's largest variant had 1.5 billion parameters. GPT-4 class models have hundreds of billions. The architecture is the same, just much wider and deeper. Training is fully wired up. If you want to confirm everything works before adding inference, run it now: dotnet run -c Release -- full You'll see per-step loss lines and a [milestone] line every 1000 steps. The running average should drop from ~3.3 toward ~2.37 over 10,000 steps (5-15 minutes on a modern CPU in Release mode). Per-step loss is noisy, so watch the avg column for the trend. Generation is the next chapter, so the run will end after training without producing any names yet. If you'd rather wait and run once with both training and inference in place, skip this and head straight into Chapter 12.