Starter AI
Posts
Deep Dive: Building GPT from scratch - part 9

Deep Dive: Building GPT from scratch - part 9

learning from Andrej Karpathy

Miko
April 19, 2024

Welcome back to the series on Starter AI!

Today, after the dramatic cliff-hanger from part 8, we’re coming back to finish up our Generative Pretrained Transformer (GPT) creating infinite Shakespear on demand (re-implementing nanoGPT). We’re almost done, so let’s jump straight into it!

The roadmap

The goal of this series is to implement a GPT from scratch, and to actually understand everything needed to do that. We’re following Andrej’s Zero To Hero videos. If you missed a previous part, catch up here:

Neural Networks & Backpropagation part 1 - 2024/02/09
Neural Networks & Backpropagation part 2 - 2024/02/16
Generative language model - bigrams - 2024/02/23
Generative language model - MLP - 2024/03/01
Generative language model - activations & gradients - 2024/03/08
Generative language model - backpropagation - 2024/03/15
Generative language model - WaveNet - 2024/03/22
GPT - self-attention - 2024/03/29

To follow along, subscribe to the newsletter at starterai.dev. You can also follow me on LinkedIn.

GPT - Tiny Shakespeare

Today’s lecture is called “Let's build GPT: from scratch, in code, spelled out.”, and we’re picking up at the 01:02:00 mark.

We’re continuing to build a small GPT that we’ll train on Shakespeare’s works (a toy dataset under 1mb, so about a million characters), that will end up re-implementing nanoGPT, which in turn is a rewrite of minGPT.

In the previous episode, we laid down the groundwork: implemented a bigram model as a starting point, and practised some maths and matrix multiplication for our samples to “communicate with the past”, which was vague last time, but finally pays off… now!

Context

Matrix multiplication - if you, like me, still need to think about how it works with dimensions, I found this visualisation at matrixmultiplication.xyz quite useful.

Deep Residual Learning for Image Recognition. Andrej is using this paper to explain some design tweaks to achieve a better result.

Dropout paper. “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, one of the improvements discussed in the lecture. Effectively switches off some neurons in each training loop to prevent overfitting.

The GPT3 paper. “Language Models are Few-Shot Learners”, the original OpenAI paper.

Video + timestamps

Building the "self-attention" - continued

01:02:00 THE CRUX OF THE VIDEO: version 4: self-attention

01:11:38 note 1: attention as communication

01:12:46 note 2: attention has no notion of space, operates over sets

01:13:40 note 3: there is no communication across batch dimension

01:14:14 note 4: encoder blocks vs. decoder blocks

01:15:39 note 5: attention vs. self-attention vs. cross-attention

01:16:56 note 6: "scaled" self-attention. why divide by sqrt(head_size)

Building the Transformer

01:19:11 inserting a single self-attention block to our network

01:21:59 multi-headed self-attention

01:24:25 feedforward layers of transformer block

01:26:48 residual connections

01:32:51 layernorm (and its relationship to our previous batchnorm)

01:37:49 scaling up the model! creating a few variables. adding dropout

Notes on Transformer

01:42:39 encoder vs. decoder vs. both (?) Transformers

01:46:22 walkthrough of nanoGPT, batched multi-headed self-attention

01:48:53 back to ChatGPT, GPT-3, pre-training vs. finetuning, RLHF

01:54:32 conclusions

Summary

So there you have it. Until you read it and try to make sense of it, it kind of looks like Shakespeare! Just squint a bit. If you don’t happen to have an A100 lying around, there are plenty of cloud providers who will happily lend you one on a per-hour basis.

The last few sections also discuss how this is a decoder-only model, so there is no way to precondition the generation process with a prompt.

And we also got some insights from Andrej about the order of magnitude of training data size, time required and numbers of parameters, as compared to the GPT3 paper. Today was a good day. “Go forth and transform!”

Giphy

What’s next

Next time we’re moving on to the last video of this series - GPT tokenizer.

As always, subscribe to this newsletter at starterai.dev to get the next parts in your mailbox!

If you like this series, please forward to a friend!

Feedback

How did you like it? Was it easy to follow? What should I change for the next time?

Please reach out on LinkedIn and let me know!