Starter AI
Posts
Deep Dive: Building GPT from scratch - part 8

Deep Dive: Building GPT from scratch - part 8

learning from Andrej Karpathy

Miko
March 29, 2024

Hello and welcome back to this deep dive series on Starter AI!

Today, we’re finally jumping into the Generative Pretrained Transformer (GPT) to Gen AI infinite Shakespear on demand, re-implementing nanoGPT. You’ll be glad to see all the new skills from the previous 7 parts come into play now!

Due to popular demand, we’re breaking this lecture into two, one-hour sessions. Enjoy!

The roadmap

The goal of this series is to implement a GPT from scratch, and to understand everything needed to do that. We’re following Andrej’s Zero To Hero videos. Catch up here:

Neural Networks & Backpropagation part 1 - 2024/02/09
Neural Networks & Backpropagation part 2 - 2024/02/16
Generative language model - bigrams - 2024/02/23
Generative language model - MLP - 2024/03/01
Generative language model - activations & gradients - 2024/03/08
Generative language model - backpropagation - 2024/03/15
Generative language model - WaveNet - 2024/03/22

To follow along, subscribe to the newsletter at starterai.dev. You can also follow me on LinkedIn.

GPT - groundwork for tiny Shakespeare

Today’s lecture is called “Let's build GPT: from scratch, in code, spelled out.”, and we’re tackling the first hour.

We’re building a small GPT that we’ll train on Shakespeare’s works (a toy dataset under 1mb, so about a million characters), that will end up re-implementing nanoGPT, which in turn is a rewrite of minGPT.

NanoGPT reproduces GPT-2 124M on OpenWebText (the actual OpenAI training set is not open, so this is best-effort reproduction of the dataset), and also allows you to load the official OpenAI’s weights. All that in about 300 lines of PyTorch for the model, and another 300 lines for the boilerplate training loop. Easily one of the coolest repos on github at the moment.

To keep things simple, it’s a character-level model, so it will learn and produce one character at a time.

You will see a lot of stuff from the previous lectures. We’ll use our old friend bigrams from part 3 of this series as a baseline model, for example.

The cliff-hanger I’m going to leave you on today is at 1:02 mark, just before we jump into implementing self-attention.

Context

Attention is All You Need is a seminal paper that literally changed the field of artificial intelligence in 15 pages (10, if you don’t count the references and visualisations). As you read it, you will see that the authors didn’t see it coming either. This is what we’re following today to build the tiny GPT.

Tokenizer. We want our model to operate on text, but neural networks can only deal with numbers. To work around it, we need a representation (encoding) of text as a number. Today’s lecture uses characters, so each character will correspond to a token. In the real world, that’s pretty inefficient, so tokens tend to be chunks of words (every model is trained using a particular tokenizer). OpenAI has a nice explainer page, where you can play with their tokenizers. The most recent one is tiktoken and it’s open source. For their models, a token is the size of 3 ⁄ 4 average word length, so 100 tokens corresponds to about 75 words. Andrej also mentions other tokenizers like SentencePiece from Google.

And with that, let’s jump into the lecture.

Video + timestamps

Baseline language modelling, code setup

00:07:52 reading and exploring the data

00:09:28 tokenization, train/val split

00:14:27 data loader: batches of chunks of data

00:22:11 simplest baseline: bigram language model, loss, generation

00:34:53 training the bigram model

00:38:00 port our code to a script

Building the "self-attention"

00:42:13 version 1: averaging past context with for loops, the weakest form of aggregation

00:47:11 the trick in self-attention: matrix multiply as weighted aggregation

00:51:54 version 2: using matrix multiply

00:54:42 version 3: adding softmax

00:58:26 minor code cleanup

01:00:18 positional encoding

… the cliff-hanger!

Gif by looneytunes on Giphy

Summary

Today we’ve picked up a lot of elements that we learned in the previous lectures, and we laid the groundwork for our GPT.

We’ve implemented a bigram model, and practised some maths and matrix multiplication for our samples to “communicate with the past”, which will help us understand the self-attention (I promise this becomes clearer next time). And then, we’re pausing just before the season finale.

What’s next

Next time we’re jumping right into action to implement the self-attention mechanism and the transformer, and to observe the miracles happen.

Make sure you subscribe to this newsletter at starterai.dev to not miss the next part!

If you like this series, please forward to a friend!

Feedback

How did you like it? Was it easy to follow? What should I change for the next time?

Please reach out on LinkedIn and let me know!

How did you like this issue?

Subscribe to keep reading

This content is free, but you must be subscribed to Starter AI to continue reading.

Already a subscriber?Sign In.Not now