Deep learning, from the neuron up to LoRA, GRPO & agents.
Every core idea — taught with plain intuition, a worked example without code, then real code in PyTorch, TensorFlow/Keras, and 🤗 Transformers. Diagrams for the parts that are hard to picture. From backprop and CNNs through quantization, PEFT, RLHF/DPO/GRPO, Mixture-of-Experts, scaled training, serving, and the LangChain / LangGraph / Deep Agents stack. Written for engineers who want depth, not slogans.
SECTION 01What deep learning actually is
Where it sits inside AI, why "deep", and the one trick that makes the whole field work.
Artificial intelligence is the broad goal of making machines do things that seem to require intelligence. Machine learning (ML) is one route to that goal: instead of writing rules by hand, you give a program examples and let it find the rules itself. Deep learning (DL) is a sub-field of ML built on neural networks with many layers — the "deep" simply means many stacked layers of transformation, not anything mystical.
The defining move of deep learning is learned representations. Classical ML often needs a human to hand-design the features (e.g. "count the edges in this image, measure their angles"). A deep network instead learns its own features, layer by layer: early layers detect edges, middle layers assemble them into textures and parts, late layers recognise whole objects. Nobody told it what an edge is — it discovered that edges were useful for the task you optimised it on.
Imagine teaching a child to recognise a cat. You don't give a checklist ("4 legs, whiskers, pointed ears"). You show thousands of cats and not-cats and correct them when they're wrong. Over time the child's brain self-organises the concept. A deep network does the same: it sees many labelled images, makes a guess, measures how wrong it was, and nudges its millions of internal numbers a tiny bit toward "less wrong". Repeat millions of times. That nudging loop — guess, measure error, adjust — is deep learning. Everything else is detail about how to do the adjusting efficiently for different shapes of data (images, text, audio).
When to reach for deep learning (and when not to)
| Use deep learning when… | Prefer classical ML / rules when… |
|---|---|
| Data is high-dimensional & unstructured: images, audio, raw text, video. | Data is small (hundreds–few thousand rows) and tabular. |
| You have lots of data (tens of thousands+ examples) or a pre-trained model to fine-tune. | You need full interpretability / an auditable decision. |
| The mapping from input→output is complex and you can't write the rules. | Gradient-boosted trees (XGBoost/LightGBM) already beat it — common on tabular data. |
| You can tolerate a black-box model and GPU compute. | Latency/compute budget is tiny or no GPU is available. |
On structured/tabular data, gradient-boosted trees frequently outperform neural nets with far less tuning. Deep learning's home turf is perception (vision, speech) and language. "Use a transformer" is not the answer to every problem — match the tool to the data.
SECTION 02The artificial neuron
The single unit every deep network is built from — a weighted sum, a bias, and a squashing function.
A biological neuron receives signals on its dendrites, sums them, and fires if the total crosses a threshold. The artificial neuron (or unit) is a deliberately crude maths version of that idea. It does exactly three things:
- Weighted sum. Each input
xiis multiplied by a weightwi(how much that input matters) and they're all added up. - Add a bias. A learnable constant
bshifts the result, letting the neuron fire more or less easily. - Activation. The sum is passed through a non-linear function
σthat decides the output. Without this step the whole network would collapse into one big linear function (see §5).
w·x is the dot product of the weight vector and input vector. The weights and bias are the learnable parameters — training is the search for good values of these.
w₂ matters
most). Inputs are scaled, summed with a bias, then squashed by an activation σ to produce the output y.Suppose a neuron decides "should I go for a run?" Inputs: x₁=weather is nice (1/0),
x₂=I have free time (1/0), x₃=I'm tired (1/0). You care a lot about free time,
somewhat about weather, and being tired pushes you the other way — so the learned weights might be
w₁=2, w₂=3, w₃=−4 with bias b=−1. If today weather=1, time=1, tired=0, the sum is
2·1 + 3·1 + (−4)·0 − 1 = 4. A positive number → activation fires → "go run". Tired=1 instead
gives 0 → borderline. The network learned these weights from your past behaviour; you
never wrote the rule.
neuron.pypythonimport numpy as np # one neuron, 3 inputs x = np.array([1.0, 1.0, 0.0]) # weather, free time, tired w = np.array([2.0, 3.0, -4.0]) # learned weights b = -1.0 # learned bias def relu(z): # a common activation (see §5) return max(0.0, z) z = np.dot(w, x) + b # weighted sum + bias -> 4.0 y = relu(z) # activation -> 4.0 print(f"pre-activation z = {z}, output y = {y}") # A whole layer is just many neurons stacked: a matrix multiply. W = np.array([[2., 3., -4.], # neuron 1 [1., -1., 1.]]) # neuron 2 b_vec = np.array([-1.0, 0.5]) layer_out = np.maximum(0.0, W @ x + b_vec) # ReLU over the vector print("layer output:", layer_out)
That last block is the crucial leap: a layer is just a matrix multiply plus a bias plus an
activation. Stack several of these and you have a deep network. Everything that follows is about
choosing the activations, measuring error, and adjusting W and b automatically.
SECTION 03Tensors & the math you actually need
A tensor is just an n-dimensional array. Four operations cover ~90% of what happens inside a network.
Everything flowing through a neural network — inputs, weights, activations, gradients — is a tensor: a grid of numbers with some number of axes (dimensions). The jargon maps onto things you know:
| Name | Rank (axes) | Example | Shape |
|---|---|---|---|
| Scalar | 0 | a single loss value 3.14 | () |
| Vector | 1 | one word embedding, one data row | (768,) |
| Matrix | 2 | a batch of rows; a weight layer | (batch, features) |
| 3-D tensor | 3 | a batch of token sequences | (batch, seq_len, dim) |
| 4-D tensor | 4 | a batch of RGB images | (batch, channels, H, W) |
You do not need heavy mathematics to be productive, but four ideas recur constantly:
1 · Matrix multiplication — the workhorse
A layer computing Y = X·W + b is a matrix multiply. If X is
(batch, in) and W is (in, out), the result is (batch, out).
The inner dimensions must match — most shape bugs are a mismatch here.
2 · Broadcasting
When you add a bias vector of shape (out,) to a matrix (batch, out), the
framework broadcasts the vector across every row automatically. Broadcasting lets
small tensors stretch to fit big ones without copying memory — but it can also silently create wrong shapes,
so check.
3 · The gradient
The gradient is the vector of partial derivatives of the loss with respect to every parameter. It points in the direction of steepest increase of the loss; training walks the opposite way. You will almost never compute it by hand — autograd does it (§7) — but you must understand what it means: "if I nudge this weight up a hair, does the error go up or down, and how fast?"
4 · The chain rule
A network is a chain of functions: loss(layer3(layer2(layer1(x)))). The chain rule from
calculus lets you compute how the final loss changes with respect to an early weight by multiplying the
local derivatives along the chain. This single rule is the mathematical engine of backpropagation.
Read right-to-left: how the weight affects the pre-activation, how that affects the output, how that affects the loss. Multiply them to get the full effect. Backprop just applies this across millions of parameters efficiently.
tensors.pypythonimport torch # create tensors x = torch.randn(32, 784) # batch of 32 flattened 28x28 images W = torch.randn(784, 128) # weight matrix: 784 in -> 128 out b = torch.randn(128) # bias vector # a linear layer, by hand y = x @ W + b # @ is matmul; b is BROADCAST over 32 rows print(x.shape, "@", W.shape, "->", y.shape) # (32,784) @ (784,128) -> (32,128) # reshaping / moving axes (constant in real code) imgs = torch.randn(32, 3, 224, 224) # (batch, channels, H, W) flat = imgs.reshape(32, -1) # -> (32, 150528) moved = imgs.permute(0, 2, 3, 1) # NCHW -> NHWC: (32,224,224,3) # everything runs on a GPU by moving the tensor there device = "cuda" if torch.cuda.is_available() else "cpu" x = x.to(device) # same API, faster hardware
When a model breaks, print the shapes. 80% of deep-learning bugs are tensors that
are the wrong shape, on the wrong device (CPU vs GPU), or the wrong dtype (float32 vs int64). Make
print(x.shape, x.dtype, x.device) a reflex.
SECTION 04The MLP & the forward pass
Stack layers of neurons, push data through, get a prediction. This is the "hello world" of deep nets.
A multi-layer perceptron (MLP), also called a fully-connected or dense network, is layers of neurons where every neuron connects to every neuron in the next layer. Data enters the input layer, flows through one or more hidden layers, and exits the output layer. Computing the output from the input is the forward pass.
activation(X·W + b). "Deep" just means
more hidden layers. The output layer's size matches the task (e.g. 2 neurons for a 2-class problem).Predicting a house price from 3 numbers: size, bedrooms, age. The input layer holds those 3 values. Hidden layer 1 might learn combinations like "big-and-new" or "small-and-old". Hidden layer 2 combines those into higher-level notions like "desirable family home". The single output neuron emits a price. Each layer builds richer concepts from the layer below — and the network decides what those concepts are, guided only by how well the final price matches reality.
mlp_pytorch.pyPyTorchimport torch.nn as nn model = nn.Sequential( nn.Linear(3, 16), # input 3 -> hidden 16 nn.ReLU(), nn.Linear(16, 16), # hidden -> hidden 16 nn.ReLU(), nn.Linear(16, 1), # hidden -> output 1 (price) ) # forward pass: prediction = model(x)
mlp_keras.pyTensorFlow / Kerasimport tensorflow as tf from tensorflow import keras model = keras.Sequential([ keras.layers.Dense(16, activation="relu", input_shape=(3,)), keras.layers.Dense(16, activation="relu"), keras.layers.Dense(1), # linear output for regression ]) # forward pass: prediction = model(x)
Notice how similar they are. Once you know the concepts, switching frameworks is mostly learning
new spellings for the same nouns: nn.Linear = Dense, both stack into a
Sequential.
SECTION 05Activation functions
The non-linearity that lets networks model curves, not just lines. Without it, depth is pointless.
Here is the single most important fact about activations: stacking linear layers without a
non-linearity gives you… one linear layer. W₂(W₁x) = (W₂W₁)x is still just a matrix
multiply. The activation function inserts a bend between layers, and bends are what let a network
approximate any function — curved decision boundaries, complex mappings, the works. (This is the gist of the
universal approximation theorem.)
| Function | Output range | Use it for | Watch out for |
|---|---|---|---|
ReLU | [0, ∞) | Default for hidden layers | "Dying ReLU": units stuck at 0 |
Leaky ReLU / GELU | (−∞, ∞) | Hidden layers; GELU in transformers | Slightly more compute |
Sigmoid | (0, 1) | Binary classification output | Saturation → vanishing gradients |
Tanh | (−1, 1) | RNN/LSTM gates, zero-centered data | Also saturates |
Softmax | (0,1), sums to 1 | Multi-class output (a probability dist.) | Output layer only, not hidden |
Think of softmax as turning raw scores into a probability vote. If the final layer
outputs raw scores [2.0, 1.0, 0.1] for "cat / dog / bird", softmax converts them to roughly
[0.66, 0.24, 0.10] — they're now positive and sum to 1, so you can read them as "66% confident
it's a cat". The biggest score still wins, but you also get a calibrated sense of confidence, which the loss
function (§6) needs.
activations.pypythonimport torch, torch.nn.functional as F z = torch.tensor([2.0, 1.0, 0.1]) F.relu(z) # tensor([2.0, 1.0, 0.1]) negatives -> 0 torch.sigmoid(z) # tensor([0.88, 0.73, 0.52]) each squashed to (0,1) torch.tanh(z) # tensor([0.96, 0.76, 0.10]) squashed to (-1,1) F.softmax(z, 0) # tensor([0.66, 0.24, 0.10]) a probability distribution F.gelu(z) # smooth ReLU-like curve used in BERT/GPT
Don't put a softmax or sigmoid on the output and also use a loss that
applies it internally. CrossEntropyLoss in PyTorch and from_logits=True in Keras
expect raw logits (no final activation). Applying softmax twice silently hurts training.
Feed raw scores to those losses.
SECTION 06Loss functions — measuring "how wrong"
A single number that scores the prediction against the truth. Training = making this number small.
The loss (or cost/objective) function takes the network's prediction and the true answer and outputs one number: bigger = more wrong. The entire goal of training is to minimise this number. Your choice of loss defines what "good" means, so it must match the task.
| Task | Loss | What it measures |
|---|---|---|
| Regression (predict a number) | MSE (mean squared error) | Average squared gap between prediction and target |
| Regression, robust to outliers | MAE / Huber | Absolute gap; less punished by big errors |
| Binary classification | Binary cross-entropy | How surprised the model is by the true 0/1 label |
| Multi-class classification | Cross-entropy | How much probability mass landed on the wrong class |
In cross-entropy, y is the true distribution (usually 1 for the
correct class, 0 elsewhere) and ŷ the predicted probabilities. It heavily punishes confident
wrong answers: predicting 0.01 for the true class gives a huge −log(0.01) penalty.
Cross-entropy is "surprise". If the true label is cat and the model says cat with 99% confidence, it was barely surprised → tiny loss. If it confidently said dog at 99%, the truth is a shock → large loss. This asymmetry is exactly what you want: a model that is confidently wrong should be punished far more than one that was unsure. MSE behaves similarly for numbers — being off by 10 costs 100× more than being off by 1, because the gap is squared.
losses.pypythonimport torch, torch.nn as nn # regression pred = torch.tensor([3.2, 5.0]); target = torch.tensor([3.0, 4.5]) mse = nn.MSELoss()(pred, target) # -> 0.145 # multi-class classification (3 classes, batch of 2) logits = torch.tensor([[2.0, 0.5, 0.1], # raw scores, NOT softmaxed [0.2, 0.1, 3.0]]) labels = torch.tensor([0, 2]) # true class indices ce = nn.CrossEntropyLoss()(logits, labels) # applies softmax internally print(mse.item(), ce.item())
SECTION 07Backpropagation — the learning engine
How the network figures out which way to nudge every weight to lower the loss. The single most important algorithm in the field.
You have a loss number. Now what? You need to know, for each of the millions of weights, "if I increase this weight slightly, does the loss go up or down, and by how much?" That quantity is the gradient of the loss with respect to that weight. Backpropagation ("backprop") computes all of them in one efficient backward sweep.
It works in two phases:
- Forward pass — push the input through the network, computing and remembering each intermediate value, until you reach the loss.
- Backward pass — start from the loss and walk backwards, applying the chain rule (§3) at every step to push the gradient back to each weight. Each layer hands the layer before it the message "here's how much you contributed to the error."
∂loss/∂W for every parameter. The optimizer (§8)
then uses those gradients to update the weights.A factory ships a defective product (the loss). To assign blame, the manager walks the assembly line backwards: the packaging station was 10% responsible, the welding station 60%, the parts supplier 30%. Each station now knows exactly how much to adjust. Backprop is this blame-assignment, done with calculus: it distributes "responsibility for the error" back through every layer, in proportion to how much each weight influenced the outcome. The beautiful part: it reuses the calculations from later layers when computing earlier ones, so the whole backward pass costs about the same as one forward pass.
autograd.pyPyTorchimport torch # w is a parameter we want gradients for w = torch.tensor([2.0], requires_grad=True) x = torch.tensor([3.0]) y = w * x # forward: 6.0 loss = (y - 10) ** 2 # target is 10 -> loss = (6-10)^2 = 16 loss.backward() # BACKPROP: fills w.grad with d(loss)/dw print(w.grad) # tensor([-24.]) -> negative: increasing w lowers loss # you almost never call backward yourself on toy expressions; # in real training it's one line inside the loop (see section 10).
Backprop (popularised 1986) made it computationally feasible to train deep networks. Modern frameworks build a computational graph of every operation during the forward pass, then automatically differentiate it — this is automatic differentiation (autodiff). You write only the forward math; the gradients come free.
SECTION 08Gradient descent & optimizers
Backprop says which way is downhill. The optimizer decides how big a step to take, and how to be smart about it.
Picture the loss as a landscape of hills and valleys, with the network's weights as your coordinates. You want the lowest valley. The gradient points uphill, so you step the opposite way. That's gradient descent:
η (eta) is the learning rate — the step
size, and the single most important hyperparameter you'll tune. Too large → you overshoot and diverge. Too
small → training crawls or gets stuck. Typical values: 1e-3 to 1e-5.
Batch, stochastic, and mini-batch
Computing the gradient over the entire dataset every step is accurate but slow. Stochastic gradient descent (SGD) uses one example at a time (noisy, fast). Mini-batch gradient descent — the universal practical choice — uses a small batch (e.g. 32–256 examples), balancing speed and stability. One pass over the whole dataset is an epoch.
Smarter optimizers
Plain SGD can be slow and get stuck. Modern optimizers add tricks:
- Momentum — accumulate a running velocity so you barrel through small bumps and flat regions, like a ball rolling downhill rather than taking timid steps.
- Adam — the default for most work. Keeps a per-parameter adaptive learning rate (parameters with consistently small gradients get bigger steps) plus momentum. Robust and forgiving.
- AdamW — Adam with corrected weight decay; the standard for training transformers.
optimizers.pyPyTorchimport torch # pass the model's parameters and a learning rate opt = torch.optim.Adam(model.parameters(), lr=1e-3) # one optimization step (inside the training loop): opt.zero_grad() # clear old gradients (they accumulate otherwise!) loss.backward() # backprop: compute new gradients opt.step() # apply the update rule to every weight # Keras equivalent: # model.compile(optimizer=keras.optimizers.Adam(1e-3), loss="mse")
Forgetting opt.zero_grad(). PyTorch accumulates gradients by default, so if you
skip it, this step's gradients pile on top of the last step's — and training quietly goes haywire. Always
zero, then backward, then step.
SECTION 09Regularization & generalization
The real goal isn't a low training loss — it's performing well on data the model has never seen. These techniques fight memorization.
A network with millions of parameters can simply memorise the training set, scoring perfectly on it while failing on new data. That's overfitting. The opposite — too simple to capture the pattern at all — is underfitting. The art is landing in between: generalization.
The toolkit
| Technique | What it does |
|---|---|
| Train/val/test split | Hold out data the model never trains on, so you can measure true generalization. |
| Dropout | Randomly zero out a fraction of neurons each step, forcing the network not to rely on any single unit. A regularizer that mimics training an ensemble. |
| Weight decay (L2) | Penalize large weights, nudging the model toward simpler functions. |
| Early stopping | Stop training when validation loss stops improving (Figure 7). |
| Data augmentation | Create new training examples by transforming existing ones (flip/crop/rotate images, paraphrase text). More effective variety = less memorization. |
| Batch normalization | Normalize each layer's inputs per mini-batch. Stabilizes & speeds training, with a mild regularizing side-effect. |
A student who memorises last year's exam answers aces the practice test but bombs the real exam with new questions — that's overfitting. Dropout is like randomly making some study notes unavailable each night: the student is forced to understand the material rather than lean on one memorised sheet. Data augmentation is studying many rephrased versions of each problem. Early stopping is putting the books down once mock-exam scores stop improving, before you start over-memorising trivia.
regularization.pyPyTorchimport torch.nn as nn model = nn.Sequential( nn.Linear(784, 256), nn.BatchNorm1d(256), # batch normalization nn.ReLU(), nn.Dropout(p=0.3), # zero 30% of activations during training nn.Linear(256, 10), ) # weight decay (L2) is set on the optimizer: opt = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-2) # IMPORTANT: dropout/batchnorm behave differently in train vs eval! model.train() # enables dropout + batchnorm updates # ... training ... model.eval() # disables dropout, freezes batchnorm stats for inference
Call model.train() before training and model.eval() before evaluating/inference.
Forgetting eval() leaves dropout active at test time and makes batchnorm use batch stats — a
classic cause of "my accuracy is randomly worse at inference."
SECTION 10The training loop — everything together
Sections 4–9 in one place. Memorize this rhythm and you can train anything.
All the pieces now connect into a single repeating cycle. Internalise these five steps and the rest of deep learning is variations on the theme:
- Forward — run a mini-batch through the model to get predictions.
- Loss — compare predictions to targets.
- Backward — backprop to compute gradients (after zeroing old ones).
- Step — optimizer updates the weights.
- Repeat — over every batch, for many epochs, validating periodically.
train.pyPyTorchimport torch, torch.nn as nn from torch.utils.data import DataLoader device = "cuda" if torch.cuda.is_available() else "cpu" model = MyModel().to(device) loss_fn = nn.CrossEntropyLoss() opt = torch.optim.AdamW(model.parameters(), lr=1e-3) for epoch in range(EPOCHS): # ---- train ---- model.train() for xb, yb in train_loader: # mini-batches xb, yb = xb.to(device), yb.to(device) preds = model(xb) # 1. forward loss = loss_fn(preds, yb) # 2. loss opt.zero_grad() # clear old grads loss.backward() # 3. backward opt.step() # 4. update weights # ---- validate ---- model.eval() correct = 0 with torch.no_grad(): # no gradients needed for xb, yb in val_loader: xb, yb = xb.to(device), yb.to(device) correct += (model(xb).argmax(1) == yb).sum().item() print(f"epoch {epoch}: val acc = {correct/len(val_loader.dataset):.3f}")
train_keras.pyTensorFlow / Keras# Keras wraps the loop in .fit() — convenient, less explicit model.compile(optimizer="adamw", loss="sparse_categorical_crossentropy", metrics=["accuracy"]) model.fit(train_ds, validation_data=val_ds, epochs=EPOCHS)
PyTorch makes you write the loop — more code, total control, easy to debug, dominant in research.
Keras hides it in .fit() — less code, faster to prototype. 🤗 Transformers' Trainer
(§16) is a third option that handles the loop and distributed training, logging, and checkpoints for
you. Learn the explicit loop first; the magic ones make sense once you know what they're hiding.
SECTION 11Convolutional networks — seeing
The architecture that cracked computer vision. It exploits the fact that in images, nearby pixels are related and patterns repeat.
Why not just feed an image into an MLP? A modest 224×224 RGB image is 150,528 numbers; a single dense layer to 1,000 units would need 150 million weights — and it would treat a cat in the top-left as completely unrelated to a cat in the bottom-right. Convolutional neural networks (CNNs) fix both problems with two ideas:
- Local connectivity. A small filter (kernel), e.g. 3×3, slides across the image looking at one little patch at a time. Vision is local — an edge is defined by a few neighbouring pixels.
- Weight sharing. The same filter is reused at every position. If a filter learns to detect a vertical edge, it detects vertical edges everywhere — drastically fewer parameters, and built-in translation invariance.
The full CNN recipe
A typical CNN alternates convolution → activation → pooling, repeated, then flattens into a small MLP head for the final prediction:
- Conv layers extract features; early ones find edges, deeper ones find object parts.
- Pooling (e.g. max-pool 2×2) downsamples the feature maps, shrinking spatial size, cutting compute, and adding robustness to small shifts.
- Flatten + dense head turns the final feature maps into class scores.
Recognising a face. The first conv layer's filters fire on tiny edges and color blobs. The next layer combines edges into eyes, noses, mouth corners. A deeper layer combines those parts into "a face arrangement". Pooling between them means it doesn't matter if the face is a few pixels left or right — the same eye-detector fires regardless. You designed none of these detectors; the network grew them because they reduced the loss on labelled faces.
cnn.pyPyTorchimport torch.nn as nn cnn = nn.Sequential( nn.Conv2d(3, 32, kernel_size=3, padding=1), # 3 RGB in -> 32 feature maps nn.ReLU(), nn.MaxPool2d(2), # halve spatial size nn.Conv2d(32, 64, kernel_size=3, padding=1), # 32 -> 64 feature maps nn.ReLU(), nn.MaxPool2d(2), nn.Flatten(), nn.Linear(64 * 8 * 8, 128), nn.ReLU(), # dense head nn.Linear(128, 10), # 10 classes ) # For real projects you rarely build from scratch — you fine-tune a # pretrained ResNet/EfficientNet (see Section 15 on transfer learning).
Almost nobody trains a vision CNN from scratch anymore. You take a model pretrained on ImageNet (ResNet, EfficientNet, or a Vision Transformer) and fine-tune it on your data — better accuracy with a fraction of the data and compute. More on this in §15.
SECTION 12Recurrent networks — sequences & memory
For data with order — text, speech, time series. The idea that ruled NLP before transformers, and still worth understanding.
An MLP and a CNN see a fixed-size input all at once. But language and time series are sequences of arbitrary length where order matters ("dog bites man" ≠ "man bites dog"). A recurrent neural network (RNN) processes a sequence one step at a time, carrying a hidden state — a running memory — from one step to the next. At each step it combines the new input with the memory of everything seen so far.
The vanishing gradient problem & the LSTM fix
Plain RNNs struggle to remember things from far back: during backprop the gradient is multiplied at every timestep and shrinks toward zero over long sequences — the vanishing gradient problem. By the time it reaches the start of a long sentence, the learning signal has evaporated.
The LSTM (Long Short-Term Memory) and the simpler GRU solve this with gates — small neural mechanisms that learn what to keep, what to forget, and what to output from a protected "cell state" that flows through largely unchanged. This lets information survive across hundreds of steps.
Reading "The keys to the cabinet … are on the table." To choose "are" over "is", the model must remember, across many intervening words, that the subject was plural ("keys"). A plain RNN tends to forget by then. An LSTM's forget gate learns to hold onto "subject = plural" in its cell state until the verb arrives, then uses it. The gates are themselves learned — the network discovers what's worth remembering.
lstm.pyPyTorchimport torch, torch.nn as nn # classify the sentiment of a sequence of word-embeddings class SentimentLSTM(nn.Module): def __init__(self, vocab, dim=128, hidden=256): super().__init__() self.embed = nn.Embedding(vocab, dim) # token id -> vector self.lstm = nn.LSTM(dim, hidden, batch_first=True) self.head = nn.Linear(hidden, 2) # pos / neg def forward(self, x): # x: (batch, seq_len) e = self.embed(x) # (batch, seq, dim) out, (h_n, c_n) = self.lstm(e) # h_n = final memory return self.head(h_n[-1]) # classify from it
Transformers (§13) have largely replaced RNNs for language because they process a whole sequence in parallel and model long-range dependencies better. RNNs/LSTMs are still useful for streaming data, very long time series on tight compute, and on-device settings — and understanding them clarifies why attention was such a leap.
SECTION 13Transformers & attention
The architecture behind GPT, BERT, and essentially every modern large model. Understand attention and you understand the engine of the AI boom.
In 2017 the paper "Attention Is All You Need" introduced the Transformer, discarding recurrence entirely. Its core insight: instead of passing memory step-by-step, let every token look directly at every other token and decide how much each one matters for understanding it. That mechanism is self-attention. Because all tokens are processed at once, transformers parallelise beautifully on GPUs — which is what made training enormous models practical.
Self-attention: Query, Key, Value
Each token produces three vectors via learned projections:
- Query (Q) — "what am I looking for?"
- Key (K) — "what do I offer / what am I about?"
- Value (V) — "the actual information I'll pass on if attended to."
A token's query is compared (dot product) against every token's key to produce attention scores — how relevant each other token is. Softmax turns those into weights that sum to 1, and the output is the weighted sum of all the values. In short: each token gathers a custom blend of information from the whole sequence, weighted by relevance.
The √dₖ divisor keeps the dot products from growing too large (which
would saturate the softmax). Multi-head attention runs several of these in parallel, each
head free to focus on a different kind of relationship (syntax, coreference, topic), then concatenates them.
The full transformer block
A transformer is a stack of identical blocks. Each block is: multi-head self-attention → add & normalize → feed-forward MLP → add & normalize. Two more pieces make it work:
- Positional encoding. Since attention sees the sequence as an unordered set, we add a signal encoding each token's position, so "dog bites man" differs from "man bites dog".
- Residual connections + layer norm. The "add & normalize" steps let gradients flow through very deep stacks (dozens of blocks) without vanishing — the trick that lets transformers be huge.
Encoder, decoder, and the model families
| Family | Structure | Best at | Examples |
|---|---|---|---|
| Encoder-only | Reads the whole input at once (bidirectional) | Understanding: classification, embeddings, search | BERT, RoBERTa |
| Decoder-only | Predicts the next token left-to-right (causal) | Generation: chat, completion, code | GPT, Llama, Mistral |
| Encoder–decoder | Encode input, then generate output | Translation, summarization (seq-to-seq) | T5, BART |
Think of self-attention as a meeting where everyone can hear everyone. To decide what "it" refers to, the word "it" effectively asks the room, "who here is a noun I might be standing in for?" Every other word answers with how relevant it is; "animal" answers loudest. "It" then updates its understanding by blending in mostly "animal". A decoder (like GPT) is the same meeting but with a rule: you may only listen to words before you, never after — which is exactly what's needed to predict the next word one at a time.
attention.pyPyTorchimport torch, torch.nn.functional as F def self_attention(x, Wq, Wk, Wv): # x: (seq_len, dim) — one sequence of token vectors Q, K, V = x @ Wq, x @ Wk, x @ Wv # learned projections d_k = Q.size(-1) scores = (Q @ K.transpose(-2, -1)) / d_k ** 0.5 # (seq, seq) weights = F.softmax(scores, dim=-1) # how much each token attends return weights @ V # weighted blend of values # In practice you use the built-in, optimized version: # torch.nn.MultiheadAttention, or just load a pretrained transformer (§16).
Transformers removed the sequential bottleneck of RNNs (parallel training), handle long-range dependencies directly (any token to any token in one step), and scale predictably — bigger model + more data + more compute reliably yields better performance. That predictable scaling is precisely what fuelled the era of large language models.
SECTION 14Embeddings — meaning as geometry
How networks turn words, images, or anything discrete into vectors where distance equals similarity. The quiet idea behind search, RAG, and recommendations.
A network can't multiply the word "king". An embedding maps each discrete item (a word, a product, a user) to a dense vector of, say, 768 numbers. Crucially, these vectors are learned so that items used in similar ways end up near each other in the space. Meaning becomes geometry: similarity is just distance.
Relationships become directions in the space. The "royalty" and "gender" concepts emerge as consistent vector offsets — never programmed, just a by-product of training on how words co-occur.
This is how semantic search and RAG (retrieval-augmented generation) work. You embed every document in your knowledge base into vectors and store them. When a user asks a question, you embed the question too, then find the document vectors nearest to it — those are the most relevant passages, even if they share no exact keywords ("car trouble" finds a doc about "vehicle won't start"). Recommendation systems do the same with products and users.
embeddings.py🤗 sentence-transformersfrom sentence_transformers import SentenceTransformer, util model = SentenceTransformer("all-MiniLM-L6-v2") # small, fast embedder docs = ["How do I reset my password?", "Ways to recover account access", "What's the weather in Paris?"] emb = model.encode(docs, convert_to_tensor=True) # (3, 384) vectors query = model.encode("I forgot my login", convert_to_tensor=True) scores = util.cos_sim(query, emb) # cosine similarity print(scores) # highest for the first two docs, low for the weather one
SECTION 15Transfer learning & fine-tuning
Don't start from zero. Take a model that already learned general features on a huge dataset, and adapt it to your task with a fraction of the data.
Training a large model from scratch needs millions of examples and serious compute. Transfer learning sidesteps this: start from a model pretrained on a giant generic dataset (ImageNet for vision, web-scale text for language), then adapt it to your specific task. The pretrained model already "knows" edges, shapes, grammar, and facts — you just teach it your last mile.
Two main strategies:
- Feature extraction. Freeze the pretrained body, replace only the final layer (the "head"), and train just that head on your data. Fast, needs little data, lower ceiling.
- Fine-tuning. Unfreeze some or all of the pretrained weights and continue training them (at a small learning rate) on your data. More data and compute, higher accuracy.
You want to classify 10 species of local birds but have only 500 photos — nowhere near enough to train a vision model from scratch. Instead you take a ResNet that already learned, from 1.2 million ImageNet images, what feathers, beaks, edges, and textures look like. You snip off its 1,000-class head, bolt on a fresh 10-class head, and train. Because the hard, general visual work is already done, your 500 photos are enough to get strong accuracy. It's standing on the shoulders of a model that already learned to see.
finetune_vision.pyPyTorch / torchvisionimport torch, torch.nn as nn from torchvision import models # 1. load a model pretrained on ImageNet net = models.resnet50(weights="IMAGENET1K_V2") # 2. freeze the body (feature extraction) for p in net.parameters(): p.requires_grad = False # 3. replace the head for OUR 10 bird classes (this part trains) net.fc = nn.Linear(net.fc.in_features, 10) # 4. train only the new head with the standard loop from Section 10 opt = torch.optim.Adam(net.fc.parameters(), lr=1e-3) # To FINE-TUNE instead: unfreeze later layers and use a tiny LR (e.g. 1e-5).
Pretrain (or download) → fine-tune → deploy. For language, this is exactly what creating a custom chatbot or classifier looks like: start from a pretrained LLM and fine-tune on your domain. Parameter-efficient methods like LoRA fine-tune only a tiny set of added weights, making it cheap to adapt even billion-parameter models on a single GPU.
SECTION 16Hugging Face & the pretrained ecosystem
The practical fast-track. Thousands of ready-to-use models, three lines of code to inference, a standard recipe to fine-tune.
The 🤗 Transformers library is the de-facto hub for pretrained models. You rarely build a transformer by hand; you download one of hundreds of thousands of community models from the Hugging Face Hub and either use it directly or fine-tune it. Three abstraction levels, from easiest to most controlled:
Level 1 — pipeline: inference in 3 lines
pipeline.py🤗 Transformersfrom transformers import pipeline clf = pipeline("sentiment-analysis") # downloads a model for you print(clf("I absolutely loved this guide!")) # [{'label': 'POSITIVE', 'score': 0.9998}] # the same API covers dozens of tasks: pipeline("summarization") pipeline("question-answering") pipeline("text-generation", model="gpt2") pipeline("zero-shot-classification") # classify into labels you invent pipeline("automatic-speech-recognition") # audio -> text
Level 2 — tokenizer + model: full control over inference
When you need the logits, embeddings, or custom decoding, load the tokenizer (which turns text into the integer token IDs the model expects) and the model separately.
model_direct.py🤗 Transformersfrom transformers import AutoTokenizer, AutoModelForSequenceClassification import torch name = "distilbert-base-uncased-finetuned-sst-2-english" tok = AutoTokenizer.from_pretrained(name) model = AutoModelForSequenceClassification.from_pretrained(name) inputs = tok("Transformers make this easy.", return_tensors="pt") with torch.no_grad(): logits = model(**inputs).logits # raw scores probs = logits.softmax(-1) # -> probabilities print(model.config.id2label[probs.argmax().item()]) # 'POSITIVE'
Level 3 — Trainer: fine-tune on your own data
The Trainer handles the entire training loop, evaluation, checkpointing, mixed precision, and
multi-GPU — you supply a dataset and a config.
finetune_text.py🤗 Transformersfrom transformers import (AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer) from datasets import load_dataset ds = load_dataset("imdb") # 1. data tok = AutoTokenizer.from_pretrained("distilbert-base-uncased") ds = ds.map(lambda b: tok(b["text"], truncation=True), batched=True) model = AutoModelForSequenceClassification.from_pretrained( # 2. model "distilbert-base-uncased", num_labels=2) args = TrainingArguments(output_dir="out", num_train_epochs=2, # 3. config per_device_train_batch_size=16, eval_strategy="epoch", learning_rate=2e-5, fp16=True) Trainer(model=model, args=args, # 4. train train_dataset=ds["train"], eval_dataset=ds["test"], tokenizer=tok).train()
Hugging Face is to deep learning what package managers (npm, pip) are to software: instead of writing a
sentiment model, an OCR model, and a translator from scratch — each months of work — you install pretrained
ones and compose them. The Hub is the "app store" of models; pipeline is the one-click install;
Trainer is the recipe for teaching a downloaded model your specific job. This is why a small team
can ship serious NLP in days.
Match the task (text-classification, summarization, ASR…), check the size vs your hardware (a 7B-param model needs ~14 GB VRAM in fp16), prefer models with a permissive license and good downloads/likes, and read the model card for its training data and known limitations. Smaller distilled models (DistilBERT, MiniLM) are often the right call for production latency.
SECTION 17Generative models
Networks that don't just classify — they create. The lineage behind image generators, voice cloning, and beyond.
Everything so far has been discriminative: map an input to a label or number. Generative models learn the underlying distribution of the data so they can produce new samples that look like it — new images, audio, molecules, text. Three influential families:
Autoencoders & VAEs
An autoencoder squeezes input through a narrow bottleneck (the latent code) and reconstructs it. By forcing everything through a few numbers, it learns a compressed, meaningful representation. A variational autoencoder (VAE) makes the latent space smooth and continuous, so you can sample new points from it and decode them into novel outputs.
GANs — the adversarial game
A generative adversarial network pits two networks against each other. The generator tries to produce fakes; the discriminator tries to tell fakes from real. They train together — as the discriminator gets sharper, the generator is forced to produce more convincing samples, until the fakes are nearly indistinguishable from real data.
Diffusion models — the modern image engine
Diffusion models (behind most state-of-the-art image generators) learn to reverse a noising process. During training, real images are progressively corrupted with noise until they're pure static; the model learns to undo one step of noising. To generate, you start from pure noise and run the learned denoiser repeatedly, and a coherent image gradually emerges. Conditioning the denoiser on a text embedding gives you text-to-image.
GAN as an art forger and detective: a forger paints fakes, a detective judges them. Each time the detective spots a fake, the forger learns and improves; each new forgery trains the detective to be pickier. After thousands of rounds the forgeries are gallery-quality. Diffusion as a sculptor: imagine starting with a block of TV static and "carving away" the noise that doesn't belong, a little each pass, guided by the prompt "a cat in a spacesuit", until a clean image of exactly that remains.
generate_image.py🤗 diffusersfrom diffusers import StableDiffusionPipeline import torch pipe = StableDiffusionPipeline.from_pretrained( "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16 ).to("cuda") image = pipe("a cat in a spacesuit, digital art", num_inference_steps=30, # how many denoising steps guidance_scale=7.5).images[0] # how strongly to follow the prompt image.save("cat.png")
SECTION 18Engineering & deployment
The unglamorous work that decides whether a model ships. What separates a notebook demo from production.
Make training fast and stable
- Use the GPU. Deep learning is matrix math; GPUs do it 10–100× faster than CPUs. Confirm
your tensors and model are on
cuda— a model silently running on CPU is a common "why is this so slow?" culprit. - Mixed precision (fp16/bf16). Compute in 16-bit where safe to roughly halve memory and speed things up, keeping critical parts in 32-bit. One flag in most frameworks.
- Efficient data loading. Use multiple worker processes so the CPU prepares the next batch while the GPU trains on the current one. A starved GPU is wasted money.
- Gradient accumulation. Simulate a large batch on limited memory by summing gradients over several small batches before stepping.
- Learning-rate scheduling & warmup. Ramp the LR up then decay it; standard for transformers and a reliable accuracy boost.
Debugging checklist (in order)
| Symptom | First things to check |
|---|---|
Loss is NaN | Learning rate too high; bad input normalization; log(0) in a custom loss. Lower LR, clip gradients. |
| Loss won't decrease | Forgot zero_grad(); LR too low/high; labels misaligned with inputs; model on wrong device. |
| Great train acc, bad val acc | Overfitting — add dropout/augmentation/weight decay, get more data, or stop earlier (§9). |
| Worse at inference than training | Forgot model.eval(); different preprocessing at inference time. |
| Out-of-memory (OOM) | Reduce batch size; enable mixed precision; use gradient accumulation/checkpointing. |
Overfit a single batch on purpose. Take 2–4 examples and train until the loss is near zero. If your model can't memorize a tiny batch, there's a bug in the model, loss, or data pipeline — no amount of tuning will help until you fix it. If it can, your machinery works and the rest is data and regularization.
From trained model to product
- Save & version the model weights and the exact preprocessing alongside them.
- Export to a portable format (ONNX, TorchScript, or a SavedModel) for serving outside Python.
- Optimize for inference: quantization (run in int8) and distillation (train a small model to mimic a big one) cut latency and cost dramatically.
- Serve behind an API (FastAPI, TorchServe, TF Serving, or a managed endpoint), batch requests, and monitor.
- Monitor for drift. Real-world data shifts over time; track input distributions and prediction quality, and plan to retrain.
deploy.pyPyTorch + FastAPI# --- save & load weights --- torch.save(model.state_dict(), "model.pt") model.load_state_dict(torch.load("model.pt")) model.eval() # ALWAYS eval before serving # --- minimal inference API --- from fastapi import FastAPI app = FastAPI() @app.post("/predict") def predict(payload: dict): x = preprocess(payload["text"]) # same prep as training! with torch.no_grad(): logits = model(x) return {"label": int(logits.argmax(-1))} # run: uvicorn deploy:app --host 0.0.0.0 --port 8000
SECTION 19Quantization — making big models fit
A model's weights are just numbers. Store them in fewer bits and the model shrinks 2–8× — often with almost no loss in quality. This is what lets a 70B model run on a single GPU.
By default weights live in 32-bit (fp32) or 16-bit (fp16/bf16) floats. Quantization maps them to low-bit integers — typically int8 or int4 — storing a scale factor so they can be approximately reconstructed. Memory scales directly with bit-width, and because generation is memory-bandwidth bound, smaller weights also mean faster inference.
| Format | Bits | 7B model size | Use |
|---|---|---|---|
| fp32 | 32 | ~28 GB | legacy / high-precision training |
| fp16 / bf16 | 16 | ~14 GB | standard training & inference (bf16 preferred on Ampere+) |
| int8 | 8 | ~7 GB | inference, light quality loss |
| int4 (NF4) | 4 | ~3.5 GB | inference + QLoRA fine-tuning |
| FP8 | 8 | ~7 GB | training/inference on H100/Blackwell |
Two timing strategies: post-training quantization (PTQ) — quantize an already-trained model (fast, the common case; methods include GPTQ, AWQ, bitsandbytes NF4, and GGUF for llama.cpp) — and quantization-aware training (QAT), which simulates quantization during training for the best low-bit accuracy at higher cost. NF4 ("4-bit NormalFloat") is a data type tuned for the bell-curve distribution of neural-net weights, and "double quantization" even quantizes the scale factors.
Think of a RAW photo vs a JPEG. The RAW file stores every pixel in full precision — gorgeous, but huge. The JPEG throws away detail your eye can't notice and is a fraction of the size. Quantization is JPEG for model weights: a 70B model in fp16 needs ~140 GB (two 80 GB A100s); in 4-bit it needs ~40 GB and runs on a single GPU, while answering almost identically. You trade a sliver of fidelity for an order-of-magnitude in cost.
load_4bit.pyTransformers + bitsandbytesfrom transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig import torch # 4-bit NF4 config: the recipe behind QLoRA bnb = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", # NormalFloat-4, tuned for weight distributions bnb_4bit_use_double_quant=True, # quantize the quantization constants too bnb_4bit_compute_dtype=torch.bfloat16, # math runs in bf16 ) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B-Instruct", quantization_config=bnb, device_map="auto", # place layers across available GPUs/CPU ) tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct") # The 8B model now occupies ~5 GB instead of ~16 GB.
Quantize for inference and for QLoRA fine-tuning almost for free with int4/int8. Reach for GPTQ/AWQ when you want the fastest pre-baked inference weights, GGUF for llama.cpp/Ollama on laptops, and FP8 on H100/Blackwell for training. Watch for activation outliers — a few large activations can hurt naive low-bit schemes, which is exactly what methods like LLM.int8() and AWQ are designed to handle.
SECTION 20LoRA, QLoRA & parameter-efficient fine-tuning
Full fine-tuning rewrites every weight — one giant checkpoint per task and a huge bill. PEFT freezes the model and trains a tiny set of new parameters instead. This is how almost all custom LLMs are built today.
The key idea behind LoRA (Low-Rank Adaptation): when you fine-tune, the change to a weight matrix is low-rank — it doesn't need full degrees of freedom. So freeze the original weight W (size d×k) and learn its update as a product of two skinny matrices:
Only A and B train. With r=8 on a 4096×4096 layer you update ~65K params instead of ~16.7M — about 0.4%. A scaling factor α/r controls the update's strength. At deploy time you can fold B·A back into W, so there is zero extra inference latency.
The family worth knowing:
- LoRA — the workhorse. Tiny adapters, mergeable, hot-swappable.
- QLoRA — LoRA on top of a 4-bit (NF4) frozen base + paged optimizers. Massive memory savings; the default for fine-tuning on consumer GPUs.
- DoRA — decomposes each weight into magnitude and direction and applies LoRA to the direction. Closes most of the remaining gap to full fine-tuning at the same parameter cost.
- Others: LoRA+ (different LRs for A and B), rsLoRA (rank-stabilized scaling), PiSSA (better init), VeRA (shared frozen random matrices, even fewer params), IA³, and the soft-prompt family (prefix-, prompt-, P-tuning) and classic adapters.
You have one 13B base model and need five specialists: legal, medical, code, customer-support, and sales. Full fine-tuning means five separate expensive runs and five ~26 GB checkpoints to store and serve. With LoRA you keep one frozen base and train five adapter files of ~10–50 MB each, hot-swapping them per request. It's one game console with five cartridges instead of buying five consoles.
qlora_finetune.pyPEFT + TRL + Transformersfrom transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training from trl import SFTTrainer, SFTConfig import torch model_id = "meta-llama/Llama-3.1-8B" # 1. load base in 4-bit (the "Q" in QLoRA) -- frozen bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16) model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb, device_map="auto") model = prepare_model_for_kbit_training(model) # 2. attach LoRA adapters (the only thing that trains) lora = LoraConfig( r=16, lora_alpha=32, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM", target_modules=["q_proj","k_proj","v_proj","o_proj", # attention "gate_proj","up_proj","down_proj"], # MLP -> "all-linear" ) model = get_peft_model(model, lora) model.print_trainable_parameters() # e.g. "trainable: 0.42% of all params" # 3. train like normal -- TRL handles the SFT loop trainer = SFTTrainer(model=model, train_dataset=my_dataset, args=SFTConfig(output_dir="out", per_device_train_batch_size=2, gradient_accumulation_steps=8, bf16=True, num_train_epochs=1, learning_rate=2e-4)) trainer.train() # 4. save the tiny adapter (a few dozen MB), or merge for zero-latency deploy: model.save_pretrained("my-lora-adapter") # merged = model.merge_and_unload() # fold B*A into W for serving
Start with rank r=16–64, α = 2×r, and target all linear layers (attention + MLP) — this consistently beats attaching LoRA to attention only. Learning rates are higher than full fine-tuning (1e-4 to 3e-4). Use QLoRA when memory is tight, DoRA when you want to squeeze out the last accuracy points, and merge_and_unload() before serving so inference is as fast as the base model.
SECTION 21Alignment with RLHF & PPO
Pretraining makes a model knowledgeable; supervised fine-tuning makes it follow instructions. But "the most likely next token" isn't the same as "the answer a human prefers." Alignment closes that gap.
RLHF (Reinforcement Learning from Human Feedback) is the classic three-stage recipe that turned raw LLMs into helpful assistants:
- SFT — fine-tune on high-quality demonstration (prompt → ideal answer) pairs.
- Reward model (RM) — show humans two answers to the same prompt; they pick the better one. Train a model to predict that preference. Mathematically it uses the Bradley–Terry model: the probability that answer w beats l is σ(r(w) − r(l)).
- RL optimization (PPO) — let the policy generate answers, score them with the RM, and update the policy to earn higher reward — while a KL penalty keeps it from drifting too far from the SFT model (so it can't "cheat" the reward).
Imagine training a chef. SFT is having them copy recipes from a great cookbook. The reward model is a food critic you've trained to taste a dish and give it a score the way real diners would. PPO is the chef cooking freely, the critic scoring each plate, and the chef adjusting to win higher scores — but with a rule that they can't stray too far from real cooking (the KL leash), or they'd just drown everything in sugar to game the critic. That "gaming" failure is called reward hacking.
ppo_rlhf.pyTRL (Transformer Reinforcement Learning)from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead from transformers import AutoTokenizer, pipeline # policy = SFT model + a value head (the critic PPO needs) policy = AutoModelForCausalLMWithValueHead.from_pretrained("my-sft-model") ref = AutoModelForCausalLMWithValueHead.from_pretrained("my-sft-model") # frozen anchor tok = AutoTokenizer.from_pretrained("my-sft-model") # a trained reward model scores responses (here via a text-classification pipeline) reward_fn = pipeline("text-classification", model="my-reward-model") ppo = PPOTrainer(PPOConfig(batch_size=32, learning_rate=1e-5), policy, ref, tok) for batch in dataloader: queries = batch["input_ids"] responses = ppo.generate(queries, max_new_tokens=128) # policy acts texts = [tok.decode(r) for r in responses] rewards = [out["score"] for out in reward_fn(texts)] # RM judges # PPO step: push policy toward higher reward, clipped + KL-penalized vs ref ppo.step(queries, responses, rewards)
Optimize a proxy hard enough and the model finds loopholes: padding answers with flattery, gaming length, or exploiting RM blind spots. The KL coefficient is your main defense — too low and the model degenerates, too high and it never improves. This fragility (plus PPO's 4-model memory cost and tuning difficulty) is why the field largely shifted to the simpler methods next.
SECTION 22DPO, GRPO & modern preference optimization
PPO works but is complex, unstable, and memory-hungry. The 2023–2025 wave of methods gets the same alignment with far less machinery — and powers today's reasoning models.
DPO (Direct Preference Optimization) made the key observation that your language model is secretly its own reward model. Instead of training a separate RM and running RL, DPO derives a single classification-style loss directly on preference pairs (prompt, chosen, rejected). It nudges the policy to raise the log-probability of chosen answers and lower it for rejected ones, with a coefficient β that plays the role of the KL leash — all measured relative to a frozen reference model. No reward model, no rollouts, no critic: just stable supervised-style training.
GRPO (Group Relative Policy Optimization, from DeepSeek) keeps RL's online strength but drops PPO's expensive critic. For each prompt it samples a group of G answers, scores them, and uses the group's mean and standard deviation to compute each answer's advantage — "how much better than my other attempts was this one?" No value network needed. It shines with verifiable rewards (RLVR): math answers you can check, code that either passes tests or doesn't. GRPO on verifiable rewards is what produced DeepSeek-R1's reasoning ability.
The broader family: IPO (fixes a DPO overfitting failure mode), KTO (needs only a binary good/bad label per sample — no pairs — using prospect-theory utility), ORPO (folds preference into SFT with no reference model), SimPO (reference-free), and DAPO (GRPO refinements: dynamic sampling, asymmetric "clip-higher").
DPO is teaching with side-by-side flashcards: "this reply is better than that one" — the model learns the preference directly, no separate judge required. GRPO is a classroom exercise: hand eight students the same math problem, check which solutions are actually correct, and reward the ones that beat the class average. You never wrote a grading rubric (a critic) — correctness plus relative ranking is the whole signal. That's why GRPO is perfect for math and code, where "right or wrong" is checkable.
preference_opt.pyTRL# ---------- DPO: offline, from preference pairs ---------- from trl import DPOTrainer, DPOConfig from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("my-sft-model") tok = AutoTokenizer.from_pretrained("my-sft-model") # dataset columns: {"prompt", "chosen", "rejected"} dpo = DPOTrainer(model=model, args=DPOConfig(beta=0.1, output_dir="dpo-out"), train_dataset=pref_dataset, processing_class=tok) dpo.train() # no reward model, no rollouts -- just stable training # ---------- GRPO: online RL with a verifiable reward ---------- from trl import GRPOTrainer, GRPOConfig def reward_correct(completions, answer, **kw): # +1 if the model's final answer matches ground truth, else 0 (RLVR) return [1.0 if extract(c) == a else 0.0 for c, a in zip(completions, answer)] grpo = GRPOTrainer( model="my-sft-model", reward_funcs=reward_correct, args=GRPOConfig(num_generations=8, # G samples per prompt -> the "group" output_dir="grpo-out", learning_rate=1e-6), train_dataset=math_dataset, ) grpo.train()
Have a dataset of preference pairs and want stable, cheap alignment → DPO (or KTO if you only have thumbs-up/down). Want to push reasoning on tasks with a checkable answer (math, code, tool success) → GRPO/RLVR. Need maximum control over a complex reward or true online exploration → PPO. The modern default for most teams is: SFT → DPO, with GRPO when a verifiable reward is available.
SECTION 23Inside a modern Transformer
The 2017 Transformer in Section 13 still describes the skeleton, but every production model (Llama, Qwen, Mistral, DeepSeek, Gemma) swaps in a set of upgrades for speed, length, and quality. Here's the modern parts list.
| Component | 2017 original | Modern replacement | Why |
|---|---|---|---|
| Position | Sinusoidal / learned | RoPE (rotary), ALiBi | relative, extrapolates to longer context |
| Normalization | LayerNorm (post) | RMSNorm, pre-norm | cheaper, more stable training |
| FFN activation | ReLU | SwiGLU (gated) | ~1–2% better perplexity |
| Attention heads | Multi-head (MHA) | GQA / MQA / MLA | shrinks the KV cache → longer context |
| Attention kernel | Naive softmax(QKᵀ)V | FlashAttention 1/2/3 | IO-aware, never stores the N×N matrix |
| Capacity | Dense FFN | Mixture-of-Experts | huge total params, few active per token |
Grouped-Query Attention (GQA) is the 2025 default. Standard attention gives every query head its own key/value head — but the KV cache (the stored keys/values for past tokens) dominates memory at long context. MQA shares a single KV head across all queries (tiny cache, slight quality hit); GQA is the sweet spot: a few KV heads shared by groups of query heads. DeepSeek's MLA pushes further by compressing KV into a low-rank latent.
Mixture-of-Experts (MoE) replaces a dense feed-forward block with many parallel "expert" FFNs plus a tiny router that sends each token to only the top-k experts (often 2). The model can hold enormous knowledge (high total parameters) while each token only pays for a few experts (low active parameters) — e.g. Mixtral, DeepSeek-V3, Llama 4.
GQA is carpooling for memory: instead of every query commuter driving a private key/value car, they share a few cars, freeing up the parking lot (KV cache) for far longer trips (context). MoE is a large hospital with a triage desk: each patient (token) is routed to just the two relevant specialists out of a hundred on staff. You get the expertise of a hundred doctors but only pay for two consultations per patient.
modern_block.pyPyTorch + Transformersimport torch, torch.nn as nn, torch.nn.functional as F class RMSNorm(nn.Module): # cheaper than LayerNorm def __init__(self, d, eps=1e-6): super().__init__(); self.g = nn.Parameter(torch.ones(d)); self.eps = eps def forward(self, x): return self.g * x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps) class SwiGLU(nn.Module): # gated FFN used by Llama/Qwen def __init__(self, d, hidden): super().__init__() self.w_gate = nn.Linear(d, hidden, bias=False) self.w_up = nn.Linear(d, hidden, bias=False) self.w_down = nn.Linear(hidden, d, bias=False) def forward(self, x): return self.w_down(F.silu(self.w_gate(x)) * self.w_up(x)) # FlashAttention is exact attention done IO-efficiently. In practice you just # enable the fused kernel -- PyTorch picks it automatically: out = F.scaled_dot_product_attention(q, k, v, is_causal=True) # uses Flash when available # ...or tell Hugging Face to use it when loading a model: from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B", attn_implementation="flash_attention_2", torch_dtype=torch.bfloat16)
SECTION 24Training at scale
A 70B model won't even fit on one GPU — Adam's optimizer states alone can need hundreds of gigabytes. Scaling means splitting the work, and trading compute for memory.
First, the memory budget. Training a parameter in mixed precision costs far more than the weight itself: a copy of the weight, its gradient, and Adam's two optimizer states (often kept in fp32). That's why optimizer states, not the model, usually dominate. The toolkit:
- Mixed precision (bf16/fp16, FP8 on H100+) — do math in 16/8-bit, keep a master copy where needed. ~2× faster, ~half the memory. bf16 is preferred on Ampere and newer for its stability.
- Gradient accumulation — run several small micro-batches and sum their gradients before stepping, simulating a large batch you couldn't otherwise fit.
- Gradient (activation) checkpointing — don't store every activation for the backward pass; recompute them on the fly. Trades extra compute for big memory savings.
- ZeRO (DeepSpeed) / FSDP (PyTorch) — shard the optimizer states, gradients, and even parameters across N GPUs so each holds ~1/N. This is sharded data parallelism and is the standard way to train large models.
- Parallelism dimensions — data (same model, different batches), tensor (split a single matrix multiply across GPUs), pipeline (different layers on different GPUs), plus sequence/context and expert parallelism. Combining them is "3D parallelism."
You're moving a house but own only small trucks. One truck can't hold the house — not even all the furniture from one room. So you split everything across eight trucks (FSDP/ZeRO shards the weights, gradients, and optimizer state), and when you need a specific couch you radio the truck that has it (gather-on-demand). Gradient checkpointing is deciding not to keep packing boxes around — you'll just rebuild a few when needed, saving space at the cost of a little extra work.
train_at_scale.pyTransformers + Acceleratefrom transformers import TrainingArguments args = TrainingArguments( output_dir="out", per_device_train_batch_size=2, # small micro-batch that fits in memory gradient_accumulation_steps=16, # ...but train as if batch = 2 * 16 * n_gpus bf16=True, # mixed precision (Ampere/Hopper+) gradient_checkpointing=True, # recompute activations -> big memory save optim="adamw_torch_fused", # multi-GPU sharding via FSDP (or use a DeepSpeed ZeRO-3 config file): fsdp="full_shard auto_wrap", fsdp_config={"transformer_layer_cls_to_wrap": "LlamaDecoderLayer"}, logging_steps=10, num_train_epochs=1, learning_rate=2e-5, ) # Launch across GPUs with: accelerate launch train_at_scale.py # (Accelerate / torchrun handle the distributed process group for you.)
Out of memory? In order: enable bf16, turn on gradient checkpointing, lower the micro-batch and raise accumulation to keep the effective batch, switch to QLoRA (Section 20) so the base is 4-bit, then shard with FSDP/ZeRO-3. Find the largest micro-batch that fits, then scale the effective batch with accumulation — throughput loves big batches.
SECTION 25Inference & serving
Training is a one-time cost; serving is forever. Generation is autoregressive — one token at a time — so the bottleneck is usually memory bandwidth, not raw compute. The whole game is keeping the GPU busy.
- KV cache — each new token attends to all previous ones. Recomputing their keys/values every step would be quadratic, so you cache them. The cache grows with sequence length and dominates memory at long context (this is what GQA in Section 23 shrinks).
- Continuous batching — classic batching waits for the whole batch to finish; continuous batching (vLLM) swaps finished requests out and new ones in every step, so the GPU never idles. Big throughput win under real traffic.
- PagedAttention — manages the KV cache like OS virtual memory in fixed "pages," eliminating fragmentation and letting you pack far more concurrent requests.
- Speculative decoding — a small fast "draft" model proposes several tokens; the big model verifies them all in one parallel pass, accepting the longest correct prefix. Same output distribution, often 2–3× faster.
- Quantized inference — int4/int8/FP8 weights (Section 19) cut memory and boost bandwidth-bound decoding.
Speculative decoding is a junior writer drafting a whole sentence while a senior editor — who's slow but authoritative — approves it in a single glance instead of writing word by word. If the draft is good, four or five words get accepted for the price of one editorial pass. Continuous batching is a ride-share van that picks up and drops off riders mid-route instead of waiting for a full bus, so a seat is never empty and the engine never idles.
serve.pyvLLM + Transformers# ---------- High-throughput serving with vLLM ---------- from vllm import LLM, SamplingParams llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct", quantization="awq", # quantized weights gpu_memory_utilization=0.9) # PagedAttention + continuous batching are automatic out = llm.generate(["Explain KV cache in one line."], SamplingParams(temperature=0.7, max_tokens=128)) print(out[0].outputs[0].text) # Or expose an OpenAI-compatible server: # vllm serve meta-llama/Llama-3.1-8B-Instruct # ---------- Speculative decoding with a draft model ---------- from transformers import AutoModelForCausalLM, AutoTokenizer target = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-70B-Instruct") draft = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B-Instruct") tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-70B-Instruct") ids = tok("The capital of France is", return_tensors="pt").input_ids out = target.generate(ids, assistant_model=draft, max_new_tokens=40) # same output, faster print(tok.decode(out[0]))
For production throughput on GPUs: vLLM, TGI, SGLang, or TensorRT-LLM. For a laptop or a single small GPU: llama.cpp / Ollama with GGUF quantized weights. Optimize for the metric that matters — latency (time to first token) for chat, throughput (tokens/sec across users) for batch jobs — they pull in different directions.
SECTION 26Reasoning, RAG & agents
The frontier moved from "predict the next token" to "use compute at inference time to think, retrieve, and act." These are the patterns behind today's most capable systems.
- Chain-of-thought & self-consistency — prompting the model to reason step-by-step, then sampling several reasoning paths and voting on the answer, reliably boosts accuracy on hard problems.
- Reasoning models (o1 / DeepSeek-R1 style) — trained (often with GRPO + verifiable rewards from Section 22) to produce long internal reasoning before answering. Their key property is test-time compute scaling: let them think longer and accuracy goes up, no retraining required.
- RAG (Retrieval-Augmented Generation) — embed your documents into a vector store; at query time retrieve the most relevant chunks and stuff them into the prompt. Grounds answers in fresh or private knowledge and cuts hallucination — without touching the model's weights.
- Agents — an LLM wrapped in a loop that can call tools (search, code execution, APIs) via function calling, observe results, and decide the next step until the task is done (the ReAct pattern). The Model Context Protocol (MCP) is an emerging standard for connecting models to tools and data sources.
- Multimodal (VLMs) — pair a vision (or audio) encoder with an LLM so it can see images and read documents, not just text.
- Distillation — train a small, cheap model to imitate a big one's outputs, capturing much of the quality at a fraction of the serving cost.
A reasoning model is a student who shows their work on scratch paper before answering — and we let them use as much scratch paper as the problem deserves; harder questions simply get more thinking. An agent is that same student who can also get up, look things up in the library (search), use a calculator, run an experiment (code), come back, and keep going — looping until the assignment is actually finished, not just until they've produced one paragraph. RAG is letting them bring the open textbook into the exam so answers are grounded in the real source, not memory.
rag_and_agent.pysentence-transformers + an LLM client# ---------- RAG: retrieve, then generate grounded on real docs ---------- from sentence_transformers import SentenceTransformer import numpy as np emb = SentenceTransformer("all-MiniLM-L6-v2") docs = ["Our refund window is 30 days.", "Support hours are 9-5 GMT.", ...] doc_vecs = emb.encode(docs, normalize_embeddings=True) def retrieve(query, k=3): q = emb.encode([query], normalize_embeddings=True)[0] scores = doc_vecs @ q # cosine similarity return [docs[i] for i in np.argsort(scores)[-k:][::-1]] def answer(query, llm): context = "\n".join(retrieve(query)) # top-k relevant chunks prompt = f"Use ONLY this context:\n{context}\n\nQuestion: {query}" return llm(prompt) # grounded, less hallucination # ---------- Agent loop: LLM decides which tool to call, then observes ---------- TOOLS = {"search": web_search, "python": run_code} # name -> callable def agent(goal, llm, max_steps=6): history = [f"Goal: {goal}"] for _ in range(max_steps): step = llm("\n".join(history) + "\nThink, then output: TOOL <name> <args> OR FINAL <answer>") history.append(step) if step.startswith("FINAL"): return step[6:] name, args = parse(step) # e.g. "search", "vLLM throughput" observation = TOOLS[name](args) # act history.append(f"Observation: {observation}") # observe -> loop return "stopped: step budget exhausted"
Need the model to know new, changing, or private facts? → RAG (cheap, updatable, citable). Need it to behave differently — a tone, a format, a skill, a domain style? → fine-tune (LoRA/QLoRA from Section 20). Need both? Most serious systems combine them: fine-tune the behavior, retrieve the facts, wrap it in an agent loop, and serve it with vLLM.
SECTION 27The LangChain stack: overview
Section 26 covered the concepts. This is the production stack most teams actually reach for to build them — three layers that fit together, plus the observability glue that ties them.
The LangChain ecosystem is best understood as a layered stack. Each layer adds abstraction on top of the one below, so you pick the level of control your task needs:
- LangGraph — the low-level orchestration runtime. It runs agents as a stateful graph of nodes and edges, and provides the hard infrastructure: durable execution (resume after a crash), persistence/memory, streaming, and human-in-the-loop. Use it when you need precise, custom control flow — loops, branches, approvals.
- LangChain — the agent framework built on LangGraph. Its headline pieces are a standard model interface (swap OpenAI ↔ Anthropic ↔ Google by changing one string, no lock-in) and
create_agent: a minimal, configurable harness = model + tools + prompt + a tool-calling loop, extendable with middleware. This is the default starting point for most agents. - Deep Agents — a batteries-included harness on top of LangChain, for complex, long-running, multi-step tasks.
create_deep_agentships with planning (awrite_todostool), a virtual filesystem with automatic context compression, subagent spawning (atasktool that delegates to isolated agents), long-term memory, skills, and human-in-the-loop — all on by default. - LangSmith — the cross-cutting observability layer: trace every model/tool call, debug behavior, and evaluate outputs for agents built with any of the three.
create_agent), which sits on the LangGraph runtime; LangSmith observes all of them. Pick your layer by how much control you need.Think of building a house. LangGraph is the foundation, plumbing, and wiring — the durable infrastructure that holds state and survives a power cut (resume where you left off). LangChain is the framing and standard fittings: every appliance brand (any model provider) plugs into the same socket, and create_agent hands you a pre-framed room. Deep Agents is the fully furnished smart home — it already has a planner on the wall, filing cabinets (the virtual filesystem), assistants you can dispatch (subagents), and a memory of past visits — move-in ready for big, messy jobs. You choose how much is pre-built versus how much you wire yourself.
01_langchain_agent.pyLangChain — create_agent# pip install -qU langchain "langchain[anthropic]" from langchain.agents import create_agent def get_weather(city: str) -> str: """Get the weather for a given city.""" # docstring = the tool description return f"It's always sunny in {city}!" agent = create_agent( model="claude-sonnet-4-6", # swap to "openai:gpt-5.4" or "google_genai:gemini-3.5-flash" tools=[get_weather], system_prompt="You are a helpful assistant.", ) result = agent.invoke( {"messages": [{"role": "user", "content": "What's the weather in San Francisco?"}]} ) print(result["messages"][-1].content_blocks) # create_agent runs the loop: model -> (maybe call tool) -> observe -> repeat -> final answer.
02_langchain_rag.pyLangChain — agentic RAG# pip install -qU langchain langchain-community langchain-chroma "langchain[anthropic]" from langchain_community.document_loaders import WebBaseLoader from langchain_text_splitters import RecursiveCharacterTextSplitter from langchain_chroma import Chroma from langchain.embeddings import init_embeddings from langchain.tools.retriever import create_retriever_tool from langchain.agents import create_agent # 1. INGEST: load -> split into chunks -> embed -> store in a vector DB docs = WebBaseLoader("https://example.com/handbook").load() chunks = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150).split_documents(docs) store = Chroma.from_documents(chunks, init_embeddings("openai:text-embedding-3-small")) # 2. turn the retriever into a TOOL the agent can choose to call retriever_tool = create_retriever_tool( store.as_retriever(search_kwargs={"k": 4}), name="company_handbook", description="Search the company handbook for policies, benefits, and procedures.", ) # 3. the agent now decides WHEN to retrieve -> this is "agentic RAG" agent = create_agent(model="claude-sonnet-4-6", tools=[retriever_tool], system_prompt="Answer using the handbook. Cite what you find.") print(agent.invoke({"messages": [{"role": "user", "content": "How many vacation days do I get?"}]}))
When a simple loop isn't enough — you need explicit branching, cycles, a grading step, or a human approval gate — you drop to LangGraph and draw the control flow yourself as a graph. State flows between nodes; edges (including conditional ones) decide what runs next.
03_langgraph_graph.pyLangGraph — StateGraph# pip install -U langgraph from langgraph.graph import StateGraph, MessagesState, START, END from langgraph.checkpoint.memory import InMemorySaver # nodes are just functions: (state) -> partial state update def retrieve(state: MessagesState): query = state["messages"][-1].content hits = store.as_retriever().invoke(query) # your vector store return {"messages": [{"role": "system", "content": f"Context:\n{hits}"}]} def generate(state: MessagesState): answer = model.invoke(state["messages"]) # your chat model return {"messages": [answer]} # wire the graph: START -> retrieve -> generate -> END g = StateGraph(MessagesState) g.add_node(retrieve); g.add_node(generate) g.add_edge(START, "retrieve") g.add_edge("retrieve", "generate") g.add_edge("generate", END) # a checkpointer gives durable execution + conversation memory across turns graph = g.compile(checkpointer=InMemorySaver()) graph.invoke({"messages": [{"role": "user", "content": "Summarize the refund policy."}]}, config={"configurable": {"thread_id": "user-42"}}) # same thread = remembered # Add g.add_conditional_edges(...) for the grade/retry branch, or interrupt() for human approval.
At the top of the stack, Deep Agents gives you all the hard parts of a capable autonomous agent for free. create_deep_agent has the same tool-calling core, but ships with a planner, a virtual filesystem that compresses context as runs grow long, the ability to spawn isolated subagents for parallel subtasks, and long-term memory — ideal for deep research, codebase work, or any task with many steps.
04_deep_agent.pyDeep Agents — create_deep_agent# pip install -qU deepagents langchain-anthropic from deepagents import create_deep_agent def web_search(query: str) -> str: """Search the web and return the top results.""" return run_search(query) # plug in Tavily/SerpAPI/your own retriever here # create_deep_agent bundles planning (write_todos), a virtual filesystem with # automatic context compression, subagent spawning (the `task` tool), and memory. agent = create_deep_agent( model="anthropic:claude-sonnet-4-6", tools=[web_search], system_prompt=( "You are an expert researcher. Plan first, delegate sub-questions to " "subagents, save findings to files, then write a cited report." ), ) # Give it a big, multi-step task -- it will plan, spawn subagents, and manage its own context. result = agent.invoke({"messages": [{"role": "user", "content": "Compare QLoRA, DoRA and full fine-tuning on cost and quality. Write a sourced summary."}]}) print(result["messages"][-1].content)
Most agents → LangChain create_agent (model + tools + prompt, swap providers freely). Need custom control flow — cycles, grading/routing, human-in-the-loop, durable long runs — → LangGraph. Big, open-ended, multi-step autonomy (research, coding, ops) → Deep Agents, which gives you planning, a filesystem, subagents, and memory out of the box. And in all three, RAG is the same move: wrap a retriever as a tool. Wire up LangSmith from day one to see what your agent is actually doing.
SECTION 28LangChain in depth: create_agent, tools, structured output & middleware
Section 27 showed the shape. Now the full surface: how the agent loop actually runs, how tools and structured output work, and the middleware system that is the real reason to use LangChain.
Agent = model + harness. create_agent builds a graph that runs the classic loop — call the model, and if the model asked for tools, run them, feed results back, and repeat until the model returns a final answer (no more tool calls). The signature exposes every lever you'll need:
create_agent surfacemodel, tools=[…], system_prompt=…,
middleware=[…], # hooks around the loop (the key feature)
response_format=…, # structured output schema (Pydantic / strategy)
checkpointer=…, # short-term memory + durability
state_schema=…, context_schema=…) # custom state / runtime context
A few things make LangChain more than a thin wrapper:
- Standard model interface. One string switches providers —
"claude-sonnet-4-6","openai:gpt-5.4","google_genai:gemini-3.5-flash"— with a unified response shape (.content_blocks). No provider lock-in. - Tools are just functions. A typed Python function with a docstring becomes a tool; the docstring is the description the model reads to decide when to call it.
- State. The agent carries an
AgentState(a messages list, extendable with custom fields likeuser_id). Acheckpointerpersists it perthread_idso a conversation is remembered across turns. - Structured output. Pass
response_formata Pydantic model and the agent returns parsed, validated data instructured_responseinstead of free text.
before_agent/after_agent bracket the whole run; before_model/after_model and wrap_model_call surround each model call; wrap_tool_call surrounds each tool. This is where guardrails, summarization, retries, and limits live.Middleware is like the staff around a chef in a busy kitchen. The chef (model) just cooks. But an expediter checks every order before it reaches the chef (before_model), a quality inspector tastes each plate before it leaves (after_model), a runner handles the actual fetching of ingredients (wrap_tool_call), and a manager caps how many dishes get made so costs don't explode (call limits). You compose the staff you need; the chef's job never changes. That separation is why one summarization or PII-redaction middleware drops into any agent unchanged.
01_lc_core.pyLangChain — create_agentfrom langchain.agents import create_agent from langgraph.checkpoint.memory import InMemorySaver from pydantic import BaseModel, Field # 1. a tool is a typed function; its docstring is the description the model reads def search_flights(origin: str, destination: str, date: str) -> str: """Search available flights between two cities on a given date.""" return f"3 flights from {origin} to {destination} on {date}: ..." # 2. structured output: ask for parsed, validated data instead of prose class FlightPick(BaseModel): airline: str = Field(description="chosen airline") price_usd: float depart_time: str agent = create_agent( model="claude-sonnet-4-6", tools=[search_flights], system_prompt="You are a travel assistant. Pick the best-value flight.", response_format=FlightPick, # -> result["structured_response"] is a FlightPick checkpointer=InMemorySaver(), # -> short-term memory across turns ) cfg = {"configurable": {"thread_id": "trip-1"}} res = agent.invoke({"messages": [{"role": "user", "content": "Cheapest flight SFO to JFK on Dec 5?"}]}, config=cfg) print(res["structured_response"]) # FlightPick(airline=..., price_usd=..., ...) # Because of the thread_id, a follow-up "what about Dec 6?" remembers the context.
02_lc_middleware.pyLangChain — built-in middlewarefrom langchain.agents import create_agent from langchain.agents.middleware import ( SummarizationMiddleware, # compress old history near the context limit ToolCallLimitMiddleware, # cap tool calls (cost / runaway protection) ModelFallbackMiddleware, # retry on a backup model if the primary fails PIIMiddleware, # detect & redact sensitive data HumanInTheLoopMiddleware, # pause for approval on risky tools ) from langgraph.checkpoint.memory import InMemorySaver agent = create_agent( model="claude-sonnet-4-6", tools=[search_tool, send_email_tool], checkpointer=InMemorySaver(), # required by HITL middleware=[ PIIMiddleware("email", strategy="redact", apply_to_input=True), SummarizationMiddleware(model="gpt-5.4-mini", trigger=("fraction", 0.8), keep=("messages", 20)), ToolCallLimitMiddleware(thread_limit=20, run_limit=8), ModelFallbackMiddleware("gpt-5.4-mini", "openai:gpt-5.4"), HumanInTheLoopMiddleware(interrupt_on={ # ask a human before sending "send_email_tool": {"allowed_decisions": ["approve", "edit", "reject"]}, "search_tool": False, # auto-run safe tools }), ], ) # Each middleware handles ONE concern and composes with the others -- no agent rewrite.
03_lc_custom_mw.pyLangChain — @hooksfrom langchain.agents import create_agent from langchain.agents.middleware import before_model, dynamic_prompt # inject a fresh, per-request system prompt (e.g. user preferences from a store) @dynamic_prompt def personalized_prompt(request) -> str: user = request.runtime.context.get("user_name", "there") return f"You are a concise assistant. Address the user as {user}." # a guardrail that runs right before every model call @before_model def log_and_guard(state) -> dict | None: last = state["messages"][-1].content if "wire transfer" in last.lower(): return {"jump_to": "end"} # short-circuit the loop return None # None -> proceed normally agent = create_agent( model="claude-sonnet-4-6", tools=[...], middleware=[personalized_prompt, log_and_guard], context_schema=dict, # lets us pass runtime context ) agent.invoke({"messages": [{"role": "user", "content": "hi"}]}, context={"user_name": "Sara"})
Build the core agent with model + tools + system_prompt, then add capabilities as middleware: summarization for long chats, limits for cost, PII for compliance, human-in-the-loop for risky actions, fallback for resilience. The six hooks — before/after_agent, before/after_model, wrap_model_call, wrap_tool_call — cover essentially any interception you'll want, and a hook can return jump_to to redirect the loop.
SECTION 29LangGraph in depth: state, edges, persistence & human-in-the-loop
When the loop isn't the right shape — you need branching, cycles, parallelism, durability, or a human gate — you drop to LangGraph and build the control flow as an explicit graph.
Everything in LangGraph is three ideas: state (a typed object that flows through the graph), nodes (functions that read state and return an update), and edges (what runs next). You build it, then compile() it into a runnable.
- State & reducers. State is a
TypedDict. By default a node's returned keys overwrite state; annotate a field with a reducer likeadd_messagesoroperator.addto instead append — essential so parallel branches don't clobber each other. - Edges.
add_edge(a, b)is unconditional;add_conditional_edges(node, router_fn)branches on a function's return;STARTandENDare the entry/exit. Cycles are allowed — that's how loops and retries work. - Command. A node can return
Command(goto="x", update={...})to set state and choose the next node in one move — dynamic routing without a separate conditional edge. - Persistence. Compile with a checkpointer (
InMemorySaverfor dev,SqliteSaversingle-server,PostgresSaverat scale). State is saved perthread_idafter every step — giving memory and durable execution (resume after a crash). - Human-in-the-loop. Call
interrupt(payload)inside a node to pause indefinitely; resume by re-invoking withCommand(resume=value), which becomes the return value ofinterrupt(). - Streaming & parallelism.
graph.stream(..., stream_mode=)emitsupdates,values,messages, orcustomevents;Sendfans out work to parallel branches (map-reduce).
LangChain's create_agent is an automatic car — one pedal, it handles the gears. LangGraph is a manual transmission: more to operate, but you control exactly when to shift, brake, loop back, or pull over for a passenger (a human). The checkpointer is the car's black box — it records the journey continuously, so if the engine cuts out on a mountain road, you restart from the last marker instead of the bottom of the hill. The reducer is the rule that when two passengers add to the shared shopping list at once, the items get combined rather than one erasing the other.
01_lg_graph.pyLangGraph — StateGraphfrom typing import Annotated, TypedDict from langgraph.graph import StateGraph, START, END from langgraph.graph.message import add_messages from langgraph.checkpoint.memory import InMemorySaver # 1. STATE: messages append (reducer); attempts overwrite by default class State(TypedDict): messages: Annotated[list, add_messages] # reducer -> appends, never clobbers attempts: int # 2. NODES: (state) -> partial update def call_model(state: State): reply = model.invoke(state["messages"]) return {"messages": [reply], "attempts": state.get("attempts", 0) + 1} def run_tools(state: State): results = execute_tool_calls(state["messages"][-1]) return {"messages": results} # 3. ROUTER for a conditional edge: keep looping or stop def should_continue(state: State) -> str: last = state["messages"][-1] if last.tool_calls and state["attempts"] < 5: return "tools" return END # 4. WIRE the graph (note the cycle tools -> agent) g = StateGraph(State) g.add_node("agent", call_model) g.add_node("tools", run_tools) g.add_edge(START, "agent") g.add_conditional_edges("agent", should_continue, {"tools": "tools", END: END}) g.add_edge("tools", "agent") # loop back graph = g.compile(checkpointer=InMemorySaver()) graph.invoke({"messages": [{"role": "user", "content": "Plan my week."}], "attempts": 0}, config={"configurable": {"thread_id": "u1"}})
02_lg_hitl.pyLangGraph — interrupt / Commandfrom langgraph.types import Command, interrupt def approve_purchase(state: State): # pause the graph and surface a payload to the caller decision = interrupt({"action": "buy", "item": state["item"], "cost": state["cost"]}) if decision == "approve": return {"messages": [{"role": "system", "content": "Purchase approved."}]} return {"messages": [{"role": "system", "content": "Purchase cancelled."}]} # ... add_node("approve", approve_purchase) ... compile(checkpointer=...) cfg = {"configurable": {"thread_id": "order-9"}} result = graph.invoke({...}, config=cfg) # Execution pauses at interrupt(); result surfaces the pending interrupt payload: print(result["__interrupt__"]) # (Interrupt(value={'action': 'buy', ...}),) # A human reviews, then we RESUME -- the resume value becomes interrupt()'s return: graph.invoke(Command(resume="approve"), config=cfg) # continues exactly where it paused
03_lg_command_stream.pyLangGraph — Command / stream / Postgresfrom langgraph.types import Command # A node that BOTH updates state AND picks the next node -- no conditional edge needed def supervisor(state: State) -> Command: nxt = "researcher" if needs_research(state) else "writer" return Command(goto=nxt, update={"messages": [route_note(nxt)]}) # Stream intermediate progress instead of waiting for the final result for mode, chunk in graph.stream(inputs, config=cfg, stream_mode=["updates", "messages"]): print(mode, chunk) # "updates" = per-node state deltas; "messages" = token stream # Durable, multi-instance memory for production: swap the checkpointer from langgraph.checkpoint.postgres import PostgresSaver with PostgresSaver.from_conn_string("postgresql://...") as saver: graph = g.compile(checkpointer=saver) # survives restarts, scales across servers
Reach for it when you need explicit control flow (branching, bounded cycles, multi-agent supervisors), durable long runs that survive crashes, human approval gates mid-run, or fine-grained streaming. Keep state small and typed, use reducers only where branches merge, and put conditional edges only at real decision points. For a plain tool-calling assistant, stay on create_agent — it compiles down to a LangGraph graph anyway.
SECTION 30Deep Agents in depth: planning, filesystem, subagents & skills
Deep Agents is the "batteries-included" harness — the same tool-calling loop, but pre-loaded with the machinery that makes agents survive long, messy, multi-step tasks. It's the open-source distillation of what makes tools like Claude Code work.
create_deep_agent returns a compiled LangGraph graph (so you keep streaming, checkpointers, and tracing), but it ships with four built-in capabilities turned on by default, each implemented as middleware you can override:
- Planning — a
write_todos/read_todostool (TodoListMiddleware) lets the agent decompose a task into a tracked checklist and adapt it as it learns. - Virtual filesystem —
ls,read_file,write_file,edit_file,glob,grep(FilesystemMiddleware). Crucially, it offloads large tool results to files automatically so the context window doesn't overflow on long runs. Backends are pluggable: in-memory state, local disk, a LangGraph Store, a shell-capable backend, or a sandbox. - Subagents — a
tasktool (SubAgentMiddleware) spawns general-purpose or specialized subagents in isolated context windows; each does its subtask and returns only a summary, keeping the main context clean. Async subagents run in the background with progress checks and cancellation. - Context management & memory — automatic summarization when context grows large (keeping recent messages), plus long-term memory across threads via the LangGraph Memory Store, and skills (reusable workflows/domain knowledge loaded into the prompt).
A basic agent is one person with a notepad trying to hold an entire project in their head — they run out of room fast. A Deep Agent is a project lead with an office: a whiteboard for the plan (write_todos), filing cabinets so findings live on paper instead of cramming the desk (the filesystem — and it files away bulky documents automatically), and a team of assistants they can hand self-contained sub-tasks to, each working in their own room and reporting back a one-paragraph summary (subagents in isolated context). That's why Deep Agents handles a "project" where a plain agent chokes on the third step.
01_deepagent.pyDeep Agents — create_deep_agentfrom deepagents import create_deep_agent def web_search(query: str) -> str: """Search the web and return results.""" return run_tavily(query) # specialized subagents: each gets its OWN isolated context window research_subagent = { "name": "researcher", "description": "Deeply researches one focused sub-question and returns a summary.", "system_prompt": "You are a meticulous researcher. Cite sources.", "tools": [web_search], } critic_subagent = { "name": "critic", "description": "Reviews a draft for accuracy and gaps.", "system_prompt": "You are a sharp editor. List concrete fixes.", } agent = create_deep_agent( model="anthropic:claude-sonnet-4-6", tools=[web_search], system_prompt=("You are a research lead. First write_todos to plan. " "Delegate sub-questions to the researcher subagent, save findings " "to files, then have the critic review before the final report."), subagents=[research_subagent, critic_subagent], ) result = agent.invoke({"messages": [{"role": "user", "content": "Write a sourced report comparing LoRA, QLoRA, and DoRA on cost and quality."}]}) print(result["messages"][-1].content) # Under the hood: write_todos -> task(researcher) x N -> write_file -> task(critic) -> report
02_deepagent_backends.pyDeep Agents — backends + HITLfrom deepagents import create_deep_agent from deepagents.backends import LocalShellBackend # filesystem + real shell `execute` from deepagents.middleware import FilesystemMiddleware from langgraph.checkpoint.memory import InMemorySaver from langgraph.store.memory import InMemoryStore # A coding-style agent: real local files + shell, long-term memory, approval gates agent = create_deep_agent( model="anthropic:claude-sonnet-4-6", tools=[], backend=LocalShellBackend(workspace_root="/workspace"), # ls/read/write/edit + execute system_prompt="You are a coding agent. Plan, edit files, run tests, then summarize.", checkpointer=InMemorySaver(), # durable + resumable (also enables interrupts) store=InMemoryStore(), # long-term memory across threads (swap for a DB) interrupt_on={"execute": {"allowed_decisions": ["approve", "reject"]}}, # gate shell cmds ) # Because it returns a compiled LangGraph graph, you can stream subagent activity: for mode, chunk in agent.stream({"messages": [{"role": "user", "content": "Fix the failing test in utils.py"}]}, config={"configurable": {"thread_id": "dev-1"}}, stream_mode="updates"): print(mode, chunk)
Deep Agents follows a trust-the-model design: the agent can do anything its tools and backend allow, including running shell commands and editing files. That power needs fences. Use a sandbox backend (not your host) for untrusted work, declare filesystem permission rules to bound read/write access, gate dangerous tools like execute behind human-in-the-loop, and cap runs with model/tool call limits. Always wire up LangSmith so you can see what a long autonomous run actually did.
Deep Agents when the task is a project: research, coding, multi-step ops needing planning + files + delegation. LangChain create_agent when it's a focused assistant (model + tools + a few middleware). LangGraph when the control flow itself is custom and the agent loop isn't the right shape. They nest cleanly — Deep Agents is built on create_agent, which is built on LangGraph — so you can always drop a level for more control without leaving the ecosystem.
SECTION 31The one-page cheat sheet
Everything above, compressed. Keep this nearby.
Vocabulary in one line each
| Term | Meaning |
|---|---|
| Parameter / weight | A learnable number the model adjusts during training. |
| Hyperparameter | A setting you choose (learning rate, batch size, #layers). |
| Forward pass | Run input through the model to get a prediction. |
| Loss | One number scoring how wrong the prediction is. |
| Backprop | Compute the gradient of the loss w.r.t. every weight. |
| Gradient | Direction + rate the loss changes as a weight changes. |
| Optimizer | Uses gradients to update weights (SGD, Adam, AdamW). |
| Epoch | One full pass over the training set. |
| Batch | A small group of examples processed together. |
| Overfitting | Memorizing train data, failing on new data. |
| Logits | Raw model scores before softmax/sigmoid. |
| Embedding | A learned vector representing a discrete item. |
| Fine-tuning | Adapting a pretrained model to your task. |
| Attention | Each token weighs how much every other token matters. |
| Quantization | Store weights in fewer bits (int8/int4) to shrink + speed up. |
| LoRA | Train tiny low-rank adapters instead of all weights. |
| QLoRA | LoRA on a 4-bit frozen base — fine-tune big models on one GPU. |
| RLHF | Align a model to human preference: SFT → reward model → PPO. |
| DPO | Align directly from preference pairs — no reward model or RL. |
| GRPO | Online RL without a critic; great for verifiable (math/code) rewards. |
| MoE | Many expert FFNs; a router activates only a few per token. |
| GQA | Query heads share KV heads — smaller KV cache, longer context. |
| KV cache | Stored past keys/values so generation isn't quadratic. |
| RAG | Retrieve relevant docs and feed them into the prompt. |
| Agent | An LLM in a loop that calls tools, observes, and iterates. |
| Distillation | Train a small model to imitate a big one's outputs. |
| LangChain | Agent framework: create_agent = model + tools + prompt + middleware. |
| LangGraph | Orchestration runtime: stateful graph, durable execution, persistence. |
| Deep Agents | Batteries-included harness: planning, filesystem, subagents, memory. |
| Middleware | Hooks around the agent loop (summarize, limit, guard, fallback). |
| Checkpointer | Saves graph state per thread — memory + crash-resume. |
| Subagent | Delegated agent with isolated context that returns a summary. |
Pick the architecture
| Your data is… | Reach for… |
|---|---|
| Tabular / structured | Gradient-boosted trees first; MLP if you must |
| Images / video | CNN or Vision Transformer (fine-tuned) |
| Text / language | Transformer (BERT to understand, GPT-class to generate) |
| Time series / streaming audio | LSTM/GRU, Temporal CNN, or a transformer |
| Need to generate images | Diffusion model |
| Need semantic search / RAG | Embedding model + vector search |
Modern LLM toolbox — pick the right lever
| Your goal… | Reach for… |
|---|---|
| Make a big model fit / run cheaper | Quantization (int4 NF4, AWQ, GGUF) · §19 |
| Customize behavior on a budget | LoRA / QLoRA / DoRA · §20 |
| Align to human preference (have pairs) | DPO (or KTO for binary labels) · §22 |
| Improve reasoning with a checkable reward | GRPO / RLVR · §22 |
| Maximum reward control / online RL | RLHF + PPO · §21 |
| Longer context / faster attention | GQA + FlashAttention + RoPE · §23 |
| Huge capacity, modest compute/token | Mixture-of-Experts · §23 |
| Train a model too big for one GPU | bf16 + checkpointing + FSDP/ZeRO · §24 |
| Serve fast to many users | vLLM (PagedAttention, continuous batching) · §25 |
| Speed up generation, same output | Speculative decoding · §25 |
| Give the model fresh / private knowledge | RAG · §26 |
| Let it use tools & act | Agent loop + function calling · §26 |
| Build an agent fast, swap providers freely | LangChain create_agent + middleware · §28 |
| Custom control flow / durability / HITL | LangGraph StateGraph + checkpointer · §29 |
| Long, multi-step "project" autonomy | Deep Agents (planning + files + subagents) · §30 |
The loop you'll write a thousand times
for epoch in range(epochs):
for xb, yb in loader:
preds = model(xb) # forward
loss = loss_fn(preds, yb)
opt.zero_grad() # clear grads ← don't forget!
loss.backward() # backprop
opt.step() # update
Golden rules
- Print shapes when anything breaks. Most bugs are shape/device/dtype.
- Overfit one batch to validate your pipeline before scaling up.
- Match loss & final activation — feed logits to cross-entropy, not softmaxed values.
- Toggle
train()/eval()around training vs inference. - Don't train from scratch if a pretrained model exists — fine-tune it.
- The learning rate is the hyperparameter that matters most. Tune it first.
You now have the full skeleton. Deepen it by building: train an MNIST classifier (the "hello world"), fine-tune a 🤗 model on a dataset you care about, then read the official docs as reference rather than cover-to-cover — pytorch.org, tensorflow.org, and huggingface.co/docs/transformers. The concepts here are the map; the docs are the territory.