PyTorch · TensorFlow · 🤗 Transformers
A complete technical reference

Deep learning, from the neuron up to LoRA, GRPO & agents.

Every core idea — taught with plain intuition, a worked example without code, then real code in PyTorch, TensorFlow/Keras, and 🤗 Transformers. Diagrams for the parts that are hard to picture. From backprop and CNNs through quantization, PEFT, RLHF/DPO/GRPO, Mixture-of-Experts, scaled training, serving, and the LangChain / LangGraph / Deep Agents stack. Written for engineers who want depth, not slogans.

31 sections intuition + code custom diagrams PyTorch · TF · HF neuron → agents

SECTION 01What deep learning actually is

Where it sits inside AI, why "deep", and the one trick that makes the whole field work.

Artificial intelligence is the broad goal of making machines do things that seem to require intelligence. Machine learning (ML) is one route to that goal: instead of writing rules by hand, you give a program examples and let it find the rules itself. Deep learning (DL) is a sub-field of ML built on neural networks with many layers — the "deep" simply means many stacked layers of transformation, not anything mystical.

The defining move of deep learning is learned representations. Classical ML often needs a human to hand-design the features (e.g. "count the edges in this image, measure their angles"). A deep network instead learns its own features, layer by layer: early layers detect edges, middle layers assemble them into textures and parts, late layers recognise whole objects. Nobody told it what an edge is — it discovered that edges were useful for the task you optimised it on.

Artificial Intelligence Machine Learning Neural Nets Deep Learning Classical ML human designs features → model learns weights Deep Learning model learns features and weights, end-to-end
Figure 1. Deep learning is a subset of ML, which is a subset of AI. Its signature is learning the features automatically rather than having an engineer hand-craft them.
Example — no code

Imagine teaching a child to recognise a cat. You don't give a checklist ("4 legs, whiskers, pointed ears"). You show thousands of cats and not-cats and correct them when they're wrong. Over time the child's brain self-organises the concept. A deep network does the same: it sees many labelled images, makes a guess, measures how wrong it was, and nudges its millions of internal numbers a tiny bit toward "less wrong". Repeat millions of times. That nudging loop — guess, measure error, adjust — is deep learning. Everything else is detail about how to do the adjusting efficiently for different shapes of data (images, text, audio).

When to reach for deep learning (and when not to)

Use deep learning when…Prefer classical ML / rules when…
Data is high-dimensional & unstructured: images, audio, raw text, video. Data is small (hundreds–few thousand rows) and tabular.
You have lots of data (tens of thousands+ examples) or a pre-trained model to fine-tune. You need full interpretability / an auditable decision.
The mapping from input→output is complex and you can't write the rules. Gradient-boosted trees (XGBoost/LightGBM) already beat it — common on tabular data.
You can tolerate a black-box model and GPU compute. Latency/compute budget is tiny or no GPU is available.
Engineer's note

On structured/tabular data, gradient-boosted trees frequently outperform neural nets with far less tuning. Deep learning's home turf is perception (vision, speech) and language. "Use a transformer" is not the answer to every problem — match the tool to the data.


SECTION 02The artificial neuron

The single unit every deep network is built from — a weighted sum, a bias, and a squashing function.

A biological neuron receives signals on its dendrites, sums them, and fires if the total crosses a threshold. The artificial neuron (or unit) is a deliberately crude maths version of that idea. It does exactly three things:

  1. Weighted sum. Each input xi is multiplied by a weight wi (how much that input matters) and they're all added up.
  2. Add a bias. A learnable constant b shifts the result, letting the neuron fire more or less easily.
  3. Activation. The sum is passed through a non-linear function σ that decides the output. Without this step the whole network would collapse into one big linear function (see §5).
The neuron in one line
y = σ( w₁x₁ + w₂x₂ + … + wₙxₙ + b ) = σ( w·x + b )

w·x is the dot product of the weight vector and input vector. The weights and bias are the learnable parameters — training is the search for good values of these.

x₁ x₂ x₃ w₁ w₂ w₃ + b σ activation y weighted sum + bias non-linearity output
Figure 2. A single neuron. Thicker arrows = larger weights (here w₂ matters most). Inputs are scaled, summed with a bias, then squashed by an activation σ to produce the output y.
Example — no code

Suppose a neuron decides "should I go for a run?" Inputs: x₁=weather is nice (1/0), x₂=I have free time (1/0), x₃=I'm tired (1/0). You care a lot about free time, somewhat about weather, and being tired pushes you the other way — so the learned weights might be w₁=2, w₂=3, w₃=−4 with bias b=−1. If today weather=1, time=1, tired=0, the sum is 2·1 + 3·1 + (−4)·0 − 1 = 4. A positive number → activation fires → "go run". Tired=1 instead gives 0 → borderline. The network learned these weights from your past behaviour; you never wrote the rule.

›_ Example — with code
neuron.pypython
import numpy as np # one neuron, 3 inputs x = np.array([1.0, 1.0, 0.0]) # weather, free time, tired w = np.array([2.0, 3.0, -4.0]) # learned weights b = -1.0 # learned bias def relu(z): # a common activation (see §5) return max(0.0, z) z = np.dot(w, x) + b # weighted sum + bias -> 4.0 y = relu(z) # activation -> 4.0 print(f"pre-activation z = {z}, output y = {y}") # A whole layer is just many neurons stacked: a matrix multiply. W = np.array([[2., 3., -4.], # neuron 1 [1., -1., 1.]]) # neuron 2 b_vec = np.array([-1.0, 0.5]) layer_out = np.maximum(0.0, W @ x + b_vec) # ReLU over the vector print("layer output:", layer_out)

That last block is the crucial leap: a layer is just a matrix multiply plus a bias plus an activation. Stack several of these and you have a deep network. Everything that follows is about choosing the activations, measuring error, and adjusting W and b automatically.


SECTION 03Tensors & the math you actually need

A tensor is just an n-dimensional array. Four operations cover ~90% of what happens inside a network.

Everything flowing through a neural network — inputs, weights, activations, gradients — is a tensor: a grid of numbers with some number of axes (dimensions). The jargon maps onto things you know:

NameRank (axes)ExampleShape
Scalar0a single loss value 3.14()
Vector1one word embedding, one data row(768,)
Matrix2a batch of rows; a weight layer(batch, features)
3-D tensor3a batch of token sequences(batch, seq_len, dim)
4-D tensor4a batch of RGB images(batch, channels, H, W)

You do not need heavy mathematics to be productive, but four ideas recur constantly:

1 · Matrix multiplication — the workhorse

A layer computing Y = X·W + b is a matrix multiply. If X is (batch, in) and W is (in, out), the result is (batch, out). The inner dimensions must match — most shape bugs are a mismatch here.

2 · Broadcasting

When you add a bias vector of shape (out,) to a matrix (batch, out), the framework broadcasts the vector across every row automatically. Broadcasting lets small tensors stretch to fit big ones without copying memory — but it can also silently create wrong shapes, so check.

3 · The gradient

The gradient is the vector of partial derivatives of the loss with respect to every parameter. It points in the direction of steepest increase of the loss; training walks the opposite way. You will almost never compute it by hand — autograd does it (§7) — but you must understand what it means: "if I nudge this weight up a hair, does the error go up or down, and how fast?"

4 · The chain rule

A network is a chain of functions: loss(layer3(layer2(layer1(x)))). The chain rule from calculus lets you compute how the final loss changes with respect to an early weight by multiplying the local derivatives along the chain. This single rule is the mathematical engine of backpropagation.

The chain rule, the only formula to memorise
dL/dw = (dL/dy) · (dy/dz) · (dz/dw)

Read right-to-left: how the weight affects the pre-activation, how that affects the output, how that affects the loss. Multiply them to get the full effect. Backprop just applies this across millions of parameters efficiently.

›_ Example — with code (PyTorch tensors)
tensors.pypython
import torch # create tensors x = torch.randn(32, 784) # batch of 32 flattened 28x28 images W = torch.randn(784, 128) # weight matrix: 784 in -> 128 out b = torch.randn(128) # bias vector # a linear layer, by hand y = x @ W + b # @ is matmul; b is BROADCAST over 32 rows print(x.shape, "@", W.shape, "->", y.shape) # (32,784) @ (784,128) -> (32,128) # reshaping / moving axes (constant in real code) imgs = torch.randn(32, 3, 224, 224) # (batch, channels, H, W) flat = imgs.reshape(32, -1) # -> (32, 150528) moved = imgs.permute(0, 2, 3, 1) # NCHW -> NHWC: (32,224,224,3) # everything runs on a GPU by moving the tensor there device = "cuda" if torch.cuda.is_available() else "cpu" x = x.to(device) # same API, faster hardware
Debugging mantra

When a model breaks, print the shapes. 80% of deep-learning bugs are tensors that are the wrong shape, on the wrong device (CPU vs GPU), or the wrong dtype (float32 vs int64). Make print(x.shape, x.dtype, x.device) a reflex.


SECTION 04The MLP & the forward pass

Stack layers of neurons, push data through, get a prediction. This is the "hello world" of deep nets.

A multi-layer perceptron (MLP), also called a fully-connected or dense network, is layers of neurons where every neuron connects to every neuron in the next layer. Data enters the input layer, flows through one or more hidden layers, and exits the output layer. Computing the output from the input is the forward pass.

input (3) hidden 1 (4) hidden 2 (4) output (2) forward pass →
Figure 3. A 3-layer MLP. Each layer is activation(X·W + b). "Deep" just means more hidden layers. The output layer's size matches the task (e.g. 2 neurons for a 2-class problem).
Example — no code

Predicting a house price from 3 numbers: size, bedrooms, age. The input layer holds those 3 values. Hidden layer 1 might learn combinations like "big-and-new" or "small-and-old". Hidden layer 2 combines those into higher-level notions like "desirable family home". The single output neuron emits a price. Each layer builds richer concepts from the layer below — and the network decides what those concepts are, guided only by how well the final price matches reality.

›_ Example — with code: the same MLP in 3 frameworks
mlp_pytorch.pyPyTorch
import torch.nn as nn model = nn.Sequential( nn.Linear(3, 16), # input 3 -> hidden 16 nn.ReLU(), nn.Linear(16, 16), # hidden -> hidden 16 nn.ReLU(), nn.Linear(16, 1), # hidden -> output 1 (price) ) # forward pass: prediction = model(x)
mlp_keras.pyTensorFlow / Keras
import tensorflow as tf from tensorflow import keras model = keras.Sequential([ keras.layers.Dense(16, activation="relu", input_shape=(3,)), keras.layers.Dense(16, activation="relu"), keras.layers.Dense(1), # linear output for regression ]) # forward pass: prediction = model(x)

Notice how similar they are. Once you know the concepts, switching frameworks is mostly learning new spellings for the same nouns: nn.Linear = Dense, both stack into a Sequential.


SECTION 05Activation functions

The non-linearity that lets networks model curves, not just lines. Without it, depth is pointless.

Here is the single most important fact about activations: stacking linear layers without a non-linearity gives you… one linear layer. W₂(W₁x) = (W₂W₁)x is still just a matrix multiply. The activation function inserts a bend between layers, and bends are what let a network approximate any function — curved decision boundaries, complex mappings, the works. (This is the gist of the universal approximation theorem.)

ReLU max(0, x) Sigmoid → (0,1) Tanh → (−1,1)
Figure 4. Three classic activations. ReLU is the modern default for hidden layers — cheap and avoids vanishing gradients. Sigmoid and tanh saturate at the ends (flat = tiny gradient), which is why they're mostly relegated to output layers and gates now.
FunctionOutput rangeUse it forWatch out for
ReLU[0, ∞)Default for hidden layers"Dying ReLU": units stuck at 0
Leaky ReLU / GELU(−∞, ∞)Hidden layers; GELU in transformersSlightly more compute
Sigmoid(0, 1)Binary classification outputSaturation → vanishing gradients
Tanh(−1, 1)RNN/LSTM gates, zero-centered dataAlso saturates
Softmax(0,1), sums to 1Multi-class output (a probability dist.)Output layer only, not hidden
Example — no code

Think of softmax as turning raw scores into a probability vote. If the final layer outputs raw scores [2.0, 1.0, 0.1] for "cat / dog / bird", softmax converts them to roughly [0.66, 0.24, 0.10] — they're now positive and sum to 1, so you can read them as "66% confident it's a cat". The biggest score still wins, but you also get a calibrated sense of confidence, which the loss function (§6) needs.

›_ Example — with code
activations.pypython
import torch, torch.nn.functional as F z = torch.tensor([2.0, 1.0, 0.1]) F.relu(z) # tensor([2.0, 1.0, 0.1]) negatives -> 0 torch.sigmoid(z) # tensor([0.88, 0.73, 0.52]) each squashed to (0,1) torch.tanh(z) # tensor([0.96, 0.76, 0.10]) squashed to (-1,1) F.softmax(z, 0) # tensor([0.66, 0.24, 0.10]) a probability distribution F.gelu(z) # smooth ReLU-like curve used in BERT/GPT
! Common mistake

Don't put a softmax or sigmoid on the output and also use a loss that applies it internally. CrossEntropyLoss in PyTorch and from_logits=True in Keras expect raw logits (no final activation). Applying softmax twice silently hurts training. Feed raw scores to those losses.


SECTION 06Loss functions — measuring "how wrong"

A single number that scores the prediction against the truth. Training = making this number small.

The loss (or cost/objective) function takes the network's prediction and the true answer and outputs one number: bigger = more wrong. The entire goal of training is to minimise this number. Your choice of loss defines what "good" means, so it must match the task.

TaskLossWhat it measures
Regression (predict a number)MSE (mean squared error)Average squared gap between prediction and target
Regression, robust to outliersMAE / HuberAbsolute gap; less punished by big errors
Binary classificationBinary cross-entropyHow surprised the model is by the true 0/1 label
Multi-class classificationCross-entropyHow much probability mass landed on the wrong class
The two you'll use most
MSE = (1/N) · Σ (ŷᵢ − yᵢ)²
CrossEntropy = − Σ yᵢ · log(ŷᵢ)

In cross-entropy, y is the true distribution (usually 1 for the correct class, 0 elsewhere) and ŷ the predicted probabilities. It heavily punishes confident wrong answers: predicting 0.01 for the true class gives a huge −log(0.01) penalty.

Example — no code

Cross-entropy is "surprise". If the true label is cat and the model says cat with 99% confidence, it was barely surprised → tiny loss. If it confidently said dog at 99%, the truth is a shock → large loss. This asymmetry is exactly what you want: a model that is confidently wrong should be punished far more than one that was unsure. MSE behaves similarly for numbers — being off by 10 costs 100× more than being off by 1, because the gap is squared.

›_ Example — with code
losses.pypython
import torch, torch.nn as nn # regression pred = torch.tensor([3.2, 5.0]); target = torch.tensor([3.0, 4.5]) mse = nn.MSELoss()(pred, target) # -> 0.145 # multi-class classification (3 classes, batch of 2) logits = torch.tensor([[2.0, 0.5, 0.1], # raw scores, NOT softmaxed [0.2, 0.1, 3.0]]) labels = torch.tensor([0, 2]) # true class indices ce = nn.CrossEntropyLoss()(logits, labels) # applies softmax internally print(mse.item(), ce.item())

SECTION 07Backpropagation — the learning engine

How the network figures out which way to nudge every weight to lower the loss. The single most important algorithm in the field.

You have a loss number. Now what? You need to know, for each of the millions of weights, "if I increase this weight slightly, does the loss go up or down, and by how much?" That quantity is the gradient of the loss with respect to that weight. Backpropagation ("backprop") computes all of them in one efficient backward sweep.

It works in two phases:

  1. Forward pass — push the input through the network, computing and remembering each intermediate value, until you reach the loss.
  2. Backward pass — start from the loss and walk backwards, applying the chain rule (§3) at every step to push the gradient back to each weight. Each layer hands the layer before it the message "here's how much you contributed to the error."
input x layer 1W₁,b₁ layer 2W₂,b₂ output ŷ loss → forward: compute prediction & loss ← backward: push gradients to every weight
Figure 5. Backprop = a forward pass (blue) to get the loss, then a backward pass (red, dashed) that uses the chain rule to compute ∂loss/∂W for every parameter. The optimizer (§8) then uses those gradients to update the weights.
Example — no code

A factory ships a defective product (the loss). To assign blame, the manager walks the assembly line backwards: the packaging station was 10% responsible, the welding station 60%, the parts supplier 30%. Each station now knows exactly how much to adjust. Backprop is this blame-assignment, done with calculus: it distributes "responsibility for the error" back through every layer, in proportion to how much each weight influenced the outcome. The beautiful part: it reuses the calculations from later layers when computing earlier ones, so the whole backward pass costs about the same as one forward pass.

›_ Example — with code (autograd does it for you)
autograd.pyPyTorch
import torch # w is a parameter we want gradients for w = torch.tensor([2.0], requires_grad=True) x = torch.tensor([3.0]) y = w * x # forward: 6.0 loss = (y - 10) ** 2 # target is 10 -> loss = (6-10)^2 = 16 loss.backward() # BACKPROP: fills w.grad with d(loss)/dw print(w.grad) # tensor([-24.]) -> negative: increasing w lowers loss # you almost never call backward yourself on toy expressions; # in real training it's one line inside the loop (see section 10).
Why it changed everything

Backprop (popularised 1986) made it computationally feasible to train deep networks. Modern frameworks build a computational graph of every operation during the forward pass, then automatically differentiate it — this is automatic differentiation (autodiff). You write only the forward math; the gradients come free.


SECTION 08Gradient descent & optimizers

Backprop says which way is downhill. The optimizer decides how big a step to take, and how to be smart about it.

Picture the loss as a landscape of hills and valleys, with the network's weights as your coordinates. You want the lowest valley. The gradient points uphill, so you step the opposite way. That's gradient descent:

The update rule
w ← w − η · (∂loss/∂w)

η (eta) is the learning rate — the step size, and the single most important hyperparameter you'll tune. Too large → you overshoot and diverge. Too small → training crawls or gets stuck. Typical values: 1e-3 to 1e-5.

minimum start (high loss) loss surface weight value →
Figure 6. Gradient descent rolls downhill on the loss surface. Each red dot is one update step; the step size is the learning rate. Real loss surfaces have millions of dimensions, but the intuition — follow the slope down — is the same.

Batch, stochastic, and mini-batch

Computing the gradient over the entire dataset every step is accurate but slow. Stochastic gradient descent (SGD) uses one example at a time (noisy, fast). Mini-batch gradient descent — the universal practical choice — uses a small batch (e.g. 32–256 examples), balancing speed and stability. One pass over the whole dataset is an epoch.

Smarter optimizers

Plain SGD can be slow and get stuck. Modern optimizers add tricks:

›_ Example — with code
optimizers.pyPyTorch
import torch # pass the model's parameters and a learning rate opt = torch.optim.Adam(model.parameters(), lr=1e-3) # one optimization step (inside the training loop): opt.zero_grad() # clear old gradients (they accumulate otherwise!) loss.backward() # backprop: compute new gradients opt.step() # apply the update rule to every weight # Keras equivalent: # model.compile(optimizer=keras.optimizers.Adam(1e-3), loss="mse")
! The #1 silent bug

Forgetting opt.zero_grad(). PyTorch accumulates gradients by default, so if you skip it, this step's gradients pile on top of the last step's — and training quietly goes haywire. Always zero, then backward, then step.


SECTION 09Regularization & generalization

The real goal isn't a low training loss — it's performing well on data the model has never seen. These techniques fight memorization.

A network with millions of parameters can simply memorise the training set, scoring perfectly on it while failing on new data. That's overfitting. The opposite — too simple to capture the pattern at all — is underfitting. The art is landing in between: generalization.

training time (epochs) → loss training loss ↓ validation loss stop here ✓ overfitting zone →
Figure 7. The tell-tale sign of overfitting: training loss keeps dropping while validation loss bottoms out and then climbs. The gap between the two curves is the overfit. Stopping at the green line (early stopping) keeps the best-generalizing model.

The toolkit

TechniqueWhat it does
Train/val/test splitHold out data the model never trains on, so you can measure true generalization.
DropoutRandomly zero out a fraction of neurons each step, forcing the network not to rely on any single unit. A regularizer that mimics training an ensemble.
Weight decay (L2)Penalize large weights, nudging the model toward simpler functions.
Early stoppingStop training when validation loss stops improving (Figure 7).
Data augmentationCreate new training examples by transforming existing ones (flip/crop/rotate images, paraphrase text). More effective variety = less memorization.
Batch normalizationNormalize each layer's inputs per mini-batch. Stabilizes & speeds training, with a mild regularizing side-effect.
Example — no code

A student who memorises last year's exam answers aces the practice test but bombs the real exam with new questions — that's overfitting. Dropout is like randomly making some study notes unavailable each night: the student is forced to understand the material rather than lean on one memorised sheet. Data augmentation is studying many rephrased versions of each problem. Early stopping is putting the books down once mock-exam scores stop improving, before you start over-memorising trivia.

›_ Example — with code
regularization.pyPyTorch
import torch.nn as nn model = nn.Sequential( nn.Linear(784, 256), nn.BatchNorm1d(256), # batch normalization nn.ReLU(), nn.Dropout(p=0.3), # zero 30% of activations during training nn.Linear(256, 10), ) # weight decay (L2) is set on the optimizer: opt = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-2) # IMPORTANT: dropout/batchnorm behave differently in train vs eval! model.train() # enables dropout + batchnorm updates # ... training ... model.eval() # disables dropout, freezes batchnorm stats for inference
! Don't forget the mode switch

Call model.train() before training and model.eval() before evaluating/inference. Forgetting eval() leaves dropout active at test time and makes batchnorm use batch stats — a classic cause of "my accuracy is randomly worse at inference."


SECTION 10The training loop — everything together

Sections 4–9 in one place. Memorize this rhythm and you can train anything.

All the pieces now connect into a single repeating cycle. Internalise these five steps and the rest of deep learning is variations on the theme:

  1. Forward — run a mini-batch through the model to get predictions.
  2. Loss — compare predictions to targets.
  3. Backward — backprop to compute gradients (after zeroing old ones).
  4. Step — optimizer updates the weights.
  5. Repeat — over every batch, for many epochs, validating periodically.
1 · forward 2 · loss 3 · backward 4 · step 5 · repeat for every batch & epoch
Figure 8. The canonical training loop. This same five-step cycle trains a tiny MLP and a billion-parameter transformer alike — only the model and data change.
›_ Example — a complete, runnable PyTorch loop
train.pyPyTorch
import torch, torch.nn as nn from torch.utils.data import DataLoader device = "cuda" if torch.cuda.is_available() else "cpu" model = MyModel().to(device) loss_fn = nn.CrossEntropyLoss() opt = torch.optim.AdamW(model.parameters(), lr=1e-3) for epoch in range(EPOCHS): # ---- train ---- model.train() for xb, yb in train_loader: # mini-batches xb, yb = xb.to(device), yb.to(device) preds = model(xb) # 1. forward loss = loss_fn(preds, yb) # 2. loss opt.zero_grad() # clear old grads loss.backward() # 3. backward opt.step() # 4. update weights # ---- validate ---- model.eval() correct = 0 with torch.no_grad(): # no gradients needed for xb, yb in val_loader: xb, yb = xb.to(device), yb.to(device) correct += (model(xb).argmax(1) == yb).sum().item() print(f"epoch {epoch}: val acc = {correct/len(val_loader.dataset):.3f}")
›_ The same in Keras — the loop is hidden
train_keras.pyTensorFlow / Keras
# Keras wraps the loop in .fit() — convenient, less explicit model.compile(optimizer="adamw", loss="sparse_categorical_crossentropy", metrics=["accuracy"]) model.fit(train_ds, validation_data=val_ds, epochs=EPOCHS)
The two philosophies

PyTorch makes you write the loop — more code, total control, easy to debug, dominant in research. Keras hides it in .fit() — less code, faster to prototype. 🤗 Transformers' Trainer (§16) is a third option that handles the loop and distributed training, logging, and checkpoints for you. Learn the explicit loop first; the magic ones make sense once you know what they're hiding.


SECTION 11Convolutional networks — seeing

The architecture that cracked computer vision. It exploits the fact that in images, nearby pixels are related and patterns repeat.

Why not just feed an image into an MLP? A modest 224×224 RGB image is 150,528 numbers; a single dense layer to 1,000 units would need 150 million weights — and it would treat a cat in the top-left as completely unrelated to a cat in the bottom-right. Convolutional neural networks (CNNs) fix both problems with two ideas:

input image (5×5) 3×3 filter slides → convolve feature map brighter = filter found its pattern here. Stack many filters → many maps.
Figure 9. A convolution: a small filter slides over the image computing a dot product at each position, producing a feature map that lights up where the filter's pattern appears. A conv layer has many filters, each learning a different pattern (edges, corners, textures).

The full CNN recipe

A typical CNN alternates convolution → activation → pooling, repeated, then flattens into a small MLP head for the final prediction:

image conv+relu pool conv+relu pool flatten dense head class spatial size shrinks · number of feature channels grows → abstraction increases
Figure 10. A classic CNN pipeline. As data flows right, the spatial resolution decreases while the feature depth increases — the network trades "where" for "what".
Example — no code

Recognising a face. The first conv layer's filters fire on tiny edges and color blobs. The next layer combines edges into eyes, noses, mouth corners. A deeper layer combines those parts into "a face arrangement". Pooling between them means it doesn't matter if the face is a few pixels left or right — the same eye-detector fires regardless. You designed none of these detectors; the network grew them because they reduced the loss on labelled faces.

›_ Example — with code (a small image classifier)
cnn.pyPyTorch
import torch.nn as nn cnn = nn.Sequential( nn.Conv2d(3, 32, kernel_size=3, padding=1), # 3 RGB in -> 32 feature maps nn.ReLU(), nn.MaxPool2d(2), # halve spatial size nn.Conv2d(32, 64, kernel_size=3, padding=1), # 32 -> 64 feature maps nn.ReLU(), nn.MaxPool2d(2), nn.Flatten(), nn.Linear(64 * 8 * 8, 128), nn.ReLU(), # dense head nn.Linear(128, 10), # 10 classes ) # For real projects you rarely build from scratch — you fine-tune a # pretrained ResNet/EfficientNet (see Section 15 on transfer learning).
Practical reality

Almost nobody trains a vision CNN from scratch anymore. You take a model pretrained on ImageNet (ResNet, EfficientNet, or a Vision Transformer) and fine-tune it on your data — better accuracy with a fraction of the data and compute. More on this in §15.


SECTION 12Recurrent networks — sequences & memory

For data with order — text, speech, time series. The idea that ruled NLP before transformers, and still worth understanding.

An MLP and a CNN see a fixed-size input all at once. But language and time series are sequences of arbitrary length where order matters ("dog bites man" ≠ "man bites dog"). A recurrent neural network (RNN) processes a sequence one step at a time, carrying a hidden state — a running memory — from one step to the next. At each step it combines the new input with the memory of everything seen so far.

RNN RNN RNN RNN hidden state "the""cat""sat""down" one shared cell, applied at every timestep →
Figure 11. An RNN "unrolled" over a sentence. It's actually one cell reused at each timestep; the teal arrow is the hidden state passing memory forward. The same weights process every word — like reading left to right, updating your understanding as you go.

The vanishing gradient problem & the LSTM fix

Plain RNNs struggle to remember things from far back: during backprop the gradient is multiplied at every timestep and shrinks toward zero over long sequences — the vanishing gradient problem. By the time it reaches the start of a long sentence, the learning signal has evaporated.

The LSTM (Long Short-Term Memory) and the simpler GRU solve this with gates — small neural mechanisms that learn what to keep, what to forget, and what to output from a protected "cell state" that flows through largely unchanged. This lets information survive across hundreds of steps.

Example — no code

Reading "The keys to the cabinet … are on the table." To choose "are" over "is", the model must remember, across many intervening words, that the subject was plural ("keys"). A plain RNN tends to forget by then. An LSTM's forget gate learns to hold onto "subject = plural" in its cell state until the verb arrives, then uses it. The gates are themselves learned — the network discovers what's worth remembering.

›_ Example — with code
lstm.pyPyTorch
import torch, torch.nn as nn # classify the sentiment of a sequence of word-embeddings class SentimentLSTM(nn.Module): def __init__(self, vocab, dim=128, hidden=256): super().__init__() self.embed = nn.Embedding(vocab, dim) # token id -> vector self.lstm = nn.LSTM(dim, hidden, batch_first=True) self.head = nn.Linear(hidden, 2) # pos / neg def forward(self, x): # x: (batch, seq_len) e = self.embed(x) # (batch, seq, dim) out, (h_n, c_n) = self.lstm(e) # h_n = final memory return self.head(h_n[-1]) # classify from it
Where RNNs stand today

Transformers (§13) have largely replaced RNNs for language because they process a whole sequence in parallel and model long-range dependencies better. RNNs/LSTMs are still useful for streaming data, very long time series on tight compute, and on-device settings — and understanding them clarifies why attention was such a leap.


SECTION 13Transformers & attention

The architecture behind GPT, BERT, and essentially every modern large model. Understand attention and you understand the engine of the AI boom.

In 2017 the paper "Attention Is All You Need" introduced the Transformer, discarding recurrence entirely. Its core insight: instead of passing memory step-by-step, let every token look directly at every other token and decide how much each one matters for understanding it. That mechanism is self-attention. Because all tokens are processed at once, transformers parallelise beautifully on GPUs — which is what made training enormous models practical.

Self-attention: Query, Key, Value

Each token produces three vectors via learned projections:

A token's query is compared (dot product) against every token's key to produce attention scores — how relevant each other token is. Softmax turns those into weights that sum to 1, and the output is the weighted sum of all the values. In short: each token gathers a custom blend of information from the whole sequence, weighted by relevance.

Scaled dot-product attention
Attention(Q, K, V) = softmax( Q·Kᵀ / √dₖ ) · V

The √dₖ divisor keeps the dot products from growing too large (which would saturate the softmax). Multi-head attention runs several of these in parallel, each head free to focus on a different kind of relationship (syntax, coreference, topic), then concatenates them.

The animal didn't cross because it "it" query strongest ↑ to resolve "it", attention looks back & weights "animal" highest
Figure 12. Self-attention resolving a pronoun. When processing "it", the model's query matches the key of "animal" most strongly (thick line), so it pulls in that token's value — correctly linking "it" to "animal". Each token computes such a weighted view of the whole sentence.

The full transformer block

A transformer is a stack of identical blocks. Each block is: multi-head self-attention → add & normalize → feed-forward MLP → add & normalize. Two more pieces make it work:

Transformer block (×N) Multi-Head Self-Attention Add & LayerNorm Feed-Forward MLP Add & LayerNorm residual
Figure 13. One transformer block. Stack N of these (GPT-class models use dozens to over a hundred). The red residual paths and LayerNorm are what make such depth trainable.

Encoder, decoder, and the model families

FamilyStructureBest atExamples
Encoder-onlyReads the whole input at once (bidirectional)Understanding: classification, embeddings, searchBERT, RoBERTa
Decoder-onlyPredicts the next token left-to-right (causal)Generation: chat, completion, codeGPT, Llama, Mistral
Encoder–decoderEncode input, then generate outputTranslation, summarization (seq-to-seq)T5, BART
Example — no code

Think of self-attention as a meeting where everyone can hear everyone. To decide what "it" refers to, the word "it" effectively asks the room, "who here is a noun I might be standing in for?" Every other word answers with how relevant it is; "animal" answers loudest. "It" then updates its understanding by blending in mostly "animal". A decoder (like GPT) is the same meeting but with a rule: you may only listen to words before you, never after — which is exactly what's needed to predict the next word one at a time.

›_ Example — self-attention from scratch
attention.pyPyTorch
import torch, torch.nn.functional as F def self_attention(x, Wq, Wk, Wv): # x: (seq_len, dim) — one sequence of token vectors Q, K, V = x @ Wq, x @ Wk, x @ Wv # learned projections d_k = Q.size(-1) scores = (Q @ K.transpose(-2, -1)) / d_k ** 0.5 # (seq, seq) weights = F.softmax(scores, dim=-1) # how much each token attends return weights @ V # weighted blend of values # In practice you use the built-in, optimized version: # torch.nn.MultiheadAttention, or just load a pretrained transformer (§16).
Why this won

Transformers removed the sequential bottleneck of RNNs (parallel training), handle long-range dependencies directly (any token to any token in one step), and scale predictably — bigger model + more data + more compute reliably yields better performance. That predictable scaling is precisely what fuelled the era of large language models.


SECTION 14Embeddings — meaning as geometry

How networks turn words, images, or anything discrete into vectors where distance equals similarity. The quiet idea behind search, RAG, and recommendations.

A network can't multiply the word "king". An embedding maps each discrete item (a word, a product, a user) to a dense vector of, say, 768 numbers. Crucially, these vectors are learned so that items used in similar ways end up near each other in the space. Meaning becomes geometry: similarity is just distance.

The famous example
vec("king") − vec("man") + vec("woman") ≈ vec("queen")

Relationships become directions in the space. The "royalty" and "gender" concepts emerge as consistent vector offsets — never programmed, just a by-product of training on how words co-occur.

cat dog hamster pets man king woman queen +royalty +royalty → gender → a 2-D sketch of a high-dimensional embedding space
Figure 14. Embeddings place related items close together and encode relationships as consistent directions. Real spaces have hundreds of dimensions; this is a 2-D cartoon of the idea.
Example — no code

This is how semantic search and RAG (retrieval-augmented generation) work. You embed every document in your knowledge base into vectors and store them. When a user asks a question, you embed the question too, then find the document vectors nearest to it — those are the most relevant passages, even if they share no exact keywords ("car trouble" finds a doc about "vehicle won't start"). Recommendation systems do the same with products and users.

›_ Example — with code (sentence embeddings + similarity)
embeddings.py🤗 sentence-transformers
from sentence_transformers import SentenceTransformer, util model = SentenceTransformer("all-MiniLM-L6-v2") # small, fast embedder docs = ["How do I reset my password?", "Ways to recover account access", "What's the weather in Paris?"] emb = model.encode(docs, convert_to_tensor=True) # (3, 384) vectors query = model.encode("I forgot my login", convert_to_tensor=True) scores = util.cos_sim(query, emb) # cosine similarity print(scores) # highest for the first two docs, low for the weather one

SECTION 15Transfer learning & fine-tuning

Don't start from zero. Take a model that already learned general features on a huge dataset, and adapt it to your task with a fraction of the data.

Training a large model from scratch needs millions of examples and serious compute. Transfer learning sidesteps this: start from a model pretrained on a giant generic dataset (ImageNet for vision, web-scale text for language), then adapt it to your specific task. The pretrained model already "knows" edges, shapes, grammar, and facts — you just teach it your last mile.

Two main strategies:

Pretrained body (frozen ❄️) knows generic features — keep it New head 🔥 train on your data out input Feature extraction: train only the new head. Fine-tuning: also unfreeze part of the body at a low LR.
Figure 15. Transfer learning: reuse the pretrained body, attach a fresh head for your task. This is how the vast majority of real-world deep learning gets done today.
Example — no code

You want to classify 10 species of local birds but have only 500 photos — nowhere near enough to train a vision model from scratch. Instead you take a ResNet that already learned, from 1.2 million ImageNet images, what feathers, beaks, edges, and textures look like. You snip off its 1,000-class head, bolt on a fresh 10-class head, and train. Because the hard, general visual work is already done, your 500 photos are enough to get strong accuracy. It's standing on the shoulders of a model that already learned to see.

›_ Example — with code (fine-tune a pretrained CNN)
finetune_vision.pyPyTorch / torchvision
import torch, torch.nn as nn from torchvision import models # 1. load a model pretrained on ImageNet net = models.resnet50(weights="IMAGENET1K_V2") # 2. freeze the body (feature extraction) for p in net.parameters(): p.requires_grad = False # 3. replace the head for OUR 10 bird classes (this part trains) net.fc = nn.Linear(net.fc.in_features, 10) # 4. train only the new head with the standard loop from Section 10 opt = torch.optim.Adam(net.fc.parameters(), lr=1e-3) # To FINE-TUNE instead: unfreeze later layers and use a tiny LR (e.g. 1e-5).
The modern default workflow

Pretrain (or download) → fine-tune → deploy. For language, this is exactly what creating a custom chatbot or classifier looks like: start from a pretrained LLM and fine-tune on your domain. Parameter-efficient methods like LoRA fine-tune only a tiny set of added weights, making it cheap to adapt even billion-parameter models on a single GPU.


SECTION 16Hugging Face & the pretrained ecosystem

The practical fast-track. Thousands of ready-to-use models, three lines of code to inference, a standard recipe to fine-tune.

The 🤗 Transformers library is the de-facto hub for pretrained models. You rarely build a transformer by hand; you download one of hundreds of thousands of community models from the Hugging Face Hub and either use it directly or fine-tune it. Three abstraction levels, from easiest to most controlled:

Level 1 — pipeline: inference in 3 lines

›_ Example — with code
pipeline.py🤗 Transformers
from transformers import pipeline clf = pipeline("sentiment-analysis") # downloads a model for you print(clf("I absolutely loved this guide!")) # [{'label': 'POSITIVE', 'score': 0.9998}] # the same API covers dozens of tasks: pipeline("summarization") pipeline("question-answering") pipeline("text-generation", model="gpt2") pipeline("zero-shot-classification") # classify into labels you invent pipeline("automatic-speech-recognition") # audio -> text

Level 2 — tokenizer + model: full control over inference

When you need the logits, embeddings, or custom decoding, load the tokenizer (which turns text into the integer token IDs the model expects) and the model separately.

›_ Example — with code
model_direct.py🤗 Transformers
from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch name = "distilbert-base-uncased-finetuned-sst-2-english" tok = AutoTokenizer.from_pretrained(name) model = AutoModelForSequenceClassification.from_pretrained(name) inputs = tok("Transformers make this easy.", return_tensors="pt") with torch.no_grad(): logits = model(**inputs).logits # raw scores probs = logits.softmax(-1) # -> probabilities print(model.config.id2label[probs.argmax().item()]) # 'POSITIVE'

Level 3 — Trainer: fine-tune on your own data

The Trainer handles the entire training loop, evaluation, checkpointing, mixed precision, and multi-GPU — you supply a dataset and a config.

›_ Example — with code (fine-tuning)
finetune_text.py🤗 Transformers
from transformers import (AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer) from datasets import load_dataset ds = load_dataset("imdb") # 1. data tok = AutoTokenizer.from_pretrained("distilbert-base-uncased") ds = ds.map(lambda b: tok(b["text"], truncation=True), batched=True) model = AutoModelForSequenceClassification.from_pretrained( # 2. model "distilbert-base-uncased", num_labels=2) args = TrainingArguments(output_dir="out", num_train_epochs=2, # 3. config per_device_train_batch_size=16, eval_strategy="epoch", learning_rate=2e-5, fp16=True) Trainer(model=model, args=args, # 4. train train_dataset=ds["train"], eval_dataset=ds["test"], tokenizer=tok).train()
Example — no code

Hugging Face is to deep learning what package managers (npm, pip) are to software: instead of writing a sentiment model, an OCR model, and a translator from scratch — each months of work — you install pretrained ones and compose them. The Hub is the "app store" of models; pipeline is the one-click install; Trainer is the recipe for teaching a downloaded model your specific job. This is why a small team can ship serious NLP in days.

How to pick a model on the Hub

Match the task (text-classification, summarization, ASR…), check the size vs your hardware (a 7B-param model needs ~14 GB VRAM in fp16), prefer models with a permissive license and good downloads/likes, and read the model card for its training data and known limitations. Smaller distilled models (DistilBERT, MiniLM) are often the right call for production latency.


SECTION 17Generative models

Networks that don't just classify — they create. The lineage behind image generators, voice cloning, and beyond.

Everything so far has been discriminative: map an input to a label or number. Generative models learn the underlying distribution of the data so they can produce new samples that look like it — new images, audio, molecules, text. Three influential families:

Autoencoders & VAEs

An autoencoder squeezes input through a narrow bottleneck (the latent code) and reconstructs it. By forcing everything through a few numbers, it learns a compressed, meaningful representation. A variational autoencoder (VAE) makes the latent space smooth and continuous, so you can sample new points from it and decode them into novel outputs.

GANs — the adversarial game

A generative adversarial network pits two networks against each other. The generator tries to produce fakes; the discriminator tries to tell fakes from real. They train together — as the discriminator gets sharper, the generator is forced to produce more convincing samples, until the fakes are nearly indistinguishable from real data.

Diffusion models — the modern image engine

Diffusion models (behind most state-of-the-art image generators) learn to reverse a noising process. During training, real images are progressively corrupted with noise until they're pure static; the model learns to undo one step of noising. To generate, you start from pure noise and run the learned denoiser repeatedly, and a coherent image gradually emerges. Conditioning the denoiser on a text embedding gives you text-to-image.

→ training: add noise step by step until pure static ← generation: model denoises from static back into a new image start: noise result: image
Figure 16. Diffusion. The model is trained to remove noise (top, red). To generate, you run that learned denoiser in reverse from pure noise (bottom, teal), and an image materializes — optionally steered by a text prompt.
Example — no code

GAN as an art forger and detective: a forger paints fakes, a detective judges them. Each time the detective spots a fake, the forger learns and improves; each new forgery trains the detective to be pickier. After thousands of rounds the forgeries are gallery-quality. Diffusion as a sculptor: imagine starting with a block of TV static and "carving away" the noise that doesn't belong, a little each pass, guided by the prompt "a cat in a spacesuit", until a clean image of exactly that remains.

›_ Example — with code (text-to-image with diffusers)
generate_image.py🤗 diffusers
from diffusers import StableDiffusionPipeline import torch pipe = StableDiffusionPipeline.from_pretrained( "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16 ).to("cuda") image = pipe("a cat in a spacesuit, digital art", num_inference_steps=30, # how many denoising steps guidance_scale=7.5).images[0] # how strongly to follow the prompt image.save("cat.png")

SECTION 18Engineering & deployment

The unglamorous work that decides whether a model ships. What separates a notebook demo from production.

Make training fast and stable

Debugging checklist (in order)

SymptomFirst things to check
Loss is NaNLearning rate too high; bad input normalization; log(0) in a custom loss. Lower LR, clip gradients.
Loss won't decreaseForgot zero_grad(); LR too low/high; labels misaligned with inputs; model on wrong device.
Great train acc, bad val accOverfitting — add dropout/augmentation/weight decay, get more data, or stop earlier (§9).
Worse at inference than trainingForgot model.eval(); different preprocessing at inference time.
Out-of-memory (OOM)Reduce batch size; enable mixed precision; use gradient accumulation/checkpointing.
The most useful debugging trick

Overfit a single batch on purpose. Take 2–4 examples and train until the loss is near zero. If your model can't memorize a tiny batch, there's a bug in the model, loss, or data pipeline — no amount of tuning will help until you fix it. If it can, your machinery works and the rest is data and regularization.

From trained model to product

›_ Example — with code (save / load / serve)
deploy.pyPyTorch + FastAPI
# --- save & load weights --- torch.save(model.state_dict(), "model.pt") model.load_state_dict(torch.load("model.pt")) model.eval() # ALWAYS eval before serving # --- minimal inference API --- from fastapi import FastAPI app = FastAPI() @app.post("/predict") def predict(payload: dict): x = preprocess(payload["text"]) # same prep as training! with torch.no_grad(): logits = model(x) return {"label": int(logits.argmax(-1))} # run: uvicorn deploy:app --host 0.0.0.0 --port 8000


SECTION 19Quantization — making big models fit

A model's weights are just numbers. Store them in fewer bits and the model shrinks 2–8× — often with almost no loss in quality. This is what lets a 70B model run on a single GPU.

By default weights live in 32-bit (fp32) or 16-bit (fp16/bf16) floats. Quantization maps them to low-bit integers — typically int8 or int4 — storing a scale factor so they can be approximately reconstructed. Memory scales directly with bit-width, and because generation is memory-bandwidth bound, smaller weights also mean faster inference.

FormatBits7B model sizeUse
fp3232~28 GBlegacy / high-precision training
fp16 / bf1616~14 GBstandard training & inference (bf16 preferred on Ampere+)
int88~7 GBinference, light quality loss
int4 (NF4)4~3.5 GBinference + QLoRA fine-tuning
FP88~7 GBtraining/inference on H100/Blackwell

Two timing strategies: post-training quantization (PTQ) — quantize an already-trained model (fast, the common case; methods include GPTQ, AWQ, bitsandbytes NF4, and GGUF for llama.cpp) — and quantization-aware training (QAT), which simulates quantization during training for the best low-bit accuracy at higher cost. NF4 ("4-bit NormalFloat") is a data type tuned for the bell-curve distribution of neural-net weights, and "double quantization" even quantizes the scale factors.

Same 7B model — memory shrinks with bit-width fp32 ~28 GB 32-bit fp16 ~14 GB 16-bit int8 8-bit · ~7 GB int4 4-bit · ~3.5 GB
Figure 17. Lower precision → smaller footprint → fits on cheaper hardware and runs faster. int4 fits a 7B model in <4 GB with minimal quality loss for inference.
Example — no code

Think of a RAW photo vs a JPEG. The RAW file stores every pixel in full precision — gorgeous, but huge. The JPEG throws away detail your eye can't notice and is a fraction of the size. Quantization is JPEG for model weights: a 70B model in fp16 needs ~140 GB (two 80 GB A100s); in 4-bit it needs ~40 GB and runs on a single GPU, while answering almost identically. You trade a sliver of fidelity for an order-of-magnitude in cost.

›_ Example — with code (load any model in 4-bit)
load_4bit.pyTransformers + bitsandbytes
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig import torch # 4-bit NF4 config: the recipe behind QLoRA bnb = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", # NormalFloat-4, tuned for weight distributions bnb_4bit_use_double_quant=True, # quantize the quantization constants too bnb_4bit_compute_dtype=torch.bfloat16, # math runs in bf16 ) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B-Instruct", quantization_config=bnb, device_map="auto", # place layers across available GPUs/CPU ) tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct") # The 8B model now occupies ~5 GB instead of ~16 GB.
Rule of thumb

Quantize for inference and for QLoRA fine-tuning almost for free with int4/int8. Reach for GPTQ/AWQ when you want the fastest pre-baked inference weights, GGUF for llama.cpp/Ollama on laptops, and FP8 on H100/Blackwell for training. Watch for activation outliers — a few large activations can hurt naive low-bit schemes, which is exactly what methods like LLM.int8() and AWQ are designed to handle.


SECTION 20LoRA, QLoRA & parameter-efficient fine-tuning

Full fine-tuning rewrites every weight — one giant checkpoint per task and a huge bill. PEFT freezes the model and trains a tiny set of new parameters instead. This is how almost all custom LLMs are built today.

The key idea behind LoRA (Low-Rank Adaptation): when you fine-tune, the change to a weight matrix is low-rank — it doesn't need full degrees of freedom. So freeze the original weight W (size d×k) and learn its update as a product of two skinny matrices:

The LoRA decomposition
W′ = W + ΔW = W + B·A    (B is d×r,  A is r×k,  rank r ≪ d,k)

Only A and B train. With r=8 on a 4096×4096 layer you update ~65K params instead of ~16.7M — about 0.4%. A scaling factor α/r controls the update's strength. At deploy time you can fold B·A back into W, so there is zero extra inference latency.

W frozen ❄ d × k (huge) + B d × r × A r × k ΔW (trainable, tiny: ~0.4%) Train only the thin blue + teal matrices; the big frozen W never moves.
Figure 18. LoRA learns a low-rank update B·A alongside the frozen weight. QLoRA adds one trick: keep W in 4-bit (Section 19) while the adapters train in bf16 — enough to fine-tune a 65B model on a single 48 GB GPU.

The family worth knowing:

Example — no code

You have one 13B base model and need five specialists: legal, medical, code, customer-support, and sales. Full fine-tuning means five separate expensive runs and five ~26 GB checkpoints to store and serve. With LoRA you keep one frozen base and train five adapter files of ~10–50 MB each, hot-swapping them per request. It's one game console with five cartridges instead of buying five consoles.

›_ Example — with code (QLoRA fine-tune with PEFT + TRL)
qlora_finetune.pyPEFT + TRL + Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training from trl import SFTTrainer, SFTConfig import torch model_id = "meta-llama/Llama-3.1-8B" # 1. load base in 4-bit (the "Q" in QLoRA) -- frozen bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16) model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb, device_map="auto") model = prepare_model_for_kbit_training(model) # 2. attach LoRA adapters (the only thing that trains) lora = LoraConfig( r=16, lora_alpha=32, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM", target_modules=["q_proj","k_proj","v_proj","o_proj", # attention "gate_proj","up_proj","down_proj"], # MLP -> "all-linear" ) model = get_peft_model(model, lora) model.print_trainable_parameters() # e.g. "trainable: 0.42% of all params" # 3. train like normal -- TRL handles the SFT loop trainer = SFTTrainer(model=model, train_dataset=my_dataset, args=SFTConfig(output_dir="out", per_device_train_batch_size=2, gradient_accumulation_steps=8, bf16=True, num_train_epochs=1, learning_rate=2e-4)) trainer.train() # 4. save the tiny adapter (a few dozen MB), or merge for zero-latency deploy: model.save_pretrained("my-lora-adapter") # merged = model.merge_and_unload() # fold B*A into W for serving
Practical settings

Start with rank r=16–64, α = 2×r, and target all linear layers (attention + MLP) — this consistently beats attaching LoRA to attention only. Learning rates are higher than full fine-tuning (1e-4 to 3e-4). Use QLoRA when memory is tight, DoRA when you want to squeeze out the last accuracy points, and merge_and_unload() before serving so inference is as fast as the base model.


SECTION 21Alignment with RLHF & PPO

Pretraining makes a model knowledgeable; supervised fine-tuning makes it follow instructions. But "the most likely next token" isn't the same as "the answer a human prefers." Alignment closes that gap.

RLHF (Reinforcement Learning from Human Feedback) is the classic three-stage recipe that turned raw LLMs into helpful assistants:

  1. SFT — fine-tune on high-quality demonstration (prompt → ideal answer) pairs.
  2. Reward model (RM) — show humans two answers to the same prompt; they pick the better one. Train a model to predict that preference. Mathematically it uses the Bradley–Terry model: the probability that answer w beats l is σ(r(w) − r(l)).
  3. RL optimization (PPO) — let the policy generate answers, score them with the RM, and update the policy to earn higher reward — while a KL penalty keeps it from drifting too far from the SFT model (so it can't "cheat" the reward).
1 · SFT imitate good answers 2 · Reward model learns human taste 3 · PPO maximize reward Policy generates answer → RM scores it reward update policy … with a KL leash to the SFT model so it can't reward-hack.
Figure 19. The RLHF pipeline. PPO is powerful but heavy: it juggles four models in memory — policy, frozen reference, reward model, and a value/critic network — which is exactly the cost the next section's methods try to avoid.
Example — no code

Imagine training a chef. SFT is having them copy recipes from a great cookbook. The reward model is a food critic you've trained to taste a dish and give it a score the way real diners would. PPO is the chef cooking freely, the critic scoring each plate, and the chef adjusting to win higher scores — but with a rule that they can't stray too far from real cooking (the KL leash), or they'd just drown everything in sugar to game the critic. That "gaming" failure is called reward hacking.

›_ Example — with code (PPO with TRL)
ppo_rlhf.pyTRL (Transformer Reinforcement Learning)
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead from transformers import AutoTokenizer, pipeline # policy = SFT model + a value head (the critic PPO needs) policy = AutoModelForCausalLMWithValueHead.from_pretrained("my-sft-model") ref = AutoModelForCausalLMWithValueHead.from_pretrained("my-sft-model") # frozen anchor tok = AutoTokenizer.from_pretrained("my-sft-model") # a trained reward model scores responses (here via a text-classification pipeline) reward_fn = pipeline("text-classification", model="my-reward-model") ppo = PPOTrainer(PPOConfig(batch_size=32, learning_rate=1e-5), policy, ref, tok) for batch in dataloader: queries = batch["input_ids"] responses = ppo.generate(queries, max_new_tokens=128) # policy acts texts = [tok.decode(r) for r in responses] rewards = [out["score"] for out in reward_fn(texts)] # RM judges # PPO step: push policy toward higher reward, clipped + KL-penalized vs ref ppo.step(queries, responses, rewards)
! Reward hacking

Optimize a proxy hard enough and the model finds loopholes: padding answers with flattery, gaming length, or exploiting RM blind spots. The KL coefficient is your main defense — too low and the model degenerates, too high and it never improves. This fragility (plus PPO's 4-model memory cost and tuning difficulty) is why the field largely shifted to the simpler methods next.


SECTION 22DPO, GRPO & modern preference optimization

PPO works but is complex, unstable, and memory-hungry. The 2023–2025 wave of methods gets the same alignment with far less machinery — and powers today's reasoning models.

DPO (Direct Preference Optimization) made the key observation that your language model is secretly its own reward model. Instead of training a separate RM and running RL, DPO derives a single classification-style loss directly on preference pairs (prompt, chosen, rejected). It nudges the policy to raise the log-probability of chosen answers and lower it for rejected ones, with a coefficient β that plays the role of the KL leash — all measured relative to a frozen reference model. No reward model, no rollouts, no critic: just stable supervised-style training.

GRPO (Group Relative Policy Optimization, from DeepSeek) keeps RL's online strength but drops PPO's expensive critic. For each prompt it samples a group of G answers, scores them, and uses the group's mean and standard deviation to compute each answer's advantage — "how much better than my other attempts was this one?" No value network needed. It shines with verifiable rewards (RLVR): math answers you can check, code that either passes tests or doesn't. GRPO on verifiable rewards is what produced DeepSeek-R1's reasoning ability.

The two objectives, in words
DPO:  loss = −log σ( β · [ logpθ(chosen) − logpθ(rejected) ] − [ same under ref ] )
GRPO:  advantage Aᵢ = ( rᵢ − mean(r) ) / ( std(r) + ε )  over a group of G samples

The broader family: IPO (fixes a DPO overfitting failure mode), KTO (needs only a binary good/bad label per sample — no pairs — using prospect-theory utility), ORPO (folds preference into SFT with no reference model), SimPO (reference-free), and DAPO (GRPO refinements: dynamic sampling, asymmetric "clip-higher").

PPO policy + reference + reward model + critic/value 4 models · online most flexible, least stable DPO policy + reference no RM, no critic, no RL 2 models · offline stable, cheap, needs pref pairs GRPO policy + reference group of G samples no critic (group baseline) online · verifiable reward powers reasoning models
Figure 20. The trade-off: PPO is the most general but heaviest; DPO is dead-simple offline preference learning; GRPO keeps online RL but removes the critic, making RL-for-reasoning affordable.
Example — no code

DPO is teaching with side-by-side flashcards: "this reply is better than that one" — the model learns the preference directly, no separate judge required. GRPO is a classroom exercise: hand eight students the same math problem, check which solutions are actually correct, and reward the ones that beat the class average. You never wrote a grading rubric (a critic) — correctness plus relative ranking is the whole signal. That's why GRPO is perfect for math and code, where "right or wrong" is checkable.

›_ Example — with code (DPO and GRPO with TRL)
preference_opt.pyTRL
# ---------- DPO: offline, from preference pairs ---------- from trl import DPOTrainer, DPOConfig from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("my-sft-model") tok = AutoTokenizer.from_pretrained("my-sft-model") # dataset columns: {"prompt", "chosen", "rejected"} dpo = DPOTrainer(model=model, args=DPOConfig(beta=0.1, output_dir="dpo-out"), train_dataset=pref_dataset, processing_class=tok) dpo.train() # no reward model, no rollouts -- just stable training # ---------- GRPO: online RL with a verifiable reward ---------- from trl import GRPOTrainer, GRPOConfig def reward_correct(completions, answer, **kw): # +1 if the model's final answer matches ground truth, else 0 (RLVR) return [1.0 if extract(c) == a else 0.0 for c, a in zip(completions, answer)] grpo = GRPOTrainer( model="my-sft-model", reward_funcs=reward_correct, args=GRPOConfig(num_generations=8, # G samples per prompt -> the "group" output_dir="grpo-out", learning_rate=1e-6), train_dataset=math_dataset, ) grpo.train()
Which one?

Have a dataset of preference pairs and want stable, cheap alignment → DPO (or KTO if you only have thumbs-up/down). Want to push reasoning on tasks with a checkable answer (math, code, tool success) → GRPO/RLVR. Need maximum control over a complex reward or true online exploration → PPO. The modern default for most teams is: SFT → DPO, with GRPO when a verifiable reward is available.


SECTION 23Inside a modern Transformer

The 2017 Transformer in Section 13 still describes the skeleton, but every production model (Llama, Qwen, Mistral, DeepSeek, Gemma) swaps in a set of upgrades for speed, length, and quality. Here's the modern parts list.

Component2017 originalModern replacementWhy
PositionSinusoidal / learnedRoPE (rotary), ALiBirelative, extrapolates to longer context
NormalizationLayerNorm (post)RMSNorm, pre-normcheaper, more stable training
FFN activationReLUSwiGLU (gated)~1–2% better perplexity
Attention headsMulti-head (MHA)GQA / MQA / MLAshrinks the KV cache → longer context
Attention kernelNaive softmax(QKᵀ)VFlashAttention 1/2/3IO-aware, never stores the N×N matrix
CapacityDense FFNMixture-of-Expertshuge total params, few active per token

Grouped-Query Attention (GQA) is the 2025 default. Standard attention gives every query head its own key/value head — but the KV cache (the stored keys/values for past tokens) dominates memory at long context. MQA shares a single KV head across all queries (tiny cache, slight quality hit); GQA is the sweet spot: a few KV heads shared by groups of query heads. DeepSeek's MLA pushes further by compressing KV into a low-rank latent.

MHA GQA MQA 4 Q · 4 KV (biggest cache) 4 Q · 2 KV (balanced) 4 Q · 1 KV (smallest cache) QKV
Figure 21. Query heads (indigo) vs key/value heads (teal). Fewer KV heads → a smaller KV cache → longer context and faster decoding. GQA is the modern compromise.

Mixture-of-Experts (MoE) replaces a dense feed-forward block with many parallel "expert" FFNs plus a tiny router that sends each token to only the top-k experts (often 2). The model can hold enormous knowledge (high total parameters) while each token only pays for a few experts (low active parameters) — e.g. Mixtral, DeepSeek-V3, Llama 4.

token Routerpick top-2 Expert 1 ✓ Expert 2 Expert 3 ✓ Expert 4 … Expert N only 2 of N experts run per token
Figure 22. A Mixture-of-Experts layer. The router activates only a couple of experts per token, decoupling total capacity from per-token compute.
Example — no code

GQA is carpooling for memory: instead of every query commuter driving a private key/value car, they share a few cars, freeing up the parking lot (KV cache) for far longer trips (context). MoE is a large hospital with a triage desk: each patient (token) is routed to just the two relevant specialists out of a hundred on staff. You get the expertise of a hundred doctors but only pay for two consultations per patient.

›_ Example — with code (a modern block + FlashAttention)
modern_block.pyPyTorch + Transformers
import torch, torch.nn as nn, torch.nn.functional as F class RMSNorm(nn.Module): # cheaper than LayerNorm def __init__(self, d, eps=1e-6): super().__init__(); self.g = nn.Parameter(torch.ones(d)); self.eps = eps def forward(self, x): return self.g * x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps) class SwiGLU(nn.Module): # gated FFN used by Llama/Qwen def __init__(self, d, hidden): super().__init__() self.w_gate = nn.Linear(d, hidden, bias=False) self.w_up = nn.Linear(d, hidden, bias=False) self.w_down = nn.Linear(hidden, d, bias=False) def forward(self, x): return self.w_down(F.silu(self.w_gate(x)) * self.w_up(x)) # FlashAttention is exact attention done IO-efficiently. In practice you just # enable the fused kernel -- PyTorch picks it automatically: out = F.scaled_dot_product_attention(q, k, v, is_causal=True) # uses Flash when available # ...or tell Hugging Face to use it when loading a model: from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B", attn_implementation="flash_attention_2", torch_dtype=torch.bfloat16)

SECTION 24Training at scale

A 70B model won't even fit on one GPU — Adam's optimizer states alone can need hundreds of gigabytes. Scaling means splitting the work, and trading compute for memory.

First, the memory budget. Training a parameter in mixed precision costs far more than the weight itself: a copy of the weight, its gradient, and Adam's two optimizer states (often kept in fp32). That's why optimizer states, not the model, usually dominate. The toolkit:

Data parallel full model full model same weights, different data Tensor parallel ½ layer ½ layer one layer split across GPUs Pipeline parallel layers1–20 layers21–40 different layers per GPU FSDP / ZeRO-3 shards params + grads + optimizer states across all GPUs — each holds ~1/N
Figure 23. Ways to split a too-big job. In practice large runs combine several — e.g. FSDP for sharding plus tensor/pipeline parallel for the very biggest models.
Example — no code

You're moving a house but own only small trucks. One truck can't hold the house — not even all the furniture from one room. So you split everything across eight trucks (FSDP/ZeRO shards the weights, gradients, and optimizer state), and when you need a specific couch you radio the truck that has it (gather-on-demand). Gradient checkpointing is deciding not to keep packing boxes around — you'll just rebuild a few when needed, saving space at the cost of a little extra work.

›_ Example — with code (scale knobs in the Trainer)
train_at_scale.pyTransformers + Accelerate
from transformers import TrainingArguments args = TrainingArguments( output_dir="out", per_device_train_batch_size=2, # small micro-batch that fits in memory gradient_accumulation_steps=16, # ...but train as if batch = 2 * 16 * n_gpus bf16=True, # mixed precision (Ampere/Hopper+) gradient_checkpointing=True, # recompute activations -> big memory save optim="adamw_torch_fused", # multi-GPU sharding via FSDP (or use a DeepSpeed ZeRO-3 config file): fsdp="full_shard auto_wrap", fsdp_config={"transformer_layer_cls_to_wrap": "LlamaDecoderLayer"}, logging_steps=10, num_train_epochs=1, learning_rate=2e-5, ) # Launch across GPUs with: accelerate launch train_at_scale.py # (Accelerate / torchrun handle the distributed process group for you.)
The OOM playbook

Out of memory? In order: enable bf16, turn on gradient checkpointing, lower the micro-batch and raise accumulation to keep the effective batch, switch to QLoRA (Section 20) so the base is 4-bit, then shard with FSDP/ZeRO-3. Find the largest micro-batch that fits, then scale the effective batch with accumulation — throughput loves big batches.


SECTION 25Inference & serving

Training is a one-time cost; serving is forever. Generation is autoregressive — one token at a time — so the bottleneck is usually memory bandwidth, not raw compute. The whole game is keeping the GPU busy.

Speculative decoding Draft model (small) proposes 5 tokens fast the cat sat on rug Target model (big) verifies all 5 in ONE pass Accept the longest correct prefix ("the cat sat on") in a single expensive step; only re-do from the first rejected token. Output is identical to plain decoding.
Figure 24. Speculative decoding amortizes the big model's cost across several tokens at once — cheap drafting, one expensive verification.
Example — no code

Speculative decoding is a junior writer drafting a whole sentence while a senior editor — who's slow but authoritative — approves it in a single glance instead of writing word by word. If the draft is good, four or five words get accepted for the price of one editorial pass. Continuous batching is a ride-share van that picks up and drops off riders mid-route instead of waiting for a full bus, so a seat is never empty and the engine never idles.

›_ Example — with code (serve with vLLM; speculative in Transformers)
serve.pyvLLM + Transformers
# ---------- High-throughput serving with vLLM ---------- from vllm import LLM, SamplingParams llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct", quantization="awq", # quantized weights gpu_memory_utilization=0.9) # PagedAttention + continuous batching are automatic out = llm.generate(["Explain KV cache in one line."], SamplingParams(temperature=0.7, max_tokens=128)) print(out[0].outputs[0].text) # Or expose an OpenAI-compatible server: # vllm serve meta-llama/Llama-3.1-8B-Instruct # ---------- Speculative decoding with a draft model ---------- from transformers import AutoModelForCausalLM, AutoTokenizer target = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-70B-Instruct") draft = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B-Instruct") tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-70B-Instruct") ids = tok("The capital of France is", return_tensors="pt").input_ids out = target.generate(ids, assistant_model=draft, max_new_tokens=40) # same output, faster print(tok.decode(out[0]))
Picking a serving stack

For production throughput on GPUs: vLLM, TGI, SGLang, or TensorRT-LLM. For a laptop or a single small GPU: llama.cpp / Ollama with GGUF quantized weights. Optimize for the metric that matters — latency (time to first token) for chat, throughput (tokens/sec across users) for batch jobs — they pull in different directions.


SECTION 26Reasoning, RAG & agents

The frontier moved from "predict the next token" to "use compute at inference time to think, retrieve, and act." These are the patterns behind today's most capable systems.

LLMplan / decide 🔍 Search ⚙ Run code 💾 Database 🌐 API observe → loop act Plan → call a tool → read the result → repeat until the goal is met.
Figure 25. The agent loop: an LLM that can reason, call tools, observe what comes back, and iterate — the basis of coding assistants, research agents, and autonomous workflows.
Example — no code

A reasoning model is a student who shows their work on scratch paper before answering — and we let them use as much scratch paper as the problem deserves; harder questions simply get more thinking. An agent is that same student who can also get up, look things up in the library (search), use a calculator, run an experiment (code), come back, and keep going — looping until the assignment is actually finished, not just until they've produced one paragraph. RAG is letting them bring the open textbook into the exam so answers are grounded in the real source, not memory.

›_ Example — with code (a minimal RAG + tool-using agent loop)
rag_and_agent.pysentence-transformers + an LLM client
# ---------- RAG: retrieve, then generate grounded on real docs ---------- from sentence_transformers import SentenceTransformer import numpy as np emb = SentenceTransformer("all-MiniLM-L6-v2") docs = ["Our refund window is 30 days.", "Support hours are 9-5 GMT.", ...] doc_vecs = emb.encode(docs, normalize_embeddings=True) def retrieve(query, k=3): q = emb.encode([query], normalize_embeddings=True)[0] scores = doc_vecs @ q # cosine similarity return [docs[i] for i in np.argsort(scores)[-k:][::-1]] def answer(query, llm): context = "\n".join(retrieve(query)) # top-k relevant chunks prompt = f"Use ONLY this context:\n{context}\n\nQuestion: {query}" return llm(prompt) # grounded, less hallucination # ---------- Agent loop: LLM decides which tool to call, then observes ---------- TOOLS = {"search": web_search, "python": run_code} # name -> callable def agent(goal, llm, max_steps=6): history = [f"Goal: {goal}"] for _ in range(max_steps): step = llm("\n".join(history) + "\nThink, then output: TOOL <name> <args> OR FINAL <answer>") history.append(step) if step.startswith("FINAL"): return step[6:] name, args = parse(step) # e.g. "search", "vLLM throughput" observation = TOOLS[name](args) # act history.append(f"Observation: {observation}") # observe -> loop return "stopped: step budget exhausted"
RAG vs fine-tuning — the decision

Need the model to know new, changing, or private facts? → RAG (cheap, updatable, citable). Need it to behave differently — a tone, a format, a skill, a domain style? → fine-tune (LoRA/QLoRA from Section 20). Need both? Most serious systems combine them: fine-tune the behavior, retrieve the facts, wrap it in an agent loop, and serve it with vLLM.


SECTION 27The LangChain stack: overview

Section 26 covered the concepts. This is the production stack most teams actually reach for to build them — three layers that fit together, plus the observability glue that ties them.

The LangChain ecosystem is best understood as a layered stack. Each layer adds abstraction on top of the one below, so you pick the level of control your task needs:

Deep Agents — create_deep_agent planning · virtual filesystem · subagents · memory · HITL LangChain — create_agent standard model interface · tools · prompt · middleware loop LangGraph — orchestration runtime stateful graph · durable execution · persistence · streaming · HITL LangSmith · trace + eval Each layer is built on the one below. Drop down a layer when you need more control; stay high for speed. RAG appears at every layer the same way: a retriever exposed as a tool.
Figure 26. The LangChain stack. Deep Agents sits on LangChain (create_agent), which sits on the LangGraph runtime; LangSmith observes all of them. Pick your layer by how much control you need.
Example — no code

Think of building a house. LangGraph is the foundation, plumbing, and wiring — the durable infrastructure that holds state and survives a power cut (resume where you left off). LangChain is the framing and standard fittings: every appliance brand (any model provider) plugs into the same socket, and create_agent hands you a pre-framed room. Deep Agents is the fully furnished smart home — it already has a planner on the wall, filing cabinets (the virtual filesystem), assistants you can dispatch (subagents), and a memory of past visits — move-in ready for big, messy jobs. You choose how much is pre-built versus how much you wire yourself.

›_ Example — with code (1 · LangChain: an agent, provider-swappable)
01_langchain_agent.pyLangChain — create_agent
# pip install -qU langchain "langchain[anthropic]" from langchain.agents import create_agent def get_weather(city: str) -> str: """Get the weather for a given city.""" # docstring = the tool description return f"It's always sunny in {city}!" agent = create_agent( model="claude-sonnet-4-6", # swap to "openai:gpt-5.4" or "google_genai:gemini-3.5-flash" tools=[get_weather], system_prompt="You are a helpful assistant.", ) result = agent.invoke( {"messages": [{"role": "user", "content": "What's the weather in San Francisco?"}]} ) print(result["messages"][-1].content_blocks) # create_agent runs the loop: model -> (maybe call tool) -> observe -> repeat -> final answer.
›_ Example — with code (2 · RAG with LangChain: a retriever is a tool)
02_langchain_rag.pyLangChain — agentic RAG
# pip install -qU langchain langchain-community langchain-chroma "langchain[anthropic]" from langchain_community.document_loaders import WebBaseLoader from langchain_text_splitters import RecursiveCharacterTextSplitter from langchain_chroma import Chroma from langchain.embeddings import init_embeddings from langchain.tools.retriever import create_retriever_tool from langchain.agents import create_agent # 1. INGEST: load -> split into chunks -> embed -> store in a vector DB docs = WebBaseLoader("https://example.com/handbook").load() chunks = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150).split_documents(docs) store = Chroma.from_documents(chunks, init_embeddings("openai:text-embedding-3-small")) # 2. turn the retriever into a TOOL the agent can choose to call retriever_tool = create_retriever_tool( store.as_retriever(search_kwargs={"k": 4}), name="company_handbook", description="Search the company handbook for policies, benefits, and procedures.", ) # 3. the agent now decides WHEN to retrieve -> this is "agentic RAG" agent = create_agent(model="claude-sonnet-4-6", tools=[retriever_tool], system_prompt="Answer using the handbook. Cite what you find.") print(agent.invoke({"messages": [{"role": "user", "content": "How many vacation days do I get?"}]}))

When a simple loop isn't enough — you need explicit branching, cycles, a grading step, or a human approval gate — you drop to LangGraph and draw the control flow yourself as a graph. State flows between nodes; edges (including conditional ones) decide what runs next.

An agentic-RAG graph in LangGraph START retrievevector search graderelevant? generateanswer END not relevant → rewrite & retry
Figure 27. A LangGraph graph makes control flow explicit: retrieve → grade the results → generate, with a conditional edge that loops back to rewrite the query when the retrieved context isn't good enough. This "self-correcting" pattern is hard to express as a plain loop.
›_ Example — with code (3 · LangGraph: a stateful graph with memory)
03_langgraph_graph.pyLangGraph — StateGraph
# pip install -U langgraph from langgraph.graph import StateGraph, MessagesState, START, END from langgraph.checkpoint.memory import InMemorySaver # nodes are just functions: (state) -> partial state update def retrieve(state: MessagesState): query = state["messages"][-1].content hits = store.as_retriever().invoke(query) # your vector store return {"messages": [{"role": "system", "content": f"Context:\n{hits}"}]} def generate(state: MessagesState): answer = model.invoke(state["messages"]) # your chat model return {"messages": [answer]} # wire the graph: START -> retrieve -> generate -> END g = StateGraph(MessagesState) g.add_node(retrieve); g.add_node(generate) g.add_edge(START, "retrieve") g.add_edge("retrieve", "generate") g.add_edge("generate", END) # a checkpointer gives durable execution + conversation memory across turns graph = g.compile(checkpointer=InMemorySaver()) graph.invoke({"messages": [{"role": "user", "content": "Summarize the refund policy."}]}, config={"configurable": {"thread_id": "user-42"}}) # same thread = remembered # Add g.add_conditional_edges(...) for the grade/retry branch, or interrupt() for human approval.

At the top of the stack, Deep Agents gives you all the hard parts of a capable autonomous agent for free. create_deep_agent has the same tool-calling core, but ships with a planner, a virtual filesystem that compresses context as runs grow long, the ability to spawn isolated subagents for parallel subtasks, and long-term memory — ideal for deep research, codebase work, or any task with many steps.

›_ Example — with code (4 · Deep Agents: a research agent with planning + subagents)
04_deep_agent.pyDeep Agents — create_deep_agent
# pip install -qU deepagents langchain-anthropic from deepagents import create_deep_agent def web_search(query: str) -> str: """Search the web and return the top results.""" return run_search(query) # plug in Tavily/SerpAPI/your own retriever here # create_deep_agent bundles planning (write_todos), a virtual filesystem with # automatic context compression, subagent spawning (the `task` tool), and memory. agent = create_deep_agent( model="anthropic:claude-sonnet-4-6", tools=[web_search], system_prompt=( "You are an expert researcher. Plan first, delegate sub-questions to " "subagents, save findings to files, then write a cited report." ), ) # Give it a big, multi-step task -- it will plan, spawn subagents, and manage its own context. result = agent.invoke({"messages": [{"role": "user", "content": "Compare QLoRA, DoRA and full fine-tuning on cost and quality. Write a sourced summary."}]}) print(result["messages"][-1].content)
Which layer should I use?

Most agents → LangChain create_agent (model + tools + prompt, swap providers freely). Need custom control flow — cycles, grading/routing, human-in-the-loop, durable long runs — → LangGraph. Big, open-ended, multi-step autonomy (research, coding, ops) → Deep Agents, which gives you planning, a filesystem, subagents, and memory out of the box. And in all three, RAG is the same move: wrap a retriever as a tool. Wire up LangSmith from day one to see what your agent is actually doing.


SECTION 28LangChain in depth: create_agent, tools, structured output & middleware

Section 27 showed the shape. Now the full surface: how the agent loop actually runs, how tools and structured output work, and the middleware system that is the real reason to use LangChain.

Agent = model + harness. create_agent builds a graph that runs the classic loop — call the model, and if the model asked for tools, run them, feed results back, and repeat until the model returns a final answer (no more tool calls). The signature exposes every lever you'll need:

The create_agent surface
create_agent(
  model,  tools=[…],  system_prompt=…,
  middleware=[…],   # hooks around the loop (the key feature)
  response_format=…,  # structured output schema (Pydantic / strategy)
  checkpointer=…,     # short-term memory + durability
  state_schema=…, context_schema=…)  # custom state / runtime context

A few things make LangChain more than a thin wrapper:

The create_agent loop, wrapped by middleware hooks before_agent after_agent Model decide: answer or call tools before_model / after_model · wrap_model_call Tools run requested calls wrap_tool_call tool calls results fed back → loop no tool calls → final answer Middleware can read/modify state at every dashed boundary — prompts, requests, tool calls, results.
Figure 28. The agent loop and the middleware hooks that wrap it: before_agent/after_agent bracket the whole run; before_model/after_model and wrap_model_call surround each model call; wrap_tool_call surrounds each tool. This is where guardrails, summarization, retries, and limits live.
Example — no code

Middleware is like the staff around a chef in a busy kitchen. The chef (model) just cooks. But an expediter checks every order before it reaches the chef (before_model), a quality inspector tastes each plate before it leaves (after_model), a runner handles the actual fetching of ingredients (wrap_tool_call), and a manager caps how many dishes get made so costs don't explode (call limits). You compose the staff you need; the chef's job never changes. That separation is why one summarization or PII-redaction middleware drops into any agent unchanged.

›_ Example — with code (1 · tools, structured output, and memory)
01_lc_core.pyLangChain — create_agent
from langchain.agents import create_agent from langgraph.checkpoint.memory import InMemorySaver from pydantic import BaseModel, Field # 1. a tool is a typed function; its docstring is the description the model reads def search_flights(origin: str, destination: str, date: str) -> str: """Search available flights between two cities on a given date.""" return f"3 flights from {origin} to {destination} on {date}: ..." # 2. structured output: ask for parsed, validated data instead of prose class FlightPick(BaseModel): airline: str = Field(description="chosen airline") price_usd: float depart_time: str agent = create_agent( model="claude-sonnet-4-6", tools=[search_flights], system_prompt="You are a travel assistant. Pick the best-value flight.", response_format=FlightPick, # -> result["structured_response"] is a FlightPick checkpointer=InMemorySaver(), # -> short-term memory across turns ) cfg = {"configurable": {"thread_id": "trip-1"}} res = agent.invoke({"messages": [{"role": "user", "content": "Cheapest flight SFO to JFK on Dec 5?"}]}, config=cfg) print(res["structured_response"]) # FlightPick(airline=..., price_usd=..., ...) # Because of the thread_id, a follow-up "what about Dec 6?" remembers the context.
›_ Example — with code (2 · production middleware: summarize, limit, guard, fall back)
02_lc_middleware.pyLangChain — built-in middleware
from langchain.agents import create_agent from langchain.agents.middleware import ( SummarizationMiddleware, # compress old history near the context limit ToolCallLimitMiddleware, # cap tool calls (cost / runaway protection) ModelFallbackMiddleware, # retry on a backup model if the primary fails PIIMiddleware, # detect & redact sensitive data HumanInTheLoopMiddleware, # pause for approval on risky tools ) from langgraph.checkpoint.memory import InMemorySaver agent = create_agent( model="claude-sonnet-4-6", tools=[search_tool, send_email_tool], checkpointer=InMemorySaver(), # required by HITL middleware=[ PIIMiddleware("email", strategy="redact", apply_to_input=True), SummarizationMiddleware(model="gpt-5.4-mini", trigger=("fraction", 0.8), keep=("messages", 20)), ToolCallLimitMiddleware(thread_limit=20, run_limit=8), ModelFallbackMiddleware("gpt-5.4-mini", "openai:gpt-5.4"), HumanInTheLoopMiddleware(interrupt_on={ # ask a human before sending "send_email_tool": {"allowed_decisions": ["approve", "edit", "reject"]}, "search_tool": False, # auto-run safe tools }), ], ) # Each middleware handles ONE concern and composes with the others -- no agent rewrite.
›_ Example — with code (3 · custom middleware via hooks)
03_lc_custom_mw.pyLangChain — @hooks
from langchain.agents import create_agent from langchain.agents.middleware import before_model, dynamic_prompt # inject a fresh, per-request system prompt (e.g. user preferences from a store) @dynamic_prompt def personalized_prompt(request) -> str: user = request.runtime.context.get("user_name", "there") return f"You are a concise assistant. Address the user as {user}." # a guardrail that runs right before every model call @before_model def log_and_guard(state) -> dict | None: last = state["messages"][-1].content if "wire transfer" in last.lower(): return {"jump_to": "end"} # short-circuit the loop return None # None -> proceed normally agent = create_agent( model="claude-sonnet-4-6", tools=[...], middleware=[personalized_prompt, log_and_guard], context_schema=dict, # lets us pass runtime context ) agent.invoke({"messages": [{"role": "user", "content": "hi"}]}, context={"user_name": "Sara"})
The mental model

Build the core agent with model + tools + system_prompt, then add capabilities as middleware: summarization for long chats, limits for cost, PII for compliance, human-in-the-loop for risky actions, fallback for resilience. The six hooks — before/after_agent, before/after_model, wrap_model_call, wrap_tool_call — cover essentially any interception you'll want, and a hook can return jump_to to redirect the loop.


SECTION 29LangGraph in depth: state, edges, persistence & human-in-the-loop

When the loop isn't the right shape — you need branching, cycles, parallelism, durability, or a human gate — you drop to LangGraph and build the control flow as an explicit graph.

Everything in LangGraph is three ideas: state (a typed object that flows through the graph), nodes (functions that read state and return an update), and edges (what runs next). You build it, then compile() it into a runnable.

Anatomy of a compiled StateGraph State: {messages: add_messages, attempts: +} — reducers merge updates START agent route?conditional tools END tools results loop back done → END checkpointer saves state per thread_id after every step
Figure 29. A StateGraph: typed state with reducers flows through nodes; a conditional edge routes to tools or END; tool results loop back; the checkpointer persists state every step so the run is both stateful and resumable.
Example — no code

LangChain's create_agent is an automatic car — one pedal, it handles the gears. LangGraph is a manual transmission: more to operate, but you control exactly when to shift, brake, loop back, or pull over for a passenger (a human). The checkpointer is the car's black box — it records the journey continuously, so if the engine cuts out on a mountain road, you restart from the last marker instead of the bottom of the hill. The reducer is the rule that when two passengers add to the shared shopping list at once, the items get combined rather than one erasing the other.

›_ Example — with code (1 · state, reducers, conditional edges, a cyclic loop)
01_lg_graph.pyLangGraph — StateGraph
from typing import Annotated, TypedDict from langgraph.graph import StateGraph, START, END from langgraph.graph.message import add_messages from langgraph.checkpoint.memory import InMemorySaver # 1. STATE: messages append (reducer); attempts overwrite by default class State(TypedDict): messages: Annotated[list, add_messages] # reducer -> appends, never clobbers attempts: int # 2. NODES: (state) -> partial update def call_model(state: State): reply = model.invoke(state["messages"]) return {"messages": [reply], "attempts": state.get("attempts", 0) + 1} def run_tools(state: State): results = execute_tool_calls(state["messages"][-1]) return {"messages": results} # 3. ROUTER for a conditional edge: keep looping or stop def should_continue(state: State) -> str: last = state["messages"][-1] if last.tool_calls and state["attempts"] < 5: return "tools" return END # 4. WIRE the graph (note the cycle tools -> agent) g = StateGraph(State) g.add_node("agent", call_model) g.add_node("tools", run_tools) g.add_edge(START, "agent") g.add_conditional_edges("agent", should_continue, {"tools": "tools", END: END}) g.add_edge("tools", "agent") # loop back graph = g.compile(checkpointer=InMemorySaver()) graph.invoke({"messages": [{"role": "user", "content": "Plan my week."}], "attempts": 0}, config={"configurable": {"thread_id": "u1"}})
›_ Example — with code (2 · human-in-the-loop with interrupt + resume)
02_lg_hitl.pyLangGraph — interrupt / Command
from langgraph.types import Command, interrupt def approve_purchase(state: State): # pause the graph and surface a payload to the caller decision = interrupt({"action": "buy", "item": state["item"], "cost": state["cost"]}) if decision == "approve": return {"messages": [{"role": "system", "content": "Purchase approved."}]} return {"messages": [{"role": "system", "content": "Purchase cancelled."}]} # ... add_node("approve", approve_purchase) ... compile(checkpointer=...) cfg = {"configurable": {"thread_id": "order-9"}} result = graph.invoke({...}, config=cfg) # Execution pauses at interrupt(); result surfaces the pending interrupt payload: print(result["__interrupt__"]) # (Interrupt(value={'action': 'buy', ...}),) # A human reviews, then we RESUME -- the resume value becomes interrupt()'s return: graph.invoke(Command(resume="approve"), config=cfg) # continues exactly where it paused
›_ Example — with code (3 · Command routing, streaming, durable Postgres memory)
03_lg_command_stream.pyLangGraph — Command / stream / Postgres
from langgraph.types import Command # A node that BOTH updates state AND picks the next node -- no conditional edge needed def supervisor(state: State) -> Command: nxt = "researcher" if needs_research(state) else "writer" return Command(goto=nxt, update={"messages": [route_note(nxt)]}) # Stream intermediate progress instead of waiting for the final result for mode, chunk in graph.stream(inputs, config=cfg, stream_mode=["updates", "messages"]): print(mode, chunk) # "updates" = per-node state deltas; "messages" = token stream # Durable, multi-instance memory for production: swap the checkpointer from langgraph.checkpoint.postgres import PostgresSaver with PostgresSaver.from_conn_string("postgresql://...") as saver: graph = g.compile(checkpointer=saver) # survives restarts, scales across servers
When to drop to LangGraph

Reach for it when you need explicit control flow (branching, bounded cycles, multi-agent supervisors), durable long runs that survive crashes, human approval gates mid-run, or fine-grained streaming. Keep state small and typed, use reducers only where branches merge, and put conditional edges only at real decision points. For a plain tool-calling assistant, stay on create_agent — it compiles down to a LangGraph graph anyway.


SECTION 30Deep Agents in depth: planning, filesystem, subagents & skills

Deep Agents is the "batteries-included" harness — the same tool-calling loop, but pre-loaded with the machinery that makes agents survive long, messy, multi-step tasks. It's the open-source distillation of what makes tools like Claude Code work.

create_deep_agent returns a compiled LangGraph graph (so you keep streaming, checkpointers, and tracing), but it ships with four built-in capabilities turned on by default, each implemented as middleware you can override:

Deep Agent tool-calling core 📋 Planning write_todos / read_todos 📁 Filesystem read/write/edit + offload 🧑 Subagents task · isolated context 🧠 Memory + skills summarize · store · skills All four ship on by default — each is middleware on top of LangChain's create_agent, on the LangGraph runtime.
Figure 30. The Deep Agents harness: a plain tool-calling core surrounded by built-in planning, a context-offloading filesystem, subagent delegation, and memory/skills — the parts you'd otherwise hand-build for any serious long-running agent.
Example — no code

A basic agent is one person with a notepad trying to hold an entire project in their head — they run out of room fast. A Deep Agent is a project lead with an office: a whiteboard for the plan (write_todos), filing cabinets so findings live on paper instead of cramming the desk (the filesystem — and it files away bulky documents automatically), and a team of assistants they can hand self-contained sub-tasks to, each working in their own room and reporting back a one-paragraph summary (subagents in isolated context). That's why Deep Agents handles a "project" where a plain agent chokes on the third step.

›_ Example — with code (1 · a deep agent with custom subagents)
01_deepagent.pyDeep Agents — create_deep_agent
from deepagents import create_deep_agent def web_search(query: str) -> str: """Search the web and return results.""" return run_tavily(query) # specialized subagents: each gets its OWN isolated context window research_subagent = { "name": "researcher", "description": "Deeply researches one focused sub-question and returns a summary.", "system_prompt": "You are a meticulous researcher. Cite sources.", "tools": [web_search], } critic_subagent = { "name": "critic", "description": "Reviews a draft for accuracy and gaps.", "system_prompt": "You are a sharp editor. List concrete fixes.", } agent = create_deep_agent( model="anthropic:claude-sonnet-4-6", tools=[web_search], system_prompt=("You are a research lead. First write_todos to plan. " "Delegate sub-questions to the researcher subagent, save findings " "to files, then have the critic review before the final report."), subagents=[research_subagent, critic_subagent], ) result = agent.invoke({"messages": [{"role": "user", "content": "Write a sourced report comparing LoRA, QLoRA, and DoRA on cost and quality."}]}) print(result["messages"][-1].content) # Under the hood: write_todos -> task(researcher) x N -> write_file -> task(critic) -> report
›_ Example — with code (2 · backends, shell, memory & human approval)
02_deepagent_backends.pyDeep Agents — backends + HITL
from deepagents import create_deep_agent from deepagents.backends import LocalShellBackend # filesystem + real shell `execute` from deepagents.middleware import FilesystemMiddleware from langgraph.checkpoint.memory import InMemorySaver from langgraph.store.memory import InMemoryStore # A coding-style agent: real local files + shell, long-term memory, approval gates agent = create_deep_agent( model="anthropic:claude-sonnet-4-6", tools=[], backend=LocalShellBackend(workspace_root="/workspace"), # ls/read/write/edit + execute system_prompt="You are a coding agent. Plan, edit files, run tests, then summarize.", checkpointer=InMemorySaver(), # durable + resumable (also enables interrupts) store=InMemoryStore(), # long-term memory across threads (swap for a DB) interrupt_on={"execute": {"allowed_decisions": ["approve", "reject"]}}, # gate shell cmds ) # Because it returns a compiled LangGraph graph, you can stream subagent activity: for mode, chunk in agent.stream({"messages": [{"role": "user", "content": "Fix the failing test in utils.py"}]}, config={"configurable": {"thread_id": "dev-1"}}, stream_mode="updates"): print(mode, chunk)
! "Trust the LLM" — with guardrails

Deep Agents follows a trust-the-model design: the agent can do anything its tools and backend allow, including running shell commands and editing files. That power needs fences. Use a sandbox backend (not your host) for untrusted work, declare filesystem permission rules to bound read/write access, gate dangerous tools like execute behind human-in-the-loop, and cap runs with model/tool call limits. Always wire up LangSmith so you can see what a long autonomous run actually did.

Choosing your altitude, one more time

Deep Agents when the task is a project: research, coding, multi-step ops needing planning + files + delegation. LangChain create_agent when it's a focused assistant (model + tools + a few middleware). LangGraph when the control flow itself is custom and the agent loop isn't the right shape. They nest cleanly — Deep Agents is built on create_agent, which is built on LangGraph — so you can always drop a level for more control without leaving the ecosystem.

SECTION 31The one-page cheat sheet

Everything above, compressed. Keep this nearby.

Vocabulary in one line each

TermMeaning
Parameter / weightA learnable number the model adjusts during training.
HyperparameterA setting you choose (learning rate, batch size, #layers).
Forward passRun input through the model to get a prediction.
LossOne number scoring how wrong the prediction is.
BackpropCompute the gradient of the loss w.r.t. every weight.
GradientDirection + rate the loss changes as a weight changes.
OptimizerUses gradients to update weights (SGD, Adam, AdamW).
EpochOne full pass over the training set.
BatchA small group of examples processed together.
OverfittingMemorizing train data, failing on new data.
LogitsRaw model scores before softmax/sigmoid.
EmbeddingA learned vector representing a discrete item.
Fine-tuningAdapting a pretrained model to your task.
AttentionEach token weighs how much every other token matters.
QuantizationStore weights in fewer bits (int8/int4) to shrink + speed up.
LoRATrain tiny low-rank adapters instead of all weights.
QLoRALoRA on a 4-bit frozen base — fine-tune big models on one GPU.
RLHFAlign a model to human preference: SFT → reward model → PPO.
DPOAlign directly from preference pairs — no reward model or RL.
GRPOOnline RL without a critic; great for verifiable (math/code) rewards.
MoEMany expert FFNs; a router activates only a few per token.
GQAQuery heads share KV heads — smaller KV cache, longer context.
KV cacheStored past keys/values so generation isn't quadratic.
RAGRetrieve relevant docs and feed them into the prompt.
AgentAn LLM in a loop that calls tools, observes, and iterates.
DistillationTrain a small model to imitate a big one's outputs.
LangChainAgent framework: create_agent = model + tools + prompt + middleware.
LangGraphOrchestration runtime: stateful graph, durable execution, persistence.
Deep AgentsBatteries-included harness: planning, filesystem, subagents, memory.
MiddlewareHooks around the agent loop (summarize, limit, guard, fallback).
CheckpointerSaves graph state per thread — memory + crash-resume.
SubagentDelegated agent with isolated context that returns a summary.

Pick the architecture

Your data is…Reach for…
Tabular / structuredGradient-boosted trees first; MLP if you must
Images / videoCNN or Vision Transformer (fine-tuned)
Text / languageTransformer (BERT to understand, GPT-class to generate)
Time series / streaming audioLSTM/GRU, Temporal CNN, or a transformer
Need to generate imagesDiffusion model
Need semantic search / RAGEmbedding model + vector search

Modern LLM toolbox — pick the right lever

Your goal…Reach for…
Make a big model fit / run cheaperQuantization (int4 NF4, AWQ, GGUF) · §19
Customize behavior on a budgetLoRA / QLoRA / DoRA · §20
Align to human preference (have pairs)DPO (or KTO for binary labels) · §22
Improve reasoning with a checkable rewardGRPO / RLVR · §22
Maximum reward control / online RLRLHF + PPO · §21
Longer context / faster attentionGQA + FlashAttention + RoPE · §23
Huge capacity, modest compute/tokenMixture-of-Experts · §23
Train a model too big for one GPUbf16 + checkpointing + FSDP/ZeRO · §24
Serve fast to many usersvLLM (PagedAttention, continuous batching) · §25
Speed up generation, same outputSpeculative decoding · §25
Give the model fresh / private knowledgeRAG · §26
Let it use tools & actAgent loop + function calling · §26
Build an agent fast, swap providers freelyLangChain create_agent + middleware · §28
Custom control flow / durability / HITLLangGraph StateGraph + checkpointer · §29
Long, multi-step "project" autonomyDeep Agents (planning + files + subagents) · §30

The loop you'll write a thousand times

for epoch in range(epochs):
    for xb, yb in loader:
        preds = model(xb)         # forward
        loss  = loss_fn(preds, yb)
        opt.zero_grad()           # clear grads  ← don't forget!
        loss.backward()           # backprop
        opt.step()                # update

Golden rules

Where to go next

You now have the full skeleton. Deepen it by building: train an MNIST classifier (the "hello world"), fine-tune a 🤗 model on a dataset you care about, then read the official docs as reference rather than cover-to-cover — pytorch.org, tensorflow.org, and huggingface.co/docs/transformers. The concepts here are the map; the docs are the territory.