A complete technical reference

Deep learning, from the neuron up to LoRA, GRPO & agents.

Every core idea — taught with plain intuition, a worked example without code, then real code in PyTorch, TensorFlow/Keras, and 🤗 Transformers. Diagrams for the parts that are hard to picture. From backprop and CNNs through quantization, PEFT, RLHF/DPO/GRPO, Mixture-of-Experts, scaled training, serving, and the LangChain / LangGraph / Deep Agents stack. Written for engineers who want depth, not slogans.

31 sections intuition + code custom diagrams PyTorch · TF · HF neuron → agents

SECTION 01What deep learning actually is

Where it sits inside AI, why "deep", and the one trick that makes the whole field work.

Artificial intelligence is the broad goal of making machines do things that seem to require intelligence. Machine learning (ML) is one route to that goal: instead of writing rules by hand, you give a program examples and let it find the rules itself. Deep learning (DL) is a sub-field of ML built on neural networks with many layers — the "deep" simply means many stacked layers of transformation, not anything mystical.

The defining move of deep learning is learned representations. Classical ML often needs a human to hand-design the features (e.g. "count the edges in this image, measure their angles"). A deep network instead learns its own features, layer by layer: early layers detect edges, middle layers assemble them into textures and parts, late layers recognise whole objects. Nobody told it what an edge is — it discovered that edges were useful for the task you optimised it on.

Figure 1. Deep learning is a subset of ML, which is a subset of AI. Its signature is learning the features automatically rather than having an engineer hand-craft them.

◆ Example — no code

Imagine teaching a child to recognise a cat. You don't give a checklist ("4 legs, whiskers, pointed ears"). You show thousands of cats and not-cats and correct them when they're wrong. Over time the child's brain self-organises the concept. A deep network does the same: it sees many labelled images, makes a guess, measures how wrong it was, and nudges its millions of internal numbers a tiny bit toward "less wrong". Repeat millions of times. That nudging loop — guess, measure error, adjust — is deep learning. Everything else is detail about how to do the adjusting efficiently for different shapes of data (images, text, audio).

When to reach for deep learning (and when not to)

Use deep learning when…	Prefer classical ML / rules when…
Data is high-dimensional & unstructured: images, audio, raw text, video.	Data is small (hundreds–few thousand rows) and tabular.
You have lots of data (tens of thousands+ examples) or a pre-trained model to fine-tune.	You need full interpretability / an auditable decision.
The mapping from input→output is complex and you can't write the rules.	Gradient-boosted trees (XGBoost/LightGBM) already beat it — common on tabular data.
You can tolerate a black-box model and GPU compute.	Latency/compute budget is tiny or no GPU is available.

↗ Engineer's note

On structured/tabular data, gradient-boosted trees frequently outperform neural nets with far less tuning. Deep learning's home turf is perception (vision, speech) and language. "Use a transformer" is not the answer to every problem — match the tool to the data.

SECTION 02The artificial neuron

The single unit every deep network is built from — a weighted sum, a bias, and a squashing function.

A biological neuron receives signals on its dendrites, sums them, and fires if the total crosses a threshold. The artificial neuron (or unit) is a deliberately crude maths version of that idea. It does exactly three things:

Weighted sum. Each input x_i is multiplied by a weight w_i (how much that input matters) and they're all added up.
Add a bias. A learnable constant b shifts the result, letting the neuron fire more or less easily.
Activation. The sum is passed through a non-linear function σ that decides the output. Without this step the whole network would collapse into one big linear function (see §5).

∑ The neuron in one line

y = σ( w₁x₁ + w₂x₂ + … + wₙxₙ + b ) = σ( w·x + b )

w·x is the dot product of the weight vector and input vector. The weights and bias are the learnable parameters — training is the search for good values of these.

Figure 2. A single neuron. Thicker arrows = larger weights (here w₂ matters most). Inputs are scaled, summed with a bias, then squashed by an activation σ to produce the output y.

◆ Example — no code

Suppose a neuron decides "should I go for a run?" Inputs: x₁=weather is nice (1/0), x₂=I have free time (1/0), x₃=I'm tired (1/0). You care a lot about free time, somewhat about weather, and being tired pushes you the other way — so the learned weights might be w₁=2, w₂=3, w₃=−4 with bias b=−1. If today weather=1, time=1, tired=0, the sum is 2·1 + 3·1 + (−4)·0 − 1 = 4. A positive number → activation fires → "go run". Tired=1 instead gives 0 → borderline. The network learned these weights from your past behaviour; you never wrote the rule.

›_ Example — with code

neuron.pypython
import numpy as np

# one neuron, 3 inputs
x = np.array([1.0, 1.0, 0.0])          # weather, free time, tired
w = np.array([2.0, 3.0, -4.0])         # learned weights
b = -1.0                               # learned bias

def relu(z):                           # a common activation (see §5)
    return max(0.0, z)

z = np.dot(w, x) + b                   # weighted sum + bias  -> 4.0
y = relu(z)                            # activation           -> 4.0
print(f"pre-activation z = {z},  output y = {y}")
# A whole layer is just many neurons stacked: a matrix multiply.
W = np.array([[2., 3., -4.],           # neuron 1
              [1., -1., 1.]])          # neuron 2
b_vec = np.array([-1.0, 0.5])
layer_out = np.maximum(0.0, W @ x + b_vec)   # ReLU over the vector
print("layer output:", layer_out)

That last block is the crucial leap: a layer is just a matrix multiply plus a bias plus an activation. Stack several of these and you have a deep network. Everything that follows is about choosing the activations, measuring error, and adjusting W and b automatically.

SECTION 03Tensors & the math you actually need

A tensor is just an n-dimensional array. Four operations cover ~90% of what happens inside a network.

Everything flowing through a neural network — inputs, weights, activations, gradients — is a tensor: a grid of numbers with some number of axes (dimensions). The jargon maps onto things you know:

Name	Rank (axes)	Example	Shape
Scalar	0	a single loss value `3.14`	`()`
Vector	1	one word embedding, one data row	`(768,)`
Matrix	2	a batch of rows; a weight layer	`(batch, features)`
3-D tensor	3	a batch of token sequences	`(batch, seq_len, dim)`
4-D tensor	4	a batch of RGB images	`(batch, channels, H, W)`

You do not need heavy mathematics to be productive, but four ideas recur constantly:

1 · Matrix multiplication — the workhorse

A layer computing Y = X·W + b is a matrix multiply. If X is (batch, in) and W is (in, out), the result is (batch, out). The inner dimensions must match — most shape bugs are a mismatch here.

2 · Broadcasting

When you add a bias vector of shape (out,) to a matrix (batch, out), the framework broadcasts the vector across every row automatically. Broadcasting lets small tensors stretch to fit big ones without copying memory — but it can also silently create wrong shapes, so check.

3 · The gradient

The gradient is the vector of partial derivatives of the loss with respect to every parameter. It points in the direction of steepest increase of the loss; training walks the opposite way. You will almost never compute it by hand — autograd does it (§7) — but you must understand what it means: "if I nudge this weight up a hair, does the error go up or down, and how fast?"

4 · The chain rule

A network is a chain of functions: loss(layer3(layer2(layer1(x)))). The chain rule from calculus lets you compute how the final loss changes with respect to an early weight by multiplying the local derivatives along the chain. This single rule is the mathematical engine of backpropagation.

∂ The chain rule, the only formula to memorise

dL/dw = (dL/dy) · (dy/dz) · (dz/dw)

Read right-to-left: how the weight affects the pre-activation, how that affects the output, how that affects the loss. Multiply them to get the full effect. Backprop just applies this across millions of parameters efficiently.

›_ Example — with code (PyTorch tensors)

tensors.pypython
import torch

# create tensors
x = torch.randn(32, 784)        # batch of 32 flattened 28x28 images
W = torch.randn(784, 128)       # weight matrix: 784 in -> 128 out
b = torch.randn(128)            # bias vector

# a linear layer, by hand
y = x @ W + b                   # @ is matmul; b is BROADCAST over 32 rows
print(x.shape, "@", W.shape, "->", y.shape)   # (32,784) @ (784,128) -> (32,128)

# reshaping / moving axes (constant in real code)
imgs = torch.randn(32, 3, 224, 224)           # (batch, channels, H, W)
flat = imgs.reshape(32, -1)                   # -> (32, 150528)
moved = imgs.permute(0, 2, 3, 1)              # NCHW -> NHWC: (32,224,224,3)

# everything runs on a GPU by moving the tensor there
device = "cuda" if torch.cuda.is_available() else "cpu"
x = x.to(device)                              # same API, faster hardware

↗ Debugging mantra

When a model breaks, print the shapes. 80% of deep-learning bugs are tensors that are the wrong shape, on the wrong device (CPU vs GPU), or the wrong dtype (float32 vs int64). Make print(x.shape, x.dtype, x.device) a reflex.

SECTION 04The MLP & the forward pass

Stack layers of neurons, push data through, get a prediction. This is the "hello world" of deep nets.

A multi-layer perceptron (MLP), also called a fully-connected or dense network, is layers of neurons where every neuron connects to every neuron in the next layer. Data enters the input layer, flows through one or more hidden layers, and exits the output layer. Computing the output from the input is the forward pass.

Figure 3. A 3-layer MLP. Each layer is activation(X·W + b). "Deep" just means more hidden layers. The output layer's size matches the task (e.g. 2 neurons for a 2-class problem).

◆ Example — no code

Predicting a house price from 3 numbers: size, bedrooms, age. The input layer holds those 3 values. Hidden layer 1 might learn combinations like "big-and-new" or "small-and-old". Hidden layer 2 combines those into higher-level notions like "desirable family home". The single output neuron emits a price. Each layer builds richer concepts from the layer below — and the network decides what those concepts are, guided only by how well the final price matches reality.

›_ Example — with code: the same MLP in 3 frameworks

mlp_pytorch.pyPyTorch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(3, 16),   # input 3  -> hidden 16
    nn.ReLU(),
    nn.Linear(16, 16),  # hidden   -> hidden 16
    nn.ReLU(),
    nn.Linear(16, 1),   # hidden   -> output 1 (price)
)
# forward pass: prediction = model(x)

mlp_keras.pyTensorFlow / Keras
import tensorflow as tf
from tensorflow import keras

model = keras.Sequential([
    keras.layers.Dense(16, activation="relu", input_shape=(3,)),
    keras.layers.Dense(16, activation="relu"),
    keras.layers.Dense(1),                 # linear output for regression
])
# forward pass: prediction = model(x)

Notice how similar they are. Once you know the concepts, switching frameworks is mostly learning new spellings for the same nouns: nn.Linear = Dense, both stack into a Sequential.

SECTION 05Activation functions

The non-linearity that lets networks model curves, not just lines. Without it, depth is pointless.

Here is the single most important fact about activations: stacking linear layers without a non-linearity gives you… one linear layer. W₂(W₁x) = (W₂W₁)x is still just a matrix multiply. The activation function inserts a bend between layers, and bends are what let a network approximate any function — curved decision boundaries, complex mappings, the works. (This is the gist of the universal approximation theorem.)

Figure 4. Three classic activations. ReLU is the modern default for hidden layers — cheap and avoids vanishing gradients. Sigmoid and tanh saturate at the ends (flat = tiny gradient), which is why they're mostly relegated to output layers and gates now.

Function	Output range	Use it for	Watch out for
`ReLU`	[0, ∞)	Default for hidden layers	"Dying ReLU": units stuck at 0
`Leaky ReLU / GELU`	(−∞, ∞)	Hidden layers; GELU in transformers	Slightly more compute
`Sigmoid`	(0, 1)	Binary classification output	Saturation → vanishing gradients
`Tanh`	(−1, 1)	RNN/LSTM gates, zero-centered data	Also saturates
`Softmax`	(0,1), sums to 1	Multi-class output (a probability dist.)	Output layer only, not hidden

◆ Example — no code

Think of softmax as turning raw scores into a probability vote. If the final layer outputs raw scores [2.0, 1.0, 0.1] for "cat / dog / bird", softmax converts them to roughly [0.66, 0.24, 0.10] — they're now positive and sum to 1, so you can read them as "66% confident it's a cat". The biggest score still wins, but you also get a calibrated sense of confidence, which the loss function (§6) needs.

›_ Example — with code

activations.pypython
import torch, torch.nn.functional as F

z = torch.tensor([2.0, 1.0, 0.1])

F.relu(z)        # tensor([2.0, 1.0, 0.1])      negatives -> 0
torch.sigmoid(z) # tensor([0.88, 0.73, 0.52])   each squashed to (0,1)
torch.tanh(z)    # tensor([0.96, 0.76, 0.10])   squashed to (-1,1)
F.softmax(z, 0)  # tensor([0.66, 0.24, 0.10])   a probability distribution
F.gelu(z)        # smooth ReLU-like curve used in BERT/GPT

! Common mistake

Don't put a softmax or sigmoid on the output and also use a loss that applies it internally. CrossEntropyLoss in PyTorch and from_logits=True in Keras expect raw logits (no final activation). Applying softmax twice silently hurts training. Feed raw scores to those losses.

SECTION 06Loss functions — measuring "how wrong"

A single number that scores the prediction against the truth. Training = making this number small.

The loss (or cost/objective) function takes the network's prediction and the true answer and outputs one number: bigger = more wrong. The entire goal of training is to minimise this number. Your choice of loss defines what "good" means, so it must match the task.

Task	Loss	What it measures
Regression (predict a number)	`MSE` (mean squared error)	Average squared gap between prediction and target
Regression, robust to outliers	`MAE` / Huber	Absolute gap; less punished by big errors
Binary classification	`Binary cross-entropy`	How surprised the model is by the true 0/1 label
Multi-class classification	`Cross-entropy`	How much probability mass landed on the wrong class

∑ The two you'll use most

MSE = (1/N) · Σ (ŷᵢ − yᵢ)²

CrossEntropy = − Σ yᵢ · log(ŷᵢ)

In cross-entropy, y is the true distribution (usually 1 for the correct class, 0 elsewhere) and ŷ the predicted probabilities. It heavily punishes confident wrong answers: predicting 0.01 for the true class gives a huge −log(0.01) penalty.

◆ Example — no code

Cross-entropy is "surprise". If the true label is cat and the model says cat with 99% confidence, it was barely surprised → tiny loss. If it confidently said dog at 99%, the truth is a shock → large loss. This asymmetry is exactly what you want: a model that is confidently wrong should be punished far more than one that was unsure. MSE behaves similarly for numbers — being off by 10 costs 100× more than being off by 1, because the gap is squared.

›_ Example — with code

losses.pypython
import torch, torch.nn as nn

# regression
pred = torch.tensor([3.2, 5.0]); target = torch.tensor([3.0, 4.5])
mse = nn.MSELoss()(pred, target)            # -> 0.145

# multi-class classification (3 classes, batch of 2)
logits = torch.tensor([[2.0, 0.5, 0.1],     # raw scores, NOT softmaxed
                       [0.2, 0.1, 3.0]])
labels = torch.tensor([0, 2])               # true class indices
ce = nn.CrossEntropyLoss()(logits, labels)  # applies softmax internally
print(mse.item(), ce.item())

SECTION 07Backpropagation — the learning engine

How the network figures out which way to nudge every weight to lower the loss. The single most important algorithm in the field.

You have a loss number. Now what? You need to know, for each of the millions of weights, "if I increase this weight slightly, does the loss go up or down, and by how much?" That quantity is the gradient of the loss with respect to that weight. Backpropagation ("backprop") computes all of them in one efficient backward sweep.

It works in two phases:

Forward pass — push the input through the network, computing and remembering each intermediate value, until you reach the loss.
Backward pass — start from the loss and walk backwards, applying the chain rule (§3) at every step to push the gradient back to each weight. Each layer hands the layer before it the message "here's how much you contributed to the error."

Figure 5. Backprop = a forward pass (blue) to get the loss, then a backward pass (red, dashed) that uses the chain rule to compute ∂loss/∂W for every parameter. The optimizer (§8) then uses those gradients to update the weights.

◆ Example — no code

A factory ships a defective product (the loss). To assign blame, the manager walks the assembly line backwards: the packaging station was 10% responsible, the welding station 60%, the parts supplier 30%. Each station now knows exactly how much to adjust. Backprop is this blame-assignment, done with calculus: it distributes "responsibility for the error" back through every layer, in proportion to how much each weight influenced the outcome. The beautiful part: it reuses the calculations from later layers when computing earlier ones, so the whole backward pass costs about the same as one forward pass.

›_ Example — with code (autograd does it for you)

autograd.pyPyTorch
import torch

# w is a parameter we want gradients for
w = torch.tensor([2.0], requires_grad=True)
x = torch.tensor([3.0])

y    = w * x          # forward: 6.0
loss = (y - 10) ** 2  # target is 10 -> loss = (6-10)^2 = 16

loss.backward()       # BACKPROP: fills w.grad with d(loss)/dw
print(w.grad)         # tensor([-24.])  -> negative: increasing w lowers loss

# you almost never call backward yourself on toy expressions;
# in real training it's one line inside the loop (see section 10).

↗ Why it changed everything

Backprop (popularised 1986) made it computationally feasible to train deep networks. Modern frameworks build a computational graph of every operation during the forward pass, then automatically differentiate it — this is automatic differentiation (autodiff). You write only the forward math; the gradients come free.

SECTION 08Gradient descent & optimizers

Backprop says which way is downhill. The optimizer decides how big a step to take, and how to be smart about it.

Picture the loss as a landscape of hills and valleys, with the network's weights as your coordinates. You want the lowest valley. The gradient points uphill, so you step the opposite way. That's gradient descent:

∇ The update rule

w ← w − η · (∂loss/∂w)

η (eta) is the learning rate — the step size, and the single most important hyperparameter you'll tune. Too large → you overshoot and diverge. Too small → training crawls or gets stuck. Typical values: 1e-3 to 1e-5.

Figure 6. Gradient descent rolls downhill on the loss surface. Each red dot is one update step; the step size is the learning rate. Real loss surfaces have millions of dimensions, but the intuition — follow the slope down — is the same.

Batch, stochastic, and mini-batch

Computing the gradient over the entire dataset every step is accurate but slow. Stochastic gradient descent (SGD) uses one example at a time (noisy, fast). Mini-batch gradient descent — the universal practical choice — uses a small batch (e.g. 32–256 examples), balancing speed and stability. One pass over the whole dataset is an epoch.

Smarter optimizers

Plain SGD can be slow and get stuck. Modern optimizers add tricks:

Momentum — accumulate a running velocity so you barrel through small bumps and flat regions, like a ball rolling downhill rather than taking timid steps.
Adam — the default for most work. Keeps a per-parameter adaptive learning rate (parameters with consistently small gradients get bigger steps) plus momentum. Robust and forgiving.
AdamW — Adam with corrected weight decay; the standard for training transformers.

›_ Example — with code

optimizers.pyPyTorch
import torch

# pass the model's parameters and a learning rate
opt = torch.optim.Adam(model.parameters(), lr=1e-3)

# one optimization step (inside the training loop):
opt.zero_grad()     # clear old gradients (they accumulate otherwise!)
loss.backward()     # backprop: compute new gradients
opt.step()          # apply the update rule to every weight

# Keras equivalent:
# model.compile(optimizer=keras.optimizers.Adam(1e-3), loss="mse")

! The #1 silent bug

Forgetting opt.zero_grad(). PyTorch accumulates gradients by default, so if you skip it, this step's gradients pile on top of the last step's — and training quietly goes haywire. Always zero, then backward, then step.

SECTION 09Regularization & generalization

The real goal isn't a low training loss — it's performing well on data the model has never seen. These techniques fight memorization.

A network with millions of parameters can simply memorise the training set, scoring perfectly on it while failing on new data. That's overfitting. The opposite — too simple to capture the pattern at all — is underfitting. The art is landing in between: generalization.

Figure 7. The tell-tale sign of overfitting: training loss keeps dropping while validation loss bottoms out and then climbs. The gap between the two curves is the overfit. Stopping at the green line (early stopping) keeps the best-generalizing model.

The toolkit

Technique	What it does
Train/val/test split	Hold out data the model never trains on, so you can measure true generalization.
Dropout	Randomly zero out a fraction of neurons each step, forcing the network not to rely on any single unit. A regularizer that mimics training an ensemble.
Weight decay (L2)	Penalize large weights, nudging the model toward simpler functions.
Early stopping	Stop training when validation loss stops improving (Figure 7).
Data augmentation	Create new training examples by transforming existing ones (flip/crop/rotate images, paraphrase text). More effective variety = less memorization.
Batch normalization	Normalize each layer's inputs per mini-batch. Stabilizes & speeds training, with a mild regularizing side-effect.

◆ Example — no code

A student who memorises last year's exam answers aces the practice test but bombs the real exam with new questions — that's overfitting. Dropout is like randomly making some study notes unavailable each night: the student is forced to understand the material rather than lean on one memorised sheet. Data augmentation is studying many rephrased versions of each problem. Early stopping is putting the books down once mock-exam scores stop improving, before you start over-memorising trivia.

›_ Example — with code

regularization.pyPyTorch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(784, 256),
    nn.BatchNorm1d(256),    # batch normalization
    nn.ReLU(),
    nn.Dropout(p=0.3),      # zero 30% of activations during training
    nn.Linear(256, 10),
)

# weight decay (L2) is set on the optimizer:
opt = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-2)

# IMPORTANT: dropout/batchnorm behave differently in train vs eval!
model.train()   # enables dropout + batchnorm updates
# ... training ...
model.eval()    # disables dropout, freezes batchnorm stats for inference

! Don't forget the mode switch

Call model.train() before training and model.eval() before evaluating/inference. Forgetting eval() leaves dropout active at test time and makes batchnorm use batch stats — a classic cause of "my accuracy is randomly worse at inference."

SECTION 10The training loop — everything together

Sections 4–9 in one place. Memorize this rhythm and you can train anything.

All the pieces now connect into a single repeating cycle. Internalise these five steps and the rest of deep learning is variations on the theme:

Forward — run a mini-batch through the model to get predictions.
Loss — compare predictions to targets.
Backward — backprop to compute gradients (after zeroing old ones).
Step — optimizer updates the weights.
Repeat — over every batch, for many epochs, validating periodically.

Figure 8. The canonical training loop. This same five-step cycle trains a tiny MLP and a billion-parameter transformer alike — only the model and data change.

›_ Example — a complete, runnable PyTorch loop

train.pyPyTorch
import torch, torch.nn as nn
from torch.utils.data import DataLoader

device = "cuda" if torch.cuda.is_available() else "cpu"
model  = MyModel().to(device)
loss_fn = nn.CrossEntropyLoss()
opt    = torch.optim.AdamW(model.parameters(), lr=1e-3)

for epoch in range(EPOCHS):
    # ---- train ----
    model.train()
    for xb, yb in train_loader:                 # mini-batches
        xb, yb = xb.to(device), yb.to(device)
        preds  = model(xb)                       # 1. forward
        loss   = loss_fn(preds, yb)              # 2. loss
        opt.zero_grad()                          #    clear old grads
        loss.backward()                          # 3. backward
        opt.step()                               # 4. update weights

    # ---- validate ----
    model.eval()
    correct = 0
    with torch.no_grad():                        # no gradients needed
        for xb, yb in val_loader:
            xb, yb = xb.to(device), yb.to(device)
            correct += (model(xb).argmax(1) == yb).sum().item()
    print(f"epoch {epoch}: val acc = {correct/len(val_loader.dataset):.3f}")

›_ The same in Keras — the loop is hidden

train_keras.pyTensorFlow / Keras
# Keras wraps the loop in .fit() — convenient, less explicit
model.compile(optimizer="adamw", loss="sparse_categorical_crossentropy",
              metrics=["accuracy"])
model.fit(train_ds, validation_data=val_ds, epochs=EPOCHS)

↗ The two philosophies

PyTorch makes you write the loop — more code, total control, easy to debug, dominant in research. Keras hides it in .fit() — less code, faster to prototype. 🤗 Transformers' Trainer (§16) is a third option that handles the loop and distributed training, logging, and checkpoints for you. Learn the explicit loop first; the magic ones make sense once you know what they're hiding.

SECTION 11Convolutional networks — seeing

The architecture that cracked computer vision. It exploits the fact that in images, nearby pixels are related and patterns repeat.

Why not just feed an image into an MLP? A modest 224×224 RGB image is 150,528 numbers; a single dense layer to 1,000 units would need 150 million weights — and it would treat a cat in the top-left as completely unrelated to a cat in the bottom-right. Convolutional neural networks (CNNs) fix both problems with two ideas:

Local connectivity. A small filter (kernel), e.g. 3×3, slides across the image looking at one little patch at a time. Vision is local — an edge is defined by a few neighbouring pixels.
Weight sharing. The same filter is reused at every position. If a filter learns to detect a vertical edge, it detects vertical edges everywhere — drastically fewer parameters, and built-in translation invariance.

Figure 9. A convolution: a small filter slides over the image computing a dot product at each position, producing a feature map that lights up where the filter's pattern appears. A conv layer has many filters, each learning a different pattern (edges, corners, textures).

The full CNN recipe

A typical CNN alternates convolution → activation → pooling, repeated, then flattens into a small MLP head for the final prediction:

Conv layers extract features; early ones find edges, deeper ones find object parts.
Pooling (e.g. max-pool 2×2) downsamples the feature maps, shrinking spatial size, cutting compute, and adding robustness to small shifts.
Flatten + dense head turns the final feature maps into class scores.

Figure 10. A classic CNN pipeline. As data flows right, the spatial resolution decreases while the feature depth increases — the network trades "where" for "what".

◆ Example — no code

Recognising a face. The first conv layer's filters fire on tiny edges and color blobs. The next layer combines edges into eyes, noses, mouth corners. A deeper layer combines those parts into "a face arrangement". Pooling between them means it doesn't matter if the face is a few pixels left or right — the same eye-detector fires regardless. You designed none of these detectors; the network grew them because they reduced the loss on labelled faces.

›_ Example — with code (a small image classifier)

cnn.pyPyTorch
import torch.nn as nn

cnn = nn.Sequential(
    nn.Conv2d(3, 32, kernel_size=3, padding=1),  # 3 RGB in -> 32 feature maps
    nn.ReLU(),
    nn.MaxPool2d(2),                             # halve spatial size
    nn.Conv2d(32, 64, kernel_size=3, padding=1), # 32 -> 64 feature maps
    nn.ReLU(),
    nn.MaxPool2d(2),
    nn.Flatten(),
    nn.Linear(64 * 8 * 8, 128), nn.ReLU(),       # dense head
    nn.Linear(128, 10),                          # 10 classes
)
# For real projects you rarely build from scratch — you fine-tune a
# pretrained ResNet/EfficientNet (see Section 15 on transfer learning).

↗ Practical reality

Almost nobody trains a vision CNN from scratch anymore. You take a model pretrained on ImageNet (ResNet, EfficientNet, or a Vision Transformer) and fine-tune it on your data — better accuracy with a fraction of the data and compute. More on this in §15.

SECTION 12Recurrent networks — sequences & memory

For data with order — text, speech, time series. The idea that ruled NLP before transformers, and still worth understanding.

An MLP and a CNN see a fixed-size input all at once. But language and time series are sequences of arbitrary length where order matters ("dog bites man" ≠ "man bites dog"). A recurrent neural network (RNN) processes a sequence one step at a time, carrying a hidden state — a running memory — from one step to the next. At each step it combines the new input with the memory of everything seen so far.

Figure 11. An RNN "unrolled" over a sentence. It's actually one cell reused at each timestep; the teal arrow is the hidden state passing memory forward. The same weights process every word — like reading left to right, updating your understanding as you go.

The vanishing gradient problem & the LSTM fix

Plain RNNs struggle to remember things from far back: during backprop the gradient is multiplied at every timestep and shrinks toward zero over long sequences — the vanishing gradient problem. By the time it reaches the start of a long sentence, the learning signal has evaporated.

The LSTM (Long Short-Term Memory) and the simpler GRU solve this with gates — small neural mechanisms that learn what to keep, what to forget, and what to output from a protected "cell state" that flows through largely unchanged. This lets information survive across hundreds of steps.

◆ Example — no code

Reading "The keys to the cabinet … are on the table." To choose "are" over "is", the model must remember, across many intervening words, that the subject was plural ("keys"). A plain RNN tends to forget by then. An LSTM's forget gate learns to hold onto "subject = plural" in its cell state until the verb arrives, then uses it. The gates are themselves learned — the network discovers what's worth remembering.

›_ Example — with code

lstm.pyPyTorch
import torch, torch.nn as nn

# classify the sentiment of a sequence of word-embeddings
class SentimentLSTM(nn.Module):
    def __init__(self, vocab, dim=128, hidden=256):
        super().__init__()
        self.embed = nn.Embedding(vocab, dim)        # token id -> vector
        self.lstm  = nn.LSTM(dim, hidden, batch_first=True)
        self.head  = nn.Linear(hidden, 2)            # pos / neg
    def forward(self, x):                            # x: (batch, seq_len)
        e = self.embed(x)                            # (batch, seq, dim)
        out, (h_n, c_n) = self.lstm(e)               # h_n = final memory
        return self.head(h_n[-1])                    # classify from it

↗ Where RNNs stand today

Transformers (§13) have largely replaced RNNs for language because they process a whole sequence in parallel and model long-range dependencies better. RNNs/LSTMs are still useful for streaming data, very long time series on tight compute, and on-device settings — and understanding them clarifies why attention was such a leap.

SECTION 13Transformers & attention

The architecture behind GPT, BERT, and essentially every modern large model. Understand attention and you understand the engine of the AI boom.

In 2017 the paper "Attention Is All You Need" introduced the Transformer, discarding recurrence entirely. Its core insight: instead of passing memory step-by-step, let every token look directly at every other token and decide how much each one matters for understanding it. That mechanism is self-attention. Because all tokens are processed at once, transformers parallelise beautifully on GPUs — which is what made training enormous models practical.

Self-attention: Query, Key, Value

Each token produces three vectors via learned projections:

Query (Q) — "what am I looking for?"
Key (K) — "what do I offer / what am I about?"
Value (V) — "the actual information I'll pass on if attended to."

A token's query is compared (dot product) against every token's key to produce attention scores — how relevant each other token is. Softmax turns those into weights that sum to 1, and the output is the weighted sum of all the values. In short: each token gathers a custom blend of information from the whole sequence, weighted by relevance.

∑ Scaled dot-product attention

Attention(Q, K, V) = softmax( Q·Kᵀ / √dₖ ) · V

The √dₖ divisor keeps the dot products from growing too large (which would saturate the softmax). Multi-head attention runs several of these in parallel, each head free to focus on a different kind of relationship (syntax, coreference, topic), then concatenates them.

Figure 12. Self-attention resolving a pronoun. When processing "it", the model's query matches the key of "animal" most strongly (thick line), so it pulls in that token's value — correctly linking "it" to "animal". Each token computes such a weighted view of the whole sentence.

The full transformer block

A transformer is a stack of identical blocks. Each block is: multi-head self-attention → add & normalize → feed-forward MLP → add & normalize. Two more pieces make it work:

Positional encoding. Since attention sees the sequence as an unordered set, we add a signal encoding each token's position, so "dog bites man" differs from "man bites dog".
Residual connections + layer norm. The "add & normalize" steps let gradients flow through very deep stacks (dozens of blocks) without vanishing — the trick that lets transformers be huge.

Figure 13. One transformer block. Stack N of these (GPT-class models use dozens to over a hundred). The red residual paths and LayerNorm are what make such depth trainable.

Encoder, decoder, and the model families

Family	Structure	Best at	Examples
Encoder-only	Reads the whole input at once (bidirectional)	Understanding: classification, embeddings, search	BERT, RoBERTa
Decoder-only	Predicts the next token left-to-right (causal)	Generation: chat, completion, code	GPT, Llama, Mistral
Encoder–decoder	Encode input, then generate output	Translation, summarization (seq-to-seq)	T5, BART

◆ Example — no code

Think of self-attention as a meeting where everyone can hear everyone. To decide what "it" refers to, the word "it" effectively asks the room, "who here is a noun I might be standing in for?" Every other word answers with how relevant it is; "animal" answers loudest. "It" then updates its understanding by blending in mostly "animal". A decoder (like GPT) is the same meeting but with a rule: you may only listen to words before you, never after — which is exactly what's needed to predict the next word one at a time.

›_ Example — self-attention from scratch

attention.pyPyTorch
import torch, torch.nn.functional as F

def self_attention(x, Wq, Wk, Wv):
    # x: (seq_len, dim)  — one sequence of token vectors
    Q, K, V = x @ Wq, x @ Wk, x @ Wv          # learned projections
    d_k = Q.size(-1)
    scores = (Q @ K.transpose(-2, -1)) / d_k ** 0.5   # (seq, seq)
    weights = F.softmax(scores, dim=-1)        # how much each token attends
    return weights @ V                         # weighted blend of values

# In practice you use the built-in, optimized version:
# torch.nn.MultiheadAttention, or just load a pretrained transformer (§16).

↗ Why this won

Transformers removed the sequential bottleneck of RNNs (parallel training), handle long-range dependencies directly (any token to any token in one step), and scale predictably — bigger model + more data + more compute reliably yields better performance. That predictable scaling is precisely what fuelled the era of large language models.

SECTION 14Embeddings — meaning as geometry

How networks turn words, images, or anything discrete into vectors where distance equals similarity. The quiet idea behind search, RAG, and recommendations.

A network can't multiply the word "king". An embedding maps each discrete item (a word, a product, a user) to a dense vector of, say, 768 numbers. Crucially, these vectors are learned so that items used in similar ways end up near each other in the space. Meaning becomes geometry: similarity is just distance.

∑ The famous example

vec("king") − vec("man") + vec("woman") ≈ vec("queen")

Relationships become directions in the space. The "royalty" and "gender" concepts emerge as consistent vector offsets — never programmed, just a by-product of training on how words co-occur.

Figure 14. Embeddings place related items close together and encode relationships as consistent directions. Real spaces have hundreds of dimensions; this is a 2-D cartoon of the idea.

◆ Example — no code

This is how semantic search and RAG (retrieval-augmented generation) work. You embed every document in your knowledge base into vectors and store them. When a user asks a question, you embed the question too, then find the document vectors nearest to it — those are the most relevant passages, even if they share no exact keywords ("car trouble" finds a doc about "vehicle won't start"). Recommendation systems do the same with products and users.

›_ Example — with code (sentence embeddings + similarity)

embeddings.py🤗 sentence-transformers
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("all-MiniLM-L6-v2")   # small, fast embedder
docs = ["How do I reset my password?",
        "Ways to recover account access",
        "What's the weather in Paris?"]
emb = model.encode(docs, convert_to_tensor=True)  # (3, 384) vectors

query = model.encode("I forgot my login", convert_to_tensor=True)
scores = util.cos_sim(query, emb)                 # cosine similarity
print(scores)  # highest for the first two docs, low for the weather one

SECTION 15Transfer learning & fine-tuning

Don't start from zero. Take a model that already learned general features on a huge dataset, and adapt it to your task with a fraction of the data.

Training a large model from scratch needs millions of examples and serious compute. Transfer learning sidesteps this: start from a model pretrained on a giant generic dataset (ImageNet for vision, web-scale text for language), then adapt it to your specific task. The pretrained model already "knows" edges, shapes, grammar, and facts — you just teach it your last mile.

Two main strategies:

Feature extraction. Freeze the pretrained body, replace only the final layer (the "head"), and train just that head on your data. Fast, needs little data, lower ceiling.
Fine-tuning. Unfreeze some or all of the pretrained weights and continue training them (at a small learning rate) on your data. More data and compute, higher accuracy.

Figure 15. Transfer learning: reuse the pretrained body, attach a fresh head for your task. This is how the vast majority of real-world deep learning gets done today.

◆ Example — no code

You want to classify 10 species of local birds but have only 500 photos — nowhere near enough to train a vision model from scratch. Instead you take a ResNet that already learned, from 1.2 million ImageNet images, what feathers, beaks, edges, and textures look like. You snip off its 1,000-class head, bolt on a fresh 10-class head, and train. Because the hard, general visual work is already done, your 500 photos are enough to get strong accuracy. It's standing on the shoulders of a model that already learned to see.

›_ Example — with code (fine-tune a pretrained CNN)

finetune_vision.pyPyTorch / torchvision
import torch, torch.nn as nn
from torchvision import models

# 1. load a model pretrained on ImageNet
net = models.resnet50(weights="IMAGENET1K_V2")

# 2. freeze the body (feature extraction)
for p in net.parameters():
    p.requires_grad = False

# 3. replace the head for OUR 10 bird classes (this part trains)
net.fc = nn.Linear(net.fc.in_features, 10)

# 4. train only the new head with the standard loop from Section 10
opt = torch.optim.Adam(net.fc.parameters(), lr=1e-3)
# To FINE-TUNE instead: unfreeze later layers and use a tiny LR (e.g. 1e-5).

↗ The modern default workflow

Pretrain (or download) → fine-tune → deploy. For language, this is exactly what creating a custom chatbot or classifier looks like: start from a pretrained LLM and fine-tune on your domain. Parameter-efficient methods like LoRA fine-tune only a tiny set of added weights, making it cheap to adapt even billion-parameter models on a single GPU.

SECTION 16Hugging Face & the pretrained ecosystem

The practical fast-track. Thousands of ready-to-use models, three lines of code to inference, a standard recipe to fine-tune.

The 🤗 Transformers library is the de-facto hub for pretrained models. You rarely build a transformer by hand; you download one of hundreds of thousands of community models from the Hugging Face Hub and either use it directly or fine-tune it. Three abstraction levels, from easiest to most controlled:

Level 1 — `pipeline`: inference in 3 lines

›_ Example — with code

pipeline.py🤗 Transformers
from transformers import pipeline

clf = pipeline("sentiment-analysis")          # downloads a model for you
print(clf("I absolutely loved this guide!"))
# [{'label': 'POSITIVE', 'score': 0.9998}]

# the same API covers dozens of tasks:
pipeline("summarization")
pipeline("question-answering")
pipeline("text-generation", model="gpt2")
pipeline("zero-shot-classification")          # classify into labels you invent
pipeline("automatic-speech-recognition")      # audio -> text

Level 2 — tokenizer + model: full control over inference

When you need the logits, embeddings, or custom decoding, load the tokenizer (which turns text into the integer token IDs the model expects) and the model separately.

›_ Example — with code

model_direct.py🤗 Transformers
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

name = "distilbert-base-uncased-finetuned-sst-2-english"
tok   = AutoTokenizer.from_pretrained(name)
model = AutoModelForSequenceClassification.from_pretrained(name)

inputs = tok("Transformers make this easy.", return_tensors="pt")
with torch.no_grad():
    logits = model(**inputs).logits          # raw scores
probs = logits.softmax(-1)                    # -> probabilities
print(model.config.id2label[probs.argmax().item()])   # 'POSITIVE'

Level 3 — `Trainer`: fine-tune on your own data

The Trainer handles the entire training loop, evaluation, checkpointing, mixed precision, and multi-GPU — you supply a dataset and a config.

›_ Example — with code (fine-tuning)

finetune_text.py🤗 Transformers
from transformers import (AutoTokenizer, AutoModelForSequenceClassification,
                          TrainingArguments, Trainer)
from datasets import load_dataset

ds  = load_dataset("imdb")                                  # 1. data
tok = AutoTokenizer.from_pretrained("distilbert-base-uncased")
ds  = ds.map(lambda b: tok(b["text"], truncation=True), batched=True)

model = AutoModelForSequenceClassification.from_pretrained(  # 2. model
    "distilbert-base-uncased", num_labels=2)

args = TrainingArguments(output_dir="out", num_train_epochs=2,  # 3. config
    per_device_train_batch_size=16, eval_strategy="epoch",
    learning_rate=2e-5, fp16=True)

Trainer(model=model, args=args,                              # 4. train
        train_dataset=ds["train"], eval_dataset=ds["test"],
        tokenizer=tok).train()

◆ Example — no code

Hugging Face is to deep learning what package managers (npm, pip) are to software: instead of writing a sentiment model, an OCR model, and a translator from scratch — each months of work — you install pretrained ones and compose them. The Hub is the "app store" of models; pipeline is the one-click install; Trainer is the recipe for teaching a downloaded model your specific job. This is why a small team can ship serious NLP in days.

↗ How to pick a model on the Hub

Match the task (text-classification, summarization, ASR…), check the size vs your hardware (a 7B-param model needs ~14 GB VRAM in fp16), prefer models with a permissive license and good downloads/likes, and read the model card for its training data and known limitations. Smaller distilled models (DistilBERT, MiniLM) are often the right call for production latency.

SECTION 17Generative models

Networks that don't just classify — they create. The lineage behind image generators, voice cloning, and beyond.

Everything so far has been discriminative: map an input to a label or number. Generative models learn the underlying distribution of the data so they can produce new samples that look like it — new images, audio, molecules, text. Three influential families:

Autoencoders & VAEs

An autoencoder squeezes input through a narrow bottleneck (the latent code) and reconstructs it. By forcing everything through a few numbers, it learns a compressed, meaningful representation. A variational autoencoder (VAE) makes the latent space smooth and continuous, so you can sample new points from it and decode them into novel outputs.

GANs — the adversarial game

A generative adversarial network pits two networks against each other. The generator tries to produce fakes; the discriminator tries to tell fakes from real. They train together — as the discriminator gets sharper, the generator is forced to produce more convincing samples, until the fakes are nearly indistinguishable from real data.

Diffusion models — the modern image engine

Diffusion models (behind most state-of-the-art image generators) learn to reverse a noising process. During training, real images are progressively corrupted with noise until they're pure static; the model learns to undo one step of noising. To generate, you start from pure noise and run the learned denoiser repeatedly, and a coherent image gradually emerges. Conditioning the denoiser on a text embedding gives you text-to-image.

Figure 16. Diffusion. The model is trained to remove noise (top, red). To generate, you run that learned denoiser in reverse from pure noise (bottom, teal), and an image materializes — optionally steered by a text prompt.

◆ Example — no code

GAN as an art forger and detective: a forger paints fakes, a detective judges them. Each time the detective spots a fake, the forger learns and improves; each new forgery trains the detective to be pickier. After thousands of rounds the forgeries are gallery-quality. Diffusion as a sculptor: imagine starting with a block of TV static and "carving away" the noise that doesn't belong, a little each pass, guided by the prompt "a cat in a spacesuit", until a clean image of exactly that remains.

›_ Example — with code (text-to-image with diffusers)

generate_image.py🤗 diffusers
from diffusers import StableDiffusionPipeline
import torch

pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16
).to("cuda")

image = pipe("a cat in a spacesuit, digital art",
             num_inference_steps=30,        # how many denoising steps
             guidance_scale=7.5).images[0]  # how strongly to follow the prompt
image.save("cat.png")

SECTION 18Engineering & deployment

The unglamorous work that decides whether a model ships. What separates a notebook demo from production.

Make training fast and stable

Use the GPU. Deep learning is matrix math; GPUs do it 10–100× faster than CPUs. Confirm your tensors and model are on cuda — a model silently running on CPU is a common "why is this so slow?" culprit.
Mixed precision (fp16/bf16). Compute in 16-bit where safe to roughly halve memory and speed things up, keeping critical parts in 32-bit. One flag in most frameworks.
Efficient data loading. Use multiple worker processes so the CPU prepares the next batch while the GPU trains on the current one. A starved GPU is wasted money.
Gradient accumulation. Simulate a large batch on limited memory by summing gradients over several small batches before stepping.
Learning-rate scheduling & warmup. Ramp the LR up then decay it; standard for transformers and a reliable accuracy boost.

Debugging checklist (in order)

Symptom	First things to check
Loss is `NaN`	Learning rate too high; bad input normalization; `log(0)` in a custom loss. Lower LR, clip gradients.
Loss won't decrease	Forgot `zero_grad()`; LR too low/high; labels misaligned with inputs; model on wrong device.
Great train acc, bad val acc	Overfitting — add dropout/augmentation/weight decay, get more data, or stop earlier (§9).
Worse at inference than training	Forgot `model.eval()`; different preprocessing at inference time.
Out-of-memory (OOM)	Reduce batch size; enable mixed precision; use gradient accumulation/checkpointing.

↗ The most useful debugging trick

Overfit a single batch on purpose. Take 2–4 examples and train until the loss is near zero. If your model can't memorize a tiny batch, there's a bug in the model, loss, or data pipeline — no amount of tuning will help until you fix it. If it can, your machinery works and the rest is data and regularization.

From trained model to product

Save & version the model weights and the exact preprocessing alongside them.
Export to a portable format (ONNX, TorchScript, or a SavedModel) for serving outside Python.
Optimize for inference: quantization (run in int8) and distillation (train a small model to mimic a big one) cut latency and cost dramatically.
Serve behind an API (FastAPI, TorchServe, TF Serving, or a managed endpoint), batch requests, and monitor.
Monitor for drift. Real-world data shifts over time; track input distributions and prediction quality, and plan to retrain.

›_ Example — with code (save / load / serve)

deploy.pyPyTorch + FastAPI
# --- save & load weights ---
torch.save(model.state_dict(), "model.pt")
model.load_state_dict(torch.load("model.pt"))
model.eval()                                  # ALWAYS eval before serving

# --- minimal inference API ---
from fastapi import FastAPI
app = FastAPI()

@app.post("/predict")
def predict(payload: dict):
    x = preprocess(payload["text"])           # same prep as training!
    with torch.no_grad():
        logits = model(x)
    return {"label": int(logits.argmax(-1))}
# run: uvicorn deploy:app --host 0.0.0.0 --port 8000

SECTION 19Quantization — making big models fit

A model's weights are just numbers. Store them in fewer bits and the model shrinks 2–8× — often with almost no loss in quality. This is what lets a 70B model run on a single GPU.

By default weights live in 32-bit (fp32) or 16-bit (fp16/bf16) floats. Quantization maps them to low-bit integers — typically int8 or int4 — storing a scale factor so they can be approximately reconstructed. Memory scales directly with bit-width, and because generation is memory-bandwidth bound, smaller weights also mean faster inference.

Format	Bits	7B model size	Use
fp32	32	~28 GB	legacy / high-precision training
fp16 / bf16	16	~14 GB	standard training & inference (bf16 preferred on Ampere+)
int8	8	~7 GB	inference, light quality loss
int4 (NF4)	4	~3.5 GB	inference + QLoRA fine-tuning
FP8	8	~7 GB	training/inference on H100/Blackwell

Two timing strategies: post-training quantization (PTQ) — quantize an already-trained model (fast, the common case; methods include GPTQ, AWQ, bitsandbytes NF4, and GGUF for llama.cpp) — and quantization-aware training (QAT), which simulates quantization during training for the best low-bit accuracy at higher cost. NF4 ("4-bit NormalFloat") is a data type tuned for the bell-curve distribution of neural-net weights, and "double quantization" even quantizes the scale factors.

Figure 17. Lower precision → smaller footprint → fits on cheaper hardware and runs faster. int4 fits a 7B model in <4 GB with minimal quality loss for inference.

◆ Example — no code

Think of a RAW photo vs a JPEG. The RAW file stores every pixel in full precision — gorgeous, but huge. The JPEG throws away detail your eye can't notice and is a fraction of the size. Quantization is JPEG for model weights: a 70B model in fp16 needs ~140 GB (two 80 GB A100s); in 4-bit it needs ~40 GB and runs on a single GPU, while answering almost identically. You trade a sliver of fidelity for an order-of-magnitude in cost.

›_ Example — with code (load any model in 4-bit)

load_4bit.pyTransformers + bitsandbytes
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

# 4-bit NF4 config: the recipe behind QLoRA
bnb = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",          # NormalFloat-4, tuned for weight distributions
    bnb_4bit_use_double_quant=True,     # quantize the quantization constants too
    bnb_4bit_compute_dtype=torch.bfloat16,  # math runs in bf16
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    quantization_config=bnb,
    device_map="auto",                  # place layers across available GPUs/CPU
)
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
# The 8B model now occupies ~5 GB instead of ~16 GB.

↗ Rule of thumb

Quantize for inference and for QLoRA fine-tuning almost for free with int4/int8. Reach for GPTQ/AWQ when you want the fastest pre-baked inference weights, GGUF for llama.cpp/Ollama on laptops, and FP8 on H100/Blackwell for training. Watch for activation outliers — a few large activations can hurt naive low-bit schemes, which is exactly what methods like LLM.int8() and AWQ are designed to handle.

SECTION 20LoRA, QLoRA & parameter-efficient fine-tuning

Full fine-tuning rewrites every weight — one giant checkpoint per task and a huge bill. PEFT freezes the model and trains a tiny set of new parameters instead. This is how almost all custom LLMs are built today.

The key idea behind LoRA (Low-Rank Adaptation): when you fine-tune, the change to a weight matrix is low-rank — it doesn't need full degrees of freedom. So freeze the original weight W (size d×k) and learn its update as a product of two skinny matrices:

∑ The LoRA decomposition

W′ = W + ΔW = W + B·A (B is d×r, A is r×k, rank r ≪ d,k)

Only A and B train. With r=8 on a 4096×4096 layer you update ~65K params instead of ~16.7M — about 0.4%. A scaling factor α/r controls the update's strength. At deploy time you can fold B·A back into W, so there is zero extra inference latency.

Figure 18. LoRA learns a low-rank update B·A alongside the frozen weight. QLoRA adds one trick: keep W in 4-bit (Section 19) while the adapters train in bf16 — enough to fine-tune a 65B model on a single 48 GB GPU.

The family worth knowing:

LoRA — the workhorse. Tiny adapters, mergeable, hot-swappable.
QLoRA — LoRA on top of a 4-bit (NF4) frozen base + paged optimizers. Massive memory savings; the default for fine-tuning on consumer GPUs.
DoRA — decomposes each weight into magnitude and direction and applies LoRA to the direction. Closes most of the remaining gap to full fine-tuning at the same parameter cost.
Others: LoRA+ (different LRs for A and B), rsLoRA (rank-stabilized scaling), PiSSA (better init), VeRA (shared frozen random matrices, even fewer params), IA³, and the soft-prompt family (prefix-, prompt-, P-tuning) and classic adapters.

◆ Example — no code

You have one 13B base model and need five specialists: legal, medical, code, customer-support, and sales. Full fine-tuning means five separate expensive runs and five ~26 GB checkpoints to store and serve. With LoRA you keep one frozen base and train five adapter files of ~10–50 MB each, hot-swapping them per request. It's one game console with five cartridges instead of buying five consoles.

›_ Example — with code (QLoRA fine-tune with PEFT + TRL)

qlora_finetune.pyPEFT + TRL + Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer, SFTConfig
import torch

model_id = "meta-llama/Llama-3.1-8B"

# 1. load base in 4-bit (the "Q" in QLoRA) -- frozen
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
                         bnb_4bit_use_double_quant=True,
                         bnb_4bit_compute_dtype=torch.bfloat16)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb,
                                             device_map="auto")
model = prepare_model_for_kbit_training(model)

# 2. attach LoRA adapters (the only thing that trains)
lora = LoraConfig(
    r=16, lora_alpha=32, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM",
    target_modules=["q_proj","k_proj","v_proj","o_proj",      # attention
                    "gate_proj","up_proj","down_proj"],        # MLP -> "all-linear"
)
model = get_peft_model(model, lora)
model.print_trainable_parameters()   # e.g. "trainable: 0.42% of all params"

# 3. train like normal -- TRL handles the SFT loop
trainer = SFTTrainer(model=model, train_dataset=my_dataset,
                     args=SFTConfig(output_dir="out", per_device_train_batch_size=2,
                                    gradient_accumulation_steps=8, bf16=True,
                                    num_train_epochs=1, learning_rate=2e-4))
trainer.train()

# 4. save the tiny adapter (a few dozen MB), or merge for zero-latency deploy:
model.save_pretrained("my-lora-adapter")
# merged = model.merge_and_unload()   # fold B*A into W for serving

↗ Practical settings

Start with rank r=16–64, α = 2×r, and target all linear layers (attention + MLP) — this consistently beats attaching LoRA to attention only. Learning rates are higher than full fine-tuning (1e-4 to 3e-4). Use QLoRA when memory is tight, DoRA when you want to squeeze out the last accuracy points, and merge_and_unload() before serving so inference is as fast as the base model.

SECTION 21Alignment with RLHF & PPO

Pretraining makes a model knowledgeable; supervised fine-tuning makes it follow instructions. But "the most likely next token" isn't the same as "the answer a human prefers." Alignment closes that gap.

RLHF (Reinforcement Learning from Human Feedback) is the classic three-stage recipe that turned raw LLMs into helpful assistants:

SFT — fine-tune on high-quality demonstration (prompt → ideal answer) pairs.
Reward model (RM) — show humans two answers to the same prompt; they pick the better one. Train a model to predict that preference. Mathematically it uses the Bradley–Terry model: the probability that answer w beats l is σ(r(w) − r(l)).
RL optimization (PPO) — let the policy generate answers, score them with the RM, and update the policy to earn higher reward — while a KL penalty keeps it from drifting too far from the SFT model (so it can't "cheat" the reward).

Figure 19. The RLHF pipeline. PPO is powerful but heavy: it juggles four models in memory — policy, frozen reference, reward model, and a value/critic network — which is exactly the cost the next section's methods try to avoid.

◆ Example — no code

Imagine training a chef. SFT is having them copy recipes from a great cookbook. The reward model is a food critic you've trained to taste a dish and give it a score the way real diners would. PPO is the chef cooking freely, the critic scoring each plate, and the chef adjusting to win higher scores — but with a rule that they can't stray too far from real cooking (the KL leash), or they'd just drown everything in sugar to game the critic. That "gaming" failure is called reward hacking.

›_ Example — with code (PPO with TRL)

ppo_rlhf.pyTRL (Transformer Reinforcement Learning)
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
from transformers import AutoTokenizer, pipeline

# policy = SFT model + a value head (the critic PPO needs)
policy = AutoModelForCausalLMWithValueHead.from_pretrained("my-sft-model")
ref    = AutoModelForCausalLMWithValueHead.from_pretrained("my-sft-model")  # frozen anchor
tok    = AutoTokenizer.from_pretrained("my-sft-model")

# a trained reward model scores responses (here via a text-classification pipeline)
reward_fn = pipeline("text-classification", model="my-reward-model")

ppo = PPOTrainer(PPOConfig(batch_size=32, learning_rate=1e-5), policy, ref, tok)

for batch in dataloader:
    queries   = batch["input_ids"]
    responses = ppo.generate(queries, max_new_tokens=128)      # policy acts
    texts     = [tok.decode(r) for r in responses]
    rewards   = [out["score"] for out in reward_fn(texts)]     # RM judges
    # PPO step: push policy toward higher reward, clipped + KL-penalized vs ref
    ppo.step(queries, responses, rewards)

! Reward hacking

Optimize a proxy hard enough and the model finds loopholes: padding answers with flattery, gaming length, or exploiting RM blind spots. The KL coefficient is your main defense — too low and the model degenerates, too high and it never improves. This fragility (plus PPO's 4-model memory cost and tuning difficulty) is why the field largely shifted to the simpler methods next.

SECTION 22DPO, GRPO & modern preference optimization

PPO works but is complex, unstable, and memory-hungry. The 2023–2025 wave of methods gets the same alignment with far less machinery — and powers today's reasoning models.

DPO (Direct Preference Optimization) made the key observation that your language model is secretly its own reward model. Instead of training a separate RM and running RL, DPO derives a single classification-style loss directly on preference pairs (prompt, chosen, rejected). It nudges the policy to raise the log-probability of chosen answers and lower it for rejected ones, with a coefficient β that plays the role of the KL leash — all measured relative to a frozen reference model. No reward model, no rollouts, no critic: just stable supervised-style training.

GRPO (Group Relative Policy Optimization, from DeepSeek) keeps RL's online strength but drops PPO's expensive critic. For each prompt it samples a group of G answers, scores them, and uses the group's mean and standard deviation to compute each answer's advantage — "how much better than my other attempts was this one?" No value network needed. It shines with verifiable rewards (RLVR): math answers you can check, code that either passes tests or doesn't. GRPO on verifiable rewards is what produced DeepSeek-R1's reasoning ability.

∑ The two objectives, in words

DPO: loss = −log σ( β · [ logpθ(chosen) − logpθ(rejected) ] − [ same under ref ] )

GRPO: advantage Aᵢ = ( rᵢ − mean(r) ) / ( std(r) + ε ) over a group of G samples

The broader family: IPO (fixes a DPO overfitting failure mode), KTO (needs only a binary good/bad label per sample — no pairs — using prospect-theory utility), ORPO (folds preference into SFT with no reference model), SimPO (reference-free), and DAPO (GRPO refinements: dynamic sampling, asymmetric "clip-higher").

Figure 20. The trade-off: PPO is the most general but heaviest; DPO is dead-simple offline preference learning; GRPO keeps online RL but removes the critic, making RL-for-reasoning affordable.

◆ Example — no code

DPO is teaching with side-by-side flashcards: "this reply is better than that one" — the model learns the preference directly, no separate judge required. GRPO is a classroom exercise: hand eight students the same math problem, check which solutions are actually correct, and reward the ones that beat the class average. You never wrote a grading rubric (a critic) — correctness plus relative ranking is the whole signal. That's why GRPO is perfect for math and code, where "right or wrong" is checkable.

›_ Example — with code (DPO and GRPO with TRL)

preference_opt.pyTRL
# ---------- DPO: offline, from preference pairs ----------
from trl import DPOTrainer, DPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("my-sft-model")
tok   = AutoTokenizer.from_pretrained("my-sft-model")
# dataset columns: {"prompt", "chosen", "rejected"}
dpo = DPOTrainer(model=model, args=DPOConfig(beta=0.1, output_dir="dpo-out"),
                 train_dataset=pref_dataset, processing_class=tok)
dpo.train()                       # no reward model, no rollouts -- just stable training

# ---------- GRPO: online RL with a verifiable reward ----------
from trl import GRPOTrainer, GRPOConfig

def reward_correct(completions, answer, **kw):
    # +1 if the model's final answer matches ground truth, else 0  (RLVR)
    return [1.0 if extract(c) == a else 0.0 for c, a in zip(completions, answer)]

grpo = GRPOTrainer(
    model="my-sft-model",
    reward_funcs=reward_correct,
    args=GRPOConfig(num_generations=8,      # G samples per prompt -> the "group"
                    output_dir="grpo-out", learning_rate=1e-6),
    train_dataset=math_dataset,
)
grpo.train()

↗ Which one?

Have a dataset of preference pairs and want stable, cheap alignment → DPO (or KTO if you only have thumbs-up/down). Want to push reasoning on tasks with a checkable answer (math, code, tool success) → GRPO/RLVR. Need maximum control over a complex reward or true online exploration → PPO. The modern default for most teams is: SFT → DPO, with GRPO when a verifiable reward is available.

SECTION 23Inside a modern Transformer

The 2017 Transformer in Section 13 still describes the skeleton, but every production model (Llama, Qwen, Mistral, DeepSeek, Gemma) swaps in a set of upgrades for speed, length, and quality. Here's the modern parts list.

Component	2017 original	Modern replacement	Why
Position	Sinusoidal / learned	RoPE (rotary), ALiBi	relative, extrapolates to longer context
Normalization	LayerNorm (post)	RMSNorm, pre-norm	cheaper, more stable training
FFN activation	ReLU	SwiGLU (gated)	~1–2% better perplexity
Attention heads	Multi-head (MHA)	GQA / MQA / MLA	shrinks the KV cache → longer context
Attention kernel	Naive softmax(QKᵀ)V	FlashAttention 1/2/3	IO-aware, never stores the N×N matrix
Capacity	Dense FFN	Mixture-of-Experts	huge total params, few active per token

Grouped-Query Attention (GQA) is the 2025 default. Standard attention gives every query head its own key/value head — but the KV cache (the stored keys/values for past tokens) dominates memory at long context. MQA shares a single KV head across all queries (tiny cache, slight quality hit); GQA is the sweet spot: a few KV heads shared by groups of query heads. DeepSeek's MLA pushes further by compressing KV into a low-rank latent.

Figure 21. Query heads (indigo) vs key/value heads (teal). Fewer KV heads → a smaller KV cache → longer context and faster decoding. GQA is the modern compromise.

Mixture-of-Experts (MoE) replaces a dense feed-forward block with many parallel "expert" FFNs plus a tiny router that sends each token to only the top-k experts (often 2). The model can hold enormous knowledge (high total parameters) while each token only pays for a few experts (low active parameters) — e.g. Mixtral, DeepSeek-V3, Llama 4.

Figure 22. A Mixture-of-Experts layer. The router activates only a couple of experts per token, decoupling total capacity from per-token compute.

◆ Example — no code

GQA is carpooling for memory: instead of every query commuter driving a private key/value car, they share a few cars, freeing up the parking lot (KV cache) for far longer trips (context). MoE is a large hospital with a triage desk: each patient (token) is routed to just the two relevant specialists out of a hundred on staff. You get the expertise of a hundred doctors but only pay for two consultations per patient.

›_ Example — with code (a modern block + FlashAttention)

modern_block.pyPyTorch + Transformers
import torch, torch.nn as nn, torch.nn.functional as F

class RMSNorm(nn.Module):                      # cheaper than LayerNorm
    def __init__(self, d, eps=1e-6):
        super().__init__(); self.g = nn.Parameter(torch.ones(d)); self.eps = eps
    def forward(self, x):
        return self.g * x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)

class SwiGLU(nn.Module):                        # gated FFN used by Llama/Qwen
    def __init__(self, d, hidden):
        super().__init__()
        self.w_gate = nn.Linear(d, hidden, bias=False)
        self.w_up   = nn.Linear(d, hidden, bias=False)
        self.w_down = nn.Linear(hidden, d, bias=False)
    def forward(self, x):
        return self.w_down(F.silu(self.w_gate(x)) * self.w_up(x))

# FlashAttention is exact attention done IO-efficiently. In practice you just
# enable the fused kernel -- PyTorch picks it automatically:
out = F.scaled_dot_product_attention(q, k, v, is_causal=True)   # uses Flash when available

# ...or tell Hugging Face to use it when loading a model:
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B", attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16)

SECTION 24Training at scale

A 70B model won't even fit on one GPU — Adam's optimizer states alone can need hundreds of gigabytes. Scaling means splitting the work, and trading compute for memory.

First, the memory budget. Training a parameter in mixed precision costs far more than the weight itself: a copy of the weight, its gradient, and Adam's two optimizer states (often kept in fp32). That's why optimizer states, not the model, usually dominate. The toolkit:

Mixed precision (bf16/fp16, FP8 on H100+) — do math in 16/8-bit, keep a master copy where needed. ~2× faster, ~half the memory. bf16 is preferred on Ampere and newer for its stability.
Gradient accumulation — run several small micro-batches and sum their gradients before stepping, simulating a large batch you couldn't otherwise fit.
Gradient (activation) checkpointing — don't store every activation for the backward pass; recompute them on the fly. Trades extra compute for big memory savings.
ZeRO (DeepSpeed) / FSDP (PyTorch) — shard the optimizer states, gradients, and even parameters across N GPUs so each holds ~1/N. This is sharded data parallelism and is the standard way to train large models.
Parallelism dimensions — data (same model, different batches), tensor (split a single matrix multiply across GPUs), pipeline (different layers on different GPUs), plus sequence/context and expert parallelism. Combining them is "3D parallelism."

Figure 23. Ways to split a too-big job. In practice large runs combine several — e.g. FSDP for sharding plus tensor/pipeline parallel for the very biggest models.

◆ Example — no code

You're moving a house but own only small trucks. One truck can't hold the house — not even all the furniture from one room. So you split everything across eight trucks (FSDP/ZeRO shards the weights, gradients, and optimizer state), and when you need a specific couch you radio the truck that has it (gather-on-demand). Gradient checkpointing is deciding not to keep packing boxes around — you'll just rebuild a few when needed, saving space at the cost of a little extra work.

›_ Example — with code (scale knobs in the Trainer)

train_at_scale.pyTransformers + Accelerate
from transformers import TrainingArguments

args = TrainingArguments(
    output_dir="out",
    per_device_train_batch_size=2,      # small micro-batch that fits in memory
    gradient_accumulation_steps=16,     # ...but train as if batch = 2 * 16 * n_gpus
    bf16=True,                          # mixed precision (Ampere/Hopper+)
    gradient_checkpointing=True,        # recompute activations -> big memory save
    optim="adamw_torch_fused",
    # multi-GPU sharding via FSDP (or use a DeepSpeed ZeRO-3 config file):
    fsdp="full_shard auto_wrap",
    fsdp_config={"transformer_layer_cls_to_wrap": "LlamaDecoderLayer"},
    logging_steps=10, num_train_epochs=1, learning_rate=2e-5,
)
# Launch across GPUs with:   accelerate launch train_at_scale.py
# (Accelerate / torchrun handle the distributed process group for you.)

↗ The OOM playbook

Out of memory? In order: enable bf16, turn on gradient checkpointing, lower the micro-batch and raise accumulation to keep the effective batch, switch to QLoRA (Section 20) so the base is 4-bit, then shard with FSDP/ZeRO-3. Find the largest micro-batch that fits, then scale the effective batch with accumulation — throughput loves big batches.

SECTION 25Inference & serving

Training is a one-time cost; serving is forever. Generation is autoregressive — one token at a time — so the bottleneck is usually memory bandwidth, not raw compute. The whole game is keeping the GPU busy.

KV cache — each new token attends to all previous ones. Recomputing their keys/values every step would be quadratic, so you cache them. The cache grows with sequence length and dominates memory at long context (this is what GQA in Section 23 shrinks).
Continuous batching — classic batching waits for the whole batch to finish; continuous batching (vLLM) swaps finished requests out and new ones in every step, so the GPU never idles. Big throughput win under real traffic.
PagedAttention — manages the KV cache like OS virtual memory in fixed "pages," eliminating fragmentation and letting you pack far more concurrent requests.
Speculative decoding — a small fast "draft" model proposes several tokens; the big model verifies them all in one parallel pass, accepting the longest correct prefix. Same output distribution, often 2–3× faster.
Quantized inference — int4/int8/FP8 weights (Section 19) cut memory and boost bandwidth-bound decoding.

Figure 24. Speculative decoding amortizes the big model's cost across several tokens at once — cheap drafting, one expensive verification.

◆ Example — no code

Speculative decoding is a junior writer drafting a whole sentence while a senior editor — who's slow but authoritative — approves it in a single glance instead of writing word by word. If the draft is good, four or five words get accepted for the price of one editorial pass. Continuous batching is a ride-share van that picks up and drops off riders mid-route instead of waiting for a full bus, so a seat is never empty and the engine never idles.

›_ Example — with code (serve with vLLM; speculative in Transformers)

serve.pyvLLM + Transformers
# ---------- High-throughput serving with vLLM ----------
from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct",
          quantization="awq",          # quantized weights
          gpu_memory_utilization=0.9)  # PagedAttention + continuous batching are automatic
out = llm.generate(["Explain KV cache in one line."],
                   SamplingParams(temperature=0.7, max_tokens=128))
print(out[0].outputs[0].text)
# Or expose an OpenAI-compatible server:
#   vllm serve meta-llama/Llama-3.1-8B-Instruct

# ---------- Speculative decoding with a draft model ----------
from transformers import AutoModelForCausalLM, AutoTokenizer
target = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-70B-Instruct")
draft  = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-70B-Instruct")
ids = tok("The capital of France is", return_tensors="pt").input_ids
out = target.generate(ids, assistant_model=draft, max_new_tokens=40)  # same output, faster
print(tok.decode(out[0]))

↗ Picking a serving stack

For production throughput on GPUs: vLLM, TGI, SGLang, or TensorRT-LLM. For a laptop or a single small GPU: llama.cpp / Ollama with GGUF quantized weights. Optimize for the metric that matters — latency (time to first token) for chat, throughput (tokens/sec across users) for batch jobs — they pull in different directions.

SECTION 26Reasoning, RAG & agents

The frontier moved from "predict the next token" to "use compute at inference time to think, retrieve, and act." These are the patterns behind today's most capable systems.

Chain-of-thought & self-consistency — prompting the model to reason step-by-step, then sampling several reasoning paths and voting on the answer, reliably boosts accuracy on hard problems.
Reasoning models (o1 / DeepSeek-R1 style) — trained (often with GRPO + verifiable rewards from Section 22) to produce long internal reasoning before answering. Their key property is test-time compute scaling: let them think longer and accuracy goes up, no retraining required.
RAG (Retrieval-Augmented Generation) — embed your documents into a vector store; at query time retrieve the most relevant chunks and stuff them into the prompt. Grounds answers in fresh or private knowledge and cuts hallucination — without touching the model's weights.
Agents — an LLM wrapped in a loop that can call tools (search, code execution, APIs) via function calling, observe results, and decide the next step until the task is done (the ReAct pattern). The Model Context Protocol (MCP) is an emerging standard for connecting models to tools and data sources.
Multimodal (VLMs) — pair a vision (or audio) encoder with an LLM so it can see images and read documents, not just text.
Distillation — train a small, cheap model to imitate a big one's outputs, capturing much of the quality at a fraction of the serving cost.

Figure 25. The agent loop: an LLM that can reason, call tools, observe what comes back, and iterate — the basis of coding assistants, research agents, and autonomous workflows.

◆ Example — no code

A reasoning model is a student who shows their work on scratch paper before answering — and we let them use as much scratch paper as the problem deserves; harder questions simply get more thinking. An agent is that same student who can also get up, look things up in the library (search), use a calculator, run an experiment (code), come back, and keep going — looping until the assignment is actually finished, not just until they've produced one paragraph. RAG is letting them bring the open textbook into the exam so answers are grounded in the real source, not memory.

›_ Example — with code (a minimal RAG + tool-using agent loop)

rag_and_agent.pysentence-transformers + an LLM client
# ---------- RAG: retrieve, then generate grounded on real docs ----------
from sentence_transformers import SentenceTransformer
import numpy as np

emb = SentenceTransformer("all-MiniLM-L6-v2")
docs = ["Our refund window is 30 days.", "Support hours are 9-5 GMT.", ...]
doc_vecs = emb.encode(docs, normalize_embeddings=True)

def retrieve(query, k=3):
    q = emb.encode([query], normalize_embeddings=True)[0]
    scores = doc_vecs @ q                       # cosine similarity
    return [docs[i] for i in np.argsort(scores)[-k:][::-1]]

def answer(query, llm):
    context = "\n".join(retrieve(query))         # top-k relevant chunks
    prompt  = f"Use ONLY this context:\n{context}\n\nQuestion: {query}"
    return llm(prompt)                           # grounded, less hallucination

# ---------- Agent loop: LLM decides which tool to call, then observes ----------
TOOLS = {"search": web_search, "python": run_code}   # name -> callable

def agent(goal, llm, max_steps=6):
    history = [f"Goal: {goal}"]
    for _ in range(max_steps):
        step = llm("\n".join(history) +
                   "\nThink, then output: TOOL <name> <args>  OR  FINAL <answer>")
        history.append(step)
        if step.startswith("FINAL"):
            return step[6:]
        name, args = parse(step)                 # e.g. "search", "vLLM throughput"
        observation = TOOLS[name](args)          # act
        history.append(f"Observation: {observation}")   # observe -> loop
    return "stopped: step budget exhausted"

↗ RAG vs fine-tuning — the decision

Need the model to know new, changing, or private facts? → RAG (cheap, updatable, citable). Need it to behave differently — a tone, a format, a skill, a domain style? → fine-tune (LoRA/QLoRA from Section 20). Need both? Most serious systems combine them: fine-tune the behavior, retrieve the facts, wrap it in an agent loop, and serve it with vLLM.

SECTION 27The LangChain stack: overview

Section 26 covered the concepts. This is the production stack most teams actually reach for to build them — three layers that fit together, plus the observability glue that ties them.

The LangChain ecosystem is best understood as a layered stack. Each layer adds abstraction on top of the one below, so you pick the level of control your task needs:

LangGraph — the low-level orchestration runtime. It runs agents as a stateful graph of nodes and edges, and provides the hard infrastructure: durable execution (resume after a crash), persistence/memory, streaming, and human-in-the-loop. Use it when you need precise, custom control flow — loops, branches, approvals.
LangChain — the agent framework built on LangGraph. Its headline pieces are a standard model interface (swap OpenAI ↔ Anthropic ↔ Google by changing one string, no lock-in) and create_agent: a minimal, configurable harness = model + tools + prompt + a tool-calling loop, extendable with middleware. This is the default starting point for most agents.
Deep Agents — a batteries-included harness on top of LangChain, for complex, long-running, multi-step tasks. create_deep_agent ships with planning (a write_todos tool), a virtual filesystem with automatic context compression, subagent spawning (a task tool that delegates to isolated agents), long-term memory, skills, and human-in-the-loop — all on by default.
LangSmith — the cross-cutting observability layer: trace every model/tool call, debug behavior, and evaluate outputs for agents built with any of the three.

Figure 26. The LangChain stack. Deep Agents sits on LangChain (create_agent), which sits on the LangGraph runtime; LangSmith observes all of them. Pick your layer by how much control you need.

◆ Example — no code

Think of building a house. LangGraph is the foundation, plumbing, and wiring — the durable infrastructure that holds state and survives a power cut (resume where you left off). LangChain is the framing and standard fittings: every appliance brand (any model provider) plugs into the same socket, and create_agent hands you a pre-framed room. Deep Agents is the fully furnished smart home — it already has a planner on the wall, filing cabinets (the virtual filesystem), assistants you can dispatch (subagents), and a memory of past visits — move-in ready for big, messy jobs. You choose how much is pre-built versus how much you wire yourself.

›_ Example — with code (1 · LangChain: an agent, provider-swappable)

01_langchain_agent.pyLangChain — create_agent
# pip install -qU langchain "langchain[anthropic]"
from langchain.agents import create_agent

def get_weather(city: str) -> str:
    """Get the weather for a given city."""        # docstring = the tool description
    return f"It's always sunny in {city}!"

agent = create_agent(
    model="claude-sonnet-4-6",     # swap to "openai:gpt-5.4" or "google_genai:gemini-3.5-flash"
    tools=[get_weather],
    system_prompt="You are a helpful assistant.",
)

result = agent.invoke(
    {"messages": [{"role": "user", "content": "What's the weather in San Francisco?"}]}
)
print(result["messages"][-1].content_blocks)
# create_agent runs the loop: model -> (maybe call tool) -> observe -> repeat -> final answer.

›_ Example — with code (2 · RAG with LangChain: a retriever is a tool)

02_langchain_rag.pyLangChain — agentic RAG
# pip install -qU langchain langchain-community langchain-chroma "langchain[anthropic]"
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain.embeddings import init_embeddings
from langchain.tools.retriever import create_retriever_tool
from langchain.agents import create_agent

# 1. INGEST: load -> split into chunks -> embed -> store in a vector DB
docs   = WebBaseLoader("https://example.com/handbook").load()
chunks = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150).split_documents(docs)
store  = Chroma.from_documents(chunks, init_embeddings("openai:text-embedding-3-small"))

# 2. turn the retriever into a TOOL the agent can choose to call
retriever_tool = create_retriever_tool(
    store.as_retriever(search_kwargs={"k": 4}),
    name="company_handbook",
    description="Search the company handbook for policies, benefits, and procedures.",
)

# 3. the agent now decides WHEN to retrieve -> this is "agentic RAG"
agent = create_agent(model="claude-sonnet-4-6", tools=[retriever_tool],
                     system_prompt="Answer using the handbook. Cite what you find.")
print(agent.invoke({"messages": [{"role": "user", "content": "How many vacation days do I get?"}]}))

When a simple loop isn't enough — you need explicit branching, cycles, a grading step, or a human approval gate — you drop to LangGraph and draw the control flow yourself as a graph. State flows between nodes; edges (including conditional ones) decide what runs next.

Figure 27. A LangGraph graph makes control flow explicit: retrieve → grade the results → generate, with a conditional edge that loops back to rewrite the query when the retrieved context isn't good enough. This "self-correcting" pattern is hard to express as a plain loop.

›_ Example — with code (3 · LangGraph: a stateful graph with memory)

03_langgraph_graph.pyLangGraph — StateGraph
# pip install -U langgraph
from langgraph.graph import StateGraph, MessagesState, START, END
from langgraph.checkpoint.memory import InMemorySaver

# nodes are just functions: (state) -> partial state update
def retrieve(state: MessagesState):
    query = state["messages"][-1].content
    hits  = store.as_retriever().invoke(query)          # your vector store
    return {"messages": [{"role": "system", "content": f"Context:\n{hits}"}]}

def generate(state: MessagesState):
    answer = model.invoke(state["messages"])            # your chat model
    return {"messages": [answer]}

# wire the graph: START -> retrieve -> generate -> END
g = StateGraph(MessagesState)
g.add_node(retrieve); g.add_node(generate)
g.add_edge(START, "retrieve")
g.add_edge("retrieve", "generate")
g.add_edge("generate", END)

# a checkpointer gives durable execution + conversation memory across turns
graph = g.compile(checkpointer=InMemorySaver())
graph.invoke({"messages": [{"role": "user", "content": "Summarize the refund policy."}]},
             config={"configurable": {"thread_id": "user-42"}})   # same thread = remembered
# Add g.add_conditional_edges(...) for the grade/retry branch, or interrupt() for human approval.

At the top of the stack, Deep Agents gives you all the hard parts of a capable autonomous agent for free. create_deep_agent has the same tool-calling core, but ships with a planner, a virtual filesystem that compresses context as runs grow long, the ability to spawn isolated subagents for parallel subtasks, and long-term memory — ideal for deep research, codebase work, or any task with many steps.

›_ Example — with code (4 · Deep Agents: a research agent with planning + subagents)

04_deep_agent.pyDeep Agents — create_deep_agent
# pip install -qU deepagents langchain-anthropic
from deepagents import create_deep_agent

def web_search(query: str) -> str:
    """Search the web and return the top results."""
    return run_search(query)        # plug in Tavily/SerpAPI/your own retriever here

# create_deep_agent bundles planning (write_todos), a virtual filesystem with
# automatic context compression, subagent spawning (the `task` tool), and memory.
agent = create_deep_agent(
    model="anthropic:claude-sonnet-4-6",
    tools=[web_search],
    system_prompt=(
        "You are an expert researcher. Plan first, delegate sub-questions to "
        "subagents, save findings to files, then write a cited report."
    ),
)

# Give it a big, multi-step task -- it will plan, spawn subagents, and manage its own context.
result = agent.invoke({"messages": [{"role": "user", "content":
    "Compare QLoRA, DoRA and full fine-tuning on cost and quality. Write a sourced summary."}]})
print(result["messages"][-1].content)

↗ Which layer should I use?

Most agents → LangChain create_agent (model + tools + prompt, swap providers freely). Need custom control flow — cycles, grading/routing, human-in-the-loop, durable long runs — → LangGraph. Big, open-ended, multi-step autonomy (research, coding, ops) → Deep Agents, which gives you planning, a filesystem, subagents, and memory out of the box. And in all three, RAG is the same move: wrap a retriever as a tool. Wire up LangSmith from day one to see what your agent is actually doing.

SECTION 28LangChain in depth: `create_agent`, tools, structured output & middleware

Section 27 showed the shape. Now the full surface: how the agent loop actually runs, how tools and structured output work, and the middleware system that is the real reason to use LangChain.

Agent = model + harness. create_agent builds a graph that runs the classic loop — call the model, and if the model asked for tools, run them, feed results back, and repeat until the model returns a final answer (no more tool calls). The signature exposes every lever you'll need:

∑ The create_agent surface

create_agent(
  model, tools=[…], system_prompt=…,
  middleware=[…],   # hooks around the loop (the key feature)
  response_format=…,  # structured output schema (Pydantic / strategy)
  checkpointer=…,     # short-term memory + durability
  state_schema=…, context_schema=…)  # custom state / runtime context

A few things make LangChain more than a thin wrapper:

Standard model interface. One string switches providers — "claude-sonnet-4-6", "openai:gpt-5.4", "google_genai:gemini-3.5-flash" — with a unified response shape (.content_blocks). No provider lock-in.
Tools are just functions. A typed Python function with a docstring becomes a tool; the docstring is the description the model reads to decide when to call it.
State. The agent carries an AgentState (a messages list, extendable with custom fields like user_id). A checkpointer persists it per thread_id so a conversation is remembered across turns.
Structured output. Pass response_format a Pydantic model and the agent returns parsed, validated data in structured_response instead of free text.

Figure 28. The agent loop and the middleware hooks that wrap it: before_agent/after_agent bracket the whole run; before_model/after_model and wrap_model_call surround each model call; wrap_tool_call surrounds each tool. This is where guardrails, summarization, retries, and limits live.

◆ Example — no code

Middleware is like the staff around a chef in a busy kitchen. The chef (model) just cooks. But an expediter checks every order before it reaches the chef (before_model), a quality inspector tastes each plate before it leaves (after_model), a runner handles the actual fetching of ingredients (wrap_tool_call), and a manager caps how many dishes get made so costs don't explode (call limits). You compose the staff you need; the chef's job never changes. That separation is why one summarization or PII-redaction middleware drops into any agent unchanged.

›_ Example — with code (1 · tools, structured output, and memory)

01_lc_core.pyLangChain — create_agent
from langchain.agents import create_agent
from langgraph.checkpoint.memory import InMemorySaver
from pydantic import BaseModel, Field

# 1. a tool is a typed function; its docstring is the description the model reads
def search_flights(origin: str, destination: str, date: str) -> str:
    """Search available flights between two cities on a given date."""
    return f"3 flights from {origin} to {destination} on {date}: ..."

# 2. structured output: ask for parsed, validated data instead of prose
class FlightPick(BaseModel):
    airline: str = Field(description="chosen airline")
    price_usd: float
    depart_time: str

agent = create_agent(
    model="claude-sonnet-4-6",
    tools=[search_flights],
    system_prompt="You are a travel assistant. Pick the best-value flight.",
    response_format=FlightPick,             # -> result["structured_response"] is a FlightPick
    checkpointer=InMemorySaver(),           # -> short-term memory across turns
)

cfg = {"configurable": {"thread_id": "trip-1"}}
res = agent.invoke({"messages": [{"role": "user",
        "content": "Cheapest flight SFO to JFK on Dec 5?"}]}, config=cfg)
print(res["structured_response"])           # FlightPick(airline=..., price_usd=..., ...)
# Because of the thread_id, a follow-up "what about Dec 6?" remembers the context.

›_ Example — with code (2 · production middleware: summarize, limit, guard, fall back)

02_lc_middleware.pyLangChain — built-in middleware
from langchain.agents import create_agent
from langchain.agents.middleware import (
    SummarizationMiddleware,    # compress old history near the context limit
    ToolCallLimitMiddleware,    # cap tool calls (cost / runaway protection)
    ModelFallbackMiddleware,    # retry on a backup model if the primary fails
    PIIMiddleware,              # detect & redact sensitive data
    HumanInTheLoopMiddleware,   # pause for approval on risky tools
)
from langgraph.checkpoint.memory import InMemorySaver

agent = create_agent(
    model="claude-sonnet-4-6",
    tools=[search_tool, send_email_tool],
    checkpointer=InMemorySaver(),                       # required by HITL
    middleware=[
        PIIMiddleware("email", strategy="redact", apply_to_input=True),
        SummarizationMiddleware(model="gpt-5.4-mini",
                                trigger=("fraction", 0.8), keep=("messages", 20)),
        ToolCallLimitMiddleware(thread_limit=20, run_limit=8),
        ModelFallbackMiddleware("gpt-5.4-mini", "openai:gpt-5.4"),
        HumanInTheLoopMiddleware(interrupt_on={          # ask a human before sending
            "send_email_tool": {"allowed_decisions": ["approve", "edit", "reject"]},
            "search_tool": False,                        # auto-run safe tools
        }),
    ],
)
# Each middleware handles ONE concern and composes with the others -- no agent rewrite.

›_ Example — with code (3 · custom middleware via hooks)

03_lc_custom_mw.pyLangChain — @hooks
from langchain.agents import create_agent
from langchain.agents.middleware import before_model, dynamic_prompt

# inject a fresh, per-request system prompt (e.g. user preferences from a store)
@dynamic_prompt
def personalized_prompt(request) -> str:
    user = request.runtime.context.get("user_name", "there")
    return f"You are a concise assistant. Address the user as {user}."

# a guardrail that runs right before every model call
@before_model
def log_and_guard(state) -> dict | None:
    last = state["messages"][-1].content
    if "wire transfer" in last.lower():
        return {"jump_to": "end"}        # short-circuit the loop
    return None                          # None -> proceed normally

agent = create_agent(
    model="claude-sonnet-4-6", tools=[...],
    middleware=[personalized_prompt, log_and_guard],
    context_schema=dict,                 # lets us pass runtime context
)
agent.invoke({"messages": [{"role": "user", "content": "hi"}]},
             context={"user_name": "Sara"})

↗ The mental model

Build the core agent with model + tools + system_prompt, then add capabilities as middleware: summarization for long chats, limits for cost, PII for compliance, human-in-the-loop for risky actions, fallback for resilience. The six hooks — before/after_agent, before/after_model, wrap_model_call, wrap_tool_call — cover essentially any interception you'll want, and a hook can return jump_to to redirect the loop.

SECTION 29LangGraph in depth: state, edges, persistence & human-in-the-loop

When the loop isn't the right shape — you need branching, cycles, parallelism, durability, or a human gate — you drop to LangGraph and build the control flow as an explicit graph.

Everything in LangGraph is three ideas: state (a typed object that flows through the graph), nodes (functions that read state and return an update), and edges (what runs next). You build it, then compile() it into a runnable.

State & reducers. State is a TypedDict. By default a node's returned keys overwrite state; annotate a field with a reducer like add_messages or operator.add to instead append — essential so parallel branches don't clobber each other.
Edges. add_edge(a, b) is unconditional; add_conditional_edges(node, router_fn) branches on a function's return; START and END are the entry/exit. Cycles are allowed — that's how loops and retries work.
Command. A node can return Command(goto="x", update={...}) to set state and choose the next node in one move — dynamic routing without a separate conditional edge.
Persistence. Compile with a checkpointer (InMemorySaver for dev, SqliteSaver single-server, PostgresSaver at scale). State is saved per thread_id after every step — giving memory and durable execution (resume after a crash).
Human-in-the-loop. Call interrupt(payload) inside a node to pause indefinitely; resume by re-invoking with Command(resume=value), which becomes the return value of interrupt().
Streaming & parallelism. graph.stream(..., stream_mode=) emits updates, values, messages, or custom events; Send fans out work to parallel branches (map-reduce).

Figure 29. A StateGraph: typed state with reducers flows through nodes; a conditional edge routes to tools or END; tool results loop back; the checkpointer persists state every step so the run is both stateful and resumable.

◆ Example — no code

LangChain's create_agent is an automatic car — one pedal, it handles the gears. LangGraph is a manual transmission: more to operate, but you control exactly when to shift, brake, loop back, or pull over for a passenger (a human). The checkpointer is the car's black box — it records the journey continuously, so if the engine cuts out on a mountain road, you restart from the last marker instead of the bottom of the hill. The reducer is the rule that when two passengers add to the shared shopping list at once, the items get combined rather than one erasing the other.

›_ Example — with code (1 · state, reducers, conditional edges, a cyclic loop)

01_lg_graph.pyLangGraph — StateGraph
from typing import Annotated, TypedDict
from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import add_messages
from langgraph.checkpoint.memory import InMemorySaver

# 1. STATE: messages append (reducer); attempts overwrite by default
class State(TypedDict):
    messages: Annotated[list, add_messages]   # reducer -> appends, never clobbers
    attempts: int

# 2. NODES: (state) -> partial update
def call_model(state: State):
    reply = model.invoke(state["messages"])
    return {"messages": [reply], "attempts": state.get("attempts", 0) + 1}

def run_tools(state: State):
    results = execute_tool_calls(state["messages"][-1])
    return {"messages": results}

# 3. ROUTER for a conditional edge: keep looping or stop
def should_continue(state: State) -> str:
    last = state["messages"][-1]
    if last.tool_calls and state["attempts"] < 5:
        return "tools"
    return END

# 4. WIRE the graph (note the cycle tools -> agent)
g = StateGraph(State)
g.add_node("agent", call_model)
g.add_node("tools", run_tools)
g.add_edge(START, "agent")
g.add_conditional_edges("agent", should_continue, {"tools": "tools", END: END})
g.add_edge("tools", "agent")              # loop back

graph = g.compile(checkpointer=InMemorySaver())
graph.invoke({"messages": [{"role": "user", "content": "Plan my week."}], "attempts": 0},
             config={"configurable": {"thread_id": "u1"}})

›_ Example — with code (2 · human-in-the-loop with interrupt + resume)

02_lg_hitl.pyLangGraph — interrupt / Command
from langgraph.types import Command, interrupt

def approve_purchase(state: State):
    # pause the graph and surface a payload to the caller
    decision = interrupt({"action": "buy", "item": state["item"], "cost": state["cost"]})
    if decision == "approve":
        return {"messages": [{"role": "system", "content": "Purchase approved."}]}
    return {"messages": [{"role": "system", "content": "Purchase cancelled."}]}

# ... add_node("approve", approve_purchase) ... compile(checkpointer=...)

cfg = {"configurable": {"thread_id": "order-9"}}
result = graph.invoke({...}, config=cfg)
# Execution pauses at interrupt(); result surfaces the pending interrupt payload:
print(result["__interrupt__"])     # (Interrupt(value={'action': 'buy', ...}),)

# A human reviews, then we RESUME -- the resume value becomes interrupt()'s return:
graph.invoke(Command(resume="approve"), config=cfg)   # continues exactly where it paused

›_ Example — with code (3 · Command routing, streaming, durable Postgres memory)

03_lg_command_stream.pyLangGraph — Command / stream / Postgres
from langgraph.types import Command

# A node that BOTH updates state AND picks the next node -- no conditional edge needed
def supervisor(state: State) -> Command:
    nxt = "researcher" if needs_research(state) else "writer"
    return Command(goto=nxt, update={"messages": [route_note(nxt)]})

# Stream intermediate progress instead of waiting for the final result
for mode, chunk in graph.stream(inputs, config=cfg, stream_mode=["updates", "messages"]):
    print(mode, chunk)        # "updates" = per-node state deltas; "messages" = token stream

# Durable, multi-instance memory for production: swap the checkpointer
from langgraph.checkpoint.postgres import PostgresSaver
with PostgresSaver.from_conn_string("postgresql://...") as saver:
    graph = g.compile(checkpointer=saver)   # survives restarts, scales across servers

↗ When to drop to LangGraph

Reach for it when you need explicit control flow (branching, bounded cycles, multi-agent supervisors), durable long runs that survive crashes, human approval gates mid-run, or fine-grained streaming. Keep state small and typed, use reducers only where branches merge, and put conditional edges only at real decision points. For a plain tool-calling assistant, stay on create_agent — it compiles down to a LangGraph graph anyway.

SECTION 30Deep Agents in depth: planning, filesystem, subagents & skills

Deep Agents is the "batteries-included" harness — the same tool-calling loop, but pre-loaded with the machinery that makes agents survive long, messy, multi-step tasks. It's the open-source distillation of what makes tools like Claude Code work.

create_deep_agent returns a compiled LangGraph graph (so you keep streaming, checkpointers, and tracing), but it ships with four built-in capabilities turned on by default, each implemented as middleware you can override:

Planning — a write_todos / read_todos tool (TodoListMiddleware) lets the agent decompose a task into a tracked checklist and adapt it as it learns.
Virtual filesystem — ls, read_file, write_file, edit_file, glob, grep (FilesystemMiddleware). Crucially, it offloads large tool results to files automatically so the context window doesn't overflow on long runs. Backends are pluggable: in-memory state, local disk, a LangGraph Store, a shell-capable backend, or a sandbox.
Subagents — a task tool (SubAgentMiddleware) spawns general-purpose or specialized subagents in isolated context windows; each does its subtask and returns only a summary, keeping the main context clean. Async subagents run in the background with progress checks and cancellation.
Context management & memory — automatic summarization when context grows large (keeping recent messages), plus long-term memory across threads via the LangGraph Memory Store, and skills (reusable workflows/domain knowledge loaded into the prompt).

Figure 30. The Deep Agents harness: a plain tool-calling core surrounded by built-in planning, a context-offloading filesystem, subagent delegation, and memory/skills — the parts you'd otherwise hand-build for any serious long-running agent.

◆ Example — no code

A basic agent is one person with a notepad trying to hold an entire project in their head — they run out of room fast. A Deep Agent is a project lead with an office: a whiteboard for the plan (write_todos), filing cabinets so findings live on paper instead of cramming the desk (the filesystem — and it files away bulky documents automatically), and a team of assistants they can hand self-contained sub-tasks to, each working in their own room and reporting back a one-paragraph summary (subagents in isolated context). That's why Deep Agents handles a "project" where a plain agent chokes on the third step.

›_ Example — with code (1 · a deep agent with custom subagents)

01_deepagent.pyDeep Agents — create_deep_agent
from deepagents import create_deep_agent

def web_search(query: str) -> str:
    """Search the web and return results."""
    return run_tavily(query)

# specialized subagents: each gets its OWN isolated context window
research_subagent = {
    "name": "researcher",
    "description": "Deeply researches one focused sub-question and returns a summary.",
    "system_prompt": "You are a meticulous researcher. Cite sources.",
    "tools": [web_search],
}
critic_subagent = {
    "name": "critic",
    "description": "Reviews a draft for accuracy and gaps.",
    "system_prompt": "You are a sharp editor. List concrete fixes.",
}

agent = create_deep_agent(
    model="anthropic:claude-sonnet-4-6",
    tools=[web_search],
    system_prompt=("You are a research lead. First write_todos to plan. "
                   "Delegate sub-questions to the researcher subagent, save findings "
                   "to files, then have the critic review before the final report."),
    subagents=[research_subagent, critic_subagent],
)

result = agent.invoke({"messages": [{"role": "user", "content":
    "Write a sourced report comparing LoRA, QLoRA, and DoRA on cost and quality."}]})
print(result["messages"][-1].content)
# Under the hood: write_todos -> task(researcher) x N -> write_file -> task(critic) -> report

›_ Example — with code (2 · backends, shell, memory & human approval)

02_deepagent_backends.pyDeep Agents — backends + HITL
from deepagents import create_deep_agent
from deepagents.backends import LocalShellBackend     # filesystem + real shell `execute`
from deepagents.middleware import FilesystemMiddleware
from langgraph.checkpoint.memory import InMemorySaver
from langgraph.store.memory import InMemoryStore

# A coding-style agent: real local files + shell, long-term memory, approval gates
agent = create_deep_agent(
    model="anthropic:claude-sonnet-4-6",
    tools=[],
    backend=LocalShellBackend(workspace_root="/workspace"),  # ls/read/write/edit + execute
    system_prompt="You are a coding agent. Plan, edit files, run tests, then summarize.",
    checkpointer=InMemorySaver(),     # durable + resumable (also enables interrupts)
    store=InMemoryStore(),            # long-term memory across threads (swap for a DB)
    interrupt_on={"execute": {"allowed_decisions": ["approve", "reject"]}},  # gate shell cmds
)

# Because it returns a compiled LangGraph graph, you can stream subagent activity:
for mode, chunk in agent.stream({"messages": [{"role": "user",
        "content": "Fix the failing test in utils.py"}]},
        config={"configurable": {"thread_id": "dev-1"}}, stream_mode="updates"):
    print(mode, chunk)

! "Trust the LLM" — with guardrails

Deep Agents follows a trust-the-model design: the agent can do anything its tools and backend allow, including running shell commands and editing files. That power needs fences. Use a sandbox backend (not your host) for untrusted work, declare filesystem permission rules to bound read/write access, gate dangerous tools like execute behind human-in-the-loop, and cap runs with model/tool call limits. Always wire up LangSmith so you can see what a long autonomous run actually did.

↗ Choosing your altitude, one more time

Deep Agents when the task is a project: research, coding, multi-step ops needing planning + files + delegation. LangChain create_agent when it's a focused assistant (model + tools + a few middleware). LangGraph when the control flow itself is custom and the agent loop isn't the right shape. They nest cleanly — Deep Agents is built on create_agent, which is built on LangGraph — so you can always drop a level for more control without leaving the ecosystem.

SECTION 31The one-page cheat sheet

Everything above, compressed. Keep this nearby.

Vocabulary in one line each

Term	Meaning
Parameter / weight	A learnable number the model adjusts during training.
Hyperparameter	A setting you choose (learning rate, batch size, #layers).
Forward pass	Run input through the model to get a prediction.
Loss	One number scoring how wrong the prediction is.
Backprop	Compute the gradient of the loss w.r.t. every weight.
Gradient	Direction + rate the loss changes as a weight changes.
Optimizer	Uses gradients to update weights (SGD, Adam, AdamW).
Epoch	One full pass over the training set.
Batch	A small group of examples processed together.
Overfitting	Memorizing train data, failing on new data.
Logits	Raw model scores before softmax/sigmoid.
Embedding	A learned vector representing a discrete item.
Fine-tuning	Adapting a pretrained model to your task.
Attention	Each token weighs how much every other token matters.
Quantization	Store weights in fewer bits (int8/int4) to shrink + speed up.
LoRA	Train tiny low-rank adapters instead of all weights.
QLoRA	LoRA on a 4-bit frozen base — fine-tune big models on one GPU.
RLHF	Align a model to human preference: SFT → reward model → PPO.
DPO	Align directly from preference pairs — no reward model or RL.
GRPO	Online RL without a critic; great for verifiable (math/code) rewards.
MoE	Many expert FFNs; a router activates only a few per token.
GQA	Query heads share KV heads — smaller KV cache, longer context.
KV cache	Stored past keys/values so generation isn't quadratic.
RAG	Retrieve relevant docs and feed them into the prompt.
Agent	An LLM in a loop that calls tools, observes, and iterates.
Distillation	Train a small model to imitate a big one's outputs.
LangChain	Agent framework: `create_agent` = model + tools + prompt + middleware.
LangGraph	Orchestration runtime: stateful graph, durable execution, persistence.
Deep Agents	Batteries-included harness: planning, filesystem, subagents, memory.
Middleware	Hooks around the agent loop (summarize, limit, guard, fallback).
Checkpointer	Saves graph state per thread — memory + crash-resume.
Subagent	Delegated agent with isolated context that returns a summary.

Pick the architecture

Your data is…	Reach for…
Tabular / structured	Gradient-boosted trees first; MLP if you must
Images / video	CNN or Vision Transformer (fine-tuned)
Text / language	Transformer (BERT to understand, GPT-class to generate)
Time series / streaming audio	LSTM/GRU, Temporal CNN, or a transformer
Need to generate images	Diffusion model
Need semantic search / RAG	Embedding model + vector search

Modern LLM toolbox — pick the right lever

Your goal…	Reach for…
Make a big model fit / run cheaper	Quantization (int4 NF4, AWQ, GGUF) · §19
Customize behavior on a budget	LoRA / QLoRA / DoRA · §20
Align to human preference (have pairs)	DPO (or KTO for binary labels) · §22
Improve reasoning with a checkable reward	GRPO / RLVR · §22
Maximum reward control / online RL	RLHF + PPO · §21
Longer context / faster attention	GQA + FlashAttention + RoPE · §23
Huge capacity, modest compute/token	Mixture-of-Experts · §23
Train a model too big for one GPU	bf16 + checkpointing + FSDP/ZeRO · §24
Serve fast to many users	vLLM (PagedAttention, continuous batching) · §25
Speed up generation, same output	Speculative decoding · §25
Give the model fresh / private knowledge	RAG · §26
Let it use tools & act	Agent loop + function calling · §26
Build an agent fast, swap providers freely	LangChain `create_agent` + middleware · §28
Custom control flow / durability / HITL	LangGraph `StateGraph` + checkpointer · §29
Long, multi-step "project" autonomy	Deep Agents (planning + files + subagents) · §30

The loop you'll write a thousand times

for epoch in range(epochs):
    for xb, yb in loader:
        preds = model(xb)         # forward
        loss  = loss_fn(preds, yb)
        opt.zero_grad()           # clear grads  ← don't forget!
        loss.backward()           # backprop
        opt.step()                # update

Golden rules

Print shapes when anything breaks. Most bugs are shape/device/dtype.
Overfit one batch to validate your pipeline before scaling up.
Match loss & final activation — feed logits to cross-entropy, not softmaxed values.
Toggle train()/eval() around training vs inference.
Don't train from scratch if a pretrained model exists — fine-tune it.
The learning rate is the hyperparameter that matters most. Tune it first.

◆ Where to go next

You now have the full skeleton. Deepen it by building: train an MNIST classifier (the "hello world"), fine-tune a 🤗 model on a dataset you care about, then read the official docs as reference rather than cover-to-cover — pytorch.org, tensorflow.org, and huggingface.co/docs/transformers. The concepts here are the map; the docs are the territory.

SECTION 01What deep learning actually is

When to reach for deep learning (and when not to)

SECTION 02The artificial neuron

SECTION 03Tensors & the math you actually need

1 · Matrix multiplication — the workhorse

2 · Broadcasting

3 · The gradient

4 · The chain rule

SECTION 04The MLP & the forward pass

SECTION 05Activation functions

SECTION 06Loss functions — measuring "how wrong"

SECTION 07Backpropagation — the learning engine

SECTION 08Gradient descent & optimizers

Batch, stochastic, and mini-batch

Smarter optimizers

SECTION 09Regularization & generalization

The toolkit

SECTION 10The training loop — everything together

SECTION 11Convolutional networks — seeing

The full CNN recipe

SECTION 12Recurrent networks — sequences & memory

The vanishing gradient problem & the LSTM fix

SECTION 13Transformers & attention

Self-attention: Query, Key, Value

The full transformer block

Encoder, decoder, and the model families

SECTION 14Embeddings — meaning as geometry

SECTION 15Transfer learning & fine-tuning

SECTION 16Hugging Face & the pretrained ecosystem

Level 1 — pipeline: inference in 3 lines

Level 2 — tokenizer + model: full control over inference

Level 3 — Trainer: fine-tune on your own data

SECTION 17Generative models

Autoencoders & VAEs

GANs — the adversarial game

Diffusion models — the modern image engine

SECTION 18Engineering & deployment

Make training fast and stable

Debugging checklist (in order)

From trained model to product

SECTION 19Quantization — making big models fit

SECTION 20LoRA, QLoRA & parameter-efficient fine-tuning

SECTION 21Alignment with RLHF & PPO

SECTION 22DPO, GRPO & modern preference optimization

SECTION 23Inside a modern Transformer

SECTION 24Training at scale

SECTION 25Inference & serving

SECTION 26Reasoning, RAG & agents

SECTION 27The LangChain stack: overview

SECTION 28LangChain in depth: create_agent, tools, structured output & middleware

SECTION 29LangGraph in depth: state, edges, persistence & human-in-the-loop

SECTION 30Deep Agents in depth: planning, filesystem, subagents & skills

SECTION 31The one-page cheat sheet

Vocabulary in one line each

Pick the architecture

Modern LLM toolbox — pick the right lever

The loop you'll write a thousand times

Golden rules

Level 1 — `pipeline`: inference in 3 lines

Level 3 — `Trainer`: fine-tune on your own data

SECTION 28LangChain in depth: `create_agent`, tools, structured output & middleware