Machine Learning · scikit-learn

A visual guide to Machine Learning, with examples & code.

Every concept follows one simple rhythm — a plain-English idea, a real example, a diagram you can picture, and code you can run.

scikit-learn 1.9 14 sections 8 models 16 diagrams beginner-friendly
01

What is Machine Learning?

In normal programming you write the rules and the computer produces the answer. In machine learning you hand the computer many examples — data plus the correct answers — and it discovers the rules by itself.

NORMAL PROGRAMMING data + rules answers MACHINE LEARNING data + answers rules The computer learns the rule from examples — then predicts on new data.
Programming gives the computer rules; ML gives it examples and lets it find the rules.
Example To predict house prices, you don't write a price formula. You show the model thousands of houses (size, location, rooms) with their real prices, and it learns the relationship — then prices a brand-new house.
# install once
pip install scikit-learn pandas numpy matplotlib
02

Types of ML

Three families, split by whether your data already contains the correct answers.

TypeHas answers?What it doesExample
SupervisedYesPredicts a value or categoryHouse price · spam filter
UnsupervisedNoFinds hidden groups / patternsCustomer segments
ReinforcementReward / penaltyLearns by trial and errorGame-playing · robotics

Inside Supervised there are two jobs: Regression (the answer is a number, like a price) and Classification (the answer is a category, like spam / not spam).

Regression → a number Classification → a category
Regression fits a trend to predict numbers; classification finds a boundary to separate categories.
03

How every model works: fit & predict

The single most useful thing here: every scikit-learn model uses the same three methods. Learn them once and you can drive all of them.

.fit(X, y) learn .predict(X) guess new data .score(X, y) how good?
The universal scikit-learn workflow — true for linear models, trees, SVMs, everything.
model.fit(X_train, y_train)   # 1) learn from data
model.predict(X_test)         # 2) make predictions
model.score(X_test, y_test)   # 3) check how good it is

X = the inputs (features that describe the thing). y = the answer you want to predict (the target).

04

Train / Test split

You must test the model on data it has never seen, to prove it actually learned instead of just memorizing. So we hold out a slice.

All data TRAIN — 80% (study) TEST 20%
Most data trains the model; the held-out test set is the “exam” on unseen questions.
Example Like exam prep: you practice on some questions (training), but the real exam uses different questions (testing) — that's the only way to prove you understood, not memorized.
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,      # 20% held out for testing
    random_state=42,    # makes the split repeatable
    stratify=y          # keep class balance (classification)
)
05

Preparing data

Scaling — put every column on the same footing

If one column is in the thousands and another in single digits, the model is fooled into thinking the big-number column matters more. Scaling fixes that.

BEFORE (different scales) income age scale AFTER (comparable) income age
Income (≈50,000) would dwarf age (≈35) until scaling brings them onto one comparable range.
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # learns mean/std from TRAIN only
X_test_scaled  = scaler.transform(X_test)       # just applies it
Golden rule The scaler learns from training data only (fit_transform), then merely applies (transform) to the test data. Fit it on everything and you've secretly cheated.

Encoding — turn text into numbers

Models only understand numbers, so a city column of "Cairo / Giza / Alex" becomes three 0/1 columns.

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(handle_unknown="ignore")
06

Supervised models

Eight workhorses. Each card: the idea, a real example, a diagram, and code.

6.1 · Linear Regression

Regression

The idea. Draws the best straight line through your data to describe a relationship and predict numbers.

Example Bigger house → higher price. The line learns "for every extra square meter, add about X to the price."
house size → price → predict
The fitted line lets you read off the predicted price for any new house size.
from sklearn.linear_model import LinearRegression
import numpy as np

X = np.array([[50],[60],[80],[100],[120],[150]])  # size m²
y = np.array([500,600,780,1000,1150,1450])         # price (k)

model = LinearRegression().fit(X, y)
print("130 m² →", model.predict([[130]])[0])

6.2 · Logistic Regression

Classification

The idea. Despite the name, it's for classification. It outputs the probability of belonging to a class, using an S-shaped curve.

Example Will a customer buy? It outputs "80% likely" → predicts "buy." Anything above the 0.5 line tips to yes.
0.5 threshold 10
The sigmoid maps any input to a probability between 0 and 1; cross the threshold to decide the class.
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X, y = load_iris(return_X_y=True)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)
model = LogisticRegression(max_iter=200).fit(X_tr, y_tr)
print("accuracy:", model.score(X_te, y_te))

6.3 · K-Nearest Neighbors

Classification

The idea. "Tell me who your neighbors are, I'll tell you who you are." It looks at the closest K points and takes a majority vote.

Example To classify a new flower, it finds the 3 most similar known flowers and goes with the majority type.
K = 3 nearest new point → votes 2 teal / 1 coral → teal
The new point inherits the majority label among its K closest neighbors.
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=3).fit(X_tr, y_tr)
print("KNN:", model.score(X_te, y_te))   # scale features first — KNN uses distance

6.4 · Decision Tree

Both

The idea. Asks a series of yes/no questions until it reaches a decision. Easy to read and explain.

Example Loan approval: "Income > 50k? → Age > 30? → Approve." Exactly like a flowchart.
income > 50k? no yes Reject age > 30? Approve Reject
Each node splits the data with one question; leaves give the final answer.
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=3, random_state=42).fit(X_tr, y_tr)
# max_depth limits how many questions → stops it from memorizing

6.5 · Random Forest

Strong default

The idea. Builds hundreds of decision trees and lets them vote. One tree can be wrong; a whole forest rarely is.

Example Instead of trusting one doctor, you poll 100 doctors and take the majority diagnosis — far more reliable.
tree → A tree → A tree → B … 100 trees vote majority → A
Many decorrelated trees vote; averaging cancels out individual mistakes.
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42).fit(X_tr, y_tr)
print("forest:", model.score(X_te, y_te))
print(model.feature_importances_)   # which features mattered most

6.6 · Support Vector Machine

Classification

The idea. Draws the boundary that separates classes with the widest possible gap (margin) between them.

Example Separating cats from dogs with the cleanest line that leaves the biggest safety buffer on both sides.
margin circled points = support vectors
The solid line is the boundary; dashed lines mark the widest margin, set by the support vectors.
from sklearn.svm import SVC
model = SVC(kernel="rbf", C=1.0).fit(X_tr, y_tr)   # rbf handles curved boundaries; scale first

6.7 · Naive Bayes

Classification

The idea. Uses probability to classify. Very fast and great for text.

Example Spam detection — emails with "free", "win", "prize" lean spam. It multiplies those word probabilities to decide.
P(spam | "free win prize") 0.86 P(not spam) 0.14
It compares the probability of each class given the words, and picks the larger one.
from sklearn.naive_bayes import GaussianNB
model = GaussianNB().fit(X_tr, y_tr)
print("NB:", model.score(X_te, y_te))

6.8 · Gradient Boosting

Best on tables

The idea. Builds trees one after another, where each new tree fixes the mistakes of the previous ones. Usually the strongest model for tabular data.

Example Like studying smart — each round you focus on the questions you got wrong last time, improving step by step.
trees added → error → low error
Each tree shaves off the leftover error, so the ensemble keeps getting more accurate.
from sklearn.ensemble import HistGradientBoostingClassifier
model = HistGradientBoostingClassifier(learning_rate=0.1, max_iter=300).fit(X_tr, y_tr)

Model comparison — when to reach for which

ModelBest forNeeds scaling?Easy to read?Speed
Linear / LogisticSimple, linear relationshipsYesYesFast
KNNSmall data, clear groupsYesMediumSlow at predict
Decision TreeInterpretable rulesNoVeryFast
Random ForestStrong all-round defaultNoLowMedium
SVMClear margins, mid-size dataYesLowSlow on big data
Naive BayesText / many featuresNoMediumVery fast
Gradient BoostingTop accuracy on tablesNoLowMedium
07

Measuring how good your model is

Accuracy alone can lie. A cancer test that always says "healthy" is 99% accurate if only 1% are sick — yet useless. That's why we also read precision and recall from the confusion matrix.

Predicted Positive Negative Actual + Actual − TPcorrect hit FNmissed FPfalse alarm TNcorrect pass
Precision = TP / (TP+FP) · Recall = TP / (TP+FN). The off-diagonal cells are the two kinds of mistake.
MetricQuestion it answersCare most when…
AccuracyOverall % correctClasses are balanced
PrecisionOf my "yes" predictions, how many were right?False alarms are costly (spam filter)
RecallOf all real positives, how many did I catch?Misses are costly (cancer screening)
F1Balance of precision & recallYou need one number under imbalance
R² / RMSEHow close are numeric predictions?Regression problems
from sklearn.metrics import classification_report, confusion_matrix
pred = model.predict(X_te)
print(confusion_matrix(y_te, pred))        # the 2×2 table above
print(classification_report(y_te, pred))   # precision / recall / f1 per class
08

Overfitting & Underfitting

The most important idea in practice. A model can be too simple (underfit), just right, or so complex it memorizes noise (overfit).

Underfit too simple Good fit captures the trend Overfit memorizes noise
The middle model generalizes; the right one chases every point and fails on new data.
ProblemMeaningSignFix
UnderfittingModel too simpleBad on train and testStronger model, more features
OverfittingMemorized the dataGreat on train, bad on testMore data, simpler model, regularization
Example A student who memorizes practice answers (overfit) fails a slightly different exam. One who barely studied (underfit) fails everything. You want the one who understood the concepts.
from sklearn.linear_model import Ridge
model = Ridge(alpha=1.0)   # a penalty that stops the model overreacting; bigger alpha = simpler
09

Cross-Validation

Instead of testing once, split the data into 5 parts and test 5 times — each part takes a turn as the test set. Average the scores for a far more trustworthy estimate.

Fold 1Fold 2Fold 3Fold 4Fold 5 test train
Every fold serves as the test set exactly once; the final score is the average across all five.
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5)   # 5 rounds
print("avg:", scores.mean(), "±", scores.std())
10

Hyperparameter tuning

Every model has settings (like n_neighbors in KNN or max_depth in a tree). GridSearchCV tries all combinations for you and keeps the best.

Example Tuning a recipe — automatically trying different oven temperatures and baking times, then keeping the combo that tastes best.
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

params = {"n_estimators":[50,100,200], "max_depth":[3,5,10,None]}
grid = GridSearchCV(RandomForestClassifier(random_state=42), params, cv=5)
grid.fit(X_tr, y_tr)
print("best:", grid.best_params_)
best_model = grid.best_estimator_   # ready to use
11

Unsupervised models

No answers given — the model discovers structure on its own.

11.1 · K-Means

Clustering

The idea. Automatically groups similar items together, without being told the groups in advance.

Example A shop discovers customer groups — big spenders, bargain hunters, occasional buyers — purely from behavior. Nobody labeled them first.
× × × × = cluster center
Points snap to the nearest center; the centers shift until the groups settle.
from sklearn.cluster import KMeans
model = KMeans(n_clusters=3, n_init=10, random_state=42).fit(X)
print(model.labels_)   # which group each point belongs to

11.2 · PCA

Dimensionality reduction

The idea. Compresses many columns into a few while keeping most of the information — great for speed and for plotting data in 2D.

Example Summarizing a 300-page book into a 2-page summary that still captures the main story.
main direction 2D points projected onto the line of greatest spread
PCA finds the direction of most variation and projects data onto it, dropping redundant columns.
from sklearn.decomposition import PCA
pca = PCA(n_components=2)          # compress to 2 columns
X_2d = pca.fit_transform(X)
print("info kept:", pca.explained_variance_ratio_.sum())
12

Pipelines

Instead of doing scaling and the model in separate steps (and forgetting one), bundle them into a single clean object that runs in order.

raw data scale model predict
One object handles every step in order — like a coffee machine: grind, brew, pour, one button.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

pipe = Pipeline([
    ("scaler", StandardScaler()),   # step 1
    ("model", SVC()),               # step 2
])
pipe.fit(X_tr, y_tr)                   # runs both, in order
print(pipe.score(X_te, y_te))
Bonus Pipelines also prevent a classic mistake: scaling before splitting and accidentally leaking test info into training.
13

Full project, start to finish

Everything above, working together on a real dataset. Read it line by line.

# 1) imports
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# 2) data (tumor diagnosis: benign vs malignant)
data = load_breast_cancer()
X, y = data.data, data.target

# 3) split
X_tr, X_te, y_tr, y_te = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

# 4) pipeline: scaling + model
pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("model", RandomForestClassifier(n_estimators=100, random_state=42)),
])

# 5) train
pipe.fit(X_tr, y_tr)

# 6) evaluate
print("test accuracy:", pipe.score(X_te, y_te))
cv = cross_val_score(pipe, X, y, cv=5)
print("cross-val:", cv.mean().round(3), "±", cv.std().round(3))
print(classification_report(y_te, pipe.predict(X_te), target_names=data.target_names))

# 7) use it on a new case
print("prediction:", data.target_names[pipe.predict(X_te[0].reshape(1, -1))[0]])
Safety The tumor example is for learning only — real medical decisions are made by doctors, not models.
14

Cheat sheet & next steps

Which model should I start with?

Your taskGood first move
Predict a numberLinearRegression → then HistGradientBoostingRegressor
Predict a categoryLogisticRegression → then RandomForestClassifier
Best score on a tableHistGradientBoostingClassifier
Group data (no labels)KMeans
Too many columnsPCA

The one mental model to keep

# Master this and every estimator clicks into place
1. Prepare data  → split, scale, encode (inside a Pipeline)
2. model.fit(X_train, y_train)  → learn
3. model.predict(X_test)  → guess
4. Evaluate honestly  → cross-validation, the right metric
5. Tune & repeat  → GridSearchCV, watch for overfitting

Habits that matter

DoWhy
Run the code & change the numbersYou learn by seeing what breaks
Start simple, then go complexA simple baseline tells you if complexity helps
Spend time on the data~80% of real ML is cleaning & preparing data
Always keep a held-out test setIt's your only honest measure of generalization

After this guide

1 · Build a project on real data from Kaggle.   2 · Learn pandas for data cleaning.   3 · Then move to deep learning (TensorFlow / PyTorch).