What is Machine Learning?
In normal programming you write the rules and the computer produces the answer. In machine learning you hand the computer many examples — data plus the correct answers — and it discovers the rules by itself.
# install once
pip install scikit-learn pandas numpy matplotlib
Types of ML
Three families, split by whether your data already contains the correct answers.
| Type | Has answers? | What it does | Example |
|---|---|---|---|
| Supervised | Yes | Predicts a value or category | House price · spam filter |
| Unsupervised | No | Finds hidden groups / patterns | Customer segments |
| Reinforcement | Reward / penalty | Learns by trial and error | Game-playing · robotics |
Inside Supervised there are two jobs: Regression (the answer is a number, like a price) and Classification (the answer is a category, like spam / not spam).
How every model works: fit & predict
The single most useful thing here: every scikit-learn model uses the same three methods. Learn them once and you can drive all of them.
model.fit(X_train, y_train) # 1) learn from data
model.predict(X_test) # 2) make predictions
model.score(X_test, y_test) # 3) check how good it is
X = the inputs (features that describe the thing). y = the answer you want to predict (the target).
Train / Test split
You must test the model on data it has never seen, to prove it actually learned instead of just memorizing. So we hold out a slice.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2, # 20% held out for testing
random_state=42, # makes the split repeatable
stratify=y # keep class balance (classification)
)
Preparing data
Scaling — put every column on the same footing
If one column is in the thousands and another in single digits, the model is fooled into thinking the big-number column matters more. Scaling fixes that.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # learns mean/std from TRAIN only
X_test_scaled = scaler.transform(X_test) # just applies it
fit_transform), then merely applies (transform) to the test data. Fit it on everything and you've secretly cheated.Encoding — turn text into numbers
Models only understand numbers, so a city column of "Cairo / Giza / Alex" becomes three 0/1 columns.
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(handle_unknown="ignore")
Supervised models
Eight workhorses. Each card: the idea, a real example, a diagram, and code.
6.1 · Linear Regression
RegressionThe idea. Draws the best straight line through your data to describe a relationship and predict numbers.
from sklearn.linear_model import LinearRegression
import numpy as np
X = np.array([[50],[60],[80],[100],[120],[150]]) # size m²
y = np.array([500,600,780,1000,1150,1450]) # price (k)
model = LinearRegression().fit(X, y)
print("130 m² →", model.predict([[130]])[0])
6.2 · Logistic Regression
ClassificationThe idea. Despite the name, it's for classification. It outputs the probability of belonging to a class, using an S-shaped curve.
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
X, y = load_iris(return_X_y=True)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)
model = LogisticRegression(max_iter=200).fit(X_tr, y_tr)
print("accuracy:", model.score(X_te, y_te))
6.3 · K-Nearest Neighbors
ClassificationThe idea. "Tell me who your neighbors are, I'll tell you who you are." It looks at the closest K points and takes a majority vote.
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=3).fit(X_tr, y_tr)
print("KNN:", model.score(X_te, y_te)) # scale features first — KNN uses distance
6.4 · Decision Tree
BothThe idea. Asks a series of yes/no questions until it reaches a decision. Easy to read and explain.
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=3, random_state=42).fit(X_tr, y_tr)
# max_depth limits how many questions → stops it from memorizing
6.5 · Random Forest
Strong defaultThe idea. Builds hundreds of decision trees and lets them vote. One tree can be wrong; a whole forest rarely is.
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42).fit(X_tr, y_tr)
print("forest:", model.score(X_te, y_te))
print(model.feature_importances_) # which features mattered most
6.6 · Support Vector Machine
ClassificationThe idea. Draws the boundary that separates classes with the widest possible gap (margin) between them.
from sklearn.svm import SVC
model = SVC(kernel="rbf", C=1.0).fit(X_tr, y_tr) # rbf handles curved boundaries; scale first
6.7 · Naive Bayes
ClassificationThe idea. Uses probability to classify. Very fast and great for text.
from sklearn.naive_bayes import GaussianNB
model = GaussianNB().fit(X_tr, y_tr)
print("NB:", model.score(X_te, y_te))
6.8 · Gradient Boosting
Best on tablesThe idea. Builds trees one after another, where each new tree fixes the mistakes of the previous ones. Usually the strongest model for tabular data.
from sklearn.ensemble import HistGradientBoostingClassifier
model = HistGradientBoostingClassifier(learning_rate=0.1, max_iter=300).fit(X_tr, y_tr)
Model comparison — when to reach for which
| Model | Best for | Needs scaling? | Easy to read? | Speed |
|---|---|---|---|---|
| Linear / Logistic | Simple, linear relationships | Yes | Yes | Fast |
| KNN | Small data, clear groups | Yes | Medium | Slow at predict |
| Decision Tree | Interpretable rules | No | Very | Fast |
| Random Forest | Strong all-round default | No | Low | Medium |
| SVM | Clear margins, mid-size data | Yes | Low | Slow on big data |
| Naive Bayes | Text / many features | No | Medium | Very fast |
| Gradient Boosting | Top accuracy on tables | No | Low | Medium |
Measuring how good your model is
Accuracy alone can lie. A cancer test that always says "healthy" is 99% accurate if only 1% are sick — yet useless. That's why we also read precision and recall from the confusion matrix.
| Metric | Question it answers | Care most when… |
|---|---|---|
| Accuracy | Overall % correct | Classes are balanced |
| Precision | Of my "yes" predictions, how many were right? | False alarms are costly (spam filter) |
| Recall | Of all real positives, how many did I catch? | Misses are costly (cancer screening) |
| F1 | Balance of precision & recall | You need one number under imbalance |
| R² / RMSE | How close are numeric predictions? | Regression problems |
from sklearn.metrics import classification_report, confusion_matrix
pred = model.predict(X_te)
print(confusion_matrix(y_te, pred)) # the 2×2 table above
print(classification_report(y_te, pred)) # precision / recall / f1 per class
Overfitting & Underfitting
The most important idea in practice. A model can be too simple (underfit), just right, or so complex it memorizes noise (overfit).
| Problem | Meaning | Sign | Fix |
|---|---|---|---|
| Underfitting | Model too simple | Bad on train and test | Stronger model, more features |
| Overfitting | Memorized the data | Great on train, bad on test | More data, simpler model, regularization |
from sklearn.linear_model import Ridge
model = Ridge(alpha=1.0) # a penalty that stops the model overreacting; bigger alpha = simpler
Cross-Validation
Instead of testing once, split the data into 5 parts and test 5 times — each part takes a turn as the test set. Average the scores for a far more trustworthy estimate.
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5) # 5 rounds
print("avg:", scores.mean(), "±", scores.std())
Hyperparameter tuning
Every model has settings (like n_neighbors in KNN or max_depth in a tree). GridSearchCV tries all combinations for you and keeps the best.
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
params = {"n_estimators":[50,100,200], "max_depth":[3,5,10,None]}
grid = GridSearchCV(RandomForestClassifier(random_state=42), params, cv=5)
grid.fit(X_tr, y_tr)
print("best:", grid.best_params_)
best_model = grid.best_estimator_ # ready to use
Unsupervised models
No answers given — the model discovers structure on its own.
11.1 · K-Means
ClusteringThe idea. Automatically groups similar items together, without being told the groups in advance.
from sklearn.cluster import KMeans
model = KMeans(n_clusters=3, n_init=10, random_state=42).fit(X)
print(model.labels_) # which group each point belongs to
11.2 · PCA
Dimensionality reductionThe idea. Compresses many columns into a few while keeping most of the information — great for speed and for plotting data in 2D.
from sklearn.decomposition import PCA
pca = PCA(n_components=2) # compress to 2 columns
X_2d = pca.fit_transform(X)
print("info kept:", pca.explained_variance_ratio_.sum())
Pipelines
Instead of doing scaling and the model in separate steps (and forgetting one), bundle them into a single clean object that runs in order.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
pipe = Pipeline([
("scaler", StandardScaler()), # step 1
("model", SVC()), # step 2
])
pipe.fit(X_tr, y_tr) # runs both, in order
print(pipe.score(X_te, y_te))
Full project, start to finish
Everything above, working together on a real dataset. Read it line by line.
# 1) imports
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
# 2) data (tumor diagnosis: benign vs malignant)
data = load_breast_cancer()
X, y = data.data, data.target
# 3) split
X_tr, X_te, y_tr, y_te = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y)
# 4) pipeline: scaling + model
pipe = Pipeline([
("scaler", StandardScaler()),
("model", RandomForestClassifier(n_estimators=100, random_state=42)),
])
# 5) train
pipe.fit(X_tr, y_tr)
# 6) evaluate
print("test accuracy:", pipe.score(X_te, y_te))
cv = cross_val_score(pipe, X, y, cv=5)
print("cross-val:", cv.mean().round(3), "±", cv.std().round(3))
print(classification_report(y_te, pipe.predict(X_te), target_names=data.target_names))
# 7) use it on a new case
print("prediction:", data.target_names[pipe.predict(X_te[0].reshape(1, -1))[0]])
Cheat sheet & next steps
Which model should I start with?
| Your task | Good first move |
|---|---|
| Predict a number | LinearRegression → then HistGradientBoostingRegressor |
| Predict a category | LogisticRegression → then RandomForestClassifier |
| Best score on a table | HistGradientBoostingClassifier |
| Group data (no labels) | KMeans |
| Too many columns | PCA |
The one mental model to keep
1. Prepare data → split, scale, encode (inside a Pipeline)
2. model.fit(X_train, y_train) → learn
3. model.predict(X_test) → guess
4. Evaluate honestly → cross-validation, the right metric
5. Tune & repeat → GridSearchCV, watch for overfitting
Habits that matter
| Do | Why |
|---|---|
| Run the code & change the numbers | You learn by seeing what breaks |
| Start simple, then go complex | A simple baseline tells you if complexity helps |
| Spend time on the data | ~80% of real ML is cleaning & preparing data |
| Always keep a held-out test set | It's your only honest measure of generalization |
After this guide
1 · Build a project on real data from Kaggle. 2 · Learn pandas for data cleaning. 3 · Then move to deep learning (TensorFlow / PyTorch).