Machine Learning Model Selection Guide

Choosing the right model is a separate skill from knowing how to code it. This guide focuses on the when — which algorithm fits your problem, your data, and your constraints — and the mistakes that come from choosing wrong.

For the syntax and scikit-learn code, see Machine Learning Notebook. All examples use the parking system dataset.

Task Selection

Before choosing a model, identify what type of problem you are solving.

Your question Task Example
Predict a number Regression Predict parking amount from duration and station
Predict a category Classification Predict whether payment will be by credit card
Find natural groups in unlabelled data Clustering Group stations by usage patterns

If you have a labelled target column, you are doing supervised learning (regression or classification). If you do not, you are doing unsupervised learning (clustering).

Model Selection Table

Use this to find a starting point before reading the full section.

Situation Start with
First model on any problem 1. Linear / Logistic Regression
Need to understand why the model decides 2. Decision Tree
Best out-of-the-box performance on tabular data 3. Random Forest
Best performance, willing to tune 4. Gradient Boosting (XGBoost)
Small dataset, simple distance-based problem 5. KNN

Always build a baseline first (see Common Mistakes). A model that barely beats "always predict the mean" is not useful.


1. Linear / Logistic Regression — The Baseline

Use for: - Linear Regression: predict a numeric target when the relationship with features is roughly linear. - Logistic Regression: classify into two (or more) categories when you need probabilities, not just labels.

Use when: - You want a fast, interpretable starting point. - The number of features is large relative to the number of rows — simpler models generalise better. - You need to explain the model to non-technical stakeholders (coefficients map directly to feature effects).

Avoid when: - The relationship between features and target is non-linear (e.g., revenue spikes on weekends but not other days). - Features interact with each other in complex ways.

Reading the output

# coefficients show the effect of each feature
coef_df = pd.DataFrame({'feature': X.columns, 'coef': model.coef_}).sort_values('coef')

A positive coefficient means increasing that feature raises the predicted value. A large absolute coefficient means a strong effect — but only if features are on the same scale. Always scale features before comparing coefficients.

For Logistic Regression, predict_proba() returns the probability of each class. A threshold of 0.5 is the default, but you can lower it (e.g., 0.3) to catch more positives at the cost of more false alarms.


2. Decision Tree — Maximum Interpretability

Use when: - You need to show stakeholders exactly how the model makes decisions. - The data has clear if/else splits (e.g., parking type = monthly → high amount). - You are exploring which features matter before building a more complex model.

Avoid when: - You need strong predictive performance — a single tree overfits easily and rarely wins against Random Forest. - The data has many continuous features with subtle interactions.

from sklearn.tree import DecisionTreeClassifier, plot_tree

model = DecisionTreeClassifier(max_depth=4, random_state=42)
model.fit(X_train, y_train)

plt.figure(figsize=(20, 8))
plot_tree(model, feature_names=X.columns, class_names=['Other', 'Credit Card'],
          filled=True, rounded=True)
plt.show()

Controlling overfitting

Parameter Effect
max_depth Limits tree depth — most important lever
min_samples_leaf Minimum rows per leaf — prevents tiny, overfit splits
min_samples_split Minimum rows to attempt a split

Start with max_depth=4 or max_depth=5. A tree deeper than 6 is almost always overfitting. Check: if training accuracy >> test accuracy, reduce depth.


3. Random Forest — The Reliable Default

Use when: - You want strong performance without extensive tuning. - You have a mix of numeric and categorical features. - You want built-in feature importance.

Avoid when: - You need a fully interpretable model — Random Forest is a black box relative to a single tree. - Training time or memory is a hard constraint — 100+ trees on a large dataset can be slow. - You need probabilities to be well-calibrated (use Logistic Regression instead).

Key parameters

Parameter What it controls Typical range
n_estimators Number of trees 100–500
max_depth Maximum depth per tree 5–20, or None
max_features Features considered at each split 'sqrt' (default for classification)
min_samples_leaf Minimum rows per leaf 1–10

More trees always helps up to a point, then diminishing returns. Start with 100 and increase only if cross-validation scores are still improving.

Feature importance

importance_df = pd.DataFrame({
    'feature':    X.columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

Feature importance tells you which features the model relies on most. It does not tell you the direction of the effect (use coefficients for that, or SHAP values).


4. Gradient Boosting (XGBoost) — Best Performance

Gradient boosting builds trees sequentially — each tree corrects the errors of the previous one. It typically outperforms Random Forest on structured tabular data.

Use when: - You need the best possible predictive performance. - You are willing to tune hyperparameters. - Training time is acceptable (slower than Random Forest per tree, but often needs fewer trees).

Avoid when: - Your dataset is small (< 1,000 rows) — it will overfit. - You need a quick first model — use Random Forest first, switch to XGBoost if you need the extra performance.

from xgboost import XGBClassifier

model = XGBClassifier(n_estimators=200, max_depth=4, learning_rate=0.1,
                      random_state=42, eval_metric='logloss')
model.fit(X_train, y_train)

Key parameters

Parameter What it controls
n_estimators Number of trees
learning_rate Step size per tree — lower = more trees needed, often better generalisation
max_depth Tree depth (3–6 is typical, shallower than Random Forest)
subsample Fraction of rows per tree (0.8 = 80%) — reduces overfitting
colsample_bytree Fraction of features per tree (0.8 = 80%)

Lower learning_rate + more n_estimators almost always improves results. A common starting point: learning_rate=0.05, n_estimators=500.


5. KNN — Simple and Assumption-Free

K-Nearest Neighbours classifies a point by the majority label of its k nearest neighbours.

Use when: - The dataset is small (< 10,000 rows). - The decision boundary is genuinely based on local similarity. - You want a non-parametric model with no training phase.

Avoid when: - The dataset is large — prediction time scales with training set size. - You have many features — distance becomes meaningless in high dimensions (curse of dimensionality). - Features have very different scales — always scale before using KNN.

from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train_scaled, y_train)   # must scale features first

Tune n_neighbors with cross-validation. Odd values avoid ties in binary classification.


Regression vs Classification: Choosing Evaluation Metrics

Regression Metrics

Metric Formula Use when
MAE mean(|y - ŷ|) Outliers should not dominate — reports average error in original units
RMSE √mean((y - ŷ)²) Large errors are especially bad — penalises them more
1 - SS_res/SS_tot Explaining model quality to stakeholders (0–1 scale)

R² of 0.85 means the model explains 85% of the variance in the target. An R² below 0.5 is weak for most business use cases.

Classification Metrics

Metric Use when
Accuracy Classes are balanced and all errors cost the same
Precision False positives are costly (spam filter, irrelevant recommendations)
Recall False negatives are costly (fraud detection, medical diagnosis)
F1 Imbalanced classes, both precision and recall matter
ROC-AUC Comparing models regardless of threshold; probability ranking quality

Accuracy is misleading on imbalanced data. If 95% of sessions are non-credit-card, a model that always predicts "not credit card" scores 95% accuracy but is useless. Use F1 or ROC-AUC instead.

Reading a Classification Report

              precision    recall  f1-score   support
       Other       0.94      0.97      0.95      1820
 Credit Card       0.81      0.70      0.75       380
    accuracy                           0.92      2200
  • High precision, low recall for "Credit Card": the model is conservative — when it predicts credit card, it's usually right, but it misses many actual credit card payments.
  • Lowering the classification threshold (predict_proba > 0.35 instead of > 0.5) will raise recall and lower precision.

Overfitting vs Underfitting

Symptom Diagnosis Fix
Train score high, test score low Overfitting — model memorised training data Reduce max_depth, increase min_samples_leaf, add more data
Both scores low Underfitting — model too simple Add features, increase model complexity, try a different algorithm
Train ≈ test, both acceptable Good fit Done
# always compare train vs test score
print(f"Train: {model.score(X_train, y_train):.3f}")
print(f"Test:  {model.score(X_test,  y_test):.3f}")

A gap of more than 0.10 between train and test is a warning sign of overfitting.


Common Mistakes

1. Skipping the Baseline

Any model must beat a naive baseline to be worth using.

from sklearn.dummy import DummyClassifier
dummy = DummyClassifier(strategy='most_frequent')
scores = cross_val_score(dummy, X, y, cv=5, scoring='f1')
print(f"Baseline F1: {scores.mean():.3f}")   # your model must beat this

2. Evaluating on Training Data

# WRONG: inflated score — the model has seen this data
model.score(X_train, y_train)

# CORRECT: held-out test set, touched once at the very end
model.score(X_test, y_test)

3. Fitting the Scaler on the Full Dataset

# WRONG: test data leaks into the scaler's mean and std
scaler.fit(X)
X_scaled = scaler.transform(X)

# CORRECT: fit only on training data
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled  = scaler.transform(X_test)

Use a Pipeline to prevent this mistake entirely — it enforces the correct fit/transform order automatically.

4. Using Accuracy on Imbalanced Classes

# MISLEADING: 95% accuracy when 95% of samples are class 0
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")

# BETTER: F1 or ROC-AUC on imbalanced data
print(f"F1: {f1_score(y_test, y_pred):.3f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_pred_prob):.3f}")

5. Jumping to a Complex Model

Random Forest and XGBoost are not always better than Logistic Regression. A simple model that is 2% worse but fully explainable is often the right business choice. Always build the simple model first.

6. Not Cross-Validating

A single train/test split can be lucky or unlucky. Cross-validation gives a more reliable estimate.

# one split: unreliable
model.fit(X_train, y_train)
model.score(X_test, y_test)

# five folds: more reliable
cross_val_score(model, X, y, cv=5, scoring='f1').mean()

Model Selection Workflow

from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import pandas as pd

models = {
    'Baseline':            DummyClassifier(strategy='most_frequent'),
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Decision Tree':       DecisionTreeClassifier(max_depth=5, random_state=42),
    'Random Forest':       RandomForestClassifier(n_estimators=100, random_state=42),
}

results = {}
for name, model in models.items():
    scores = cross_val_score(model, X_encoded, y, cv=5, scoring='f1')
    results[name] = {'mean_f1': scores.mean().round(3), 'std': scores.std().round(3)}

print(pd.DataFrame(results).T.sort_values('mean_f1', ascending=False))

Decision logic after comparing results: - If Logistic Regression is close to Random Forest → stick with Logistic Regression (simpler, explainable). - If Random Forest clearly wins → use it, then consider XGBoost if you need more. - If all models barely beat the baseline → the problem is in the features, not the algorithm. Add better features before changing models.