Machine Learning Model Selection Guide
Choosing the right model is a separate skill from knowing how to code it. This guide focuses on the when — which algorithm fits your problem, your data, and your constraints — and the mistakes that come from choosing wrong.
For the syntax and scikit-learn code, see Machine Learning Notebook. All examples use the parking system dataset.
Task Selection
Before choosing a model, identify what type of problem you are solving.
| Your question | Task | Example |
|---|---|---|
| Predict a number | Regression | Predict parking amount from duration and station |
| Predict a category | Classification | Predict whether payment will be by credit card |
| Find natural groups in unlabelled data | Clustering | Group stations by usage patterns |
If you have a labelled target column, you are doing supervised learning (regression or classification). If you do not, you are doing unsupervised learning (clustering).
Model Selection Table
Use this to find a starting point before reading the full section.
| Situation | Start with |
|---|---|
| First model on any problem | 1. Linear / Logistic Regression |
| Need to understand why the model decides | 2. Decision Tree |
| Best out-of-the-box performance on tabular data | 3. Random Forest |
| Best performance, willing to tune | 4. Gradient Boosting (XGBoost) |
| Small dataset, simple distance-based problem | 5. KNN |
Always build a baseline first (see Common Mistakes). A model that barely beats "always predict the mean" is not useful.
1. Linear / Logistic Regression — The Baseline
Use for: - Linear Regression: predict a numeric target when the relationship with features is roughly linear. - Logistic Regression: classify into two (or more) categories when you need probabilities, not just labels.
Use when: - You want a fast, interpretable starting point. - The number of features is large relative to the number of rows — simpler models generalise better. - You need to explain the model to non-technical stakeholders (coefficients map directly to feature effects).
Avoid when: - The relationship between features and target is non-linear (e.g., revenue spikes on weekends but not other days). - Features interact with each other in complex ways.
Reading the output
# coefficients show the effect of each feature
coef_df = pd.DataFrame({'feature': X.columns, 'coef': model.coef_}).sort_values('coef')
A positive coefficient means increasing that feature raises the predicted value. A large absolute coefficient means a strong effect — but only if features are on the same scale. Always scale features before comparing coefficients.
For Logistic Regression, predict_proba() returns the probability of each class. A threshold of 0.5 is the default, but you can lower it (e.g., 0.3) to catch more positives at the cost of more false alarms.
2. Decision Tree — Maximum Interpretability
Use when: - You need to show stakeholders exactly how the model makes decisions. - The data has clear if/else splits (e.g., parking type = monthly → high amount). - You are exploring which features matter before building a more complex model.
Avoid when: - You need strong predictive performance — a single tree overfits easily and rarely wins against Random Forest. - The data has many continuous features with subtle interactions.
from sklearn.tree import DecisionTreeClassifier, plot_tree
model = DecisionTreeClassifier(max_depth=4, random_state=42)
model.fit(X_train, y_train)
plt.figure(figsize=(20, 8))
plot_tree(model, feature_names=X.columns, class_names=['Other', 'Credit Card'],
filled=True, rounded=True)
plt.show()
Controlling overfitting
| Parameter | Effect |
|---|---|
max_depth |
Limits tree depth — most important lever |
min_samples_leaf |
Minimum rows per leaf — prevents tiny, overfit splits |
min_samples_split |
Minimum rows to attempt a split |
Start with max_depth=4 or max_depth=5. A tree deeper than 6 is almost always overfitting. Check: if training accuracy >> test accuracy, reduce depth.
3. Random Forest — The Reliable Default
Use when: - You want strong performance without extensive tuning. - You have a mix of numeric and categorical features. - You want built-in feature importance.
Avoid when: - You need a fully interpretable model — Random Forest is a black box relative to a single tree. - Training time or memory is a hard constraint — 100+ trees on a large dataset can be slow. - You need probabilities to be well-calibrated (use Logistic Regression instead).
Key parameters
| Parameter | What it controls | Typical range |
|---|---|---|
n_estimators |
Number of trees | 100–500 |
max_depth |
Maximum depth per tree | 5–20, or None |
max_features |
Features considered at each split | 'sqrt' (default for classification) |
min_samples_leaf |
Minimum rows per leaf | 1–10 |
More trees always helps up to a point, then diminishing returns. Start with 100 and increase only if cross-validation scores are still improving.
Feature importance
importance_df = pd.DataFrame({
'feature': X.columns,
'importance': model.feature_importances_
}).sort_values('importance', ascending=False)
Feature importance tells you which features the model relies on most. It does not tell you the direction of the effect (use coefficients for that, or SHAP values).
4. Gradient Boosting (XGBoost) — Best Performance
Gradient boosting builds trees sequentially — each tree corrects the errors of the previous one. It typically outperforms Random Forest on structured tabular data.
Use when: - You need the best possible predictive performance. - You are willing to tune hyperparameters. - Training time is acceptable (slower than Random Forest per tree, but often needs fewer trees).
Avoid when: - Your dataset is small (< 1,000 rows) — it will overfit. - You need a quick first model — use Random Forest first, switch to XGBoost if you need the extra performance.
from xgboost import XGBClassifier
model = XGBClassifier(n_estimators=200, max_depth=4, learning_rate=0.1,
random_state=42, eval_metric='logloss')
model.fit(X_train, y_train)
Key parameters
| Parameter | What it controls |
|---|---|
n_estimators |
Number of trees |
learning_rate |
Step size per tree — lower = more trees needed, often better generalisation |
max_depth |
Tree depth (3–6 is typical, shallower than Random Forest) |
subsample |
Fraction of rows per tree (0.8 = 80%) — reduces overfitting |
colsample_bytree |
Fraction of features per tree (0.8 = 80%) |
Lower learning_rate + more n_estimators almost always improves results. A common starting point: learning_rate=0.05, n_estimators=500.
5. KNN — Simple and Assumption-Free
K-Nearest Neighbours classifies a point by the majority label of its k nearest neighbours.
Use when: - The dataset is small (< 10,000 rows). - The decision boundary is genuinely based on local similarity. - You want a non-parametric model with no training phase.
Avoid when: - The dataset is large — prediction time scales with training set size. - You have many features — distance becomes meaningless in high dimensions (curse of dimensionality). - Features have very different scales — always scale before using KNN.
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train_scaled, y_train) # must scale features first
Tune n_neighbors with cross-validation. Odd values avoid ties in binary classification.
Regression vs Classification: Choosing Evaluation Metrics
Regression Metrics
| Metric | Formula | Use when |
|---|---|---|
| MAE | mean(|y - ŷ|) | Outliers should not dominate — reports average error in original units |
| RMSE | √mean((y - ŷ)²) | Large errors are especially bad — penalises them more |
| R² | 1 - SS_res/SS_tot | Explaining model quality to stakeholders (0–1 scale) |
R² of 0.85 means the model explains 85% of the variance in the target. An R² below 0.5 is weak for most business use cases.
Classification Metrics
| Metric | Use when |
|---|---|
| Accuracy | Classes are balanced and all errors cost the same |
| Precision | False positives are costly (spam filter, irrelevant recommendations) |
| Recall | False negatives are costly (fraud detection, medical diagnosis) |
| F1 | Imbalanced classes, both precision and recall matter |
| ROC-AUC | Comparing models regardless of threshold; probability ranking quality |
Accuracy is misleading on imbalanced data. If 95% of sessions are non-credit-card, a model that always predicts "not credit card" scores 95% accuracy but is useless. Use F1 or ROC-AUC instead.
Reading a Classification Report
precision recall f1-score support
Other 0.94 0.97 0.95 1820
Credit Card 0.81 0.70 0.75 380
accuracy 0.92 2200
- High precision, low recall for "Credit Card": the model is conservative — when it predicts credit card, it's usually right, but it misses many actual credit card payments.
- Lowering the classification threshold (
predict_proba > 0.35instead of> 0.5) will raise recall and lower precision.
Overfitting vs Underfitting
| Symptom | Diagnosis | Fix |
|---|---|---|
| Train score high, test score low | Overfitting — model memorised training data | Reduce max_depth, increase min_samples_leaf, add more data |
| Both scores low | Underfitting — model too simple | Add features, increase model complexity, try a different algorithm |
| Train ≈ test, both acceptable | Good fit | Done |
# always compare train vs test score
print(f"Train: {model.score(X_train, y_train):.3f}")
print(f"Test: {model.score(X_test, y_test):.3f}")
A gap of more than 0.10 between train and test is a warning sign of overfitting.
Common Mistakes
1. Skipping the Baseline
Any model must beat a naive baseline to be worth using.
from sklearn.dummy import DummyClassifier
dummy = DummyClassifier(strategy='most_frequent')
scores = cross_val_score(dummy, X, y, cv=5, scoring='f1')
print(f"Baseline F1: {scores.mean():.3f}") # your model must beat this
2. Evaluating on Training Data
# WRONG: inflated score — the model has seen this data
model.score(X_train, y_train)
# CORRECT: held-out test set, touched once at the very end
model.score(X_test, y_test)
3. Fitting the Scaler on the Full Dataset
# WRONG: test data leaks into the scaler's mean and std
scaler.fit(X)
X_scaled = scaler.transform(X)
# CORRECT: fit only on training data
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
Use a Pipeline to prevent this mistake entirely — it enforces the correct fit/transform order automatically.
4. Using Accuracy on Imbalanced Classes
# MISLEADING: 95% accuracy when 95% of samples are class 0
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
# BETTER: F1 or ROC-AUC on imbalanced data
print(f"F1: {f1_score(y_test, y_pred):.3f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_pred_prob):.3f}")
5. Jumping to a Complex Model
Random Forest and XGBoost are not always better than Logistic Regression. A simple model that is 2% worse but fully explainable is often the right business choice. Always build the simple model first.
6. Not Cross-Validating
A single train/test split can be lucky or unlucky. Cross-validation gives a more reliable estimate.
# one split: unreliable
model.fit(X_train, y_train)
model.score(X_test, y_test)
# five folds: more reliable
cross_val_score(model, X, y, cv=5, scoring='f1').mean()
Model Selection Workflow
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import pandas as pd
models = {
'Baseline': DummyClassifier(strategy='most_frequent'),
'Logistic Regression': LogisticRegression(max_iter=1000),
'Decision Tree': DecisionTreeClassifier(max_depth=5, random_state=42),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
}
results = {}
for name, model in models.items():
scores = cross_val_score(model, X_encoded, y, cv=5, scoring='f1')
results[name] = {'mean_f1': scores.mean().round(3), 'std': scores.std().round(3)}
print(pd.DataFrame(results).T.sort_values('mean_f1', ascending=False))
Decision logic after comparing results: - If Logistic Regression is close to Random Forest → stick with Logistic Regression (simpler, explainable). - If Random Forest clearly wins → use it, then consider XGBoost if you need more. - If all models barely beat the baseline → the problem is in the features, not the algorithm. Add better features before changing models.