Machine Learning Model Selection Guide

Picking the right model is a different skill from knowing how to code one. This guide is about the when — which algorithm fits your problem, your data, and your limits — and the mistakes you make when you pick wrong.

For the syntax and scikit-learn code, see Machine Learning Notebook. All examples use the parking system dataset.

Task Selection

Before you choose a model, decide what kind of problem you have.

Your question Task Example
Predict a number Regression Predict parking amount from duration and station
Predict a category Classification Predict whether the payment will be by credit card
Find natural groups in unlabeled data Clustering Group stations by usage patterns

If you have a target column to learn from, you are doing supervised learning (regression or classification). If you do not, you are doing unsupervised learning (clustering).

Model Selection Table

Use this to find a starting point before you read the full section.

Situation Start with
First model on any problem 1. Linear / Logistic Regression
You need to know why the model decides 2. Decision Tree
Best out-of-the-box result on tabular data 3. Random Forest
Best result, willing to tune 4. Gradient Boosting (XGBoost)
Small dataset, simple distance-based problem 5. KNN

Always build a baseline first (see Common Mistakes). A model that barely beats "always predict the average" is not useful.


1. Linear / Logistic Regression — The Baseline

Use for: - Linear Regression: predict a number when the link to features is roughly a straight line. - Logistic Regression: predict two (or more) categories when you need probabilities, not just labels.

Use when: - You want a fast, easy-to-explain starting point. - You have many features but few rows — simple models work better in this case. - You need to explain the model to non-technical people (each coefficient maps directly to one feature's effect).

Avoid when: - The link between features and target is not a straight line (for example, revenue jumps on weekends but not on other days). - Features mix together in complex ways.

Reading the output

# coefficients show the effect of each feature
coef_df = pd.DataFrame({'feature': X.columns, 'coef': model.coef_}).sort_values('coef')

A positive coefficient means the predicted value goes up when that feature goes up. A large coefficient means a strong effect — but only if all features are on the same scale. Always scale features before you compare coefficients.

For Logistic Regression, predict_proba() returns the probability of each class. The default cut-off is 0.5, but you can lower it (for example, 0.3) to catch more positives at the cost of more false alarms.


2. Decision Tree — Easiest to Explain

Use when: - You need to show people exactly how the model decides. - The data has clear if/else splits (for example, parking type = monthly → high amount). - You are checking which features matter before you build a more complex model.

Avoid when: - You need strong predictions — a single tree overfits easily and rarely beats a Random Forest. - The data has many continuous features with subtle interactions.

from sklearn.tree import DecisionTreeClassifier, plot_tree

model = DecisionTreeClassifier(max_depth=4, random_state=42)
model.fit(X_train, y_train)

plt.figure(figsize=(20, 8))
plot_tree(model, feature_names=X.columns, class_names=['Other', 'Credit Card'],
          filled=True, rounded=True)
plt.show()

Controlling overfitting

Parameter Effect
max_depth Limits the tree depth — the most important lever
min_samples_leaf Smallest number of rows per leaf — stops tiny, overfit splits
min_samples_split Smallest number of rows needed to try a split

Start with max_depth=4 or max_depth=5. A tree deeper than 6 is almost always overfitting. Check: if the training score is much higher than the test score, lower the depth.


3. Random Forest — The Reliable Default

Use when: - You want strong results without much tuning. - You have a mix of numeric and categorical features. - You want feature importance built in.

Avoid when: - You need a fully transparent model — Random Forest is a black box compared to a single tree. - Training time or memory is tight — 100+ trees on a large dataset can be slow. - You need probabilities to be well-calibrated (use Logistic Regression instead).

Key parameters

Parameter What it controls Typical range
n_estimators Number of trees 100–500
max_depth Maximum depth per tree 5–20, or None
max_features Features tried at each split 'sqrt' (default for classification)
min_samples_leaf Smallest number of rows per leaf 1–10

More trees always helps up to a point, then you stop seeing gains. Start with 100 and add more only if cross-validation scores still go up.

Feature importance

importance_df = pd.DataFrame({
    'feature':    X.columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

Feature importance tells you which features the model relies on most. It does not tell you whether the effect is positive or negative (use coefficients for that, or SHAP values).


4. Gradient Boosting (XGBoost) — Best Results

Gradient boosting builds trees one after another. Each tree fixes the mistakes of the one before. It usually beats Random Forest on tabular data.

Use when: - You need the best possible result. - You are willing to tune hyperparameters. - Training time is OK (slower than Random Forest per tree, but you often need fewer trees).

Avoid when: - Your dataset is small (under 1,000 rows) — it will overfit. - You need a quick first model — use Random Forest first, switch to XGBoost only if you need the extra performance.

from xgboost import XGBClassifier

model = XGBClassifier(n_estimators=200, max_depth=4, learning_rate=0.1,
                      random_state=42, eval_metric='logloss')
model.fit(X_train, y_train)

Key parameters

Parameter What it controls
n_estimators Number of trees
learning_rate Step size per tree — lower means you need more trees, but the model often generalizes better
max_depth Tree depth (3–6 is typical, shallower than Random Forest)
subsample Share of rows used per tree (0.8 = 80%) — lowers overfitting
colsample_bytree Share of features used per tree (0.8 = 80%)

A lower learning_rate plus more n_estimators almost always improves the result. A common starting point: learning_rate=0.05, n_estimators=500.


5. KNN — Simple and Free of Assumptions

K-Nearest Neighbors classifies a point by the most common label among its k nearest neighbors.

Use when: - The dataset is small (under 10,000 rows). - The decision really does depend on local similarity. - You want a model with no training step.

Avoid when: - The dataset is large — prediction time grows with the size of the training set. - You have many features — distance loses meaning in high dimensions (the curse of dimensionality). - Features are on very different scales — always scale before using KNN.

from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train_scaled, y_train)   # scale features first

Tune n_neighbors with cross-validation. Use odd values to avoid ties in binary classification.


Regression vs Classification: Choosing Evaluation Metrics

Regression Metrics

Metric Formula Use when
MAE mean(|y - ŷ|) You don't want outliers to dominate — reports the average error in the same units as the target
RMSE √mean((y - ŷ)²) Big errors are especially bad — RMSE punishes them more
1 - SS_res/SS_tot Explaining model quality to non-technical people (a 0–1 scale)

R² of 0.85 means the model explains 85% of the variance in the target. R² below 0.5 is weak for most business cases.

Classification Metrics

Metric Use when
Accuracy Classes are balanced and all errors cost the same
Precision False positives are costly (spam filter, irrelevant recommendations)
Recall False negatives are costly (fraud detection, medical diagnosis)
F1 Imbalanced classes; both precision and recall matter
ROC-AUC Comparing models without picking a threshold; quality of probability ranking

Accuracy lies on imbalanced data. If 95% of sessions are not credit card, a model that always predicts "not credit card" gets 95% accuracy but is useless. Use F1 or ROC-AUC instead.

Reading a Classification Report

              precision    recall  f1-score   support
       Other       0.94      0.97      0.95      1820
 Credit Card       0.81      0.70      0.75       380
    accuracy                           0.92      2200
  • High precision, low recall for "Credit Card": the model is careful — when it says credit card, it's usually right, but it misses many real credit card payments.
  • Lowering the cut-off (predict_proba > 0.35 instead of > 0.5) raises recall and lowers precision.

Overfitting vs Underfitting

Symptom Diagnosis Fix
Train score high, test score low Overfitting — model memorized the training data Lower max_depth, raise min_samples_leaf, get more data
Both scores low Underfitting — model is too simple Add features, raise model complexity, try a different algorithm
Train ≈ test, both acceptable Good fit Done
# always compare train vs test score
print(f"Train: {model.score(X_train, y_train):.3f}")
print(f"Test:  {model.score(X_test,  y_test):.3f}")

A gap of more than 0.10 between train and test is a sign of overfitting.


Common Mistakes

1. Skipping the Baseline

Any model has to beat a simple baseline to be worth using.

from sklearn.dummy import DummyClassifier
dummy = DummyClassifier(strategy='most_frequent')
scores = cross_val_score(dummy, X, y, cv=5, scoring='f1')
print(f"Baseline F1: {scores.mean():.3f}")   # your model has to beat this

2. Evaluating on Training Data

# WRONG: too high — the model has seen this data
model.score(X_train, y_train)

# CORRECT: a held-out test set, touched only once at the end
model.score(X_test, y_test)

3. Fitting the Scaler on the Full Dataset

# WRONG: test data leaks into the scaler's mean and std
scaler.fit(X)
X_scaled = scaler.transform(X)

# CORRECT: fit only on training data
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled  = scaler.transform(X_test)

Use a Pipeline to avoid this mistake — it forces the right fit/transform order on its own.

4. Using Accuracy on Imbalanced Classes

# MISLEADING: 95% accuracy when 95% of samples are class 0
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")

# BETTER: F1 or ROC-AUC on imbalanced data
print(f"F1: {f1_score(y_test, y_pred):.3f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_pred_prob):.3f}")

5. Jumping to a Complex Model

Random Forest and XGBoost are not always better than Logistic Regression. A simple model that is 2% worse but easy to explain is often the right business choice. Always build the simple model first.

6. Not Cross-Validating

A single train/test split can be lucky or unlucky. Cross-validation gives you a more reliable number.

# one split: not reliable
model.fit(X_train, y_train)
model.score(X_test, y_test)

# five folds: more reliable
cross_val_score(model, X, y, cv=5, scoring='f1').mean()

Model Selection Workflow

from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import pandas as pd

models = {
    'Baseline':            DummyClassifier(strategy='most_frequent'),
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Decision Tree':       DecisionTreeClassifier(max_depth=5, random_state=42),
    'Random Forest':       RandomForestClassifier(n_estimators=100, random_state=42),
}

results = {}
for name, model in models.items():
    scores = cross_val_score(model, X_encoded, y, cv=5, scoring='f1')
    results[name] = {'mean_f1': scores.mean().round(3), 'std': scores.std().round(3)}

print(pd.DataFrame(results).T.sort_values('mean_f1', ascending=False))

How to decide after you compare the results: - If Logistic Regression is close to Random Forest → stick with Logistic Regression (simpler, easier to explain). - If Random Forest clearly wins → use it, then think about XGBoost if you need more. - If every model barely beats the baseline → the problem is in the features, not the algorithm. Build better features before you change models.