Machine Learning Model Selection Guide

2026-04-25 Data Analysis 7 min read

Python Machine Learning

Picking the right model is a different skill from knowing how to code one. This guide is about the when — which algorithm fits your problem, your data, and your limits — and the mistakes you make when you pick wrong.

For the syntax and scikit-learn code, see Machine Learning Notebook. All examples use the parking system dataset.

Task Selection

Before you choose a model, decide what kind of problem you have.

Your question	Task	Example
Predict a number	Regression	Predict parking amount from duration and station
Predict a category	Classification	Predict whether the payment will be by credit card
Find natural groups in unlabeled data	Clustering	Group stations by usage patterns

If you have a target column to learn from, you are doing supervised learning (regression or classification). If you do not, you are doing unsupervised learning (clustering).

Model Selection Table

Use this to find a starting point before you read the full section.

Situation	Start with
First model on any problem	1. Linear / Logistic Regression
You need to know why the model decides	2. Decision Tree
Best out-of-the-box result on tabular data	3. Random Forest
Best result, willing to tune	4. Gradient Boosting (XGBoost)
Small dataset, simple distance-based problem	5. KNN

Always build a baseline first (see Common Mistakes). A model that barely beats "always predict the average" is not useful.

1. Linear / Logistic Regression — The Baseline

Use for: - Linear Regression: predict a number when the link to features is roughly a straight line. - Logistic Regression: predict two (or more) categories when you need probabilities, not just labels.

Use when: - You want a fast, easy-to-explain starting point. - You have many features but few rows — simple models work better in this case. - You need to explain the model to non-technical people (each coefficient maps directly to one feature's effect).

Avoid when: - The link between features and target is not a straight line (for example, revenue jumps on weekends but not on other days). - Features mix together in complex ways.

Reading the output

# coefficients show the effect of each feature
coef_df = pd.DataFrame({'feature': X.columns, 'coef': model.coef_}).sort_values('coef')

A positive coefficient means the predicted value goes up when that feature goes up. A large coefficient means a strong effect — but only if all features are on the same scale. Always scale features before you compare coefficients.

For Logistic Regression, predict_proba() returns the probability of each class. The default cut-off is 0.5, but you can lower it (for example, 0.3) to catch more positives at the cost of more false alarms.

2. Decision Tree — Easiest to Explain

Use when: - You need to show people exactly how the model decides. - The data has clear if/else splits (for example, parking type = monthly → high amount). - You are checking which features matter before you build a more complex model.

Avoid when: - You need strong predictions — a single tree overfits easily and rarely beats a Random Forest. - The data has many continuous features with subtle interactions.

from sklearn.tree import DecisionTreeClassifier, plot_tree

model = DecisionTreeClassifier(max_depth=4, random_state=42)
model.fit(X_train, y_train)

plt.figure(figsize=(20, 8))
plot_tree(model, feature_names=X.columns, class_names=['Other', 'Credit Card'],
          filled=True, rounded=True)
plt.show()

Controlling overfitting

Parameter	Effect
`max_depth`	Limits the tree depth — the most important lever
`min_samples_leaf`	Smallest number of rows per leaf — stops tiny, overfit splits
`min_samples_split`	Smallest number of rows needed to try a split

Start with max_depth=4 or max_depth=5. A tree deeper than 6 is almost always overfitting. Check: if the training score is much higher than the test score, lower the depth.

3. Random Forest — The Reliable Default

Use when: - You want strong results without much tuning. - You have a mix of numeric and categorical features. - You want feature importance built in.

Avoid when: - You need a fully transparent model — Random Forest is a black box compared to a single tree. - Training time or memory is tight — 100+ trees on a large dataset can be slow. - You need probabilities to be well-calibrated (use Logistic Regression instead).

Key parameters

Parameter	What it controls	Typical range
`n_estimators`	Number of trees	100–500
`max_depth`	Maximum depth per tree	5–20, or None
`max_features`	Features tried at each split	`'sqrt'` (default for classification)
`min_samples_leaf`	Smallest number of rows per leaf	1–10

More trees always helps up to a point, then you stop seeing gains. Start with 100 and add more only if cross-validation scores still go up.

Feature importance

importance_df = pd.DataFrame({
    'feature':    X.columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

Feature importance tells you which features the model relies on most. It does not tell you whether the effect is positive or negative (use coefficients for that, or SHAP values).

4. Gradient Boosting (XGBoost) — Best Results

Gradient boosting builds trees one after another. Each tree fixes the mistakes of the one before. It usually beats Random Forest on tabular data.

Use when: - You need the best possible result. - You are willing to tune hyperparameters. - Training time is OK (slower than Random Forest per tree, but you often need fewer trees).

Avoid when: - Your dataset is small (under 1,000 rows) — it will overfit. - You need a quick first model — use Random Forest first, switch to XGBoost only if you need the extra performance.

from xgboost import XGBClassifier

model = XGBClassifier(n_estimators=200, max_depth=4, learning_rate=0.1,
                      random_state=42, eval_metric='logloss')
model.fit(X_train, y_train)

Key parameters

Parameter	What it controls
`n_estimators`	Number of trees
`learning_rate`	Step size per tree — lower means you need more trees, but the model often generalizes better
`max_depth`	Tree depth (3–6 is typical, shallower than Random Forest)
`subsample`	Share of rows used per tree (0.8 = 80%) — lowers overfitting
`colsample_bytree`	Share of features used per tree (0.8 = 80%)

A lower learning_rate plus more n_estimators almost always improves the result. A common starting point: learning_rate=0.05, n_estimators=500.

5. KNN — Simple and Free of Assumptions

K-Nearest Neighbors classifies a point by the most common label among its k nearest neighbors.

Use when: - The dataset is small (under 10,000 rows). - The decision really does depend on local similarity. - You want a model with no training step.

Avoid when: - The dataset is large — prediction time grows with the size of the training set. - You have many features — distance loses meaning in high dimensions (the curse of dimensionality). - Features are on very different scales — always scale before using KNN.

from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train_scaled, y_train)   # scale features first

Tune n_neighbors with cross-validation. Use odd values to avoid ties in binary classification.

Regression vs Classification: Choosing Evaluation Metrics

Regression Metrics

Metric	Formula	Use when
MAE	mean(\|y - ŷ\|)	You don't want outliers to dominate — reports the average error in the same units as the target
RMSE	√mean((y - ŷ)²)	Big errors are especially bad — RMSE punishes them more
R²	1 - SS_res/SS_tot	Explaining model quality to non-technical people (a 0–1 scale)

R² of 0.85 means the model explains 85% of the variance in the target. R² below 0.5 is weak for most business cases.

Classification Metrics

Metric	Use when
Accuracy	Classes are balanced and all errors cost the same
Precision	False positives are costly (spam filter, irrelevant recommendations)
Recall	False negatives are costly (fraud detection, medical diagnosis)
F1	Imbalanced classes; both precision and recall matter
ROC-AUC	Comparing models without picking a threshold; quality of probability ranking

Accuracy lies on imbalanced data. If 95% of sessions are not credit card, a model that always predicts "not credit card" gets 95% accuracy but is useless. Use F1 or ROC-AUC instead.

Reading a Classification Report

              precision    recall  f1-score   support
       Other       0.94      0.97      0.95      1820
 Credit Card       0.81      0.70      0.75       380
    accuracy                           0.92      2200

High precision, low recall for "Credit Card": the model is careful — when it says credit card, it's usually right, but it misses many real credit card payments.
Lowering the cut-off (predict_proba > 0.35 instead of > 0.5) raises recall and lowers precision.

Overfitting vs Underfitting

Symptom	Diagnosis	Fix
Train score high, test score low	Overfitting — model memorized the training data	Lower `max_depth`, raise `min_samples_leaf`, get more data
Both scores low	Underfitting — model is too simple	Add features, raise model complexity, try a different algorithm
Train ≈ test, both acceptable	Good fit	Done

# always compare train vs test score
print(f"Train: {model.score(X_train, y_train):.3f}")
print(f"Test:  {model.score(X_test,  y_test):.3f}")

A gap of more than 0.10 between train and test is a sign of overfitting.

Common Mistakes

1. Skipping the Baseline

Any model has to beat a simple baseline to be worth using.

from sklearn.dummy import DummyClassifier
dummy = DummyClassifier(strategy='most_frequent')
scores = cross_val_score(dummy, X, y, cv=5, scoring='f1')
print(f"Baseline F1: {scores.mean():.3f}")   # your model has to beat this

2. Evaluating on Training Data

# WRONG: too high — the model has seen this data
model.score(X_train, y_train)

# CORRECT: a held-out test set, touched only once at the end
model.score(X_test, y_test)

3. Fitting the Scaler on the Full Dataset

# WRONG: test data leaks into the scaler's mean and std
scaler.fit(X)
X_scaled = scaler.transform(X)

# CORRECT: fit only on training data
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled  = scaler.transform(X_test)

Use a Pipeline to avoid this mistake — it forces the right fit/transform order on its own.

4. Using Accuracy on Imbalanced Classes

# MISLEADING: 95% accuracy when 95% of samples are class 0
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")

# BETTER: F1 or ROC-AUC on imbalanced data
print(f"F1: {f1_score(y_test, y_pred):.3f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_pred_prob):.3f}")

5. Jumping to a Complex Model

Random Forest and XGBoost are not always better than Logistic Regression. A simple model that is 2% worse but easy to explain is often the right business choice. Always build the simple model first.

6. Not Cross-Validating

A single train/test split can be lucky or unlucky. Cross-validation gives you a more reliable number.

# one split: not reliable
model.fit(X_train, y_train)
model.score(X_test, y_test)

# five folds: more reliable
cross_val_score(model, X, y, cv=5, scoring='f1').mean()

Model Selection Workflow

from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import pandas as pd

models = {
    'Baseline':            DummyClassifier(strategy='most_frequent'),
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Decision Tree':       DecisionTreeClassifier(max_depth=5, random_state=42),
    'Random Forest':       RandomForestClassifier(n_estimators=100, random_state=42),
}

results = {}
for name, model in models.items():
    scores = cross_val_score(model, X_encoded, y, cv=5, scoring='f1')
    results[name] = {'mean_f1': scores.mean().round(3), 'std': scores.std().round(3)}

print(pd.DataFrame(results).T.sort_values('mean_f1', ascending=False))

How to decide after you compare the results: - If Logistic Regression is close to Random Forest → stick with Logistic Regression (simpler, easier to explain). - If Random Forest clearly wins → use it, then think about XGBoost if you need more. - If every model barely beats the baseline → the problem is in the features, not the algorithm. Build better features before you change models.

← Back to Blog