Machine Learning Notebook

Machine learning in Python centres on one library: scikit-learn. It provides a consistent API across all models — fit, predict, score — so once you understand the workflow, switching between algorithms is straightforward.

This notebook covers the full workflow: data preparation → model training → evaluation → iteration. For the data manipulation that happens before this, see Pandas Notebook and NumPy Notebook. For choosing the right model and interpreting results, see ML Model Selection Guide.

All examples use the parking system dataset. Two tasks run throughout: - Regression: predict parking amount from duration, station, and parking type. - Classification: predict whether a session was paid by credit card.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import (
    mean_absolute_error, mean_squared_error, r2_score,
    accuracy_score, classification_report, confusion_matrix, ConfusionMatrixDisplay
)
import matplotlib.pyplot as plt

The ML Workflow

Every ML project follows the same five steps:

1. Prepare data      → select features, handle nulls, encode categoricals, scale
2. Split             → train set / test set (never evaluate on training data)
3. Train             → fit a model on the training set
4. Evaluate          → measure performance on the test set
5. Iterate           → tune, try different models, add features

The test set is set aside at the start and touched only once at the end. Using it during development inflates your performance estimates.

Data Preparation

Feature Selection

# regression: predict amount
features = ['duration_mins', 'station_code', 'parking_type']
target   = 'amount'

df_model = parking_df[features + [target]].dropna()
X = df_model[features]
y = df_model[target]
# classification: predict credit card payment (binary)
df_model = parking_df[features + ['payment_method']].dropna()
X = df_model[features]
y = (df_model['payment_method'] == 'Credit_Card').astype(int)   # 1 = credit card, 0 = other

Drop rows with nulls in the features or target before splitting — scikit-learn does not handle NaN by default.

Train / Test Split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
print(f"Train: {len(X_train)} rows, Test: {len(X_test)} rows")

test_size=0.2 reserves 20% for evaluation. random_state=42 makes the split reproducible.

For classification with imbalanced classes, add stratify=y to preserve the class ratio in both sets:

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

Encoding Categorical Features

scikit-learn requires numeric input. Categorical columns must be encoded first.

OneHotEncoder — for nominal categories (no order: station_code, payment_method):

# produces one binary column per category
enc = OneHotEncoder(drop='first', sparse_output=False)
encoded = enc.fit_transform(X[['parking_type', 'station_code']])

drop='first' removes one redundant column per category to avoid multicollinearity.

LabelEncoder — for ordinal categories or a binary target:

le = LabelEncoder()
y_encoded = le.fit_transform(y_series)    # 'hourly' → 0, 'monthly' → 1, ...

In practice, use ColumnTransformer inside a Pipeline (see below) rather than encoding manually.

Feature Scaling

Tree-based models (Decision Tree, Random Forest) do not need scaling. Linear models and distance-based models do.

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_numeric)
X_test_scaled  = scaler.transform(X_test_numeric)      # use fit from training set only

Always fit the scaler on the training set and apply the same transformation to the test set. Fitting on the full dataset leaks test information into training.


Regression

Predict a continuous numeric value (parking amount).

Linear Regression

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

Inspect coefficients to understand feature contributions:

coef_df = pd.DataFrame({
    'feature':     X_train.columns,
    'coefficient': model.coef_
}).sort_values('coefficient', ascending=False)
print(coef_df)

Decision Tree Regressor

model = DecisionTreeRegressor(max_depth=5, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

max_depth limits tree depth to prevent overfitting. Without it, the tree will memorise the training data.

Random Forest Regressor

model = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

Random Forest builds many trees on random subsets of data and averages their predictions — more robust than a single tree.

Regression Evaluation Metrics

mae  = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2   = r2_score(y_test, y_pred)

print(f"MAE:  {mae:.1f}")
print(f"RMSE: {rmse:.1f}")
print(f"R²:   {r2:.3f}")
Metric What it measures Units
MAE Average absolute error Same as target
RMSE Average error, penalises large mistakes more Same as target
Proportion of variance explained (0–1) Unitless

R² of 0.80 means the model explains 80% of the variance in the target. An R² of 1.0 is a perfect fit; 0.0 means the model does no better than predicting the mean.

Residual Plot

Always plot residuals — patterns in the residuals reveal model weaknesses:

residuals = y_test - y_pred

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].scatter(y_pred, residuals, alpha=0.3)
axes[0].axhline(0, color='red', linewidth=1)
axes[0].set_xlabel('Predicted')
axes[0].set_ylabel('Residual')
axes[0].set_title('Residuals vs Predicted')

axes[1].hist(residuals, bins=30)
axes[1].set_title('Residual Distribution')

plt.tight_layout()
plt.show()

Residuals should be randomly scattered around zero. A systematic pattern (curve, funnel shape) means the model is missing something.


Classification

Predict a category (credit card vs not).

Logistic Regression

model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train, y_train)

y_pred      = model.predict(X_test)
y_pred_prob = model.predict_proba(X_test)[:, 1]   # probability of class 1

max_iter=1000 prevents convergence warnings on larger datasets.

Decision Tree Classifier

model = DecisionTreeClassifier(max_depth=5, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

Random Forest Classifier

model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

Classification Evaluation Metrics

print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(classification_report(y_test, y_pred, target_names=['Other', 'Credit Card']))

classification_report prints precision, recall, and F1 for each class:

Metric Formula Meaning
Accuracy correct / total Overall correctness
Precision TP / (TP + FP) Of predicted positives, how many are correct?
Recall TP / (TP + FN) Of actual positives, how many did we catch?
F1 2 × (P × R) / (P + R) Harmonic mean of precision and recall

When to prioritise recall over precision: - Fraud detection — missing a fraud (false negative) is worse than a false alarm. - Medical diagnosis — missing a disease is worse than an unnecessary follow-up.

When to prioritise precision: - Spam filter — marking a legitimate email as spam is worse than missing spam.

Confusion Matrix

cm = confusion_matrix(y_test, y_pred)
ConfusionMatrixDisplay(cm, display_labels=['Other', 'Credit Card']).plot()
plt.title('Confusion Matrix')
plt.show()
Predicted Negative Predicted Positive
Actual Negative True Negative (TN) False Positive (FP)
Actual Positive False Negative (FN) True Positive (TP)

Cross-Validation

A single train/test split can be lucky or unlucky depending on which rows land in the test set. Cross-validation averages performance across multiple splits for a more reliable estimate.

model = RandomForestClassifier(n_estimators=100, random_state=42)
cv    = KFold(n_splits=5, shuffle=True, random_state=42)

scores = cross_val_score(model, X, y, cv=cv, scoring='f1')

print(f"F1 per fold: {scores.round(3)}")
print(f"Mean F1: {scores.mean():.3f} ± {scores.std():.3f}")

5-fold CV splits data into 5 equal parts, trains on 4 and tests on 1, rotating which part is the test set. The mean score is more reliable than any single split.

Common scoring values: 'accuracy', 'f1', 'roc_auc', 'r2', 'neg_mean_absolute_error'.


Feature Importance

Tree-based models provide built-in feature importance scores — how much each feature contributed to reducing prediction error.

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_encoded, y_train)

importance_df = pd.DataFrame({
    'feature':    feature_names,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

importance_df.plot(kind='barh', x='feature', y='importance', legend=False)
plt.title('Feature Importance')
plt.tight_layout()
plt.show()

Feature importance tells you which inputs the model relies on most, but does not tell you the direction of the effect. For that, use a coefficient plot (linear models) or SHAP values (any model).


Pipeline

A Pipeline chains preprocessing and model steps into one object. This prevents the most common ML mistake: fitting the scaler or encoder on the full dataset instead of only the training set.

numeric_features     = ['duration_mins']
categorical_features = ['parking_type', 'station_code']

# preprocessing for each column type
numeric_transformer     = StandardScaler()
categorical_transformer = OneHotEncoder(drop='first', handle_unknown='ignore')

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer,     numeric_features),
    ('cat', categorical_transformer, categorical_features),
])

# full pipeline: preprocessing + model
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model',        RandomForestClassifier(n_estimators=100, random_state=42))
])

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

print(classification_report(y_test, y_pred))

handle_unknown='ignore' prevents errors when the test set contains a category not seen during training.

Using a Pipeline means: - pipeline.fit(X_train, y_train) fits all steps on training data only. - pipeline.predict(X_test) applies the same transformations automatically. - The entire workflow is one object — easy to save, load, and deploy.

Save and Load a Pipeline

import joblib

joblib.dump(pipeline, 'parking_model.pkl')      # save
pipeline = joblib.load('parking_model.pkl')     # load

Common Workflows

1. Compare Multiple Models

models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Decision Tree':       DecisionTreeClassifier(max_depth=5, random_state=42),
    'Random Forest':       RandomForestClassifier(n_estimators=100, random_state=42),
}

results = {}
for name, model in models.items():
    scores = cross_val_score(model, X_encoded, y, cv=5, scoring='f1')
    results[name] = {'mean': scores.mean(), 'std': scores.std()}

results_df = pd.DataFrame(results).T.sort_values('mean', ascending=False)
print(results_df.round(3))

2. Choosing a Model

Model Strengths Weaknesses
Linear / Logistic Regression Fast, interpretable, good baseline Assumes linear relationships
Decision Tree Interpretable, handles mixed types Overfits without depth limit
Random Forest Strong out-of-the-box, robust Slower, less interpretable

Always start with a simple model (linear/logistic regression) as a baseline before trying more complex ones. A simple model that performs nearly as well is almost always preferable.

3. Handling Imbalanced Classes

When one class is much rarer than the other (e.g., 95% non-fraud, 5% fraud), accuracy is misleading — a model that always predicts the majority class scores 95%.

# option 1: class_weight='balanced' — automatically adjusts for imbalance
model = RandomForestClassifier(class_weight='balanced', random_state=42)

# option 2: evaluate with F1 or AUC instead of accuracy
scores = cross_val_score(model, X, y, cv=5, scoring='roc_auc')

class_weight='balanced' penalises mistakes on the minority class more heavily during training.

4. Baseline Check

Before building any model, establish a naive baseline:

from sklearn.dummy import DummyClassifier, DummyRegressor

# classification baseline: always predict the most common class
dummy = DummyClassifier(strategy='most_frequent')
scores = cross_val_score(dummy, X, y, cv=5, scoring='f1')
print(f"Baseline F1: {scores.mean():.3f}")

# regression baseline: always predict the mean
dummy = DummyRegressor(strategy='mean')
scores = cross_val_score(dummy, X, y, cv=5, scoring='r2')
print(f"Baseline R²: {scores.mean():.3f}")

Your model must beat the baseline to be worth using. A model that barely outperforms "always predict the mean" is not useful.