Machine Learning Notebook

2026-04-25 Data Analysis 5 min read

Python Machine Learning

Machine learning in Python is built around one library: scikit-learn. Every model uses the same three methods — fit, predict, score — so once you learn the workflow, switching between algorithms is easy.

This notebook covers the full workflow: data prep → train → evaluate → improve. For the data work that comes before this, see Pandas Notebook and NumPy Notebook. For picking the right model and reading the results, see ML Model Selection Guide.

All examples use the parking system dataset. Two tasks run through the whole notebook: - Regression: predict the parking amount from duration, station, and parking type. - Classification: predict whether a session was paid by credit card.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import (
    mean_absolute_error, mean_squared_error, r2_score,
    accuracy_score, classification_report, confusion_matrix, ConfusionMatrixDisplay
)
import matplotlib.pyplot as plt

The ML Workflow

Every ML project follows the same five steps:

1. Prepare data      → pick features, handle nulls, encode categories, scale
2. Split             → training set / test set (never test on training data)
3. Train             → fit the model on the training set
4. Evaluate          → measure how it does on the test set
5. Iterate           → tune, try other models, add features

The test set is set aside at the very start and used only once, at the end. If you use it during development, your scores will look better than they really are.

Data Preparation

Feature Selection

# regression: predict amount
features = ['duration_mins', 'station_code', 'parking_type']
target   = 'amount'

df_model = parking_df[features + [target]].dropna()
X = df_model[features]
y = df_model[target]

# classification: predict credit card payment (binary)
df_model = parking_df[features + ['payment_method']].dropna()
X = df_model[features]
y = (df_model['payment_method'] == 'Credit_Card').astype(int)   # 1 = credit card, 0 = other

Drop rows with nulls in the features or target before you split. scikit-learn does not handle NaN by default.

Train / Test Split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
print(f"Train: {len(X_train)} rows, Test: {len(X_test)} rows")

test_size=0.2 keeps 20% for evaluation. random_state=42 makes the split reproducible.

For classification with imbalanced classes, add stratify=y to keep the class ratio the same in both sets:

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

Encoding Categorical Features

scikit-learn needs numeric input. You have to encode category columns first.

OneHotEncoder — for nominal categories (no order: station_code, payment_method):

# creates one binary column per category
enc = OneHotEncoder(drop='first', sparse_output=False)
encoded = enc.fit_transform(X[['parking_type', 'station_code']])

drop='first' removes one column per category to avoid multicollinearity (when columns predict each other and confuse linear models).

LabelEncoder — for ordinal categories or a binary target:

le = LabelEncoder()
y_encoded = le.fit_transform(y_series)    # 'hourly' → 0, 'monthly' → 1, ...

In real work, use ColumnTransformer inside a Pipeline (see below) instead of encoding by hand.

Feature Scaling

Tree-based models (Decision Tree, Random Forest) do not need scaling. Linear models and distance-based models do.

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_numeric)
X_test_scaled  = scaler.transform(X_test_numeric)      # use the fit from training only

Always fit the scaler on the training set, then apply the same transform to the test set. Fitting on the full dataset leaks test info into training.

Regression

Predict a continuous number (parking amount).

Linear Regression

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

Look at the coefficients to see how much each feature matters:

coef_df = pd.DataFrame({
    'feature':     X_train.columns,
    'coefficient': model.coef_
}).sort_values('coefficient', ascending=False)
print(coef_df)

Decision Tree Regressor

model = DecisionTreeRegressor(max_depth=5, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

max_depth limits how deep the tree can grow, which stops overfitting. Without it, the tree memorizes the training data.

Random Forest Regressor

model = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

A Random Forest builds many trees on random slices of the data and averages their predictions. It is more reliable than a single tree.

Regression Evaluation Metrics

mae  = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2   = r2_score(y_test, y_pred)

print(f"MAE:  {mae:.1f}")
print(f"RMSE: {rmse:.1f}")
print(f"R²:   {r2:.3f}")

Metric	What it measures	Units
MAE	Average absolute error	Same as the target
RMSE	Average error, with bigger mistakes punished more	Same as the target
R²	Share of variance the model explains (0–1)	No units

R² of 0.80 means the model explains 80% of the variance in the target. R² = 1.0 is a perfect fit. R² = 0.0 means the model does no better than predicting the mean.

Residual Plot

Always plot residuals (the prediction errors). Patterns in the residuals show what the model is missing:

residuals = y_test - y_pred

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].scatter(y_pred, residuals, alpha=0.3)
axes[0].axhline(0, color='red', linewidth=1)
axes[0].set_xlabel('Predicted')
axes[0].set_ylabel('Residual')
axes[0].set_title('Residuals vs Predicted')

axes[1].hist(residuals, bins=30)
axes[1].set_title('Residual Distribution')

plt.tight_layout()
plt.show()

Residuals should be scattered randomly around zero. A clear pattern (curve, funnel shape) means the model is missing something.

Classification

Predict a category (credit card vs not).

Logistic Regression

model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train, y_train)

y_pred      = model.predict(X_test)
y_pred_prob = model.predict_proba(X_test)[:, 1]   # probability of class 1

max_iter=1000 stops the convergence warnings on larger datasets.

Decision Tree Classifier

model = DecisionTreeClassifier(max_depth=5, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

Random Forest Classifier

model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

Classification Evaluation Metrics

print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(classification_report(y_test, y_pred, target_names=['Other', 'Credit Card']))

classification_report prints precision, recall, and F1 for each class:

Metric	Formula	Meaning
Accuracy	correct / total	Overall correctness
Precision	TP / (TP + FP)	Of the predictions you said were positive, how many were right?
Recall	TP / (TP + FN)	Of the real positives, how many did you catch?
F1	2 × (P × R) / (P + R)	A balance between precision and recall

When to focus on recall over precision: - Fraud detection — missing a fraud (false negative) is worse than a false alarm. - Medical diagnosis — missing a disease is worse than an extra check-up.

When to focus on precision: - Spam filter — marking a real email as spam is worse than letting some spam through.

Confusion Matrix

cm = confusion_matrix(y_test, y_pred)
ConfusionMatrixDisplay(cm, display_labels=['Other', 'Credit Card']).plot()
plt.title('Confusion Matrix')
plt.show()

	Predicted Negative	Predicted Positive
Actual Negative	True Negative (TN)	False Positive (FP)
Actual Positive	False Negative (FN)	True Positive (TP)

Cross-Validation

A single train/test split can be lucky or unlucky depending on which rows end up in the test set. Cross-validation runs many splits and averages the score, which is more reliable.

model = RandomForestClassifier(n_estimators=100, random_state=42)
cv    = KFold(n_splits=5, shuffle=True, random_state=42)

scores = cross_val_score(model, X, y, cv=cv, scoring='f1')

print(f"F1 per fold: {scores.round(3)}")
print(f"Mean F1: {scores.mean():.3f} ± {scores.std():.3f}")

5-fold CV cuts the data into 5 equal parts. It trains on 4 and tests on 1, then rotates which part is the test set. The average score across the folds is more reliable than any single split.

Common scoring values: 'accuracy', 'f1', 'roc_auc', 'r2', 'neg_mean_absolute_error'.

Feature Importance

Tree-based models give you feature importance scores — how much each feature helped reduce the prediction error.

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_encoded, y_train)

importance_df = pd.DataFrame({
    'feature':    feature_names,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

importance_df.plot(kind='barh', x='feature', y='importance', legend=False)
plt.title('Feature Importance')
plt.tight_layout()
plt.show()

Feature importance tells you which inputs the model relies on most. It does not tell you whether the effect is positive or negative. For that, use a coefficient plot (linear models) or SHAP values (any model).

Pipeline

A Pipeline links preprocessing and the model into one object. This stops the most common ML mistake: fitting the scaler or encoder on the full dataset instead of only the training set.

numeric_features     = ['duration_mins']
categorical_features = ['parking_type', 'station_code']

# preprocessing for each column type
numeric_transformer     = StandardScaler()
categorical_transformer = OneHotEncoder(drop='first', handle_unknown='ignore')

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer,     numeric_features),
    ('cat', categorical_transformer, categorical_features),
])

# full pipeline: preprocessing + model
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model',        RandomForestClassifier(n_estimators=100, random_state=42))
])

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

print(classification_report(y_test, y_pred))

handle_unknown='ignore' stops errors when the test set has a category that did not appear during training.

With a Pipeline: - pipeline.fit(X_train, y_train) fits every step on training data only. - pipeline.predict(X_test) applies the same transformations on its own. - The whole workflow is one object — easy to save, load, and deploy.

Save and Load a Pipeline

import joblib

joblib.dump(pipeline, 'parking_model.pkl')      # save
pipeline = joblib.load('parking_model.pkl')     # load

Common Workflows

1. Compare Multiple Models

models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Decision Tree':       DecisionTreeClassifier(max_depth=5, random_state=42),
    'Random Forest':       RandomForestClassifier(n_estimators=100, random_state=42),
}

results = {}
for name, model in models.items():
    scores = cross_val_score(model, X_encoded, y, cv=5, scoring='f1')
    results[name] = {'mean': scores.mean(), 'std': scores.std()}

results_df = pd.DataFrame(results).T.sort_values('mean', ascending=False)
print(results_df.round(3))

2. Choosing a Model

Model	Strengths	Weaknesses
Linear / Logistic Regression	Fast, easy to explain, good baseline	Assumes a straight-line relationship
Decision Tree	Easy to read, handles mixed data types	Overfits without a depth limit
Random Forest	Strong out of the box, reliable	Slower, harder to read

Always start with a simple model (linear / logistic regression) as a baseline before you try more complex ones. A simple model that does almost as well is almost always the better choice.

3. Handling Imbalanced Classes

When one class is much rarer than the other (for example, 95% non-fraud, 5% fraud), accuracy lies — a model that always predicts the majority class scores 95%.

# option 1: class_weight='balanced' — adjusts for imbalance on its own
model = RandomForestClassifier(class_weight='balanced', random_state=42)

# option 2: evaluate with F1 or AUC instead of accuracy
scores = cross_val_score(model, X, y, cv=5, scoring='roc_auc')

class_weight='balanced' makes the model pay more attention to mistakes on the minority class during training.

4. Baseline Check

Before you build any model, set a simple baseline:

from sklearn.dummy import DummyClassifier, DummyRegressor

# classification baseline: always predict the most common class
dummy = DummyClassifier(strategy='most_frequent')
scores = cross_val_score(dummy, X, y, cv=5, scoring='f1')
print(f"Baseline F1: {scores.mean():.3f}")

# regression baseline: always predict the mean
dummy = DummyRegressor(strategy='mean')
scores = cross_val_score(dummy, X, y, cv=5, scoring='r2')
print(f"Baseline R²: {scores.mean():.3f}")

Your model has to beat the baseline to be worth using. A model that barely beats "always predict the mean" is not useful.

← Back to Blog