Machine Learning Notebook
Machine learning in Python centres on one library: scikit-learn. It provides a consistent API across all models — fit, predict, score — so once you understand the workflow, switching between algorithms is straightforward.
This notebook covers the full workflow: data preparation → model training → evaluation → iteration. For the data manipulation that happens before this, see Pandas Notebook and NumPy Notebook. For choosing the right model and interpreting results, see ML Model Selection Guide.
All examples use the parking system dataset. Two tasks run throughout:
- Regression: predict parking amount from duration, station, and parking type.
- Classification: predict whether a session was paid by credit card.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import (
mean_absolute_error, mean_squared_error, r2_score,
accuracy_score, classification_report, confusion_matrix, ConfusionMatrixDisplay
)
import matplotlib.pyplot as plt
The ML Workflow
Every ML project follows the same five steps:
1. Prepare data → select features, handle nulls, encode categoricals, scale
2. Split → train set / test set (never evaluate on training data)
3. Train → fit a model on the training set
4. Evaluate → measure performance on the test set
5. Iterate → tune, try different models, add features
The test set is set aside at the start and touched only once at the end. Using it during development inflates your performance estimates.
Data Preparation
Feature Selection
# regression: predict amount
features = ['duration_mins', 'station_code', 'parking_type']
target = 'amount'
df_model = parking_df[features + [target]].dropna()
X = df_model[features]
y = df_model[target]
# classification: predict credit card payment (binary)
df_model = parking_df[features + ['payment_method']].dropna()
X = df_model[features]
y = (df_model['payment_method'] == 'Credit_Card').astype(int) # 1 = credit card, 0 = other
Drop rows with nulls in the features or target before splitting — scikit-learn does not handle NaN by default.
Train / Test Split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print(f"Train: {len(X_train)} rows, Test: {len(X_test)} rows")
test_size=0.2 reserves 20% for evaluation. random_state=42 makes the split reproducible.
For classification with imbalanced classes, add stratify=y to preserve the class ratio in both sets:
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
Encoding Categorical Features
scikit-learn requires numeric input. Categorical columns must be encoded first.
OneHotEncoder — for nominal categories (no order: station_code, payment_method):
# produces one binary column per category
enc = OneHotEncoder(drop='first', sparse_output=False)
encoded = enc.fit_transform(X[['parking_type', 'station_code']])
drop='first' removes one redundant column per category to avoid multicollinearity.
LabelEncoder — for ordinal categories or a binary target:
le = LabelEncoder()
y_encoded = le.fit_transform(y_series) # 'hourly' → 0, 'monthly' → 1, ...
In practice, use ColumnTransformer inside a Pipeline (see below) rather than encoding manually.
Feature Scaling
Tree-based models (Decision Tree, Random Forest) do not need scaling. Linear models and distance-based models do.
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_numeric)
X_test_scaled = scaler.transform(X_test_numeric) # use fit from training set only
Always fit the scaler on the training set and apply the same transformation to the test set. Fitting on the full dataset leaks test information into training.
Regression
Predict a continuous numeric value (parking amount).
Linear Regression
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
Inspect coefficients to understand feature contributions:
coef_df = pd.DataFrame({
'feature': X_train.columns,
'coefficient': model.coef_
}).sort_values('coefficient', ascending=False)
print(coef_df)
Decision Tree Regressor
model = DecisionTreeRegressor(max_depth=5, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
max_depth limits tree depth to prevent overfitting. Without it, the tree will memorise the training data.
Random Forest Regressor
model = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
Random Forest builds many trees on random subsets of data and averages their predictions — more robust than a single tree.
Regression Evaluation Metrics
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"MAE: {mae:.1f}")
print(f"RMSE: {rmse:.1f}")
print(f"R²: {r2:.3f}")
| Metric | What it measures | Units |
|---|---|---|
| MAE | Average absolute error | Same as target |
| RMSE | Average error, penalises large mistakes more | Same as target |
| R² | Proportion of variance explained (0–1) | Unitless |
R² of 0.80 means the model explains 80% of the variance in the target. An R² of 1.0 is a perfect fit; 0.0 means the model does no better than predicting the mean.
Residual Plot
Always plot residuals — patterns in the residuals reveal model weaknesses:
residuals = y_test - y_pred
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].scatter(y_pred, residuals, alpha=0.3)
axes[0].axhline(0, color='red', linewidth=1)
axes[0].set_xlabel('Predicted')
axes[0].set_ylabel('Residual')
axes[0].set_title('Residuals vs Predicted')
axes[1].hist(residuals, bins=30)
axes[1].set_title('Residual Distribution')
plt.tight_layout()
plt.show()
Residuals should be randomly scattered around zero. A systematic pattern (curve, funnel shape) means the model is missing something.
Classification
Predict a category (credit card vs not).
Logistic Regression
model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_pred_prob = model.predict_proba(X_test)[:, 1] # probability of class 1
max_iter=1000 prevents convergence warnings on larger datasets.
Decision Tree Classifier
model = DecisionTreeClassifier(max_depth=5, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
Random Forest Classifier
model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
Classification Evaluation Metrics
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(classification_report(y_test, y_pred, target_names=['Other', 'Credit Card']))
classification_report prints precision, recall, and F1 for each class:
| Metric | Formula | Meaning |
|---|---|---|
| Accuracy | correct / total | Overall correctness |
| Precision | TP / (TP + FP) | Of predicted positives, how many are correct? |
| Recall | TP / (TP + FN) | Of actual positives, how many did we catch? |
| F1 | 2 × (P × R) / (P + R) | Harmonic mean of precision and recall |
When to prioritise recall over precision: - Fraud detection — missing a fraud (false negative) is worse than a false alarm. - Medical diagnosis — missing a disease is worse than an unnecessary follow-up.
When to prioritise precision: - Spam filter — marking a legitimate email as spam is worse than missing spam.
Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
ConfusionMatrixDisplay(cm, display_labels=['Other', 'Credit Card']).plot()
plt.title('Confusion Matrix')
plt.show()
| Predicted Negative | Predicted Positive | |
|---|---|---|
| Actual Negative | True Negative (TN) | False Positive (FP) |
| Actual Positive | False Negative (FN) | True Positive (TP) |
Cross-Validation
A single train/test split can be lucky or unlucky depending on which rows land in the test set. Cross-validation averages performance across multiple splits for a more reliable estimate.
model = RandomForestClassifier(n_estimators=100, random_state=42)
cv = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv, scoring='f1')
print(f"F1 per fold: {scores.round(3)}")
print(f"Mean F1: {scores.mean():.3f} ± {scores.std():.3f}")
5-fold CV splits data into 5 equal parts, trains on 4 and tests on 1, rotating which part is the test set. The mean score is more reliable than any single split.
Common scoring values: 'accuracy', 'f1', 'roc_auc', 'r2', 'neg_mean_absolute_error'.
Feature Importance
Tree-based models provide built-in feature importance scores — how much each feature contributed to reducing prediction error.
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_encoded, y_train)
importance_df = pd.DataFrame({
'feature': feature_names,
'importance': model.feature_importances_
}).sort_values('importance', ascending=False)
importance_df.plot(kind='barh', x='feature', y='importance', legend=False)
plt.title('Feature Importance')
plt.tight_layout()
plt.show()
Feature importance tells you which inputs the model relies on most, but does not tell you the direction of the effect. For that, use a coefficient plot (linear models) or SHAP values (any model).
Pipeline
A Pipeline chains preprocessing and model steps into one object. This prevents the most common ML mistake: fitting the scaler or encoder on the full dataset instead of only the training set.
numeric_features = ['duration_mins']
categorical_features = ['parking_type', 'station_code']
# preprocessing for each column type
numeric_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(drop='first', handle_unknown='ignore')
preprocessor = ColumnTransformer(transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features),
])
# full pipeline: preprocessing + model
pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('model', RandomForestClassifier(n_estimators=100, random_state=42))
])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))
handle_unknown='ignore' prevents errors when the test set contains a category not seen during training.
Using a Pipeline means:
- pipeline.fit(X_train, y_train) fits all steps on training data only.
- pipeline.predict(X_test) applies the same transformations automatically.
- The entire workflow is one object — easy to save, load, and deploy.
Save and Load a Pipeline
import joblib
joblib.dump(pipeline, 'parking_model.pkl') # save
pipeline = joblib.load('parking_model.pkl') # load
Common Workflows
1. Compare Multiple Models
models = {
'Logistic Regression': LogisticRegression(max_iter=1000),
'Decision Tree': DecisionTreeClassifier(max_depth=5, random_state=42),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
}
results = {}
for name, model in models.items():
scores = cross_val_score(model, X_encoded, y, cv=5, scoring='f1')
results[name] = {'mean': scores.mean(), 'std': scores.std()}
results_df = pd.DataFrame(results).T.sort_values('mean', ascending=False)
print(results_df.round(3))
2. Choosing a Model
| Model | Strengths | Weaknesses |
|---|---|---|
| Linear / Logistic Regression | Fast, interpretable, good baseline | Assumes linear relationships |
| Decision Tree | Interpretable, handles mixed types | Overfits without depth limit |
| Random Forest | Strong out-of-the-box, robust | Slower, less interpretable |
Always start with a simple model (linear/logistic regression) as a baseline before trying more complex ones. A simple model that performs nearly as well is almost always preferable.
3. Handling Imbalanced Classes
When one class is much rarer than the other (e.g., 95% non-fraud, 5% fraud), accuracy is misleading — a model that always predicts the majority class scores 95%.
# option 1: class_weight='balanced' — automatically adjusts for imbalance
model = RandomForestClassifier(class_weight='balanced', random_state=42)
# option 2: evaluate with F1 or AUC instead of accuracy
scores = cross_val_score(model, X, y, cv=5, scoring='roc_auc')
class_weight='balanced' penalises mistakes on the minority class more heavily during training.
4. Baseline Check
Before building any model, establish a naive baseline:
from sklearn.dummy import DummyClassifier, DummyRegressor
# classification baseline: always predict the most common class
dummy = DummyClassifier(strategy='most_frequent')
scores = cross_val_score(dummy, X, y, cv=5, scoring='f1')
print(f"Baseline F1: {scores.mean():.3f}")
# regression baseline: always predict the mean
dummy = DummyRegressor(strategy='mean')
scores = cross_val_score(dummy, X, y, cv=5, scoring='r2')
print(f"Baseline R²: {scores.mean():.3f}")
Your model must beat the baseline to be worth using. A model that barely outperforms "always predict the mean" is not useful.