Data Analyst Learning Path

2026-04-25

A data analyst's job is to turn raw data into decisions. The tools on this site form a complete stack for doing that work in Python and SQL. This guide explains where each tool fits in the workflow, the recommended reading order, and how the articles connect to each other.

The DA Workflow

Every analysis follows the same five stages, regardless of the tool or domain:

1. Extract      → pull data from a database or file
2. Clean        → fix types, handle nulls, remove duplicates
3. Analyse      → aggregate, filter, compute metrics
4. Visualise    → communicate findings with charts
5. Present      → build interactive dashboards for stakeholders
6. Model        → go beyond description to prediction

The articles on this site map directly onto these stages.

Stage 1 — Extract: SQL

SQL is the primary language for pulling data. Most business data lives in relational databases, and SQL is how you query it.

Start here: SQL Notebook Core syntax — SELECT, WHERE, GROUP BY, JOINs, CTEs, window functions. This is the foundation everything else builds on.

Next: SQL Patterns Reusable solutions to the problems you'll encounter repeatedly: de-duplication, period-over-period comparisons, cohort analysis, top-N per group, funnel analysis. Each pattern combines the SQL building blocks into a complete approach.

When you're comfortable: SQL Advanced Notebook Window frame clauses, recursive CTEs for date series and hierarchies, GROUPING SETS, views, temporary tables, and EXPLAIN for query performance. These unlock a class of problem that basic SQL cannot solve.

Stage 2 — Clean: Excel + NumPy + Pandas

Once data is extracted, it almost always needs cleaning before analysis. Python handles this layer, and Excel handles the lightweight version used in most workplaces.

Excel Notebook The most widely used data tool in Taiwan's workplace. Covers core functions (XLOOKUP, SUMIFS, IFS), pivot tables, date and text manipulation, and common DA workflows. Read this before NumPy and Pandas if you are new to data work — Excel's concepts (lookup, filter, aggregate, pivot) are the same operations that SQL and Pandas implement programmatically.

Start here: NumPy Notebook NumPy is the numerical foundation of the Python data stack. Pandas is built on top of it. Understanding arrays, vectorization, and broadcasting makes Pandas faster to learn and easier to debug.

Then: Pandas Notebook The primary tool for data cleaning and manipulation in Python. Covers reading files, filtering, handling nulls, merging tables, string and date operations, and ten common DA workflows — from data cleaning pipelines to period-over-period comparisons.

The Pandas workflows directly mirror the SQL patterns — if you know the SQL version, the Pandas translation will feel familiar.

Stage 3 — Analyse

Analysis is the core of the job. This is where SQL and Pandas overlap the most.

SQL is the right choice when: - The data lives in a database and you want results without loading it into Python. - The query involves multiple large tables — the database engine handles joins more efficiently. - You are writing a report that others will re-run directly.

Pandas is the right choice when: - You are working in a notebook and need to iterate quickly. - The analysis involves reshaping, rolling windows, or multi-step transformations. - The output feeds into a visualisation or model.

In practice, most analysts use both: SQL to extract and aggregate, Pandas to reshape and compute derived metrics.

Stage 4 — Visualise: Matplotlib + Seaborn

Charts turn numbers into insights. The two articles here cover complementary skills.

Syntax and code: Matplotlib / Seaborn Notebook Line charts, bar charts, histograms, box plots, scatter plots, heatmaps, subplots, and styling. Seaborn handles most charts with less code; Matplotlib gives fine-grained control. In practice, use both together.

Choosing the right chart: Visualization Selection Guide Knowing how to code a chart and knowing which chart to use are different skills. This guide covers when to use each chart type, when to avoid it, and the mistakes that come from choosing wrong. Read this before presenting findings to stakeholders.

Stage 5 — Present: PowerBI

Charts in Python are for analysis. PowerBI is for delivering findings to stakeholders — interactive dashboards they can filter and explore without writing code.

PowerBI Notebook covers three layers: - Data modeling — star schema, relationships, calculated columns vs measures. Getting this right makes everything else easier. - DAX syntax — the formula language for measures: CALCULATE, time intelligence, DIVIDE, VAR/RETURN, iterator functions. DAX is where most PowerBI analysts struggle; this is the reference to come back to. - Visualization selection — which visual for which question, and the mistakes that come from choosing wrong.

Stage 6 — Statistics: The Reasoning Layer

Statistics is what separates description from insight — it tells you whether a pattern is real or just noise.

Statistics Notebook covers: - Descriptive statistics — mean vs median, standard deviation, percentiles, skewness - Distributions — normal distribution, central limit theorem, outlier detection - Hypothesis testing — t-test, chi-square, Mann-Whitney, p-values, Type I/II errors - Correlation — Pearson, Spearman, correlation vs causation - A/B testing — sample size calculation, running the test, statistical vs practical significance - Common mistakes — p-hacking, Simpson's paradox, survivorship bias

Read this before the ML articles — statistical intuition is the foundation that makes ML results interpretable.

Stage 7 — Model: Machine Learning

Modelling goes beyond describing what happened to predicting what will happen.

Syntax and workflow: Machine Learning Notebook The end-to-end scikit-learn workflow: feature selection, train/test split, encoding, scaling, training, evaluation, cross-validation, feature importance, and Pipeline. Covers both regression (predict a number) and classification (predict a category).

Choosing the right model: ML Model Selection Guide When to use linear regression vs decision trees vs random forest vs XGBoost. How to read evaluation metrics correctly, diagnose overfitting, and run a structured model comparison. Read this alongside the notebook — the notebook shows how, the guide shows when and why.

How the Articles Connect

Excel Notebook ─────────────────────────────────────────────┐
                                                             │
SQL Notebook ──────────────────── SQL Patterns              │
     │                                  │                   │
     └──── SQL Advanced Notebook        │ (same patterns,   │
                                        │  Python side)     │
NumPy Notebook ──── Pandas Notebook ───┘ ◄─────────────────┘
                         │
                         ├──── Matplotlib Notebook ──── Visualization Guide
                         │
                         ├──── PowerBI Notebook
                         │
                         ├──── Statistics Notebook
                         │              │
                         └──── ML Notebook ──────────── ML Model Selection Guide

Each "notebook" article is a syntax reference. Each "guide" or "patterns" article teaches when and why — the judgment layer on top of the syntax.

← Back to Blog

Data Analyst Learning Path

The DA Workflow

Stage 1 — Extract: SQL

Stage 2 — Clean: Excel + NumPy + Pandas

Stage 3 — Analyse

Stage 4 — Visualise: Matplotlib + Seaborn

Stage 5 — Present: PowerBI

Stage 6 — Statistics: The Reasoning Layer

Stage 7 — Model: Machine Learning

Recommended Reading Order

If you are new to data analysis

If you already know SQL

If you already know Python

If you want to move into machine learning

How the Articles Connect