Data Analyst Learning Path
A data analyst's job is to turn raw data into decisions. The tools on this site form a complete stack for doing that work in Python and SQL. This guide explains where each tool fits in the workflow, the recommended reading order, and how the articles connect to each other.
The DA Workflow
Every analysis follows the same five stages, regardless of the tool or domain:
1. Extract → pull data from a database or file
2. Clean → fix types, handle nulls, remove duplicates
3. Analyse → aggregate, filter, compute metrics
4. Visualise → communicate findings with charts
5. Present → build interactive dashboards for stakeholders
6. Model → go beyond description to prediction
The articles on this site map directly onto these stages.
Stage 1 — Extract: SQL
SQL is the primary language for pulling data. Most business data lives in relational databases, and SQL is how you query it.
Start here: SQL Notebook Core syntax — SELECT, WHERE, GROUP BY, JOINs, CTEs, window functions. This is the foundation everything else builds on.
Next: SQL Patterns Reusable solutions to the problems you'll encounter repeatedly: de-duplication, period-over-period comparisons, cohort analysis, top-N per group, funnel analysis. Each pattern combines the SQL building blocks into a complete approach.
When you're comfortable: SQL Advanced Notebook Window frame clauses, recursive CTEs for date series and hierarchies, GROUPING SETS, views, temporary tables, and EXPLAIN for query performance. These unlock a class of problem that basic SQL cannot solve.
Stage 2 — Clean: Excel + NumPy + Pandas
Once data is extracted, it almost always needs cleaning before analysis. Python handles this layer, and Excel handles the lightweight version used in most workplaces.
Excel Notebook The most widely used data tool in Taiwan's workplace. Covers core functions (XLOOKUP, SUMIFS, IFS), pivot tables, date and text manipulation, and common DA workflows. Read this before NumPy and Pandas if you are new to data work — Excel's concepts (lookup, filter, aggregate, pivot) are the same operations that SQL and Pandas implement programmatically.
Start here: NumPy Notebook NumPy is the numerical foundation of the Python data stack. Pandas is built on top of it. Understanding arrays, vectorization, and broadcasting makes Pandas faster to learn and easier to debug.
Then: Pandas Notebook The primary tool for data cleaning and manipulation in Python. Covers reading files, filtering, handling nulls, merging tables, string and date operations, and ten common DA workflows — from data cleaning pipelines to period-over-period comparisons.
The Pandas workflows directly mirror the SQL patterns — if you know the SQL version, the Pandas translation will feel familiar.
Stage 3 — Analyse
Analysis is the core of the job. This is where SQL and Pandas overlap the most.
SQL is the right choice when: - The data lives in a database and you want results without loading it into Python. - The query involves multiple large tables — the database engine handles joins more efficiently. - You are writing a report that others will re-run directly.
Pandas is the right choice when: - You are working in a notebook and need to iterate quickly. - The analysis involves reshaping, rolling windows, or multi-step transformations. - The output feeds into a visualisation or model.
In practice, most analysts use both: SQL to extract and aggregate, Pandas to reshape and compute derived metrics.
Stage 4 — Visualise: Matplotlib + Seaborn
Charts turn numbers into insights. The two articles here cover complementary skills.
Syntax and code: Matplotlib / Seaborn Notebook Line charts, bar charts, histograms, box plots, scatter plots, heatmaps, subplots, and styling. Seaborn handles most charts with less code; Matplotlib gives fine-grained control. In practice, use both together.
Choosing the right chart: Visualization Selection Guide Knowing how to code a chart and knowing which chart to use are different skills. This guide covers when to use each chart type, when to avoid it, and the mistakes that come from choosing wrong. Read this before presenting findings to stakeholders.
Stage 5 — Present: PowerBI
Charts in Python are for analysis. PowerBI is for delivering findings to stakeholders — interactive dashboards they can filter and explore without writing code.
PowerBI Notebook covers three layers: - Data modeling — star schema, relationships, calculated columns vs measures. Getting this right makes everything else easier. - DAX syntax — the formula language for measures: CALCULATE, time intelligence, DIVIDE, VAR/RETURN, iterator functions. DAX is where most PowerBI analysts struggle; this is the reference to come back to. - Visualization selection — which visual for which question, and the mistakes that come from choosing wrong.
Stage 6 — Statistics: The Reasoning Layer
Statistics is what separates description from insight — it tells you whether a pattern is real or just noise.
Statistics Notebook covers: - Descriptive statistics — mean vs median, standard deviation, percentiles, skewness - Distributions — normal distribution, central limit theorem, outlier detection - Hypothesis testing — t-test, chi-square, Mann-Whitney, p-values, Type I/II errors - Correlation — Pearson, Spearman, correlation vs causation - A/B testing — sample size calculation, running the test, statistical vs practical significance - Common mistakes — p-hacking, Simpson's paradox, survivorship bias
Read this before the ML articles — statistical intuition is the foundation that makes ML results interpretable.
Stage 7 — Model: Machine Learning
Modelling goes beyond describing what happened to predicting what will happen.
Syntax and workflow: Machine Learning Notebook The end-to-end scikit-learn workflow: feature selection, train/test split, encoding, scaling, training, evaluation, cross-validation, feature importance, and Pipeline. Covers both regression (predict a number) and classification (predict a category).
Choosing the right model: ML Model Selection Guide When to use linear regression vs decision trees vs random forest vs XGBoost. How to read evaluation metrics correctly, diagnose overfitting, and run a structured model comparison. Read this alongside the notebook — the notebook shows how, the guide shows when and why.
Recommended Reading Order
If you are new to data analysis
- Excel Notebook — start here if you have no prior DA experience
- SQL Notebook — learn to query data from databases
- NumPy Notebook — Python numerical foundation
- Pandas Notebook — data cleaning and manipulation
- Matplotlib / Seaborn Notebook — visualise your findings
- Visualization Selection Guide — learn to choose the right chart
- PowerBI Notebook — build dashboards for stakeholders
- SQL Patterns — apply SQL to real problems
- Statistics Notebook — understand whether your findings are real
- Machine Learning Notebook — intro to prediction
- ML Model Selection Guide — choose models correctly
- SQL Advanced Notebook — unlock advanced SQL
If you already know SQL
Start at step 2. The Pandas workflows in the Pandas Notebook include SQL equivalents for each operation, so the transition is direct.
If you already know Python
Start at step 1. The SQL Notebook and SQL Patterns cover the patterns most Python analysts need when they move to querying databases directly.
If you want to move into machine learning
Make sure both Pandas and Statistics are solid first — feature engineering is 80% of ML work, and evaluating whether a model result is meaningful requires statistical intuition. Work through the ML articles in order: Notebook before Guide.
How the Articles Connect
Excel Notebook ─────────────────────────────────────────────┐
│
SQL Notebook ──────────────────── SQL Patterns │
│ │ │
└──── SQL Advanced Notebook │ (same patterns, │
│ Python side) │
NumPy Notebook ──── Pandas Notebook ───┘ ◄─────────────────┘
│
├──── Matplotlib Notebook ──── Visualization Guide
│
├──── PowerBI Notebook
│
├──── Statistics Notebook
│ │
└──── ML Notebook ──────────── ML Model Selection Guide
Each "notebook" article is a syntax reference. Each "guide" or "patterns" article teaches when and why — the judgment layer on top of the syntax.