Data Analyst Learning Path

A data analyst turns raw data into decisions. The articles on this site cover the tools you need to do that work in Python and SQL. This guide shows where each tool fits, what to read first, and how the pieces connect.

The DA Workflow

Every analysis follows the same five stages, no matter the tool or the topic:

1. Extract      → pull data from a database or file
2. Clean        → fix types, handle nulls, remove duplicates
3. Analyse      → aggregate, filter, compute metrics
4. Visualise    → communicate findings with charts
5. Present      → build interactive dashboards for stakeholders
6. Model        → go beyond description to prediction

The articles on this site map onto these stages.


Stage 1 — Extract: SQL

SQL is the language you use to pull data from a database. Most business data lives in databases (Postgres, MySQL, SQL Server), and SQL is how you talk to them.

Start here: SQL Notebook The core syntax — SELECT, WHERE, GROUP BY, JOINs, CTEs (a temporary named result you can use inside your query), and window functions. Everything else builds on this.

Next: SQL Patterns Common problems you will see again and again: removing duplicate rows, comparing this month to last month, cohort analysis (grouping users by sign-up month and tracking them over time), top-N per group, and funnel analysis. Each pattern shows how to combine the basic SQL parts into a full answer.

When you are comfortable: SQL Advanced Notebook Window frame clauses, recursive CTEs (a CTE that calls itself, useful for date series and trees), GROUPING SETS, views, temporary tables, and EXPLAIN for checking query speed. These solve problems that basic SQL cannot.


Stage 2 — Clean: Excel + NumPy + Pandas

Data is rarely clean when you get it. You almost always need to fix the column types, fill in blanks, and drop duplicates first. Python handles this layer. Excel handles the lighter version that most offices use.

Excel Notebook The tool most Taiwan offices use every day. The article covers the core functions (XLOOKUP, SUMIFS, IFS), pivot tables, date and text work, and the DA workflows you will see at work. If you are new to data work, read this before NumPy and Pandas. The ideas — lookup, filter, aggregate, pivot — are the same. Excel does them with menus, while SQL and Pandas do them with code.

Start here: NumPy Notebook NumPy is the math base of the Python data world. Pandas is built on top of it. Once you understand arrays, vectorization (running one operation on many numbers at once), and broadcasting, Pandas is much easier to learn and debug.

Then: Pandas Notebook The main tool for cleaning and shaping data in Python. Covers reading files, filtering rows, handling nulls (missing values), joining tables, working with strings and dates, and ten common DA workflows — from cleaning pipelines to comparing this period vs the last one.

The Pandas workflows match the SQL patterns one-to-one. If you know the SQL version, the Pandas one will feel familiar.


Stage 3 — Analyse

Analysis is the heart of the job. SQL and Pandas overlap the most here.

SQL is the right choice when: - The data lives in a database and you want results without loading it into Python. - The query touches several large tables — the database engine joins them faster than Python can. - You are writing a report that other people will re-run later.

Pandas is the right choice when: - You are working in a notebook and need to try things quickly. - The work involves reshaping (turning rows into columns or back), rolling windows (moving averages), or several steps in a row. - The output goes into a chart or a model.

In real work, most analysts use both. SQL pulls and groups the data. Pandas reshapes it and adds the derived numbers.

Applied analysis — RFM: RFM Analysis Notebook The most practical way to group customers by behavior. RFM uses three numbers from each customer's transactions — Recency (how recently they bought), Frequency (how often they buy), and Monetary (how much they spend). The article covers two cases: when you only have the license plate (no membership system) and when you have a member ID (app users). Both cases include full SQL and Python code.


Stage 4 — Visualise: Matplotlib + Seaborn

Charts turn numbers into something people can understand. The two articles here cover two different skills.

Syntax and code: Matplotlib / Seaborn Notebook Line charts, bar charts, histograms, box plots, scatter plots, heatmaps, subplots, and styling. Seaborn handles most charts with less code. Matplotlib gives you fine-grained control. In real work, you use both together.

Choosing the right chart: Visualization Selection Guide Drawing a chart and choosing the right chart are two different skills. This guide covers when each chart type works, when it does not, and the mistakes you make when you pick the wrong one. Read this before you show your results to your team or your boss.


Stage 5 — Present: PowerBI

Python charts are for your own analysis. PowerBI is for sharing results with your team — interactive dashboards they can filter and explore on their own, no code needed.

PowerBI Notebook covers three layers: - Data modeling — star schema, relationships, calculated columns vs measures. If you get this wrong, everything else gets harder. - DAX syntax — the formula language for measures: CALCULATE, time intelligence, DIVIDE, VAR/RETURN, iterator functions. Most PowerBI users struggle with DAX. Use this article as the reference you come back to. - Visualization selection — which visual fits which question, and the mistakes you make when you pick the wrong one.


Stage 6 — Statistics: The Reasoning Layer

Statistics tells you whether a pattern in the data is real or just noise.

Statistics Notebook covers: - Descriptive statistics — mean vs median, standard deviation, percentiles, skewness - Distributions — normal distribution, central limit theorem, outlier detection - Hypothesis testing — checking if a difference between two groups is real. Covers t-test, chi-square, Mann-Whitney, p-values (the chance the result happened by luck), and Type I/II errors (false alarm vs missed signal) - Correlation — Pearson, Spearman, and the difference between correlation and causation - A/B testing — showing two versions to two groups and comparing the result. Covers sample size calculation, running the test, and the difference between statistical significance (the math says yes) and practical significance (it actually matters for the business) - Common mistakes — p-hacking, Simpson's paradox, survivorship bias

Read this before the ML articles. Without basic statistics, you cannot tell whether a model result is meaningful.


Stage 7 — Model: Machine Learning

Modeling moves you from describing what happened to predicting what happens next.

Syntax and workflow: Machine Learning Notebook The full scikit-learn workflow from start to finish: feature selection, train/test split, encoding, scaling, training, evaluation, cross-validation (testing the model on different slices of the data), feature importance, and Pipeline. Covers both regression (predict a number) and classification (predict a category).

Choosing the right model: ML Model Selection Guide When to use linear regression vs decision trees vs random forest vs XGBoost. How to read the evaluation metrics, spot overfitting (when the model memorizes the training data and fails on new data), and compare models in a structured way. Read this alongside the notebook. The notebook shows you how. The guide shows you when and why.


Recommended Reading Order

If you are new to data analysis

  1. Excel Notebook — start here if you have no DA experience
  2. SQL Notebook — learn to query data from databases
  3. NumPy Notebook — the Python math foundation
  4. Pandas Notebook — clean and shape data
  5. Matplotlib / Seaborn Notebook — turn results into charts
  6. Visualization Selection Guide — pick the right chart
  7. PowerBI Notebook — build dashboards for your team
  8. SQL Patterns — apply SQL to real problems
  9. RFM Analysis Notebook — your first end-to-end project
  10. Statistics Notebook — learn whether your findings are real
  11. Machine Learning Notebook — intro to prediction
  12. ML Model Selection Guide — pick the right model
  13. SQL Advanced Notebook — unlock advanced SQL

If you already know SQL

Start at step 2. The Pandas Notebook shows the SQL version next to each Pandas workflow, so the switch is easy.

If you already know Python

Start at step 1. The SQL Notebook and SQL Patterns cover what most Python analysts need when they start pulling data from databases.

If you want to move into machine learning

Get Pandas and Statistics solid first. Feature engineering (preparing the input data for the model) is 80% of ML work, and you need statistics to tell whether a model result is real. Then work through the ML articles in order: Notebook first, then Guide.


How the Articles Connect

Excel Notebook ─────────────────────────────────────────────┐
                                                             │
SQL Notebook ──────────────────── SQL Patterns              │
     │                                  │                   │
     └──── SQL Advanced Notebook        │ (same patterns,   │
                                        │  Python side)     │
NumPy Notebook ──── Pandas Notebook ───┘ ◄─────────────────┘
                         │
                         ├──── RFM Analysis Notebook  (SQL + Pandas applied)
                         │
                         ├──── Matplotlib Notebook ──── Visualization Guide
                         │
                         ├──── PowerBI Notebook
                         │
                         ├──── Statistics Notebook
                         │              │
                         └──── ML Notebook ──────────── ML Model Selection Guide

Each "notebook" article is a syntax reference — the how. Each "guide" or "patterns" article teaches when and why — the judgment that sits on top of the syntax.