Visualization Selection Guide

Picking the right chart is a separate skill from knowing how to code it. This guide focuses on the when — which chart fits your data and your question — and the mistakes that come from choosing wrong.

For the syntax and code, see Matplotlib / Seaborn Notebook. All examples use the same parking dataset.

Chart Selection Table

Use this table to find the right chart before reading the full section.

Your data and question Chart
How does a metric change over time? 1. Line Chart
How do categories compare on one metric? 2. Bar Chart
What share does each category contribute to a total? 3. Stacked Bar / Pie
What does the distribution of a single variable look like? 4. Histogram
How does a distribution differ across groups? 5. Box Plot
Is there a relationship between two numeric variables? 6. Scatter Plot
How do values vary across a two-dimensional grid? 7. Heatmap
How do multiple metrics move together over time? Combine: Line + dual axis
How does a station compare on several dimensions at once? Combine: Small multiples

When no single chart tells the whole story, see Dashboard Composition at the end.


1. Line Chart — Trend Over Time

Use when: your x-axis is time and you want to show how a metric changes.

ax.plot(monthly['month_str'], monthly['revenue'], marker='o')

Avoid when: - You have fewer than 4 time points — use a bar chart instead; a line implies continuity that doesn't exist. - The x-axis is categorical with no natural order (station names, payment methods) — a bar chart is clearer.

What to watch for

Add a rolling average overlay when the raw line is too noisy to read a trend:

monthly['rolling_3m'] = monthly['revenue'].rolling(3, center=True).mean()
ax.plot(monthly['month_str'], monthly['rolling_3m'], linestyle='--', label='3M Avg')

The rolling average should complement the raw line, not replace it. Show both.

Common mistake: truncated y-axis

Starting the y-axis at a non-zero value exaggerates small changes visually. Always start at 0 unless the range of variation is large relative to the absolute value (e.g., stock prices).

ax.set_ylim(bottom=0)   # enforce zero baseline

2. Bar Chart — Categorical Comparison

Use when: comparing a numeric metric across discrete categories (stations, payment methods, parking types).

sns.barplot(data=station_rev, x='station_code', y='amount', ax=ax)

Avoid when: - You have more than ~12 categories — the chart becomes unreadable. Filter to top N, or switch to a table. - You want to show change over time with many time points — use a line chart.

Vertical vs. Horizontal

Use horizontal bars when category labels are long — they're easier to read than rotated x-axis labels.

sns.barplot(data=station_rev, y='station_code', x='amount', orient='h', ax=ax)

Sort your bars

Unsorted bars make comparison harder. Always sort by the metric unless the category has a natural order (months, age groups).

station_rev = station_rev.sort_values('amount', ascending=False)

Add value labels for precision

Bar height gives a rough comparison; exact numbers require value labels for reports.

for container in ax.containers:
    ax.bar_label(container, fmt='%.0f', padding=3)

3. Stacked Bar & Pie — Composition

Use these when you care about what share each part contributes to a whole.

Stacked Bar — Composition That Changes Across Categories

Use when: you want to show both the total and the breakdown by sub-category, across multiple groups.

pivot = parking_df.pivot_table(
    values='amount', index='station_code',
    columns='payment_method', aggfunc='sum', fill_value=0
)
pivot.plot(kind='bar', stacked=True)

Avoid when: you need to compare a specific sub-category across groups. The floating baseline makes it hard to judge anything except the bottom segment and the total.

If you need per-segment comparison, use a grouped bar or separate small charts instead.

Pie / Donut Chart

Use when: - There are 5 or fewer categories. - You want to emphasize one dominant segment ("Station A contributes 60% of revenue"). - The audience cares about part-of-whole, not exact values.

Avoid when: - You have more than 5 categories — too many slices become unreadable. - You need to compare values — humans are poor at judging angles. A bar chart is almost always more accurate. - You need to show change over time.

Situation Stacked Bar Pie
Multiple groups with sub-categories
Single snapshot of composition
≤ 5 categories
Exact value comparison

4. Histogram — Single Variable Distribution

Use when: you want to understand how a single numeric variable is distributed — its shape, center, spread, and whether outliers exist.

sns.histplot(parking_df['amount'], bins=30, kde=True, ax=ax)

Avoid when: you want to compare two groups with very different sizes — the raw count scale is misleading. Use stat='density' or stat='probability' to normalize.

sns.histplot(data=parking_df, x='amount', hue='parking_type',
             stat='density', common_norm=False, bins=30, kde=True)

common_norm=False normalizes each group independently so the shapes are comparable regardless of group size.

Choosing bin count

Too few bins hide shape; too many create noise. bins=30 is a reasonable default for most business data. If the distribution is bimodal (two humps), check whether you should split the data by a categorical variable first.

What to look for

  • Right skew (long tail to the right): most parking amounts are small with occasional large ones — median is more meaningful than mean.
  • Outliers: gaps in the tail suggest data quality issues or a distinct sub-population.
  • Bimodal: two peaks often signal two different behaviors (e.g., short-term vs. monthly parkers mixed together).

5. Box Plot — Distribution Across Groups

Use when: comparing the distribution of a numeric variable across 3 or more categories — especially when you want to see median, spread, and outliers simultaneously.

sns.boxplot(data=parking_df, x='parking_type', y='amount', ax=ax)

Avoid when: - You have only 2 groups — a histogram with hue or a simple comparison of means is clearer. - Your audience is non-technical — box plots require explanation (what is an IQR?). A bar chart of means with error bars is more accessible.

Reading a box plot

  • Box = interquartile range (IQR), the middle 50% of data.
  • Line inside box = median.
  • Whiskers = 1.5× IQR from the box edges.
  • Dots beyond whiskers = outliers.

The median tells you more than the mean when the data is skewed — a box plot makes this visible.

Overlay individual points for small datasets

When n < 50 per group, the box plot summary can mislead. Show the individual points:

sns.boxplot(data=df, x='parking_type', y='amount', ax=ax)
sns.stripplot(data=df, x='parking_type', y='amount',
              color='black', alpha=0.4, size=2, ax=ax)

6. Scatter Plot — Relationship Between Two Numeric Variables

Use when: investigating whether two numeric variables are correlated — e.g., does parking duration predict payment amount?

sns.scatterplot(data=parking_df, x='duration_mins', y='amount',
                hue='parking_type', alpha=0.5)

Avoid when: - One axis is categorical — use a box plot or bar chart. - You have millions of points — overplotting makes patterns invisible. Subsample, use hexbin, or a 2D density plot instead.

Add a regression line to confirm direction

sns.regplot(data=parking_df, x='duration_mins', y='amount',
            scatter_kws={'alpha': 0.3})

The shaded band is the confidence interval. A wide band means the relationship is weak or the sample is small.

Correlation ≠ Causation

A scatter plot shows association, not cause. If duration and amount are correlated, it's because most pricing is time-based — not a discovery.

Common mistake: ignoring outliers

A few outliers can make a real correlation invisible, or create an apparent one that doesn't exist. Always check whether the pattern holds after removing extreme values.


7. Heatmap — Values Across a Two-Dimensional Grid

Use when: you have a matrix of values — a correlation table, or a pivot table with two categorical dimensions.

# correlation heatmap
sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm', vmin=-1, vmax=1)

# pivot heatmap: revenue by station × month
sns.heatmap(pivot, annot=True, fmt='.0f', cmap='YlOrRd')

Avoid when: - You have fewer than ~4 rows or columns — a table or grouped bar chart is cleaner. - You need to compare exact values — color is harder to read precisely than bar length. Add annot=True to print the values if precision matters.

Choosing a color palette

Palette Use case
'coolwarm' Diverging values (correlation: -1 to +1, growth: negative to positive)
'YlOrRd' Sequential, one direction (revenue, volume — higher is more intense)
'Blues' Sequential single color, less visually loud

Always set vmin and vmax for diverging palettes so the midpoint (zero or neutral) maps to white consistently:

sns.heatmap(corr, cmap='coolwarm', vmin=-1, vmax=1)

Pivot heatmap vs. line chart for trend data

A pivot heatmap (station × month) lets you scan two dimensions at once — which stations are growing and which months are peaks. A line chart per station is better when you care about the exact shape of the trend. Use both: the heatmap for overview, line charts for follow-up.


Common Mistakes

1. Using a Pie Chart with Too Many Slices

# AVOID: 8 slices — angles become indistinguishable
df['payment_method'].value_counts().plot(kind='pie')

# BETTER: bar chart, or collapse small categories into "Other"
top5 = df['payment_method'].value_counts().head(5)
other = df['payment_method'].value_counts().iloc[5:].sum()
top5['Other'] = other
top5.plot(kind='pie')

2. Omitting a Zero Baseline on Bar Charts

# AVOID: y-axis starts at 80000, making a small difference look huge
ax.set_ylim(80000, 120000)

# CORRECT: bar charts should always start at 0
ax.set_ylim(bottom=0)

Line charts may legitimately use a non-zero baseline when the variation is small relative to the absolute value, but bar charts never should — bar length encodes the value.

3. Plotting Means Without Showing Spread

A bar chart of means hides whether the groups have similar distributions or wildly different ones.

# Shows mean only — misleading if distributions differ
sns.barplot(data=df, x='parking_type', y='amount')

# Better: box plot, or bar + stripplot overlay
sns.boxplot(data=df, x='parking_type', y='amount')

If you must use a bar chart of means, at minimum add error bars:

sns.barplot(data=df, x='parking_type', y='amount', errorbar='sd')

4. Using Color to Encode the Same Variable Twice

# AVOID: x-axis already encodes station_code — color adds no information
sns.barplot(data=station_rev, x='station_code', y='amount',
            hue='station_code')   # redundant

# Use hue only when it encodes a DIFFERENT variable
sns.barplot(data=df, x='station_code', y='amount',
            hue='parking_type')   # hue = a second dimension

5. Not Labeling Axes or Units

# Always set axis labels with units
ax.set_xlabel('Month')
ax.set_ylabel('Revenue (NTD)')
ax.set_title('Monthly Revenue by Station')

A chart without units is incomplete. "Revenue" alone doesn't say NTD, USD, or thousands.

6. Comparing Groups with Very Different Sizes Using Raw Counts

# AVOID: Station A has 10x more sessions — raw count comparison is meaningless
sns.barplot(data=df, x='station_code', y='parking_id', estimator='count')

# BETTER: normalize to per-session averages
sns.barplot(data=df, x='station_code', y='amount', estimator='mean')

# OR: show both total and average side by side

Dashboard Composition

Real reporting tasks need multiple charts working together. The rule: each chart answers one question; together they tell one story.

Principle: Overview → Detail

Design dashboards so the reader moves from a high-level summary to specifics.

[Top row]   KPI metrics (total revenue, total visits, avg amount)
[Middle]    Trend over time (line chart — what's happening?)
[Bottom]    Breakdown (bar chart by station, heatmap by station × month)

Example: Monthly Station Report

fig = plt.figure(figsize=(18, 12))

# top row: trend + MoM change
ax1 = fig.add_subplot(2, 2, 1)
ax2 = fig.add_subplot(2, 2, 2)

# bottom row: station breakdown + heatmap
ax3 = fig.add_subplot(2, 2, 3)
ax4 = fig.add_subplot(2, 2, 4)

# top-left: monthly revenue + rolling avg
ax1.plot(monthly['month_str'], monthly['revenue'], marker='o', label='Monthly')
ax1.plot(monthly['month_str'], monthly['rolling_3m'], linestyle='--', label='3M Avg')
ax1.set_title('Monthly Revenue')
ax1.legend()
ax1.tick_params(axis='x', rotation=45)

# top-right: MoM % change (green/red bars)
colors = ['green' if v >= 0 else 'tomato' for v in monthly['mom_pct'].fillna(0)]
ax2.bar(monthly['month_str'], monthly['mom_pct'].fillna(0), color=colors)
ax2.axhline(0, color='black', linewidth=0.8)
ax2.set_title('Month-over-Month Change (%)')
ax2.tick_params(axis='x', rotation=45)

# bottom-left: revenue by station
sns.barplot(data=station_rev, x='station_code', y='amount', ax=ax3)
ax3.set_title('Total Revenue by Station')
for c in ax3.containers:
    ax3.bar_label(c, fmt='%.0f', padding=3, fontsize=8)

# bottom-right: station × month heatmap
pivot = parking_df.pivot_table(
    values='amount', index='station_code',
    columns=parking_df['entry_time'].dt.month, aggfunc='sum'
)
sns.heatmap(pivot, annot=True, fmt='.0f', cmap='YlOrRd', ax=ax4)
ax4.set_title('Revenue by Station × Month')
ax4.set_xlabel('Month')

plt.suptitle('Parking System — Monthly Report', fontsize=16, y=1.01)
plt.tight_layout()
plt.savefig('monthly_report.png', dpi=150, bbox_inches='tight')
plt.show()

Reading the composition: - Top-left tells you what the trend is. Top-right tells you how fast it's changing. - Bottom-left ranks stations by total. Bottom-right shows which station had which month — the question the bar chart can't answer alone. - The line chart and heatmap complement each other: one shows shape, the other shows magnitude across two dimensions simultaneously.