Visualization Selection Guide
Picking the right chart is a separate skill from knowing how to code it. This guide focuses on the when — which chart fits your data and your question — and the mistakes that come from choosing wrong.
For the syntax and code, see Matplotlib / Seaborn Notebook. All examples use the same parking dataset.
Chart Selection Table
Use this table to find the right chart before reading the full section.
| Your data and question | Chart |
|---|---|
| How does a metric change over time? | 1. Line Chart |
| How do categories compare on one metric? | 2. Bar Chart |
| What share does each category contribute to a total? | 3. Stacked Bar / Pie |
| What does the distribution of a single variable look like? | 4. Histogram |
| How does a distribution differ across groups? | 5. Box Plot |
| Is there a relationship between two numeric variables? | 6. Scatter Plot |
| How do values vary across a two-dimensional grid? | 7. Heatmap |
| How do multiple metrics move together over time? | Combine: Line + dual axis |
| How does a station compare on several dimensions at once? | Combine: Small multiples |
When no single chart tells the whole story, see Dashboard Composition at the end.
1. Line Chart — Trend Over Time
Use when: your x-axis is time and you want to show how a metric changes.
ax.plot(monthly['month_str'], monthly['revenue'], marker='o')
Avoid when: - You have fewer than 4 time points — use a bar chart instead; a line implies continuity that doesn't exist. - The x-axis is categorical with no natural order (station names, payment methods) — a bar chart is clearer.
What to watch for
Add a rolling average overlay when the raw line is too noisy to read a trend:
monthly['rolling_3m'] = monthly['revenue'].rolling(3, center=True).mean()
ax.plot(monthly['month_str'], monthly['rolling_3m'], linestyle='--', label='3M Avg')
The rolling average should complement the raw line, not replace it. Show both.
Common mistake: truncated y-axis
Starting the y-axis at a non-zero value exaggerates small changes visually. Always start at 0 unless the range of variation is large relative to the absolute value (e.g., stock prices).
ax.set_ylim(bottom=0) # enforce zero baseline
2. Bar Chart — Categorical Comparison
Use when: comparing a numeric metric across discrete categories (stations, payment methods, parking types).
sns.barplot(data=station_rev, x='station_code', y='amount', ax=ax)
Avoid when: - You have more than ~12 categories — the chart becomes unreadable. Filter to top N, or switch to a table. - You want to show change over time with many time points — use a line chart.
Vertical vs. Horizontal
Use horizontal bars when category labels are long — they're easier to read than rotated x-axis labels.
sns.barplot(data=station_rev, y='station_code', x='amount', orient='h', ax=ax)
Sort your bars
Unsorted bars make comparison harder. Always sort by the metric unless the category has a natural order (months, age groups).
station_rev = station_rev.sort_values('amount', ascending=False)
Add value labels for precision
Bar height gives a rough comparison; exact numbers require value labels for reports.
for container in ax.containers:
ax.bar_label(container, fmt='%.0f', padding=3)
3. Stacked Bar & Pie — Composition
Use these when you care about what share each part contributes to a whole.
Stacked Bar — Composition That Changes Across Categories
Use when: you want to show both the total and the breakdown by sub-category, across multiple groups.
pivot = parking_df.pivot_table(
values='amount', index='station_code',
columns='payment_method', aggfunc='sum', fill_value=0
)
pivot.plot(kind='bar', stacked=True)
Avoid when: you need to compare a specific sub-category across groups. The floating baseline makes it hard to judge anything except the bottom segment and the total.
If you need per-segment comparison, use a grouped bar or separate small charts instead.
Pie / Donut Chart
Use when: - There are 5 or fewer categories. - You want to emphasize one dominant segment ("Station A contributes 60% of revenue"). - The audience cares about part-of-whole, not exact values.
Avoid when: - You have more than 5 categories — too many slices become unreadable. - You need to compare values — humans are poor at judging angles. A bar chart is almost always more accurate. - You need to show change over time.
| Situation | Stacked Bar | Pie |
|---|---|---|
| Multiple groups with sub-categories | ✓ | ✗ |
| Single snapshot of composition | ✓ | ✓ |
| ≤ 5 categories | ✓ | ✓ |
| Exact value comparison | ✗ | ✗ |
4. Histogram — Single Variable Distribution
Use when: you want to understand how a single numeric variable is distributed — its shape, center, spread, and whether outliers exist.
sns.histplot(parking_df['amount'], bins=30, kde=True, ax=ax)
Avoid when: you want to compare two groups with very different sizes — the raw count scale is misleading. Use stat='density' or stat='probability' to normalize.
sns.histplot(data=parking_df, x='amount', hue='parking_type',
stat='density', common_norm=False, bins=30, kde=True)
common_norm=False normalizes each group independently so the shapes are comparable regardless of group size.
Choosing bin count
Too few bins hide shape; too many create noise. bins=30 is a reasonable default for most business data. If the distribution is bimodal (two humps), check whether you should split the data by a categorical variable first.
What to look for
- Right skew (long tail to the right): most parking amounts are small with occasional large ones — median is more meaningful than mean.
- Outliers: gaps in the tail suggest data quality issues or a distinct sub-population.
- Bimodal: two peaks often signal two different behaviors (e.g., short-term vs. monthly parkers mixed together).
5. Box Plot — Distribution Across Groups
Use when: comparing the distribution of a numeric variable across 3 or more categories — especially when you want to see median, spread, and outliers simultaneously.
sns.boxplot(data=parking_df, x='parking_type', y='amount', ax=ax)
Avoid when:
- You have only 2 groups — a histogram with hue or a simple comparison of means is clearer.
- Your audience is non-technical — box plots require explanation (what is an IQR?). A bar chart of means with error bars is more accessible.
Reading a box plot
- Box = interquartile range (IQR), the middle 50% of data.
- Line inside box = median.
- Whiskers = 1.5× IQR from the box edges.
- Dots beyond whiskers = outliers.
The median tells you more than the mean when the data is skewed — a box plot makes this visible.
Overlay individual points for small datasets
When n < 50 per group, the box plot summary can mislead. Show the individual points:
sns.boxplot(data=df, x='parking_type', y='amount', ax=ax)
sns.stripplot(data=df, x='parking_type', y='amount',
color='black', alpha=0.4, size=2, ax=ax)
6. Scatter Plot — Relationship Between Two Numeric Variables
Use when: investigating whether two numeric variables are correlated — e.g., does parking duration predict payment amount?
sns.scatterplot(data=parking_df, x='duration_mins', y='amount',
hue='parking_type', alpha=0.5)
Avoid when: - One axis is categorical — use a box plot or bar chart. - You have millions of points — overplotting makes patterns invisible. Subsample, use hexbin, or a 2D density plot instead.
Add a regression line to confirm direction
sns.regplot(data=parking_df, x='duration_mins', y='amount',
scatter_kws={'alpha': 0.3})
The shaded band is the confidence interval. A wide band means the relationship is weak or the sample is small.
Correlation ≠ Causation
A scatter plot shows association, not cause. If duration and amount are correlated, it's because most pricing is time-based — not a discovery.
Common mistake: ignoring outliers
A few outliers can make a real correlation invisible, or create an apparent one that doesn't exist. Always check whether the pattern holds after removing extreme values.
7. Heatmap — Values Across a Two-Dimensional Grid
Use when: you have a matrix of values — a correlation table, or a pivot table with two categorical dimensions.
# correlation heatmap
sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm', vmin=-1, vmax=1)
# pivot heatmap: revenue by station × month
sns.heatmap(pivot, annot=True, fmt='.0f', cmap='YlOrRd')
Avoid when:
- You have fewer than ~4 rows or columns — a table or grouped bar chart is cleaner.
- You need to compare exact values — color is harder to read precisely than bar length. Add annot=True to print the values if precision matters.
Choosing a color palette
| Palette | Use case |
|---|---|
'coolwarm' |
Diverging values (correlation: -1 to +1, growth: negative to positive) |
'YlOrRd' |
Sequential, one direction (revenue, volume — higher is more intense) |
'Blues' |
Sequential single color, less visually loud |
Always set vmin and vmax for diverging palettes so the midpoint (zero or neutral) maps to white consistently:
sns.heatmap(corr, cmap='coolwarm', vmin=-1, vmax=1)
Pivot heatmap vs. line chart for trend data
A pivot heatmap (station × month) lets you scan two dimensions at once — which stations are growing and which months are peaks. A line chart per station is better when you care about the exact shape of the trend. Use both: the heatmap for overview, line charts for follow-up.
Common Mistakes
1. Using a Pie Chart with Too Many Slices
# AVOID: 8 slices — angles become indistinguishable
df['payment_method'].value_counts().plot(kind='pie')
# BETTER: bar chart, or collapse small categories into "Other"
top5 = df['payment_method'].value_counts().head(5)
other = df['payment_method'].value_counts().iloc[5:].sum()
top5['Other'] = other
top5.plot(kind='pie')
2. Omitting a Zero Baseline on Bar Charts
# AVOID: y-axis starts at 80000, making a small difference look huge
ax.set_ylim(80000, 120000)
# CORRECT: bar charts should always start at 0
ax.set_ylim(bottom=0)
Line charts may legitimately use a non-zero baseline when the variation is small relative to the absolute value, but bar charts never should — bar length encodes the value.
3. Plotting Means Without Showing Spread
A bar chart of means hides whether the groups have similar distributions or wildly different ones.
# Shows mean only — misleading if distributions differ
sns.barplot(data=df, x='parking_type', y='amount')
# Better: box plot, or bar + stripplot overlay
sns.boxplot(data=df, x='parking_type', y='amount')
If you must use a bar chart of means, at minimum add error bars:
sns.barplot(data=df, x='parking_type', y='amount', errorbar='sd')
4. Using Color to Encode the Same Variable Twice
# AVOID: x-axis already encodes station_code — color adds no information
sns.barplot(data=station_rev, x='station_code', y='amount',
hue='station_code') # redundant
# Use hue only when it encodes a DIFFERENT variable
sns.barplot(data=df, x='station_code', y='amount',
hue='parking_type') # hue = a second dimension
5. Not Labeling Axes or Units
# Always set axis labels with units
ax.set_xlabel('Month')
ax.set_ylabel('Revenue (NTD)')
ax.set_title('Monthly Revenue by Station')
A chart without units is incomplete. "Revenue" alone doesn't say NTD, USD, or thousands.
6. Comparing Groups with Very Different Sizes Using Raw Counts
# AVOID: Station A has 10x more sessions — raw count comparison is meaningless
sns.barplot(data=df, x='station_code', y='parking_id', estimator='count')
# BETTER: normalize to per-session averages
sns.barplot(data=df, x='station_code', y='amount', estimator='mean')
# OR: show both total and average side by side
Dashboard Composition
Real reporting tasks need multiple charts working together. The rule: each chart answers one question; together they tell one story.
Principle: Overview → Detail
Design dashboards so the reader moves from a high-level summary to specifics.
[Top row] KPI metrics (total revenue, total visits, avg amount)
[Middle] Trend over time (line chart — what's happening?)
[Bottom] Breakdown (bar chart by station, heatmap by station × month)
Example: Monthly Station Report
fig = plt.figure(figsize=(18, 12))
# top row: trend + MoM change
ax1 = fig.add_subplot(2, 2, 1)
ax2 = fig.add_subplot(2, 2, 2)
# bottom row: station breakdown + heatmap
ax3 = fig.add_subplot(2, 2, 3)
ax4 = fig.add_subplot(2, 2, 4)
# top-left: monthly revenue + rolling avg
ax1.plot(monthly['month_str'], monthly['revenue'], marker='o', label='Monthly')
ax1.plot(monthly['month_str'], monthly['rolling_3m'], linestyle='--', label='3M Avg')
ax1.set_title('Monthly Revenue')
ax1.legend()
ax1.tick_params(axis='x', rotation=45)
# top-right: MoM % change (green/red bars)
colors = ['green' if v >= 0 else 'tomato' for v in monthly['mom_pct'].fillna(0)]
ax2.bar(monthly['month_str'], monthly['mom_pct'].fillna(0), color=colors)
ax2.axhline(0, color='black', linewidth=0.8)
ax2.set_title('Month-over-Month Change (%)')
ax2.tick_params(axis='x', rotation=45)
# bottom-left: revenue by station
sns.barplot(data=station_rev, x='station_code', y='amount', ax=ax3)
ax3.set_title('Total Revenue by Station')
for c in ax3.containers:
ax3.bar_label(c, fmt='%.0f', padding=3, fontsize=8)
# bottom-right: station × month heatmap
pivot = parking_df.pivot_table(
values='amount', index='station_code',
columns=parking_df['entry_time'].dt.month, aggfunc='sum'
)
sns.heatmap(pivot, annot=True, fmt='.0f', cmap='YlOrRd', ax=ax4)
ax4.set_title('Revenue by Station × Month')
ax4.set_xlabel('Month')
plt.suptitle('Parking System — Monthly Report', fontsize=16, y=1.01)
plt.tight_layout()
plt.savefig('monthly_report.png', dpi=150, bbox_inches='tight')
plt.show()
Reading the composition: - Top-left tells you what the trend is. Top-right tells you how fast it's changing. - Bottom-left ranks stations by total. Bottom-right shows which station had which month — the question the bar chart can't answer alone. - The line chart and heatmap complement each other: one shows shape, the other shows magnitude across two dimensions simultaneously.