NumPy Notebook

2026-04-25 Data Analysis 3 min read

Python

NumPy is the math base of the Python data world. Pandas DataFrames are built on NumPy arrays, and most scientific Python libraries expect NumPy as input. If you understand NumPy, you can write faster, cleaner numeric code — and debug Pandas more easily.

The main idea in NumPy is vectorization: do an operation on a whole array at once, instead of looping row by row. This is faster and easier to read.

See Pandas Notebook for the DataFrame layer that sits on top of NumPy.

All examples use parking system data in array form: amounts, durations, and station IDs from the parking dataset.

import numpy as np

Creating Arrays

From Python Lists

amounts    = np.array([120, 450, 80, 230, 560])
durations  = np.array([30.5, 120.0, 15.0, 60.0, 240.0])
station_ids = np.array([1, 2, 1, 3, 2])

Built-in Constructors

np.zeros(5)                    # [0. 0. 0. 0. 0.]
np.ones(5)                     # [1. 1. 1. 1. 1.]
np.full(5, 999)                # [999 999 999 999 999]
np.arange(0, 10, 2)           # [0 2 4 6 8]  — like range()
np.linspace(0, 1, 5)          # [0.  0.25 0.5 0.75 1.]  — evenly spaced
np.eye(3)                      # 3×3 identity matrix

From Pandas

amounts = parking_df['amount'].to_numpy()          # Series → ndarray
matrix  = parking_df[['amount', 'duration_mins']].to_numpy()  # DataFrame → 2D array

Use .to_numpy() instead of .values. It is more clear that it returns an ndarray.

Array Attributes & Inspection

a = np.array([[1, 2, 3], [4, 5, 6]])

a.shape    # (2, 3) — rows × columns
a.ndim     # 2 — number of dimensions
a.size     # 6 — total number of elements
a.dtype    # dtype('int64') — element type

Data Types

np.array([1, 2, 3], dtype=float)      # force float64
np.array([1.5, 2.5]).astype(int)      # cast to int (drops decimals, does not round)

Common dtypes: int64, float64, bool, str_. NumPy uses int64 for integers and float64 for floats by default.

Indexing & Slicing

1D Arrays

a = np.array([120, 450, 80, 230, 560])

a[0]        # 120  — first element
a[-1]       # 560  — last element
a[1:4]      # [450, 80, 230]  — index 1 to 3
a[::2]      # [120, 80, 560]  — every other element
a[::-1]     # [560, 230, 80, 450, 120]  — reversed

2D Arrays

m = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])

m[0, 1]     # 2  — row 0, column 1
m[1, :]     # [4, 5, 6]  — the whole row 1
m[:, 2]     # [3, 6, 9]  — the whole column 2
m[0:2, 1:]  # [[2, 3], [5, 6]]  — sub-matrix

Boolean Indexing

amounts = np.array([120, 450, 80, 230, 560])

amounts[amounts > 200]          # [450, 230, 560]
amounts[(amounts > 100) & (amounts < 400)]  # [120, 230]

Fancy Indexing (Index with an Array)

idx = np.array([0, 2, 4])
amounts[idx]     # [120, 80, 560]  — pick by an array of positions

Reshaping

a = np.arange(12)         # [ 0  1  2  3  4  5  6  7  8  9 10 11]

a.reshape(3, 4)           # 3 rows × 4 columns
a.reshape(3, -1)          # -1 lets NumPy work out the missing dimension → same result
a.reshape(2, 2, 3)        # 3D array

a.flatten()               # always returns a copy — 1D
a.ravel()                 # returns a view when it can — faster

a[:, np.newaxis]          # add a new axis: shape (12,) → (12, 1)
a[np.newaxis, :]          # shape (12,) → (1, 12)

Use -1 in reshape so you don't have to compute the missing size yourself. NumPy fills it in for you.

Math Operations

Element-wise (vectorized)

amounts = np.array([120, 450, 80, 230, 560])

amounts + 100             # add 100 to every element
amounts * 1.05            # apply a 5% increase
amounts ** 2              # square every element
np.sqrt(amounts)          # square root
np.log(amounts)           # natural log
np.abs(amounts - 300)     # absolute distance from 300

Every operation runs element by element with no loop. This is the core NumPy pattern.

Array × Array

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

a + b      # [5, 7, 9]  — element-wise add
a * b      # [4, 10, 18]  — element-wise multiply (NOT dot product)
a @ b      # 32  — dot product

Matrix Operations

A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

A @ B              # matrix multiplication
A.T                # transpose
np.linalg.inv(A)   # inverse
np.linalg.det(A)   # determinant

Comparison & Boolean Arrays

amounts = np.array([120, 450, 80, 230, 560])

amounts > 200              # [False  True False  True  True]
amounts == 80              # [False False  True False False]
np.where(amounts > 200, 'High', 'Low')   # ['Low' 'High' 'Low' 'High' 'High']

np.where — Vectorized if/else

# np.where(condition, value_if_true, value_if_false)
labels = np.where(amounts > 300, 'High', 'Low')

# nested: three tiers
tiers = np.where(amounts > 400, 'High',
        np.where(amounts > 200, 'Medium', 'Low'))

This is the same idea as CASE WHEN in SQL and np.select in Pandas. Use np.select when you have more than 2–3 conditions.

np.select — Multiple Conditions

conditions = [amounts > 400, amounts > 200]
choices    = ['High', 'Medium']
tiers      = np.select(conditions, choices, default='Low')

Aggregation

amounts = np.array([120, 450, 80, 230, 560])

np.sum(amounts)       # 1440
np.mean(amounts)      # 288.0
np.median(amounts)    # 230.0
np.std(amounts)       # standard deviation
np.var(amounts)       # variance
np.min(amounts)       # 80
np.max(amounts)       # 560
np.argmin(amounts)    # 2  — index of the minimum
np.argmax(amounts)    # 4  — index of the maximum
np.percentile(amounts, [25, 50, 75])   # quartiles
np.cumsum(amounts)    # running total: [120 570 650 880 1440]

Axis Argument for 2D Arrays

m = np.array([[10, 20, 30],
              [40, 50, 60]])

np.sum(m)           # 210  — all elements
np.sum(m, axis=0)   # [50, 70, 90]  — sum per column (collapse rows)
np.sum(m, axis=1)   # [60, 150]  — sum per row (collapse columns)

axis=0 collapses rows (works down the columns); axis=1 collapses columns (works across the rows). The same rule applies to every NumPy aggregate function.

Broadcasting

Broadcasting is NumPy's rule for working with arrays of different shapes without copying data.

amounts = np.array([120, 450, 80, 230, 560])   # shape (5,)
amounts + 100                                   # the scalar broadcasts to every element

row    = np.array([[1, 2, 3]])                  # shape (1, 3)
col    = np.array([[10], [20], [30]])            # shape (3, 1)
row + col                                       # shape (3, 3) — both broadcast

Broadcasting rule: two dimensions are compatible if they are equal, or if one of them is 1. NumPy stretches the size-1 dimension to match the other one without copying memory.

Practical example: subtract the column mean from each column

m = np.array([[10, 20, 30],
              [40, 50, 60],
              [70, 80, 90]])

col_means = m.mean(axis=0)          # shape (3,) → [40. 50. 60.]
m - col_means                       # shape (3,3) − (3,) → broadcasts row by row

Common mistake: shape (n,) vs (n,1)

a = np.array([1, 2, 3])      # shape (3,)  — 1D
b = np.array([[1], [2], [3]])# shape (3,1) — 2D column vector

# (3,) + (3,)  → (3,)    — element-wise
# (3,1) + (1,3) → (3,3)  — outer sum via broadcasting

When broadcasting gives you a strange shape, check .shape on both sides first.

Sorting

a = np.array([450, 80, 230, 120, 560])

np.sort(a)                    # [80, 120, 230, 450, 560]  — returns a copy
np.argsort(a)                 # [1, 3, 2, 0, 4]  — the indices that would sort a
a[np.argsort(a)]              # same as np.sort(a)

np.sort(a)[::-1]              # sort descending

argsort is useful when you want to sort a second array in the same order:

names   = np.array(['D', 'B', 'C', 'A', 'E'])
amounts = np.array([450, 80, 230, 120, 560])
names[np.argsort(amounts)]    # ['B' 'A' 'C' 'D' 'E']  — names sorted by amount

Random Numbers

rng = np.random.default_rng(seed=42)   # recommended: a Generator with a seed

rng.integers(1, 100, size=5)            # 5 random integers in [1, 100)
rng.random(5)                           # 5 uniform floats in [0, 1)
rng.normal(loc=0, scale=1, size=5)      # 5 samples from N(0,1)
rng.choice(np.array([10, 20, 30]), size=3, replace=False)  # sample without replacement
rng.shuffle(a)                          # shuffle in place

Always create a Generator with default_rng(seed) so you can reproduce the same numbers later. Avoid the old np.random.seed() pattern. It uses a global state that can cause hidden bugs in larger projects.

Common Patterns

1. Replace a Loop with Vectorization

# SLOW: Python loop
result = []
for x in amounts:
    result.append(x * 1.05 if x > 200 else x)

# FAST: vectorized
result = np.where(amounts > 200, amounts * 1.05, amounts)

The vectorized version is usually 10–100× faster, because NumPy operations run in compiled C, not in interpreted Python.

2. Normalize to 0–1 Range (Min-Max Scaling)

normalized = (amounts - amounts.min()) / (amounts.max() - amounts.min())

3. Standardize (Z-Score)

standardized = (amounts - amounts.mean()) / amounts.std()

4. Clip Outliers

p5, p95 = np.percentile(amounts, [5, 95])
clipped = np.clip(amounts, p5, p95)     # values outside [p5, p95] are capped

5. Unique Values and Counts

values, counts = np.unique(station_ids, return_counts=True)
# values: [1 2 3], counts: [2 2 1]

6. Stack and Split Arrays

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

np.hstack([a, b])         # [1 2 3 4 5 6]  — join along columns (in 1D, just concatenate)
np.vstack([a, b])         # [[1 2 3], [4 5 6]]  — stack as rows
np.column_stack([a, b])   # [[1 4], [2 5], [3 6]]  — pair into columns

np.split(a, 3)            # [array([1]), array([2]), array([3])]

7. NumPy ↔ Pandas

# Pandas → NumPy
arr = df['amount'].to_numpy()
mat = df[['amount', 'duration_mins']].to_numpy()

# NumPy → Pandas
s  = pd.Series(arr, name='amount')
df = pd.DataFrame(mat, columns=['amount', 'duration_mins'])

# Apply a NumPy function to a Pandas column
df['log_amount'] = np.log(df['amount'])
df['clipped']    = np.clip(df['amount'], 0, 500)
df['tier']       = np.where(df['amount'] > 300, 'High', 'Low')

NumPy functions like np.log, np.clip, and np.where take a Pandas Series directly and return a Series. You don't need to convert first.

8. Efficient Membership Test

station_ids = np.array([1, 3, 2, 1, 5, 3])
target      = np.array([1, 3])

np.isin(station_ids, target)   # [True True False True False True]
station_ids[np.isin(station_ids, target)]   # [1 3 1 3]

np.isin is the vectorized version of Python's in operator. It is much faster on large arrays.

← Back to Blog