NumPy Notebook

NumPy is the numerical foundation of the Python data stack. Pandas DataFrames are built on NumPy arrays, and most scientific Python libraries expect NumPy as input. Understanding NumPy lets you write faster, cleaner numeric code — and debug Pandas more effectively.

The core idea in NumPy is vectorization: apply an operation to an entire array at once instead of looping row by row. This is both faster and more readable.

See Pandas Notebook for the DataFrame layer built on top of NumPy.

All examples use parking system data in array form: amounts, durations, and station IDs extracted from the parking dataset.

import numpy as np

Creating Arrays

From Python Lists

amounts    = np.array([120, 450, 80, 230, 560])
durations  = np.array([30.5, 120.0, 15.0, 60.0, 240.0])
station_ids = np.array([1, 2, 1, 3, 2])

Built-in Constructors

np.zeros(5)                    # [0. 0. 0. 0. 0.]
np.ones(5)                     # [1. 1. 1. 1. 1.]
np.full(5, 999)                # [999 999 999 999 999]
np.arange(0, 10, 2)           # [0 2 4 6 8]  — like range()
np.linspace(0, 1, 5)          # [0.  0.25 0.5 0.75 1.]  — evenly spaced
np.eye(3)                      # 3×3 identity matrix

From Pandas

amounts = parking_df['amount'].to_numpy()          # Series → ndarray
matrix  = parking_df[['amount', 'duration_mins']].to_numpy()  # DataFrame → 2D array

Use .to_numpy() instead of .values — it's explicit about returning an ndarray.

Array Attributes & Inspection

a = np.array([[1, 2, 3], [4, 5, 6]])

a.shape    # (2, 3) — rows × columns
a.ndim     # 2 — number of dimensions
a.size     # 6 — total number of elements
a.dtype    # dtype('int64') — element type

Data Types

np.array([1, 2, 3], dtype=float)      # force float64
np.array([1.5, 2.5]).astype(int)      # cast to int (truncates, does not round)

Common dtypes: int64, float64, bool, str_. NumPy defaults integers to int64 and floats to float64.

Indexing & Slicing

1D Arrays

a = np.array([120, 450, 80, 230, 560])

a[0]        # 120  — first element
a[-1]       # 560  — last element
a[1:4]      # [450, 80, 230]  — index 1 to 3
a[::2]      # [120, 80, 560]  — every other element
a[::-1]     # [560, 230, 80, 450, 120]  — reversed

2D Arrays

m = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])

m[0, 1]     # 2  — row 0, column 1
m[1, :]     # [4, 5, 6]  — entire row 1
m[:, 2]     # [3, 6, 9]  — entire column 2
m[0:2, 1:]  # [[2, 3], [5, 6]]  — sub-matrix

Boolean Indexing

amounts = np.array([120, 450, 80, 230, 560])

amounts[amounts > 200]          # [450, 230, 560]
amounts[(amounts > 100) & (amounts < 400)]  # [120, 230]

Fancy Indexing (Index with an Array)

idx = np.array([0, 2, 4])
amounts[idx]     # [120, 80, 560]  — select by position array

Reshaping

a = np.arange(12)         # [ 0  1  2  3  4  5  6  7  8  9 10 11]

a.reshape(3, 4)           # 3 rows × 4 columns
a.reshape(3, -1)          # -1 lets NumPy infer the missing dimension → same result
a.reshape(2, 2, 3)        # 3D array

a.flatten()               # always returns a copy — 1D
a.ravel()                 # returns a view when possible — faster

a[:, np.newaxis]          # add a new axis: shape (12,) → (12, 1)
a[np.newaxis, :]          # shape (12,) → (1, 12)

Use -1 in reshape to avoid computing the missing dimension manually — NumPy does it for you.

Math Operations

Element-wise (vectorized)

amounts = np.array([120, 450, 80, 230, 560])

amounts + 100             # add 100 to every element
amounts * 1.05            # apply 5% increase
amounts ** 2              # square every element
np.sqrt(amounts)          # square root
np.log(amounts)           # natural log
np.abs(amounts - 300)     # absolute deviation from 300

All operations apply element-wise without any loop. This is the core NumPy pattern.

Array × Array

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

a + b      # [5, 7, 9]  — element-wise add
a * b      # [4, 10, 18]  — element-wise multiply (NOT dot product)
a @ b      # 32  — dot product

Matrix Operations

A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

A @ B              # matrix multiplication
A.T                # transpose
np.linalg.inv(A)   # inverse
np.linalg.det(A)   # determinant

Comparison & Boolean Arrays

amounts = np.array([120, 450, 80, 230, 560])

amounts > 200              # [False  True False  True  True]
amounts == 80              # [False False  True False False]
np.where(amounts > 200, 'High', 'Low')   # ['Low' 'High' 'Low' 'High' 'High']

np.where — Vectorized if/else

# np.where(condition, value_if_true, value_if_false)
labels = np.where(amounts > 300, 'High', 'Low')

# nested: three tiers
tiers = np.where(amounts > 400, 'High',
        np.where(amounts > 200, 'Medium', 'Low'))

Equivalent to CASE WHEN in SQL and np.select in Pandas — use np.select when you have more than 2–3 conditions.

np.select — Multiple Conditions

conditions = [amounts > 400, amounts > 200]
choices    = ['High', 'Medium']
tiers      = np.select(conditions, choices, default='Low')

Aggregation

amounts = np.array([120, 450, 80, 230, 560])

np.sum(amounts)       # 1440
np.mean(amounts)      # 288.0
np.median(amounts)    # 230.0
np.std(amounts)       # standard deviation
np.var(amounts)       # variance
np.min(amounts)       # 80
np.max(amounts)       # 560
np.argmin(amounts)    # 2  — index of minimum
np.argmax(amounts)    # 4  — index of maximum
np.percentile(amounts, [25, 50, 75])   # quartiles
np.cumsum(amounts)    # running total: [120 570 650 880 1440]

Axis Argument for 2D Arrays

m = np.array([[10, 20, 30],
              [40, 50, 60]])

np.sum(m)           # 210  — all elements
np.sum(m, axis=0)   # [50, 70, 90]  — sum per column (collapse rows)
np.sum(m, axis=1)   # [60, 150]  — sum per row (collapse columns)

axis=0 collapses rows (operates down); axis=1 collapses columns (operates across). This is consistent across all NumPy aggregation functions.

Broadcasting

Broadcasting is NumPy's rule for operating on arrays of different shapes without copying data.

amounts = np.array([120, 450, 80, 230, 560])   # shape (5,)
amounts + 100                                   # scalar broadcasts to every element

row    = np.array([[1, 2, 3]])                  # shape (1, 3)
col    = np.array([[10], [20], [30]])            # shape (3, 1)
row + col                                       # shape (3, 3) — broadcasts both

Broadcasting rule: two dimensions are compatible if they are equal, or one of them is 1. NumPy expands the size-1 dimension to match the other without copying memory.

Practical example: subtract the column mean from each column

m = np.array([[10, 20, 30],
              [40, 50, 60],
              [70, 80, 90]])

col_means = m.mean(axis=0)          # shape (3,) → [40. 50. 60.]
m - col_means                       # shape (3,3) − (3,) → broadcasts row-wise

Common mistake: shape (n,) vs (n,1)

a = np.array([1, 2, 3])      # shape (3,)  — 1D
b = np.array([[1], [2], [3]])# shape (3,1) — 2D column vector

# (3,) + (3,)  → (3,)    — element-wise
# (3,1) + (1,3) → (3,3)  — outer sum via broadcasting

When broadcasting produces unexpected shapes, check .shape on both operands first.

Sorting

a = np.array([450, 80, 230, 120, 560])

np.sort(a)                    # [80, 120, 230, 450, 560]  — returns a copy
np.argsort(a)                 # [1, 3, 2, 0, 4]  — indices that would sort a
a[np.argsort(a)]              # same as np.sort(a)

np.sort(a)[::-1]              # descending sort

argsort is useful when you need to sort a second array in the same order:

names   = np.array(['D', 'B', 'C', 'A', 'E'])
amounts = np.array([450, 80, 230, 120, 560])
names[np.argsort(amounts)]    # ['B' 'A' 'C' 'D' 'E']  — names sorted by amount

Random Numbers

rng = np.random.default_rng(seed=42)   # recommended: explicit Generator with seed

rng.integers(1, 100, size=5)            # 5 random integers in [1, 100)
rng.random(5)                           # 5 uniform floats in [0, 1)
rng.normal(loc=0, scale=1, size=5)      # 5 samples from N(0,1)
rng.choice(np.array([10, 20, 30]), size=3, replace=False)  # sample without replacement
rng.shuffle(a)                          # shuffle in place

Always create a Generator with default_rng(seed) for reproducibility. Avoid the legacy np.random.seed() pattern — it uses a global state that can cause subtle bugs in larger codebases.

Common Patterns

1. Replace a Loop with Vectorization

# SLOW: Python loop
result = []
for x in amounts:
    result.append(x * 1.05 if x > 200 else x)

# FAST: vectorized
result = np.where(amounts > 200, amounts * 1.05, amounts)

The vectorized version is typically 10–100× faster because NumPy operations run in compiled C, not interpreted Python.

2. Normalize to 0–1 Range (Min-Max Scaling)

normalized = (amounts - amounts.min()) / (amounts.max() - amounts.min())

3. Standardize (Z-Score)

standardized = (amounts - amounts.mean()) / amounts.std()

4. Clip Outliers

p5, p95 = np.percentile(amounts, [5, 95])
clipped = np.clip(amounts, p5, p95)     # values outside [p5, p95] are capped

5. Unique Values and Counts

values, counts = np.unique(station_ids, return_counts=True)
# values: [1 2 3], counts: [2 2 1]

6. Stack and Split Arrays

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

np.hstack([a, b])         # [1 2 3 4 5 6]  — join along columns (1D: concatenate)
np.vstack([a, b])         # [[1 2 3], [4 5 6]]  — stack as rows
np.column_stack([a, b])   # [[1 4], [2 5], [3 6]]  — pair into columns

np.split(a, 3)            # [array([1]), array([2]), array([3])]

7. NumPy ↔ Pandas

# Pandas → NumPy
arr = df['amount'].to_numpy()
mat = df[['amount', 'duration_mins']].to_numpy()

# NumPy → Pandas
s  = pd.Series(arr, name='amount')
df = pd.DataFrame(mat, columns=['amount', 'duration_mins'])

# Apply NumPy function to a Pandas column
df['log_amount'] = np.log(df['amount'])
df['clipped']    = np.clip(df['amount'], 0, 500)
df['tier']       = np.where(df['amount'] > 300, 'High', 'Low')

NumPy functions (np.log, np.clip, np.where) accept Pandas Series directly and return a Series — you don't need to convert first.

8. Efficient Membership Test

station_ids = np.array([1, 3, 2, 1, 5, 3])
target      = np.array([1, 3])

np.isin(station_ids, target)   # [True True False True False True]
station_ids[np.isin(station_ids, target)]   # [1 3 1 3]

np.isin is the vectorized equivalent of Python's in operator — much faster on large arrays.