NumPy Notebook
NumPy is the math base of the Python data world. Pandas DataFrames are built on NumPy arrays, and most scientific Python libraries expect NumPy as input. If you understand NumPy, you can write faster, cleaner numeric code — and debug Pandas more easily.
The main idea in NumPy is vectorization: do an operation on a whole array at once, instead of looping row by row. This is faster and easier to read.
See Pandas Notebook for the DataFrame layer that sits on top of NumPy.
All examples use parking system data in array form: amounts, durations, and station IDs from the parking dataset.
import numpy as np
Creating Arrays
From Python Lists
amounts = np.array([120, 450, 80, 230, 560])
durations = np.array([30.5, 120.0, 15.0, 60.0, 240.0])
station_ids = np.array([1, 2, 1, 3, 2])
Built-in Constructors
np.zeros(5) # [0. 0. 0. 0. 0.]
np.ones(5) # [1. 1. 1. 1. 1.]
np.full(5, 999) # [999 999 999 999 999]
np.arange(0, 10, 2) # [0 2 4 6 8] — like range()
np.linspace(0, 1, 5) # [0. 0.25 0.5 0.75 1.] — evenly spaced
np.eye(3) # 3×3 identity matrix
From Pandas
amounts = parking_df['amount'].to_numpy() # Series → ndarray
matrix = parking_df[['amount', 'duration_mins']].to_numpy() # DataFrame → 2D array
Use .to_numpy() instead of .values. It is more clear that it returns an ndarray.
Array Attributes & Inspection
a = np.array([[1, 2, 3], [4, 5, 6]])
a.shape # (2, 3) — rows × columns
a.ndim # 2 — number of dimensions
a.size # 6 — total number of elements
a.dtype # dtype('int64') — element type
Data Types
np.array([1, 2, 3], dtype=float) # force float64
np.array([1.5, 2.5]).astype(int) # cast to int (drops decimals, does not round)
Common dtypes: int64, float64, bool, str_. NumPy uses int64 for integers and float64 for floats by default.
Indexing & Slicing
1D Arrays
a = np.array([120, 450, 80, 230, 560])
a[0] # 120 — first element
a[-1] # 560 — last element
a[1:4] # [450, 80, 230] — index 1 to 3
a[::2] # [120, 80, 560] — every other element
a[::-1] # [560, 230, 80, 450, 120] — reversed
2D Arrays
m = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
m[0, 1] # 2 — row 0, column 1
m[1, :] # [4, 5, 6] — the whole row 1
m[:, 2] # [3, 6, 9] — the whole column 2
m[0:2, 1:] # [[2, 3], [5, 6]] — sub-matrix
Boolean Indexing
amounts = np.array([120, 450, 80, 230, 560])
amounts[amounts > 200] # [450, 230, 560]
amounts[(amounts > 100) & (amounts < 400)] # [120, 230]
Fancy Indexing (Index with an Array)
idx = np.array([0, 2, 4])
amounts[idx] # [120, 80, 560] — pick by an array of positions
Reshaping
a = np.arange(12) # [ 0 1 2 3 4 5 6 7 8 9 10 11]
a.reshape(3, 4) # 3 rows × 4 columns
a.reshape(3, -1) # -1 lets NumPy work out the missing dimension → same result
a.reshape(2, 2, 3) # 3D array
a.flatten() # always returns a copy — 1D
a.ravel() # returns a view when it can — faster
a[:, np.newaxis] # add a new axis: shape (12,) → (12, 1)
a[np.newaxis, :] # shape (12,) → (1, 12)
Use -1 in reshape so you don't have to compute the missing size yourself. NumPy fills it in for you.
Math Operations
Element-wise (vectorized)
amounts = np.array([120, 450, 80, 230, 560])
amounts + 100 # add 100 to every element
amounts * 1.05 # apply a 5% increase
amounts ** 2 # square every element
np.sqrt(amounts) # square root
np.log(amounts) # natural log
np.abs(amounts - 300) # absolute distance from 300
Every operation runs element by element with no loop. This is the core NumPy pattern.
Array × Array
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
a + b # [5, 7, 9] — element-wise add
a * b # [4, 10, 18] — element-wise multiply (NOT dot product)
a @ b # 32 — dot product
Matrix Operations
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
A @ B # matrix multiplication
A.T # transpose
np.linalg.inv(A) # inverse
np.linalg.det(A) # determinant
Comparison & Boolean Arrays
amounts = np.array([120, 450, 80, 230, 560])
amounts > 200 # [False True False True True]
amounts == 80 # [False False True False False]
np.where(amounts > 200, 'High', 'Low') # ['Low' 'High' 'Low' 'High' 'High']
np.where — Vectorized if/else
# np.where(condition, value_if_true, value_if_false)
labels = np.where(amounts > 300, 'High', 'Low')
# nested: three tiers
tiers = np.where(amounts > 400, 'High',
np.where(amounts > 200, 'Medium', 'Low'))
This is the same idea as CASE WHEN in SQL and np.select in Pandas. Use np.select when you have more than 2–3 conditions.
np.select — Multiple Conditions
conditions = [amounts > 400, amounts > 200]
choices = ['High', 'Medium']
tiers = np.select(conditions, choices, default='Low')
Aggregation
amounts = np.array([120, 450, 80, 230, 560])
np.sum(amounts) # 1440
np.mean(amounts) # 288.0
np.median(amounts) # 230.0
np.std(amounts) # standard deviation
np.var(amounts) # variance
np.min(amounts) # 80
np.max(amounts) # 560
np.argmin(amounts) # 2 — index of the minimum
np.argmax(amounts) # 4 — index of the maximum
np.percentile(amounts, [25, 50, 75]) # quartiles
np.cumsum(amounts) # running total: [120 570 650 880 1440]
Axis Argument for 2D Arrays
m = np.array([[10, 20, 30],
[40, 50, 60]])
np.sum(m) # 210 — all elements
np.sum(m, axis=0) # [50, 70, 90] — sum per column (collapse rows)
np.sum(m, axis=1) # [60, 150] — sum per row (collapse columns)
axis=0 collapses rows (works down the columns); axis=1 collapses columns (works across the rows). The same rule applies to every NumPy aggregate function.
Broadcasting
Broadcasting is NumPy's rule for working with arrays of different shapes without copying data.
amounts = np.array([120, 450, 80, 230, 560]) # shape (5,)
amounts + 100 # the scalar broadcasts to every element
row = np.array([[1, 2, 3]]) # shape (1, 3)
col = np.array([[10], [20], [30]]) # shape (3, 1)
row + col # shape (3, 3) — both broadcast
Broadcasting rule: two dimensions are compatible if they are equal, or if one of them is 1. NumPy stretches the size-1 dimension to match the other one without copying memory.
Practical example: subtract the column mean from each column
m = np.array([[10, 20, 30],
[40, 50, 60],
[70, 80, 90]])
col_means = m.mean(axis=0) # shape (3,) → [40. 50. 60.]
m - col_means # shape (3,3) − (3,) → broadcasts row by row
Common mistake: shape (n,) vs (n,1)
a = np.array([1, 2, 3]) # shape (3,) — 1D
b = np.array([[1], [2], [3]])# shape (3,1) — 2D column vector
# (3,) + (3,) → (3,) — element-wise
# (3,1) + (1,3) → (3,3) — outer sum via broadcasting
When broadcasting gives you a strange shape, check .shape on both sides first.
Sorting
a = np.array([450, 80, 230, 120, 560])
np.sort(a) # [80, 120, 230, 450, 560] — returns a copy
np.argsort(a) # [1, 3, 2, 0, 4] — the indices that would sort a
a[np.argsort(a)] # same as np.sort(a)
np.sort(a)[::-1] # sort descending
argsort is useful when you want to sort a second array in the same order:
names = np.array(['D', 'B', 'C', 'A', 'E'])
amounts = np.array([450, 80, 230, 120, 560])
names[np.argsort(amounts)] # ['B' 'A' 'C' 'D' 'E'] — names sorted by amount
Random Numbers
rng = np.random.default_rng(seed=42) # recommended: a Generator with a seed
rng.integers(1, 100, size=5) # 5 random integers in [1, 100)
rng.random(5) # 5 uniform floats in [0, 1)
rng.normal(loc=0, scale=1, size=5) # 5 samples from N(0,1)
rng.choice(np.array([10, 20, 30]), size=3, replace=False) # sample without replacement
rng.shuffle(a) # shuffle in place
Always create a Generator with default_rng(seed) so you can reproduce the same numbers later. Avoid the old np.random.seed() pattern. It uses a global state that can cause hidden bugs in larger projects.
Common Patterns
1. Replace a Loop with Vectorization
# SLOW: Python loop
result = []
for x in amounts:
result.append(x * 1.05 if x > 200 else x)
# FAST: vectorized
result = np.where(amounts > 200, amounts * 1.05, amounts)
The vectorized version is usually 10–100× faster, because NumPy operations run in compiled C, not in interpreted Python.
2. Normalize to 0–1 Range (Min-Max Scaling)
normalized = (amounts - amounts.min()) / (amounts.max() - amounts.min())
3. Standardize (Z-Score)
standardized = (amounts - amounts.mean()) / amounts.std()
4. Clip Outliers
p5, p95 = np.percentile(amounts, [5, 95])
clipped = np.clip(amounts, p5, p95) # values outside [p5, p95] are capped
5. Unique Values and Counts
values, counts = np.unique(station_ids, return_counts=True)
# values: [1 2 3], counts: [2 2 1]
6. Stack and Split Arrays
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
np.hstack([a, b]) # [1 2 3 4 5 6] — join along columns (in 1D, just concatenate)
np.vstack([a, b]) # [[1 2 3], [4 5 6]] — stack as rows
np.column_stack([a, b]) # [[1 4], [2 5], [3 6]] — pair into columns
np.split(a, 3) # [array([1]), array([2]), array([3])]
7. NumPy ↔ Pandas
# Pandas → NumPy
arr = df['amount'].to_numpy()
mat = df[['amount', 'duration_mins']].to_numpy()
# NumPy → Pandas
s = pd.Series(arr, name='amount')
df = pd.DataFrame(mat, columns=['amount', 'duration_mins'])
# Apply a NumPy function to a Pandas column
df['log_amount'] = np.log(df['amount'])
df['clipped'] = np.clip(df['amount'], 0, 500)
df['tier'] = np.where(df['amount'] > 300, 'High', 'Low')
NumPy functions like np.log, np.clip, and np.where take a Pandas Series directly and return a Series. You don't need to convert first.
8. Efficient Membership Test
station_ids = np.array([1, 3, 2, 1, 5, 3])
target = np.array([1, 3])
np.isin(station_ids, target) # [True True False True False True]
station_ids[np.isin(station_ids, target)] # [1 3 1 3]
np.isin is the vectorized version of Python's in operator. It is much faster on large arrays.