11 min read

The Shape of Numpy

Shapes of numpy

The central construct in the numpy package is ndarray, namely, n-dimensional arrays. ndarray is a natural extensions to matrices, just like matrices are extensions to vectors, and vectors are extensions to real numbers.

The shape of any matrix can be described with a pair of two integers (a, b) — the first often being the number of rows in the matrix and the second being the number of columns; such quantities can be nicely visualised using a rectangular table with a rows and b columns.

When describing the shapes of ndarrays, however, things become more complicated since we can have more than two axes. We can certainly try to visualise the structure of a multi-axis ndarray as nested ndarrays — that is, treat each entry in a ndarray as another ndarray. For example, for a ndarray whose shape is (3, 2, 3), we may see it as a (3,)-shaped array of (2, 3)-shaped matrix:

import pandas as pd
import numpy as np
np.random.seed(42)
array = np.random.normal(size = 3 * 2 * 3).reshape((3, 2, 3))
print(array)
## [[[ 0.49671415 -0.1382643   0.64768854]
##   [ 1.52302986 -0.23415337 -0.23413696]]
## 
##  [[ 1.57921282  0.76743473 -0.46947439]
##   [ 0.54256004 -0.46341769 -0.46572975]]
## 
##  [[ 0.24196227 -1.91328024 -1.72491783]
##   [-0.56228753 -1.01283112  0.31424733]]]

or a (3, 2)-shaped matrix with each element being a (3,) array:

## row 	 col 	 value
## 0 	 0 	 [ 0.49671415 -0.1382643   0.64768854]
## 0 	 1 	 [ 1.52302986 -0.23415337 -0.23413696]
## 1 	 0 	 [ 1.57921282  0.76743473 -0.46947439]
## 1 	 1 	 [ 0.54256004 -0.46341769 -0.46572975]
## 2 	 0 	 [ 0.24196227 -1.91328024 -1.72491783]
## 2 	 1 	 [-0.56228753 -1.01283112  0.31424733]

This hierarchical view of ndarray, however, can be difficult to work with especially as the number of axes grows larger, especially considering that one can slice a ndarray along arbitrary axes. Another disadvantage of this conceptual model is that it makes it difficult to visualise the outcomes of aggregation operators such as mean ad max, as well as those that operate on the structures of the array, e.g. transpose.

A more consistent and generalised view to a ndarray is to consider it as a table of indices where each column contains the indices in a particular axis (sometimes called the long-format). The above ndarray, when represented in this view, looks like the following:

##  axis_0  axis_1  axis_2     value
##       0       0       0  0.496714
##       0       0       1 -0.138264
##       0       0       2  0.647689
##       0       1       0  1.523030
##       0       1       1 -0.234153
##       0       1       2 -0.234137
##       1       0       0  1.579213
##       1       0       1  0.767435
##       1       0       2 -0.469474
##       1       1       0  0.542560
##       1       1       1 -0.463418
##       1       1       2 -0.465730
##       2       0       0  0.241962
##       2       0       1 -1.913280
##       2       0       2 -1.724918
##       2       1       0 -0.562288
##       2       1       1 -1.012831
##       2       1       2  0.314247

For a pandas data frame, axis 0 corresponds to the row indices and axis 1 records corresponds to the column indices.

Under this view, slicing can be understood as filtering. For example, slicing the above array with (0:2, :, 0:2) will result in the following table:

##  axis_0  axis_1  axis_2     value
##       0       0       0  0.496714
##       0       0       1 -0.138264
##       0       1       0  1.523030
##       0       1       1 -0.234153
##       1       0       0  1.579213
##       1       0       1  0.767435
##       1       1       0  0.542560
##       1       1       1 -0.463418

which is a (2, 2, 2)-shaped ndarray:

## [[[ 0.49671415 -0.1382643 ]
##   [ 1.52302986 -0.23415337]]
## 
##  [[ 1.57921282  0.76743473]
##   [ 0.54256004 -0.46341769]]]

The axis argument in aggregation functions

Many aggregation methods offered by numpy and pandas has an axis argument. The official documentation of the mean function in numpy describes the behaviour of the argument as follows:

axis : None or int or tuple of ints, optional Axis or axes along which the means are computed. The default is to compute the mean of the flattened array.

The documentation, in my opinion, while accurate, is somewhat hard to remember and visualise. The simplest way I have found to understand the behaviour of the axis argument is to think of it as specifying which axes are to be aggregated by the aggregation function. For example, the following code snippet:

print(array.mean(axis=(0)))
## [[ 0.77262975 -0.42803661 -0.51556789]
##  [ 0.50110079 -0.57013406 -0.12853979]]

is sort-of equivalent to the following if we were to work with the pandas frame:

print(frame.groupby(['axis_1', 'axis_2'])['value'].mean())
## axis_1  axis_2
## 0       0         0.772630
##         1        -0.428037
##         2        -0.515568
## 1       0         0.501101
##         1        -0.570134
##         2        -0.128540
## Name: value, dtype: float64

where the axes not specified in the axis argument (i.e. the second and the third axes) were used as the grouping axes.

We can also specify multiple axes to be aggregated by a function:

print(np.random.normal(size=(64, 8, 32)).mean(axis=(0, 2)).shape)
## (8,)

Judging from the shape of the resultant array, we can see that numpy grouped the ndarray by the second axes (axis=1) and averaged the values from the remaining axes.

numpy Shape and Broadcasting

Broadcasting is a very powerful feature implemented in numpy. It allows us to easily perform pairwise operations along certain axes. For example, let us assume that we have the following two-dimensional data drawn from a multivariate normal distribution:

array = np.random.multivariate_normal(mean=np.random.uniform(size=3)*2, cov=np.ones((3, 3))*0.5 + np.eye(3)*0.5, size=10)
print(array)
## [[ 8.99104287e-02  1.62652849e+00  1.14581045e+00]
##  [-5.23281480e-01 -1.17422831e+00 -2.05469629e-03]
##  [ 1.94776298e+00  2.19683967e+00  1.82631528e+00]
##  [-3.97515736e-01  6.43356118e-01 -1.39690647e+00]
##  [-4.41518700e-01 -1.13264403e+00 -6.32800484e-01]
##  [ 2.83284793e+00  2.43326800e+00  2.29754430e+00]
##  [ 2.54662474e+00  2.65454451e+00  7.93348924e-01]
##  [ 2.86645477e+00  2.01298801e+00  1.13233329e+00]
##  [ 1.82682025e+00  1.76805044e+00  1.24128276e+00]
##  [ 9.09527491e-01  1.34364714e+00 -3.17817399e-01]]

and we want to standardise the matrix by each of its columns. For each column, we need to subtract the mean of the column from its elements and then divide the resulting values with the standard deviation of that column. We can compute the column-wise means and standard deviations utilising our knowledge about how the axis argument works:

col_means = array.mean(axis=0) # compute the means of the matrix using column indices as grouppings (i.e. axis 1)
print(col_means.shape, col_means)
## (3,) [1.16576327 1.237235   0.6087056 ]
col_stds   = array.std(axis=0)
print(col_stds.shape, col_stds)
## (3,) [1.33154069 1.31113337 1.10122451]

Now we can use broadcasting to subtract the means vector from the columns of the matrix.

print((array - col_means) / col_stds)
## [[-0.80797594  0.29691372  0.4877342 ]
##  [-1.26848903 -1.83922046 -0.55461923]
##  [ 0.58728939  0.73188944  1.10568706]
##  [-1.17403773 -0.45295078 -1.82125629]
##  [-1.20708438 -1.80750417 -1.12738689]
##  [ 1.25199678  0.91221307  1.53360072]
##  [ 1.03704039  1.08098042  0.16767092]
##  [ 1.27723585  0.59166598  0.47549586]
##  [ 0.49646022  0.40485236  0.5744307 ]
##  [-0.19243556  0.08116042 -0.84135704]]

Note that numpy automatically figured out the correct axis to apply the operations to. Let’s say we now would like to normalise array row-wise by its row-wise sums; if we calculate it naively like in the above example, we receive an exception:

try:
  print((array / array.sum(axis=1)))
except ValueError as e:
  print(e)
## operands could not be broadcast together with shapes (10,3) (10,)

To see why this failed, note that numpy’s documentation states the following about the rules for broadcasting:

When operating on two arrays, NumPy compares their shapes element-wise. It starts with the trailing dimensions, and works its way forward. Two dimensions are compatible when

  1. they are equal, or
  2. one of them is 1

In our first example, the shape of the mean and standard deviation vectors (i.e. (3,)) can be matched with the right-most dimensions of the array matrix (i.e. (10, 3)), so numpy will subtract/divide the col_means/col_stds vector from the matched axes for each of the entries indexed by the remaining axes. For the second case, numpy cannot find a valid alignment for the shapes of the two ndarrays. To fix it, we need to reshape the row-wise sum vector into a two-axes ndarray with a trailing dimension of 1:

try:
  print((array / array.sum(axis=1).reshape((-1, 1))))
except ValueError as e:
  print(e)
## [[ 3.14125072e-02  5.68269314e-01  4.00318178e-01]
##  [ 3.07891511e-01  6.90899534e-01  1.20895459e-03]
##  [ 3.26208299e-01  3.67923273e-01  3.05868428e-01]
##  [ 3.45345710e-01 -5.58921964e-01  1.21357625e+00]
##  [ 2.00057118e-01  5.13213824e-01  2.86729058e-01]
##  [ 3.74534001e-01  3.21705091e-01  3.03760908e-01]
##  [ 4.24825594e-01  4.42828669e-01  1.32345737e-01]
##  [ 4.76806644e-01  3.34840817e-01  1.88352539e-01]
##  [ 3.77742409e-01  3.65590227e-01  2.56667364e-01]
##  [ 4.69953287e-01  6.94263115e-01 -1.64216401e-01]]

For more complex cases, suppose we are subtracting a ndarray y from another ndarray X, then here are some example scenarios:

Case 1 :
X.shape:  (a, b, c, d)
y.shape:        (c, d)
X - y  :  valid, subtract y from every element indexed by (a, b) in X

Case 2 :
X.shape:  (a, b, c, d)
y.shape:     (a, c, d)
X - y  :  error, no proper shape alignment can be found

Case 3 :
X.shape:  (a, b, c, d)
y.shape:     (b, 1, d)
X - y  :  valid, because 1-shaped axis can match any shape (i.e. wildcard)

The thrid case can be tricky to visualise; I generally find it easier to imagine an “materialised” y where the trailing 1-shaped axes are expanded to match the actual shapes in X and are all populated with the single element indexed by the right-most axis of y.

An alternative way to visualise the operation makes use of the indexed table view to ndarrays developed earlier — just imagine we are joining the two tables on the axes with matching shapes:

X = np.random.uniform(size = (2, 4, 2, 3))
X_frame = pd.DataFrame.from_records(product([0, 1], [0, 1, 2, 3], [0, 1], [0, 1, 2]))
X_frame.columns = ['a', 'b', 'c', 'd']
X_frame.loc[:, 'value'] = X.reshape((-1,))
print('X:')
## X:
print(X_frame.to_string(index=False))
##  a  b  c  d    value
##  0  0  0  0 0.224303
##  0  0  0  1 0.259675
##  0  0  0  2 0.610952
##  0  0  1  0 0.040834
##  0  0  1  1 0.969459
##  0  0  1  2 0.918673
##  0  1  0  0 0.380768
##  0  1  0  1 0.515422
##  0  1  0  2 0.466119
##  0  1  1  0 0.882200
##  0  1  1  1 0.688602
##  0  1  1  2 0.278111
##  0  2  0  0 0.446841
##  0  2  0  1 0.428966
##  0  2  0  2 0.959335
##  0  2  1  0 0.175311
##  0  2  1  1 0.089576
##  0  2  1  2 0.880619
##  0  3  0  0 0.896686
##  0  3  0  1 0.871739
##  0  3  0  2 0.510977
##  0  3  1  0 0.659744
##  0  3  1  1 0.563363
##  0  3  1  2 0.069448
##  1  0  0  0 0.595835
##  1  0  0  1 0.013614
##  1  0  0  2 0.324987
##  1  0  1  0 0.932293
##  1  0  1  1 0.304045
##  1  0  1  2 0.939827
##  1  1  0  0 0.250134
##  1  1  0  1 0.314562
##  1  1  0  2 0.628407
##  1  1  1  0 0.761627
##  1  1  1  1 0.501388
##  1  1  1  2 0.882116
##  1  2  0  0 0.040240
##  1  2  0  1 0.922195
##  1  2  0  2 0.196308
##  1  2  1  0 0.884239
##  1  2  1  1 0.400863
##  1  2  1  2 0.832757
##  1  3  0  0 0.671666
##  1  3  0  1 0.175942
##  1  3  0  2 0.918048
##  1  3  1  0 0.245517
##  1  3  1  1 0.244140
##  1  3  1  2 0.814179
y = np.random.uniform(size = (4, 1, 3))
y_frame = pd.DataFrame.from_records(product([0, 1, 2, 3], [0], [0, 1, 2]))
y_frame.columns = ['b', 'c', 'd']
y_frame.loc[:, 'value'] = y.reshape((-1,))
print('y:')
## y:
print(y_frame.to_string(index=False))
##  b  c  d    value
##  0  0  0 0.105873
##  0  0  1 0.222036
##  0  0  2 0.503678
##  1  0  0 0.830477
##  1  0  1 0.147484
##  1  0  2 0.547668
##  2  0  0 0.711241
##  2  0  1 0.860595
##  2  0  2 0.935566
##  3  0  0 0.784640
##  3  0  1 0.652828
##  3  0  2 0.433620

Joinint the two data frames on b and d; note that we ignore column c because it only has one value in y_frame (corresponds to its shape being 1 for the c axis):

joint_array = pd.merge(X_frame, y_frame, on=['b', 'd']).assign(value=lambda df: df.loc[:, 'value_x'] - df.loc[:, 'value_y'] ).sort_values(by=['a', 'b', 'c_x', 'd']).loc[:, 'value'].values.reshape(X.shape)
print(joint_array)
## [[[[ 0.11842937  0.03763955  0.1072742 ]
##    [-0.06503971  0.74742266  0.4149956 ]]
## 
##   [[-0.44970946  0.36793847 -0.08154882]
##    [ 0.05172275  0.5411179  -0.26955699]]
## 
##   [[-0.26439934 -0.43162925  0.02376961]
##    [-0.53592968 -0.77101889 -0.05494673]]
## 
##   [[ 0.11204608  0.21891069  0.07735683]
##    [-0.12489549 -0.08946535 -0.36417226]]]
## 
## 
##  [[[ 0.48996164 -0.20842176 -0.17869016]
##    [ 0.82641932  0.08200936  0.43614907]]
## 
##   [[-0.58034307  0.16707827  0.08073926]
##    [-0.06885006  0.35390379  0.33444745]]
## 
##   [[-0.67100017  0.06159961 -0.73925776]
##    [ 0.17299815 -0.45973205 -0.1028085 ]]
## 
##   [[-0.11297436 -0.47688677  0.48442748]
##    [-0.53912252 -0.40868828  0.38055881]]]]

We can compare the resulting array with X - y and see that they are identical:

print((X - y) == joint_array)
## [[[[ True  True  True]
##    [ True  True  True]]
## 
##   [[ True  True  True]
##    [ True  True  True]]
## 
##   [[ True  True  True]
##    [ True  True  True]]
## 
##   [[ True  True  True]
##    [ True  True  True]]]
## 
## 
##  [[[ True  True  True]
##    [ True  True  True]]
## 
##   [[ True  True  True]
##    [ True  True  True]]
## 
##   [[ True  True  True]
##    [ True  True  True]]
## 
##   [[ True  True  True]
##    [ True  True  True]]]]