Shapes of numpy
The central construct in the numpy
package is ndarray
, namely, n-dimensional arrays. ndarray
is a natural extensions to matrices, just like matrices are extensions to vectors, and vectors are extensions to real numbers.
The shape of any matrix can be described with a pair of two integers (a, b)
— the first often being the number of rows in the matrix and the second being the number of columns; such quantities can be nicely visualised using a rectangular table with a
rows and b
columns.
When describing the shapes of ndarray
s, however, things become more complicated since we can have more than two axes. We can certainly try to visualise the structure of a multi-axis ndarray
as nested ndarray
s — that is, treat each entry in a ndarray
as another ndarray
. For example, for a ndarray
whose shape is (3, 2, 3)
, we may see it as a (3,)
-shaped array of (2, 3)
-shaped matrix:
import pandas as pd
import numpy as np
np.random.seed(42)
array = np.random.normal(size = 3 * 2 * 3).reshape((3, 2, 3))
print(array)
## [[[ 0.49671415 -0.1382643 0.64768854]
## [ 1.52302986 -0.23415337 -0.23413696]]
##
## [[ 1.57921282 0.76743473 -0.46947439]
## [ 0.54256004 -0.46341769 -0.46572975]]
##
## [[ 0.24196227 -1.91328024 -1.72491783]
## [-0.56228753 -1.01283112 0.31424733]]]
or a (3, 2)
-shaped matrix with each element being a (3,)
array:
## row col value
## 0 0 [ 0.49671415 -0.1382643 0.64768854]
## 0 1 [ 1.52302986 -0.23415337 -0.23413696]
## 1 0 [ 1.57921282 0.76743473 -0.46947439]
## 1 1 [ 0.54256004 -0.46341769 -0.46572975]
## 2 0 [ 0.24196227 -1.91328024 -1.72491783]
## 2 1 [-0.56228753 -1.01283112 0.31424733]
This hierarchical view of ndarray
, however, can be difficult to work with especially as the number of axes grows larger, especially considering that one can slice a ndarray
along arbitrary axes. Another disadvantage of this conceptual model is that it makes it difficult to visualise the outcomes of aggregation operators such as mean
ad max
, as well as those that operate on the structures of the array, e.g. transpose
.
A more consistent and generalised view to a ndarray
is to consider it as a table of indices where each column contains the indices in a particular axis (sometimes called the long-format). The above ndarray
, when represented in this view, looks like the following:
## axis_0 axis_1 axis_2 value
## 0 0 0 0.496714
## 0 0 1 -0.138264
## 0 0 2 0.647689
## 0 1 0 1.523030
## 0 1 1 -0.234153
## 0 1 2 -0.234137
## 1 0 0 1.579213
## 1 0 1 0.767435
## 1 0 2 -0.469474
## 1 1 0 0.542560
## 1 1 1 -0.463418
## 1 1 2 -0.465730
## 2 0 0 0.241962
## 2 0 1 -1.913280
## 2 0 2 -1.724918
## 2 1 0 -0.562288
## 2 1 1 -1.012831
## 2 1 2 0.314247
For a pandas data frame, axis 0
corresponds to the row indices and axis 1
records corresponds to the column indices.
Under this view, slicing can be understood as filtering. For example, slicing the above array with (0:2, :, 0:2)
will result in the following table:
## axis_0 axis_1 axis_2 value
## 0 0 0 0.496714
## 0 0 1 -0.138264
## 0 1 0 1.523030
## 0 1 1 -0.234153
## 1 0 0 1.579213
## 1 0 1 0.767435
## 1 1 0 0.542560
## 1 1 1 -0.463418
which is a (2, 2, 2)
-shaped ndarray
:
## [[[ 0.49671415 -0.1382643 ]
## [ 1.52302986 -0.23415337]]
##
## [[ 1.57921282 0.76743473]
## [ 0.54256004 -0.46341769]]]
The axis
argument in aggregation functions
Many aggregation methods offered by numpy
and pandas
has an axis
argument. The official documentation of the mean
function in numpy
describes the behaviour of the argument as follows:
axis : None or int or tuple of ints, optional Axis or axes along which the means are computed. The default is to compute the mean of the flattened array.
The documentation, in my opinion, while accurate, is somewhat hard to remember and visualise. The simplest way I have found to understand the behaviour of the axis
argument is to think of it as specifying which axes are to be aggregated by the aggregation function. For example, the following code snippet:
print(array.mean(axis=(0)))
## [[ 0.77262975 -0.42803661 -0.51556789]
## [ 0.50110079 -0.57013406 -0.12853979]]
is sort-of equivalent to the following if we were to work with the pandas frame:
print(frame.groupby(['axis_1', 'axis_2'])['value'].mean())
## axis_1 axis_2
## 0 0 0.772630
## 1 -0.428037
## 2 -0.515568
## 1 0 0.501101
## 1 -0.570134
## 2 -0.128540
## Name: value, dtype: float64
where the axes not specified in the axis
argument (i.e. the second and the third axes) were used as the grouping axes.
We can also specify multiple axes to be aggregated by a function:
print(np.random.normal(size=(64, 8, 32)).mean(axis=(0, 2)).shape)
## (8,)
Judging from the shape of the resultant array, we can see that numpy
grouped the ndarray
by the second axes (axis=1
) and averaged the values from the remaining axes.
numpy
Shape and Broadcasting
Broadcasting is a very powerful feature implemented in numpy
. It allows us to easily perform pairwise operations along certain axes. For example, let us assume that we have the following two-dimensional data drawn from a multivariate normal distribution:
array = np.random.multivariate_normal(mean=np.random.uniform(size=3)*2, cov=np.ones((3, 3))*0.5 + np.eye(3)*0.5, size=10)
print(array)
## [[ 8.99104287e-02 1.62652849e+00 1.14581045e+00]
## [-5.23281480e-01 -1.17422831e+00 -2.05469629e-03]
## [ 1.94776298e+00 2.19683967e+00 1.82631528e+00]
## [-3.97515736e-01 6.43356118e-01 -1.39690647e+00]
## [-4.41518700e-01 -1.13264403e+00 -6.32800484e-01]
## [ 2.83284793e+00 2.43326800e+00 2.29754430e+00]
## [ 2.54662474e+00 2.65454451e+00 7.93348924e-01]
## [ 2.86645477e+00 2.01298801e+00 1.13233329e+00]
## [ 1.82682025e+00 1.76805044e+00 1.24128276e+00]
## [ 9.09527491e-01 1.34364714e+00 -3.17817399e-01]]
and we want to standardise the matrix by each of its columns. For each column, we need to subtract the mean of the column from its elements and then divide the resulting values with the standard deviation of that column. We can compute the column-wise means and standard deviations utilising our knowledge about how the axis
argument works:
col_means = array.mean(axis=0) # compute the means of the matrix using column indices as grouppings (i.e. axis 1)
print(col_means.shape, col_means)
## (3,) [1.16576327 1.237235 0.6087056 ]
col_stds = array.std(axis=0)
print(col_stds.shape, col_stds)
## (3,) [1.33154069 1.31113337 1.10122451]
Now we can use broadcasting to subtract the means vector from the columns of the matrix.
print((array - col_means) / col_stds)
## [[-0.80797594 0.29691372 0.4877342 ]
## [-1.26848903 -1.83922046 -0.55461923]
## [ 0.58728939 0.73188944 1.10568706]
## [-1.17403773 -0.45295078 -1.82125629]
## [-1.20708438 -1.80750417 -1.12738689]
## [ 1.25199678 0.91221307 1.53360072]
## [ 1.03704039 1.08098042 0.16767092]
## [ 1.27723585 0.59166598 0.47549586]
## [ 0.49646022 0.40485236 0.5744307 ]
## [-0.19243556 0.08116042 -0.84135704]]
Note that numpy
automatically figured out the correct axis to apply the operations to. Let’s say we now would like to normalise array
row-wise by its row-wise sums; if we calculate it naively like in the above example, we receive an exception:
try:
print((array / array.sum(axis=1)))
except ValueError as e:
print(e)
## operands could not be broadcast together with shapes (10,3) (10,)
To see why this failed, note that numpy
’s documentation states the following about the rules for broadcasting:
When operating on two arrays, NumPy compares their shapes element-wise. It starts with the trailing dimensions, and works its way forward. Two dimensions are compatible when
- they are equal, or
- one of them is 1
In our first example, the shape of the mean and standard deviation vectors (i.e. (3,)
) can be matched with the right-most dimensions of the array
matrix (i.e. (10, 3)
), so numpy
will subtract/divide the col_means
/col_stds
vector from the matched axes for each of the entries indexed by the remaining axes. For the second case, numpy
cannot find a valid alignment for the shapes of the two ndarray
s. To fix it, we need to reshape the row-wise sum vector into a two-axes ndarray
with a trailing dimension of 1
:
try:
print((array / array.sum(axis=1).reshape((-1, 1))))
except ValueError as e:
print(e)
## [[ 3.14125072e-02 5.68269314e-01 4.00318178e-01]
## [ 3.07891511e-01 6.90899534e-01 1.20895459e-03]
## [ 3.26208299e-01 3.67923273e-01 3.05868428e-01]
## [ 3.45345710e-01 -5.58921964e-01 1.21357625e+00]
## [ 2.00057118e-01 5.13213824e-01 2.86729058e-01]
## [ 3.74534001e-01 3.21705091e-01 3.03760908e-01]
## [ 4.24825594e-01 4.42828669e-01 1.32345737e-01]
## [ 4.76806644e-01 3.34840817e-01 1.88352539e-01]
## [ 3.77742409e-01 3.65590227e-01 2.56667364e-01]
## [ 4.69953287e-01 6.94263115e-01 -1.64216401e-01]]
For more complex cases, suppose we are subtracting a ndarray
y
from another ndarray
X
, then here are some example scenarios:
Case 1 :
X.shape: (a, b, c, d)
y.shape: (c, d)
X - y : valid, subtract y from every element indexed by (a, b) in X
Case 2 :
X.shape: (a, b, c, d)
y.shape: (a, c, d)
X - y : error, no proper shape alignment can be found
Case 3 :
X.shape: (a, b, c, d)
y.shape: (b, 1, d)
X - y : valid, because 1-shaped axis can match any shape (i.e. wildcard)
The thrid case can be tricky to visualise; I generally find it easier to imagine an “materialised” y
where the trailing 1
-shaped axes are expanded to match the actual shapes in X
and are all populated with the single element indexed by the right-most axis of y
.
An alternative way to visualise the operation makes use of the indexed table view to ndarray
s developed earlier — just imagine we are joining the two tables on the axes with matching shapes:
X = np.random.uniform(size = (2, 4, 2, 3))
X_frame = pd.DataFrame.from_records(product([0, 1], [0, 1, 2, 3], [0, 1], [0, 1, 2]))
X_frame.columns = ['a', 'b', 'c', 'd']
X_frame.loc[:, 'value'] = X.reshape((-1,))
print('X:')
## X:
print(X_frame.to_string(index=False))
## a b c d value
## 0 0 0 0 0.224303
## 0 0 0 1 0.259675
## 0 0 0 2 0.610952
## 0 0 1 0 0.040834
## 0 0 1 1 0.969459
## 0 0 1 2 0.918673
## 0 1 0 0 0.380768
## 0 1 0 1 0.515422
## 0 1 0 2 0.466119
## 0 1 1 0 0.882200
## 0 1 1 1 0.688602
## 0 1 1 2 0.278111
## 0 2 0 0 0.446841
## 0 2 0 1 0.428966
## 0 2 0 2 0.959335
## 0 2 1 0 0.175311
## 0 2 1 1 0.089576
## 0 2 1 2 0.880619
## 0 3 0 0 0.896686
## 0 3 0 1 0.871739
## 0 3 0 2 0.510977
## 0 3 1 0 0.659744
## 0 3 1 1 0.563363
## 0 3 1 2 0.069448
## 1 0 0 0 0.595835
## 1 0 0 1 0.013614
## 1 0 0 2 0.324987
## 1 0 1 0 0.932293
## 1 0 1 1 0.304045
## 1 0 1 2 0.939827
## 1 1 0 0 0.250134
## 1 1 0 1 0.314562
## 1 1 0 2 0.628407
## 1 1 1 0 0.761627
## 1 1 1 1 0.501388
## 1 1 1 2 0.882116
## 1 2 0 0 0.040240
## 1 2 0 1 0.922195
## 1 2 0 2 0.196308
## 1 2 1 0 0.884239
## 1 2 1 1 0.400863
## 1 2 1 2 0.832757
## 1 3 0 0 0.671666
## 1 3 0 1 0.175942
## 1 3 0 2 0.918048
## 1 3 1 0 0.245517
## 1 3 1 1 0.244140
## 1 3 1 2 0.814179
y = np.random.uniform(size = (4, 1, 3))
y_frame = pd.DataFrame.from_records(product([0, 1, 2, 3], [0], [0, 1, 2]))
y_frame.columns = ['b', 'c', 'd']
y_frame.loc[:, 'value'] = y.reshape((-1,))
print('y:')
## y:
print(y_frame.to_string(index=False))
## b c d value
## 0 0 0 0.105873
## 0 0 1 0.222036
## 0 0 2 0.503678
## 1 0 0 0.830477
## 1 0 1 0.147484
## 1 0 2 0.547668
## 2 0 0 0.711241
## 2 0 1 0.860595
## 2 0 2 0.935566
## 3 0 0 0.784640
## 3 0 1 0.652828
## 3 0 2 0.433620
Joinint the two data frames on b
and d
; note that we ignore column c
because it only has one value in y_frame
(corresponds to its shape being 1
for the c
axis):
joint_array = pd.merge(X_frame, y_frame, on=['b', 'd']).assign(value=lambda df: df.loc[:, 'value_x'] - df.loc[:, 'value_y'] ).sort_values(by=['a', 'b', 'c_x', 'd']).loc[:, 'value'].values.reshape(X.shape)
print(joint_array)
## [[[[ 0.11842937 0.03763955 0.1072742 ]
## [-0.06503971 0.74742266 0.4149956 ]]
##
## [[-0.44970946 0.36793847 -0.08154882]
## [ 0.05172275 0.5411179 -0.26955699]]
##
## [[-0.26439934 -0.43162925 0.02376961]
## [-0.53592968 -0.77101889 -0.05494673]]
##
## [[ 0.11204608 0.21891069 0.07735683]
## [-0.12489549 -0.08946535 -0.36417226]]]
##
##
## [[[ 0.48996164 -0.20842176 -0.17869016]
## [ 0.82641932 0.08200936 0.43614907]]
##
## [[-0.58034307 0.16707827 0.08073926]
## [-0.06885006 0.35390379 0.33444745]]
##
## [[-0.67100017 0.06159961 -0.73925776]
## [ 0.17299815 -0.45973205 -0.1028085 ]]
##
## [[-0.11297436 -0.47688677 0.48442748]
## [-0.53912252 -0.40868828 0.38055881]]]]
We can compare the resulting array with X - y
and see that they are identical:
print((X - y) == joint_array)
## [[[[ True True True]
## [ True True True]]
##
## [[ True True True]
## [ True True True]]
##
## [[ True True True]
## [ True True True]]
##
## [[ True True True]
## [ True True True]]]
##
##
## [[[ True True True]
## [ True True True]]
##
## [[ True True True]
## [ True True True]]
##
## [[ True True True]
## [ True True True]]
##
## [[ True True True]
## [ True True True]]]]