[In Progress] Textbook Summary: The Principles of Deep Learning Theory

The Principles of Deep Learning Theory - Roberts, Yaida, Hanin

Chapter 0: Initialization

A network $f^\ast$ is constructed by sampling weights $\theta \sim p(\theta)$ and training $\theta \to \theta^*$ so that for input data $x$ the network output $f^\ast(x) := f(x; \theta^\ast)$ approximates the truth $f(x).$ The distribution $p(f^\ast)$ of trained networks is of interest.

Chapter 1: Pretraining

Review of the Gaussian distribution.

Chapter 2: Neural Networks

MLPs suffice as a minimal model for an effective theory of deep learning because other architectures like CNNs and transformers can be interpreted as MLPs with constraints on the relationship between weights.

We will assume that initial weights are drawn independently from a zero-mean Gaussian distribution with variance $C_b$ for biases and $C_w / n_{\ell}$ for weights (where $n_\ell$ is the dimension of the $\ell$th layer).

The distribution of interest can be specified as $p(f^\ast) = p \left( \left. z^{(L)} \, \right\vert \, \theta^\ast, \mathcal{D} \right),$ where $z^{(L)}$ is the preactivation of neurons in the $L$th layer (final layer) and $\mathcal{D}$ is the data set used during training.

Chapter 3: Effective Theory of Deep Linear Networks at Initialization

This chapter considers deep linear networks without biases. The simplest interesting observable is the two-point correlator $\textrm E \left[z_{i_1 \alpha_1}^{(\ell)}, z_{i_2 \alpha_2}^{(\ell)}\right]$ where $z_{i\alpha}^{(\ell)} := z_i^{(\ell)}(x_\alpha)$ is the preactivation of the $i$th neuron in layer $\ell$ on the $\alpha$th data point.

The one-point correlator is simpler but not interesting because it is always zero: loosely, we have

$\begin{align*} \textrm E \left[z_\alpha^{(\ell)}\right] &= \textrm E\left[ W^{(\ell)} W^{(\ell - 1)} \cdots W^{(1)} x_\alpha\right] \\ &= \textrm E\left[ W^{(\ell)} \right] \textrm E\left[ W^{(\ell - 1)} \right] \cdots \textrm E\left[ W^{(1)} \right] \textrm E\left[ x_\alpha \right] \\ &= 0 \cdot 0 \cdots 0 \cdot x_\alpha \\ &= 0 \end{align*}$


since the weights are initialized as independent zero-mean Gaussians.

The two-point correlator satisfies a recursion:

$\begin{align*} \textrm E \left[ z_{i_1 \alpha_1}^{(\ell + 1)} z_{i_2 \alpha_2}^{(\ell + 1)} \right] &= \textrm E \left[W_{i_1}^{(\ell)} z_{\alpha_1}^{(\ell)} \cdot W_{i_2}^{(\ell)} z_{\alpha_2}^{(\ell)} \right] \\[5pt] &= \sum_j \textrm E\left[ W_{i_1 j}^{(\ell)} W_{i_2 j}^{(\ell)} z_{j \alpha_1}^{(\ell)} z_{j \alpha_2}^{(\ell)}\right] \\[5pt] &= \sum_j \delta_{i_1 i_2} \dfrac{C_w}{n_\ell} \textrm E\left[ z_{j \alpha_1}^{(\ell)} z_{j \alpha_2}^{(\ell)}\right] \end{align*}$


The $\delta_{i_1 i_2}$ term causes RHS to become zero in all cases except $i_1 = i_2 = i,$ shown below:

$\begin{align*} \textrm E \left[ z_{i \alpha_1}^{(\ell + 1)} z_{i \alpha_2}^{(\ell + 1)} \right] &= \dfrac{C_w}{n_\ell} \sum_j \textrm E\left[ z_{j \alpha_1}^{(\ell)} z_{j \alpha_2}^{(\ell)}\right] \end{align*}$


We can simplify this into a geometric recursion. Since LHS is the same for all $i$ (by symmetry), we can write

$\begin{align*} \dfrac{1}{n_{\ell+1}} \sum_i \textrm E \left[ z_{i \alpha_1}^{(\ell + 1)} z_{i \alpha_2}^{(\ell + 1)} \right] &= \dfrac{C_w}{n_\ell} \sum_j \textrm E\left[ z_{j \alpha_1}^{(\ell)} z_{j \alpha_2}^{(\ell)}\right], \end{align*}$


and by defining $G_{\alpha_1 \alpha_1}^{(\ell)} := \dfrac{1}{n_\ell} \sum\limits_{j} \textrm E\left[ z_{j \alpha_1}^{(\ell)} z_{j \alpha_2}^{(\ell)}\right]$ we reach

$\begin{align*} G_{\alpha_1 \alpha_2}^{(\ell+1)} &= C_w G_{\alpha_1 \alpha_2}^{(\ell)}. \end{align*}$


So, the distribution of initial weights is strongly implicated in the numerical stability of the net: we must have $C_w = 1,$ otherwise the preactivations explode $(C_w > 1)$ or vanish $(C_w < 1)$ with the depth of the net.

An analysis for general $n$-point correlators yields a similar result. Defining $G_m^{(\ell)} := G^{(\ell)}_{\alpha_1 \alpha_2 \cdots \alpha_m}$ and doing plenty of math, we reach the recursion

$\begin{align*} G_{2m}^{(\ell+1)} &= c_{2m}(n_\ell) C_w^m G_{2m}^{(\ell)} \end{align*}$


where

$\begin{align*} c_{2m}(n) &= \left( 1 + \dfrac{2}{n} \right) \left( 1 + \dfrac{4}{n} \right) \cdots \left( 1 + \dfrac{2m-2}{n} \right). \end{align*}$


For a net with an equal number of hidden nodes $n_\ell = n$ for all layers $\ell,$ we have the following asymptotic behavior:

  • If the net has infinite width $(n \to \infty)$ and fixed depth $(L),$ then we get $c_{2m}(n) = 1,$ causing the 2-point correlator recurrence $(m=1)$ to reduce to that of the single-layer net. So the net is effectively not deep at all.
  • If the net has infinite depth $(L \to \infty)$ and fixed width $(n),$ then we get $c_{2m}(n) > 1,$ causing the preactivations to explode even if $C_w = 1.$

So, it’s necessary to keep some ratio $r := L / n$ of depth to width.

Chapter 4