l02_backpropagation

Weights Initialization

Each layer has weights matrix \(\textbf{W}^{(l)}\) and a bias vector \(\textbf{b}^{(l)}\) that needs to be initialized before training

Regarding initializing \(\textbf{W}^{(l)}\) and \(\textbf{b}^{(l)}\)
- they should be unequally initialized to break symmetrical updates
- the values should exhibit good variance

Two initialization methods of initialization from literature

Xavier initialization
- normal : \(w_{ij} \sim \mathcal{N}\left(0,\sqrt{\frac{2}{N_i+N_o}}\right)\)
- uniform : \(w_{ij} \sim \mathcal{U}\left(-\sqrt{\frac{6}{N_i+N_o}},\sqrt{\frac{6}{N_i+N_o}}\right)\)

He initialization
- normal : \(w_{ij} \sim \mathcal{N}\left(0,\sqrt{\frac{2}{N_i}}\right)\)
- uniform : \(w_{ij} \sim \mathcal{U}\left(-\sqrt{\frac{6}{N_i}},\sqrt{\frac{6}{N_o}}\right)\)

Where, \(N_i\) - Input vector size to the l^th layer. \(N_o\) - Output vector size to the l^th layer

Loss functions

They are the objective functions of neural networks, represented by \(L_m, \ m=1,2,\dots,N_{data}\)
Sum of loss values for all records in the dataset is called cost function \(J\) \[J(\textbf{W},\textbf{b}) = \sum_{m=1}^{N_{data}} L_m\]

A loss function should have following properties
- continuous
- sufficiently smooth
- convex nature

For regression problems, popular choices of cost functions are
- Mean Squared Error (MSE) \(J(\textbf{W},\textbf{b}) = \frac{1}{N_{data}}\sum_{m=1}^{N_{data}} L_m\)
- Sum of Squared Errors (SSE) \(J(\textbf{W},\textbf{b}) = \sum_{m=1}^{N_{data}} L_m\)

where, \(L_m = \left(y_m - \hat{y}_m\right)^2\)

\(y_m\) - expected value
\(\hat{y}_m\) - inferred value

Back-propagation

It is the algorithm to compute gradients of loss function w.r.t. \(\textbf{W}^{(l)}\) and \(\textbf{b}^{(l)}\)

lets take same two layer network
equations for hidden layer is \[ \begin{align} \textbf{z}_h &= \textbf{W}_{hi}\textbf{x}+\textbf{b}_h \\ \textbf{a}_h &= f_h\left(\textbf{z}_h\right) \end{align} \]
equation for output layer is \[ \begin{align} \textbf{z}_o &= \textbf{W}_{oh}\textbf{a}_h+\textbf{b}_o \\ \textbf{a}_o &= f_o\left(\textbf{z}_o\right) \end{align} \]

combining them by substitution to get single equation of our network \[ \textbf{a}_o = f_{NN}(\textbf{x}) = f_o\left(\textbf{W}_{oh} \left( f_h\left(\textbf{W}_{hi}\textbf{x}+\textbf{b}_h\right)\right)+\textbf{b}_o\right) \]

Back-propagation

combining them by substitution to get single equation of our network \[ \textbf{a}_o = f_{NN}(\textbf{x}) = f_o\left(\textbf{W}_{oh} \left( f_h\left(\textbf{W}_{hi}\textbf{x}+\textbf{b}_h\right)\right)+\textbf{b}_o\right) \]

lets take simpler network with single neurons on each layer \[ {a}_o = f_{NN}({x}) = f_o\left({w}_{oh} \left( f_h\left({w}_{hi}{x}+{b}_h\right)\right)+{b}_o\right) \]

Here, Neural network is a composite function.

Lets take loss function as MSE, \(L_m=(u-a_o)^2\)

For convenience, lets split the combined equation into individual layer equations \[ z_{h} = w_{hi}x+b_h, \ \ a_h = f_h(z_h) \\ z_{o} = w_{oh}a_h+b_o, \ \ a_o = f_o(z_o) \]

Back-propagation

For convenience, lets split the combined equation into individual layer equations \[ z_{h} = w_{hi}x+b_h, \ \ a_h = f_h(z_h) \\ z_{o} = w_{oh}a_h+b_o, \ \ a_o = f_o(z_o) \]
Loss function \(L_m = (u - a_o)^2\)

Computing gradients using chain rule. Starting with innermost weight \(w_{hi}\)

\[ \begin{align} \frac{\partial L_m}{\partial w_{hi}} &= \frac{\partial L_m}{\partial z_h} \frac{\partial z_h}{\partial w_{hi}} = \frac{\partial L_m}{\partial a_h} \frac{\partial a_h}{\partial z_h}\frac{\partial z_h}{\partial w_{hi}} \ etc... \\ &=\frac{\partial L_m}{\partial z_h} \frac{\partial z_h}{\partial w_{hi}} = \delta_h\frac{\partial z_h}{\partial w_{hi}} = \delta_h . x \\ \frac{\partial L_m}{\partial b_h} &= \frac{\partial L_m}{\partial z_h} \frac{\partial z_h}{\partial b_h} = \delta_h . 1 = \delta_h \end{align} \]

Hidden layer derivatives

\[\begin{align} \frac{\partial L_m}{\partial w_{hi}}&= \delta_h . x \\ \frac{\partial L_m}{\partial b_{h}} &= \delta_h \\ \delta_h &= \frac{\partial L_m}{\partial z_h} \end{align}\]

Back-propagation

Individual layer equations \[ z_{h} = w_{hi}x+b_h, \ \ a_h = f_h(z_h) \\ z_{o} = w_{oh}a_h+b_o, \ \ a_o = f_o(z_o) \]
Loss function \(L_m = (u - a_o)^2\)

\[\begin{align} \frac{\partial L_m}{\partial w_{hi}}&= \delta_h . x \\ \frac{\partial L_m}{\partial b_{h}} &= \delta_h \\ \delta_h &= \frac{\partial L_m}{\partial z_h} \end{align}\]

to find \(\delta_h\) by chain rule \[ \begin{align} \delta_h &= \frac{\partial L_m}{\partial z_h} =\frac{\partial L_m}{\partial a_h} \frac{\partial a_h}{\partial z_h} = \frac{\partial L_m}{\partial z_o}\frac{\partial z_o}{\partial a_h}\frac{\partial a_h}{\partial z_h} = \delta_o . w_{oh} . f'_h(z_h) \end{align} \]

Back-propagation

Individual layer equations \[ z_{h} = w_{hi}x+b_h, \ \ a_h = f_h(z_h) \\ z_{o} = w_{oh}a_h+b_o, \ \ a_o = f_o(z_o) \]
Loss function \(L_m = (u - a_o)^2\)

\[\begin{align} \frac{\partial L_m}{\partial w_{hi}}&= \delta_h . x \\ \frac{\partial L_m}{\partial b_{h}} &= \delta_h \\ \delta_h &= \frac{\partial L_m}{\partial z_h} = \delta_o . w_{oh} . f'_h(z_h)\\ \delta_o &= \frac{\partial L_m}{\partial z_o} \end{align}\]

Computing output layer derivatives \[ \begin{align} \frac{\partial L_m}{\partial w_{oh}} &= \frac{\partial L_m}{\partial z_o} \frac{\partial z_o}{\partial w_{oh}} = \delta_o . a_h \\ \frac{\partial L_m}{\partial b_{o}} &= \frac{\partial L_m}{\partial z_o} \frac{\partial z_o}{\partial b_{o}} = \delta_o \end{align} \]

Back-propagation

Individual layer equations \[ z_{h} = w_{hi}x+b_h, \ \ a_h = f_h(z_h) \\ z_{o} = w_{oh}a_h+b_o, \ \ a_o = f_o(z_o) \]
Loss function \(L_m = (u - a_o)^2\)

Computing \(\delta_o\) \[ \delta_o = \frac{\partial L_m}{\partial z_o} = \frac{\partial L_m}{\partial a_o}\frac{\partial a_o}{\partial z_o} = L_m' . f_o'(z_o) \]

Back-propagation

Individual layer equations \[ z_{h} = w_{hi}x+b_h, \ \ a_h = f_h(z_h) \\ z_{o} = w_{oh}a_h+b_o, \ \ a_o = f_o(z_o) \]
Loss function \(L_m = (u - a_o)^2\)
Gradients for all scalar design variables (weights and biases)

Hidden layer \[ \begin{align} \frac{\partial L_m}{\partial w_{hi}}&= \delta_h . x \\ \frac{\partial L_m}{\partial b_{h}} &= \delta_h \\ \delta_h &= \frac{\partial L_m}{\partial z_h} = \delta_o . w_{oh} . f'_h(z_h)\\ \end{align} \]

Output layer \[ \begin{align} \frac{\partial L_m}{\partial w_{oh}} &= \delta_o . a_h \\ \frac{\partial L_m}{\partial b_{o}} &= \delta_o \\ \delta_o &= \frac{\partial L_m}{\partial z_o} = L_m' . f_o'(z_o) \\ \end{align} \]

Back-propagation

In Vector form \[ \textbf{z}_{h} = \textbf{W}_{hi}\textbf{x}+\textbf{b}_h, \ \ \textbf{a}_h = f_h(\textbf{z}_h) \\ \textbf{z}_{o} = \textbf{W}_{oh}\textbf{a}_h+\textbf{b}_o, \ \ \textbf{a}_o = f_o(\textbf{z}_o) \]
Loss function \(L_m = (\textbf{u} - \textbf{a}_o)^2\)
Gradients of all vector design variables (weights and biases)

Hidden layer \[ \begin{align} \frac{\partial L_m}{\partial \textbf{W}_{hi}}&= \vec{\delta}_h . \textbf{x}^T \\ \frac{\partial L_m}{\partial \textbf{b}_{h}} &= \vec{\delta}_h \\ \vec{\delta}_h &= \frac{\partial L_m}{\partial \textbf{z}_h} = \textbf{W}_{oh}^T \vec{\delta}_o \odot f'_h(\textbf{z}_h)\\ \end{align} \]

Output layer \[ \begin{align} \frac{\partial L_m}{\partial \textbf{W}_{oh}} &= \vec{\delta}_o . \textbf{a}_h \\ \frac{\partial L_m}{\partial \textbf{b}_{o}} &= \vec{\delta}_o \\ \vec{\delta}_o &= \frac{\partial L_m}{\partial \textbf{z}_o} = L_m' \odot f_o'(\textbf{z}_o) \\ \end{align} \]