l03_optimizationalgorithms

Data normalization

The process of scaling each input and output data to a fixed same range
In the same example network

Input variables are \(x\) and \(y\)
Outputs are \(u\), \(v\) and \(p\)
And lets say, we have \(N_{data}\) data points

Normalization has to be performed individually for all the variables
For example, normalization equation for \(x\) would be like \[ \bar{x} = \frac{x - min(x)}{max(x) - min(x)}, \ \ \bar{x} \in [0,1] \]

Range of normalization will depend on the range of activation function
Sigmoid function \(\in (0,1)\) and for hyperbolic tan function \(\in (-1,1)\)

Reason for data normalization

Normalization is done to give equal significance to all input/output variables

Back-propagation equations have chain-linked gradients

Hidden layer \[ \begin{align} \frac{\partial L_m}{\partial \textbf{W}_{hi}}&= \vec{\delta}_h . \textbf{x}^T \\ \frac{\partial L_m}{\partial \textbf{b}_{h}} &= \vec{\delta}_h \\ \vec{\delta}_h &= \frac{\partial L_m}{\partial \textbf{z}_h} = \textbf{W}_{oh}^T \vec{\delta}_o \odot f'_h(\textbf{z}_h)\\ \end{align} \]

Output layer \[ \begin{align} \frac{\partial L_m}{\partial \textbf{W}_{oh}} &= \vec{\delta}_o . \textbf{a}_h \\ \frac{\partial L_m}{\partial \textbf{b}_{o}} &= \vec{\delta}_o \\ \vec{\delta}_o &= \frac{\partial L_m}{\partial \textbf{z}_o} = L_m' \odot f_o'(\textbf{z}_o) \\ \end{align} \]

Un-normalized data will lead to biased weight updates. Bigger number gets bigger update.
This will make network model to not converging/training due to
- vanishing gradient for smaller range input/outputs
- Neuron saturation for larger range input/outputs

\(\Huge \}\) due to back-propagation

Updating the weights

learning rate \(\alpha\)

It is the step size used in gradient descent in the design space (here it is called weight space)
weights and biases of each layer will be updated in the same way as the optimization problem \[ \begin{align} \textbf{W}^{(l)} :&= \textbf{W}^{(l)} - \alpha \nabla_{\textbf{W}^{(l)}} L_m \\ \textbf{b}^{(l)} :&= \textbf{b}^{(l)} - \alpha \nabla_{\textbf{b}^{(l)}} L_m \\ \end{align} \]
Here, unlike in optimization problems, \(\alpha\) is fixed, due to computations complexity

Training the network

It is the process of optimizing weights and biases of the neural network to model the target function

There are two fundamental optimization algorithms, others are derivatives from them

Stochastic Gradient Descent
Batch Gradient Descent

Optimiziation algorithms - Stochastic Gradient Descent

In this algorithm, the weights will be updated for each record in the dataset

Optimization algorithms - Batch Gradient Descent

In this algorithm, the weights will be updated after computing gradients for all the records in the dataset

Optimization algorithms - summary

Batch Gradient Descent
- weights are updated at the end of each iteration with full data
- takes less iterations to converge - as update is performed with all data at once
- used for small datasets

Stochastic Gradient Descent
- weights are updated after each record in the dataset
- memory efficient - stores one step of gradients only
- used for large datasets

Other optimization algorithms that are variants of these two algorithms
- mini-batch gradient descent
- Adaptive gradient (ada grad)
- Adaptive moment estimation (Adam)