Data normalization

  • The process of scaling each input and output data to a fixed same range

  • In the same example network

  • Input variables are \(x\) and \(y\)

  • Outputs are \(u\), \(v\) and \(p\)

  • And lets say, we have \(N_{data}\) data points

  • Normalization has to be performed individually for all the variables
  • For example, normalization equation for \(x\) would be like \[ \bar{x} = \frac{x - min(x)}{max(x) - min(x)}, \ \ \bar{x} \in [0,1] \]
  • Range of normalization will depend on the range of activation function
  • Sigmoid function \(\in (0,1)\) and for hyperbolic tan function \(\in (-1,1)\)

Reason for data normalization

  • Normalization is done to give equal significance to all input/output variables

Back-propagation equations have chain-linked gradients

Hidden layer \[ \begin{align} \frac{\partial L_m}{\partial \textbf{W}_{hi}}&= \vec{\delta}_h . \textbf{x}^T \\ \frac{\partial L_m}{\partial \textbf{b}_{h}} &= \vec{\delta}_h \\ \vec{\delta}_h &= \frac{\partial L_m}{\partial \textbf{z}_h} = \textbf{W}_{oh}^T \vec{\delta}_o \odot f'_h(\textbf{z}_h)\\ \end{align} \]

Output layer \[ \begin{align} \frac{\partial L_m}{\partial \textbf{W}_{oh}} &= \vec{\delta}_o . \textbf{a}_h \\ \frac{\partial L_m}{\partial \textbf{b}_{o}} &= \vec{\delta}_o \\ \vec{\delta}_o &= \frac{\partial L_m}{\partial \textbf{z}_o} = L_m' \odot f_o'(\textbf{z}_o) \\ \end{align} \]

  • Un-normalized data will lead to biased weight updates. Bigger number gets bigger update.
  • This will make network model to not converging/training due to
    • vanishing gradient for smaller range input/outputs
    • Neuron saturation for larger range input/outputs

\(\Huge \}\) due to back-propagation

Updating the weights

learning rate \(\alpha\)

  • It is the step size used in gradient descent in the design space (here it is called weight space)

  • weights and biases of each layer will be updated in the same way as the optimization problem \[ \begin{align} \textbf{W}^{(l)} :&= \textbf{W}^{(l)} - \alpha \nabla_{\textbf{W}^{(l)}} L_m \\ \textbf{b}^{(l)} :&= \textbf{b}^{(l)} - \alpha \nabla_{\textbf{b}^{(l)}} L_m \\ \end{align} \]

  • Here, unlike in optimization problems, \(\alpha\) is fixed, due to computations complexity

Training the network

  • It is the process of optimizing weights and biases of the neural network to model the target function

Flow diagram of optimization process

There are two fundamental optimization algorithms, others are derivatives from them

  • Stochastic Gradient Descent
  • Batch Gradient Descent

Optimiziation algorithms - Stochastic Gradient Descent

  • In this algorithm, the weights will be updated for each record in the dataset

Optimization algorithms - Batch Gradient Descent

  • In this algorithm, the weights will be updated after computing gradients for all the records in the dataset

Optimization algorithms - summary

  • Batch Gradient Descent
    • weights are updated at the end of each iteration with full data
    • takes less iterations to converge - as update is performed with all data at once
    • used for small datasets
  • Stochastic Gradient Descent
    • weights are updated after each record in the dataset
    • memory efficient - stores one step of gradients only
    • used for large datasets
  • Other optimization algorithms that are variants of these two algorithms
    • mini-batch gradient descent
    • Adaptive gradient (ada grad)
    • Adaptive moment estimation (Adam)