Multi-Disciplinary Optimization Course
2025-04-22
A linear/non-linear function that takes in multiple inputs and produces one output.
\[ y = f\left( w_1\times x_1 + w_2\times x_2 + \dots + w_n \times x_n + w_0\right) \]
\[ y = f\left(\sum_{i=1}^n w_i x_i + w_0 \right) = f\left(\textbf{w}^T\textbf{x} + w_0\right) \]
The purpose of activation function is to induce non-linearity into the model.
Let \(z = \textbf{w}^T\textbf{x}+w_0\),
\(z\) is called pre-activation vector (or scalar), \(\textbf{w} = \{w_1,\dots,w_n\}\) and \(\textbf{x} = \{x_1,\dots,x_n\}\)
\[ y = f\left(z\right) \]
It has a linear component \(z\) covered by a non-linear function \(f\)
Some of the activation functions
Engineering problems will require functions with multiple outputs for a set of inputs
Example: 2D incompressible flow over flat plate. \(f_{NN}: x,y \to u,v,p\)
Theorem 1 Let \(I_n\) denote the n-dimensional unit cube \([0,1]^T\) and \(C(I_n)\) be the space of continuous functions defined in \(I_n\). Let \(x\in\mathbb{R}^n\) and \(\sigma\) be any continuous discriminatory function. Then the finite sums of the form
\[ G(x) = \sum_{j=1}^N \alpha_j \sigma(y_j^T x + \theta_j) \]
are dense in \(C(I_n)\). In other words, given any \(f\in C(I_n)\) and \(\epsilon>0\), there is a sum, \(G(x)\) of above form for which \[ |G(x) - f(x)| < \epsilon, \forall \; \; x\in I_n \]
where \(y_j \in \mathbb{R}^n\) and \(\alpha_j, \theta_j \in \mathbb{R}\).
Lets take the same network having 1 hidden layer with 4 neurons and 1 output layer with 3 neurons
Hidden layer \[ \begin{align} \textbf{z}_h &= \textbf{W}_{hi}\textbf{x}+\textbf{b}_h \\ \textbf{a}_h &= f_h\left(\textbf{z}_h\right) \end{align} \]
Output layer \[ \begin{align} \textbf{z}_o &= \textbf{W}_{oh}\textbf{a}_h+\textbf{b}_o \\ \textbf{a}_o &= f_o\left(\textbf{z}_o\right) \end{align} \]
\(\textbf{a}_0 = \{u,v,p\}\) is the estimated vector of outputs from the network
Where, \(N_i\) - Input vector size to the lth layer. \(N_o\) - Output vector size to the lth layer
They are the objective functions of neural networks, represented by \(L_m, \ m=1,2,\dots,N_{data}\)
Sum of loss values for all records in the dataset is called cost function \(J\) \[J(\textbf{W},\textbf{b}) = \sum_{m=1}^{N_{data}} L_m\]
where, \(L_m = \left(y_m - \hat{y}_m\right)^2\)
lets take same two layer network
equations for hidden layer is \[ \begin{align} \textbf{z}_h &= \textbf{W}_{hi}\textbf{x}+\textbf{b}_h \\ \textbf{a}_h &= f_h\left(\textbf{z}_h\right) \end{align} \]
equation for output layer is \[ \begin{align} \textbf{z}_o &= \textbf{W}_{oh}\textbf{a}_h+\textbf{b}_o \\ \textbf{a}_o &= f_o\left(\textbf{z}_o\right) \end{align} \]
Computing gradients using chain rule. Starting with innermost weight \(w_{hi}\)
\[ \begin{align} \frac{\partial L_m}{\partial w_{hi}} &= \frac{\partial L_m}{\partial z_h} \frac{\partial z_h}{\partial w_{hi}} = \frac{\partial L_m}{\partial a_h} \frac{\partial a_h}{\partial z_h}\frac{\partial z_h}{\partial w_{hi}} \ etc... \\ &=\frac{\partial L_m}{\partial z_h} \frac{\partial z_h}{\partial w_{hi}} = \delta_h\frac{\partial z_h}{\partial w_{hi}} = \delta_h . x \\ \frac{\partial L_m}{\partial b_h} &= \frac{\partial L_m}{\partial z_h} \frac{\partial z_h}{\partial b_h} = \delta_h . 1 = \delta_h \end{align} \]
Hidden layer derivatives
\[\begin{align} \frac{\partial L_m}{\partial w_{hi}}&= \delta_h . x \\ \frac{\partial L_m}{\partial b_{h}} &= \delta_h \\ \delta_h &= \frac{\partial L_m}{\partial z_h} \end{align}\]
Individual layer equations \[ z_{h} = w_{hi}x+b_h, \ \ a_h = f_h(z_h) \\ z_{o} = w_{oh}a_h+b_o, \ \ a_o = f_o(z_o) \]
Loss function \(L_m = (u - a_o)^2\)
\[\begin{align} \frac{\partial L_m}{\partial w_{hi}}&= \delta_h . x \\ \frac{\partial L_m}{\partial b_{h}} &= \delta_h \\ \delta_h &= \frac{\partial L_m}{\partial z_h} \end{align}\]
Individual layer equations \[ z_{h} = w_{hi}x+b_h, \ \ a_h = f_h(z_h) \\ z_{o} = w_{oh}a_h+b_o, \ \ a_o = f_o(z_o) \]
Loss function \(L_m = (u - a_o)^2\)
\[\begin{align} \frac{\partial L_m}{\partial w_{hi}}&= \delta_h . x \\ \frac{\partial L_m}{\partial b_{h}} &= \delta_h \\ \delta_h &= \frac{\partial L_m}{\partial z_h} = \delta_o . w_{oh} . f'_h(z_h)\\ \delta_o &= \frac{\partial L_m}{\partial z_o} \end{align}\]
Individual layer equations \[ z_{h} = w_{hi}x+b_h, \ \ a_h = f_h(z_h) \\ z_{o} = w_{oh}a_h+b_o, \ \ a_o = f_o(z_o) \]
Loss function \(L_m = (u - a_o)^2\)
\[\begin{align} \frac{\partial L_m}{\partial w_{hi}}&= \delta_h . x \\ \frac{\partial L_m}{\partial b_{h}} &= \delta_h \\ \delta_h &= \frac{\partial L_m}{\partial z_h} = \delta_o . w_{oh} . f'_h(z_h)\\ \delta_o &= \frac{\partial L_m}{\partial z_o} \\ \frac{\partial L_m}{\partial w_{oh}} &= \delta_o . a_h \\ \frac{\partial L_m}{\partial b_{o}} &= \delta_o \end{align}\]
Individual layer equations \[ z_{h} = w_{hi}x+b_h, \ \ a_h = f_h(z_h) \\ z_{o} = w_{oh}a_h+b_o, \ \ a_o = f_o(z_o) \]
Loss function \(L_m = (u - a_o)^2\)
Gradients for all scalar design variables (weights and biases)
Hidden layer \[ \begin{align} \frac{\partial L_m}{\partial w_{hi}}&= \delta_h . x \\ \frac{\partial L_m}{\partial b_{h}} &= \delta_h \\ \delta_h &= \frac{\partial L_m}{\partial z_h} = \delta_o . w_{oh} . f'_h(z_h)\\ \end{align} \]
Output layer \[ \begin{align} \frac{\partial L_m}{\partial w_{oh}} &= \delta_o . a_h \\ \frac{\partial L_m}{\partial b_{o}} &= \delta_o \\ \delta_o &= \frac{\partial L_m}{\partial z_o} = L_m' . f_o'(z_o) \\ \end{align} \]
In Vector form \[ \textbf{z}_{h} = \textbf{W}_{hi}\textbf{x}+\textbf{b}_h, \ \ \textbf{a}_h = f_h(\textbf{z}_h) \\ \textbf{z}_{o} = \textbf{W}_{oh}\textbf{a}_h+\textbf{b}_o, \ \ \textbf{a}_o = f_o(\textbf{z}_o) \]
Loss function \(L_m = (\textbf{u} - \textbf{a}_o)^2\)
Gradients of all vector design variables (weights and biases)
Hidden layer \[ \begin{align} \frac{\partial L_m}{\partial \textbf{W}_{hi}}&= \vec{\delta}_h . \textbf{x}^T \\ \frac{\partial L_m}{\partial \textbf{b}_{h}} &= \vec{\delta}_h \\ \vec{\delta}_h &= \frac{\partial L_m}{\partial \textbf{z}_h} = \textbf{W}_{oh}^T \vec{\delta}_o \odot f'_h(\textbf{z}_h)\\ \end{align} \]
Output layer \[ \begin{align} \frac{\partial L_m}{\partial \textbf{W}_{oh}} &= \vec{\delta}_o . \textbf{a}_h \\ \frac{\partial L_m}{\partial \textbf{b}_{o}} &= \vec{\delta}_o \\ \vec{\delta}_o &= \frac{\partial L_m}{\partial \textbf{z}_o} = L_m' \odot f_o'(\textbf{z}_o) \\ \end{align} \]
The process of scaling each input and output data to a fixed same range
In the same example network
Input variables are \(x\) and \(y\)
Outputs are \(u\), \(v\) and \(p\)
And lets say, we have \(N_{data}\) data points
Back-propagation equations have chain-linked gradients
Hidden layer \[ \begin{align} \frac{\partial L_m}{\partial \textbf{W}_{hi}}&= \vec{\delta}_h . \textbf{x}^T \\ \frac{\partial L_m}{\partial \textbf{b}_{h}} &= \vec{\delta}_h \\ \vec{\delta}_h &= \frac{\partial L_m}{\partial \textbf{z}_h} = \textbf{W}_{oh}^T \vec{\delta}_o \odot f'_h(\textbf{z}_h)\\ \end{align} \]
Output layer \[ \begin{align} \frac{\partial L_m}{\partial \textbf{W}_{oh}} &= \vec{\delta}_o . \textbf{a}_h \\ \frac{\partial L_m}{\partial \textbf{b}_{o}} &= \vec{\delta}_o \\ \vec{\delta}_o &= \frac{\partial L_m}{\partial \textbf{z}_o} = L_m' \odot f_o'(\textbf{z}_o) \\ \end{align} \]
\(\Huge \}\) due to back-propagation
It is the step size used in gradient descent in the design space (here it is called weight space)
weights and biases of each layer will be updated in the same way as the optimization problem \[ \begin{align} \textbf{W}^{(l)} :&= \textbf{W}^{(l)} - \alpha \nabla_{\textbf{W}^{(l)}} L_m \\ \textbf{b}^{(l)} :&= \textbf{b}^{(l)} - \alpha \nabla_{\textbf{b}^{(l)}} L_m \\ \end{align} \]
Here, unlike in optimization problems, \(\alpha\) is fixed, due to computations complexity
There are two fundamental optimization algorithms, others are derivatives from them
Introduction to neural networks - MDO