$$ \newcommand{\diag}[1]{diag({#1})} \newcommand{\reshape}[2]{reshape \left({#1},\ {#2}\right)} \newcommand{\bm}[1]{\textbf{#1}} \newcommand{\der}[2]{\frac{\partial #1}{\partial #2}} $$

2  Forward Propagation in RNN

For the BPTT derivation, we need to know the parameters of RNN that are to be updated. For that, we need to perform forward propagation to get the model equations. The RNN considered for derivation has 2 layers: one hidden layer having \(N^{(1)}\) neurons and one output layer having \(N^{(2)}\) neurons.

Let \(\textbf{x}_t \in \mathbb{R}^n\) be the input vector for current subpart \(t\). And let \(\textbf{a}_{t-1} \in \mathbb{R}^{N^{(1)}}\) be the hidden layer output vector for the previous subpart \(t-1\). Then, the layer-wise forward propagation equations are given below.

2.1 Layer 1 equations

From Figure 1.1, the output size of layer 1 is \(N^{(1)} = 3\) and input vector size \(n=2\).

Let \(\textbf{z}_t=[z_{1,t}, z_{2,t},z_{3,t}]^T\) be the pre-activation output vector. Then the forward propagation equations will be

\[ \begin{aligned} z_{1,t} &= w_{1,1} a_{1,t-1} + w_{1,2} a_{2,t-1} + w_{1,3} a_{3,t-1} + u_{1,1} x_{1,t} + u_{1,2} x_{2,t} + b_1\\ z_{2,t} &= w_{2,1} a_{1,t-1} + w_{2,2} a_{2,t-1} + w_{2,3} a_{3,t-1} + u_{2,1} x_{1,t} + u_{2,2} x_{2,t} + b_2\\ z_{3,t} &= w_{3,1} a_{1,t-1} + w_{3,2} a_{2,t-1} + w_{3,3} a_{3,t-1} + u_{3,1} x_{1,t} + u_{3,2} x_{2,t} + b_3 \end{aligned} \tag{2.1}\]

In vector form, it will be writen as

\[ \textbf{z}_t = \textbf{W} \textbf{a}_{t-1} + \textbf{U} \textbf{x}_t + \textbf{b} \tag{2.2}\]

where, \(\textbf{x}_t = [x_{1,t}, x_{2,t}]^T\) and \(\textbf{a}_{t-1} = [a_{1,t-1},a_{2,t-1},a{3,t-1}]^T\).

\(\textbf{W} \in \mathbb{R}^{N^{(1)}\times N^{(1)}}\), \(\textbf{U} \in \mathbb{R}^{N^{(1)}\times n}\) and \(\textbf{b} \in \mathbb{R}^{N^{(1)}\times 1}\) are the layer parameters.

Now, the Equation 2.2 will be passed through the activation function \(h(.)\) to induce non-linearity and the final output obtained from this layer will be

\[ \textbf{a}_t = h\left(\textbf{z}_t\right) . \tag{2.3}\]

2.2 Layer 2 equations

Equation 2.3 gives the output vector of previous layer, which will be taken as the input vector to the current layer as it happens in regular neural networks.

Let, \(\textbf{o}_t \in \mathbb{R}^{N^{(2)}}\) be the pre-activation output vector for the current layer. Then the forward propagation equation will be

\[ \textbf{o}_t = \textbf{V} \textbf{a}_t + \textbf{c} \tag{2.4}\]

Here, \(\textbf{V}\in\mathbb{R}^{N^{(2)}\times N^{(1)}}\) and \(\textbf{c}\in\mathbb{R}^{N^{(2)}\times 1}\) are the layer parameters.

Then, the activation function will be applied to \(\textbf{o}_t\). Here, for this derivation, the activation function is assumed to be softmax as the primary motive of this work is to develop RNN from scratch for autocompletion of words. So, the input and output will be a word-embedding vector of corresponding characters in the words.

The final output vector of current layer will be

\[ \hat{\textbf{y}}_t = softmax(\textbf{o}_t) \tag{2.5}\]

2.3 Loss calculation

For the case of using softmax activation (which is used for multiclass classification problem), the apropriate loss function will be categorical cross entropy

\[ L_t = -\textbf{y}_t^T \log(\hat{\textbf{y}}_t). \tag{2.6}\]

It can be noted in Equation 2.6 that the loss is calculated sequence-wise, i.e. 1 loss per \(t\). So, the total loss will be

\[ L = \sum_{t=1}^T L_t. \tag{2.7}\]

Next, we will see how to calculate the derivatives of Equation 2.6 with respect to all the RNN model parameters: \(\textbf{W},\textbf{U},\textbf{b},\textbf{V}\) and \(\textbf{c}\).