1 Least Squares

This chapter has description on the following content

Ordinary Least Squares
Weighted Least Squares
Recursive Least Squares
Wiener filtering

The main reference for this chapter is Simon (2006).

1.1 Ordinary Least Squares (OLS)

Here, estimating a constant vector \(\mathbf{x}\in \mathbb{R}^n\) from a noisy measurement vector \(\mathbf{y} \in \mathbb{R}^k\) is performed. It is assumed that each element in \(\mathbf{y}\) is a linear combination of \(\mathbf{x}\) vector with some additional measurement noise vector \(\nu\). The equation representing this scenario is given below.

\[ \mathbf{y} = \mathbf{H}\mathbf{x} + \nu \tag{1.1}\]

Now, defining the measurement residual or error vector \(\epsilon_y\) as

\[ \epsilon_y = \mathbf{y} - \mathbf{H} \hat{\mathbf{x}} \tag{1.2}\]

The most optimal value of \(\mathbf{x}\) is \(\hat{\mathbf{x}}\) that minimizes the sum of squares of individual components of the error vector \(\epsilon_y\). So, constructing the objective function as

\[ J = \sum_{i=1}^k \epsilon_{k,i}^2 = \epsilon_y^T \epsilon_y. \tag{1.3}\]

Substituting Eq. 1.2 in Eq. 1.3 and expanding as

\[ \begin{align} J &= (\mathbf{y} - \mathbf{H} \hat{\mathbf{x}})^T(\mathbf{y} - \mathbf{H} \hat{\mathbf{x}})\\ &= \mathbf{y}^T\mathbf{y} - \mathbf{y}^T \mathbf{H}\hat{\mathbf{x}} -\hat{\mathbf{x}}^T\mathbf{H}^T \mathbf{y} + \hat{\mathbf{x}}^T\mathbf{H}^T\mathbf{H}\hat{\mathbf{x}} . \end{align} \tag{1.4}\]

To find the stationary points, taking derivative of Eq. 1.4 with respect to \(\hat{\mathbf{x}}\) as

\[ \begin{align} \frac{d J}{d \hat{\mathbf{x}}} &= \mathbf{0} - \mathbf{y}^T\mathbf{H} - \mathbf{y}^T\mathbf{H} +2 \hat{\mathbf{x}}^T\mathbf{H}^T\mathbf{H} = \mathbf{0} \\ 2\hat{\mathbf{x}}^T \mathbf{H}^T\mathbf{H} &= 2\mathbf{y}^T\mathbf{H} \\ \mathbf{H}^T\mathbf{H} \hat{\mathbf{x}} &= \mathbf{H}^T\mathbf{y} \\ \hat{\mathbf{x}} &= \left(\mathbf{H}^T\mathbf{H}\right)^{-1} \mathbf{H}^T\mathbf{y} \end{align} \tag{1.5}\]

Above equation gives the final form of the Least Squares estimation.

1.2 Weighted Least Squares (WLS)

In certain scenarios, specific portions of the measurement data \(\mathbf{y}\) may exhibit higher noise levels than others. In such case, the portions with less noise level should be given more weight than the portions with high noise. This is achieved by weighted least squares.

The weight is determined by the covariance matrix of the noise vector \(\nu\). It is assumed that the noise for each measurement is zero-mean and independent. Thus, the measurement noise covariance matrix \(\mathbf{R}\) is defined as

\[ \begin{align} \mathbf{R} &= E(\nu\nu^T) \\ &= \begin{bmatrix} \sigma_1^2 & \dots & 0 \\ \vdots & & \vdots \\ 0 & \dots & \sigma_k^2 \end{bmatrix} \end{align} . \tag{1.6}\]

Hence, the objective function (Eq. 1.3) is written as

\[ J = \sum_{i=1}^k \frac{\epsilon_{y,i}^2}{\sigma_1^2} = \epsilon_y^T \mathbf{R}^{-1} \epsilon_y . \tag{1.7}\]

Expanding above expression,

\[ \begin{align} J &= (\mathbf{y} - \mathbf{H}\hat{\mathbf{x}})^T \mathbf{R}^{-1} (\mathbf{y} - \mathbf{H}\hat{\mathbf{x}}) \\ &= \mathbf{y}^T \mathbf{R}^{-1} \mathbf{y} - \mathbf{y}^T \mathbf{R}^{-1} \mathbf{H} \hat{\mathbf{x}} - \hat{\mathbf{x}}^T\mathbf{H}^T\mathbf{R}^{-1}\mathbf{y} + \hat{\mathbf{x}}^T\mathbf{H}^T\mathbf{R}^{-1}\mathbf{H}\hat{\mathbf{x}} . \end{align} \]

Taking derivative of above expression with respect to \(\hat{\mathbf{x}}\) and setting it to zero yields

\[ \begin{align} \frac{d J}{d \hat{\mathbf{x}}} &= \mathbf{0} - \mathbf{y}^T\mathbf{R}^{-1}\mathbf{H}- \mathbf{y}^T \mathbf{R}^{-1}\mathbf{H} + 2\hat{\mathbf{x}}^T\mathbf{H}^T\mathbf{R}^{-1}\mathbf{H} = \mathbf{0} \\ 2\hat{\mathbf{x}}^T\mathbf{H}^T\mathbf{R}^{-1}\mathbf{H} &= 2 \mathbf{y}^T \mathbf{R}^{-1}\mathbf{H} \\ \mathbf{H}^T\mathbf{R}^{-1}\mathbf{H}\hat{\mathbf{x}} &= \mathbf{H}^T\mathbf{R}^{-1}\mathbf{y} \\ \hat{\mathbf{x}} &= \left(\mathbf{H}^T\mathbf{R}^{-1}\mathbf{H}\right)^{-1}\mathbf{H}^T\mathbf{R}^{-1}\mathbf{y} . \end{align} \tag{1.8}\]

Above equation gives the final form of weighted least squares.

Note

Weighted least squares generalizes ordinary least squares by incorporating the measurement noise covariance matrix. When the measurements are obtained from a single instrument with independent and identically distributed noise, the covariance matrix reduces to a scaled identity matrix, and the weighted least squares solution becomes equivalent to ordinary least squares. However, in the presence of correlated or heteroskedastic measurement noise, the weighting matrix must be explicitly accounted for.

This is still an offline estimation method

1.3 Recursive Least Squares (RLS)

The ordinary least squares (OLS) (Eq. 1.5) and weighted least squares (WLS) (Eq. 1.8) process data in batches, resulting in a tall \(\mathbf{H}\) matrix that is computationally cumbersome to handle. Moreover, they are unusable in online estimation where the data points will come in sequence. These limitations lead to the recursive least squares (RLS) method. The complete name for this method is Recursive Weighted Least Squares.

A recursive least squares estimator can be written of the form \[ \begin{align} \hat{\mathbf{x}}_k &= \hat{\mathbf{x}}_{k-1} + \mathbf{K}_k\left(y_k - \mathbf{H}_k \hat{\mathbf{x}}_{k-1}\right) \end{align} \tag{1.9}\]

Here, the current estimate \(\hat{\mathbf{x}}_k\) is computed based on the previous estimate \(\hat{\mathbf{x}}_{k-1}\) and the new measurement \(y_k\). The \(\mathbf{K}_k\) is called the estimator gain matrix, and \(\left(y_k - \mathbf{H}_k \hat{\mathbf{x}}_{k-1}\right)\) is the correction term.

\(\mathbf{H}\) matrix

In the case of RLS, \(\mathbf{H}\) is a row vector of dimension \(1\times n\). However, it is denoted with an uppercase bold symbol, \(\mathbf{H}\), to maintain notational consistency with subsequent chapters, where it is treated as a matrix.

Before computing gain matrix \(\mathbf{K}_k\), it is necessary to examine the estimator error mean, it is expressed as:

\[ \begin{align} E(\epsilon_{x,k}) &= E(\mathbf{x} - \hat{\mathbf{x}}_k)\\ &= E \left[\mathbf{x} - \hat{\mathbf{x}}_{k-1} - \mathbf{K}_k\left(y_k- \mathbf{H}_k \hat{\mathbf{x}}_{k-1}\right)\right] \\ &=E \left[\mathbf{x} - \hat{\mathbf{x}}_{k-1} - \mathbf{K}_k\left(\mathbf{H}_k\mathbf{x}+\nu- \mathbf{H}_k \hat{\mathbf{x}}_{k-1}\right)\right] \\ &=E(\mathbf{x}-\hat{\mathbf{x}}_{k-1}) - \mathbf{K}_k\left(\mathbf{H}_k E(\mathbf{x}-\hat{\mathbf{x}}_{k-1})+E(\nu)\right) \\ &= E(\epsilon_{x,k-1}) - \mathbf{K}_k\left(\mathbf{H}_kE(\epsilon_{x,k-1}) + E(\nu)\right) \\ &= \left(\mathbf{I} - \mathbf{K}_k\mathbf{H}_k\right) E(\epsilon_{x,k-1}) - \mathbf{K}_k E(\nu) \end{align} \tag{1.10}\]

Here, \(E(\nu) = 0\) (zero-mean noise was assumed). If \(E(\epsilon_{x,k-1})=0\), then \(E(\epsilon_{x,k}) = 0\). So, if the initial value of \(\hat{\mathbf{x}}_0\) is chosen such that \(\hat{\mathbf{x}}_0 = E(\mathbf{x})\), then \(E(\hat{\mathbf{x}}_k) = E(\mathbf{x}_k)\) for all \(k\)’s. Thus, Eq. 1.10 is called an unbiased estimator.

Note

The property \(E(\hat{\mathbf{x}}_k) = E(\mathbf{x}_k)\) holds regardless the value of gain matrix \(\mathbf{K}_k\). It is a desirable property as it says, on average, the estimate \(\hat{\mathbf{x}}\) will be equal to the true value \(\mathbf{x}\).

Now, to determine \(\mathbf{K}_k\), some optimality criterion has to be chosen. Estimator error mean (Eq. 1.10) cannot be used as it is unbiased regardless of \(\mathbf{K}_k\).

The chosen optimality criterion is to minimize the sum of variances of the estimation errors at time \(k\). Thus, the objective function is

\[ \begin{align} J &= \sum_{i=1}^n E\left[\left(x_i - \hat{x}_i\right)^2\right] \\ &= E\left(\sum_{i=1}^n \epsilon_{x_i,k}^2\right) \\ &= E\left(\epsilon_{x,k}^T\epsilon_{x,k}\right) \\ &= E\left[Trace\left(\epsilon_{x,k}\epsilon_{x,k}^T\right)\right] \\ &= Trace(\mathbf{P}_k) \end{align} \tag{1.11}\]

Note

In Eq. 1.11, \(\epsilon_{x,k}^T\epsilon_{x,k} = Trace(\epsilon_{x,k}\epsilon{x,k}^T\). Only difference is, the RHS of this expression is a matrix. And that matrix is called the estimator error covariance \(\mathbf{P}_k\). It is also called as the estimation uncertainty.

A recursive formula for \(\mathbf{P}_k\) can be obtained by following the process in Eq. 1.10.

\[ \begin{align} \mathbf{P}_k &= E(\epsilon_{x,k}\epsilon_{x,k}^T)\\ &= E\left\{\left[\left(\mathbf{I}-\mathbf{K}_k\mathbf{H}_k\right)\epsilon_{x,k-1}-\mathbf{K}_k\nu_k\right]\left[\left(\mathbf{I}-\mathbf{K}_k\mathbf{H}_k\right)\epsilon_{x,k-1}-\mathbf{K}_k\nu_k\right]^T\right\} . \end{align} \]

Expanding and simplifying above equation yields

\[ \mathbf{P}_k = \left(\mathbf{I}-\mathbf{K}_k\mathbf{H}_k\right)\mathbf{P}_{k-1}\left(\mathbf{I}-\mathbf{K}_k\mathbf{H}_k\right)^T + \mathbf{K}_k\mathbf{R}_k\mathbf{K}_k^T \tag{1.12}\]

Substituting Eq. 1.12 in Eq. 1.11 and taking derivative yields \[ \frac{\partial J_k}{\partial \mathbf{K}_k} = 2\left(\mathbf{I}-\mathbf{K}_k\mathbf{H}_k\right)\mathbf{P}_{k-1}\left(-\mathbf{H}_k^T\right)+2\mathbf{K}_k\mathbf{R}_k \]

Equating above expression to zero gives \[ \begin{align} \mathbf{K}_k\mathbf{R}_k &= \left(\mathbf{I} - \mathbf{K}_k\mathbf{H}_k\right) \mathbf{P}_{k-1} \mathbf{H}_k^T \\ \mathbf{K}_k &= \mathbf{P}_{k-1}\mathbf{H}_k^T\left(\mathbf{H}_k\mathbf{P}_{k-1}\mathbf{H}^T_k + \mathbf{R}_k\right)^{-1} \end{align} \tag{1.13}\]

So, the Eq. 1.9, Eq. 1.12 and Eq. 1.13 form the recursive least squares estimator.

The algorithm is as follows

1.3.1 Recursive Least Squares algorithm

Initialize the estimator as follows \[ \begin{align} \hat{\mathbf{x}}_0 &= E(\mathbf{x}) \\ \hat{\mathbf{P}}_0 &= E\left[\left(\mathbf{x}-\hat{\mathbf{x}}_0\right)\left(\mathbf{x}-\hat{\mathbf{x}}_0\right)^T\right] \end{align} \]
For \(k=1,2,\dots\) do the following
1. Obtain measurement data, assuming like \(\mathbf{y}_k\) is given be the equation \[ \mathbf{y}_k = \mathbf{H}_k\mathbf{x}+\nu_k \]
2. Update the estimate of \(\mathbf{x}\) and the estimation-error covariance \(\mathbf{P}\) as follows: \[ \begin{align} \mathbf{K}_k &= \mathbf{P}_{k-1}\mathbf{H}_k^T\left(\mathbf{H}_k\mathbf{P}_{k-1}\mathbf{H}^T_k + \mathbf{R}_k\right)^{-1} \\ \hat{\mathbf{x}}_k &= \hat{\mathbf{x}}_{k-1} + \mathbf{K}_k\left(y_k - \mathbf{H}_k \hat{\mathbf{x}}_{k-1}\right) \\ \mathbf{P}_k &= \left(\mathbf{I}-\mathbf{K}_k\mathbf{H}_k\right)\mathbf{P}_{k-1}\left(\mathbf{I}-\mathbf{K}_k\mathbf{H}_k\right)^T + \mathbf{K}_k\mathbf{R}_k\mathbf{K}_k^T \end{align} \]

1.3.2 Alternate estimator forms

The expressions for \(\mathbf{P}_k\) and \(\mathbf{K}_k\) can be written in the alternate forms through simplification. The following alternate forms are mathematically identical to the previously equations, but are beneficial from computational point of view.

The below alternate form for \(\mathbf{P}_k\) expression (Eq. 1.12) is obtained by substituting Eq. 1.13 in Eq. 1.12 and then simplified.

\[ \mathbf{P}_k = \left(\mathbf{I} - \mathbf{K}_k\mathbf{H}_k\right)\mathbf{P}_{k-1} \tag{1.14}\]

The above equation is much simple compared to its equivalent Eq. 1.12 but numerical computing problems (i.e. scaling issues) may cause this expression for \(\mathbf{P}_k\) to not be positive definite even if \(\mathbf{P}_{k-1},\mathbf{R}_k\) are positive definite.

Another form is obtained for \(\mathbf{P}_k\),

\[ \mathbf{P}_k = \left[\mathbf{P}_{k-1}^{-1} + \mathbf{H}_k^T\mathbf{R}_k^{-1}\mathbf{H}_k\right]^{-1} . \tag{1.15}\]

Using above expression, the alternate form for the estimator gain \(\mathbf{K}_k\) was derived and the final expression is

\[ \mathbf{K}_k = \mathbf{P}_k\mathbf{H}_k^T\mathbf{R}_k^{-1}. \tag{1.16}\]

1.3.3 General recursive least squares estimation algorithm

This is the summary algorithm with above alternate forms included

Let, the measurement equations are given as

\[ \begin{align} \mathbf{y}_k &= \mathbf{H}_k \mathbf{x} + \mathbf{v}_k \\ \mathbf{x} &= constant \\ E(\mathbf{v}_k) &= 0 \\ E(\mathbf{v}_k\mathbf{v}_i^T) &= \mathbf{R}_k \delta_{k-i} \end{align} \]

Assign the initial estimate of \(\mathbf{x}_0\) and \(\mathbf{P}_0\).
For \(k=1,2,\dots\) do the following \[ \begin{align} \mathbf{K}_k &= \mathbf{P}_{k-1}\mathbf{H}_k^T\left(\mathbf{H}_k\mathbf{P}_{k-1}\mathbf{H}_k^T+\mathbf{R}_k\right)^{-1} \\ &= \mathbf{P}_k\mathbf{H}_k^T\mathbf{R}_k^{-1} \\ \hat{\mathbf{x}}_k &= \hat{\mathbf{x}}_{k-1} + \mathbf{K}_k\left(\mathbf{y}_k - \mathbf{H}_k\hat{\mathbf{x}}_{k-1}\right) \\ \mathbf{P}_k &= \left(\mathbf{I}-\mathbf{K}_k\mathbf{H}_k\right)\mathbf{P}_{k-1}\left(\mathbf{I}-\mathbf{K}_k\mathbf{H}_k\right)^T + \mathbf{K}_k\mathbf{R}_k\mathbf{K}_k^T \\ &= \left[\mathbf{P}_{k-1}^{-1} + \mathbf{H}_k^T\mathbf{R}_k^{-1}\mathbf{H}_k\right]^{-1} \\ &= \left(\mathbf{I} - \mathbf{K}_k\mathbf{H}_k\right)\mathbf{P}_{k-1} \end{align} \]

1.4 Deterministic vs Probabilistic Interpretation of OLS, WLS, and RLS

The fundamental distinction between OLS/WLS and RLS lies in how uncertainty in the unknown parameters is modeled and propagated. OLS and WLS perform deterministic optimization, whereas RLS is commonly formulated as a probabilistic inference method.
In OLS and WLS, the unknown parameters are treated as fixed but unknown constants, and the estimation problem is posed as a static optimization over a finite dataset. These methods do not inherently propagate uncertainty over time.
In contrast, RLS treats the unknown parameters as random variables. Although the parameters are assumed to be constant in reality, modeling them as random variables enables the representation of epistemic uncertainty and facilitates sequential Bayesian updates as new measurements become available.
OLS and WLS are usually presented as deterministic optimization problems, whereas RLS is usually presented as a Bayesian filtering method, even though all three approaches can yield mathematically equivalent point estimates under specific assumptions (e.g., linear models, Gaussian noise, and static parameters).
The RLS algorithm is derived from a Bayesian filtering perspective, where parameter estimates correspond to posterior means and the associated matrices represent posterior covariances. Consequently, a strictly probabilistic interpretation of RLS relies on explicit probabilistic assumptions.
Nevertheless, RLS can also be used as a deterministic recursive numerical algorithm for solving (weighted) least-squares problems. In this interpretation, its internal quantities are viewed purely as algorithmic constructs rather than as random variables and covariances, and the Bayesian interpretation is no longer invoked.

1.5 Estimation error bounds in OLS,WLS and RLS

It is explicit that the covariance \(\mathbf{P}\) is propagated in the RLS and the final estimate of \(\mathbf{P}\) will be used for computing the estimation error bounds as \(\hat{\mathbf{x}} \pm diag(\sqrt{\mathbf{P}})\).

However, OLS and WLS also has estimation covariance computation methods that are generally undiscussed. The covariance in these batch LS methods is computed after parameter \(\hat{\mathbf{x}}\) estimation. The expression for covariance in OLS is given by

\[ Cov\left(\hat{\mathbf{x}}_{OLS}\right) = \hat{\sigma}^2\left(\mathbf{H}^T\mathbf{H}\right)^{-1} \]

Where, \(\hat{\sigma}^2\) is the noise scaling factor defined as

\[ \hat{\sigma}^2 = \frac{1}{k-n} \lVert \mathbf{y} - \mathbf{H}\hat{\mathbf{x}}\rVert ^2 \]

Note that the number of data points \(k\) and the length of unknown parameter vector \(n\) should satisfy the relation \(k> n\). However, \(k=n\) can happen if the system is consistent and have unique solution.

Similarly, the covariance expression for the case of WLS is given as

\[ Cov\left(\hat{\mathbf{x}}_{WLS}\right) = \left(\mathbf{H}^T\mathbf{R}^{-1}\mathbf{H}\right)^{-1} \]

This expression already includes noise scaling if the \(\mathbf{R}\) matrix is known.

In RLS, the covariance \(\mathbf{P}_k\) is recursively evolved with the data points as \(\mathbf{P}_k\) is a part of the algorithm itself.

As a quick points

In OLS/WLS, error bounds come from inverse information matrix
In RLS, error bounds are recursively propagated posterior covariance.

And the common way of describing error bounds in literature is

\[ \hat{\mathbf{x}} \pm 2{SE}\left(\hat{\mathbf{x}}\right) \tag{1.17}\]

Where, \(SE(.)\) is called the standard error operator defined as

\[ SE\left(\hat{\mathbf{x}}\right) = \sqrt{diag\left(\mathbf{P}\right)} \]

Eq. 1.17 gives the estimate with 95% confidence interval.

1.6 Selecting initial guess parameters

OLS and WLS are explicit methods of estimating the parameter vector, but RLS is a sequential update method. Thus, RLS requires an initial guess for the parameter vector \(\mathbf{x}_0\) and the covariance matrix \(\mathbf{P}_0\).

In RLS, the initial parameter estimate \(\mathbf{x}_0\) represents a prior mean and does not affect unbiasedness, while the initial covariance \(\mathbf{P}_0\) governs the rate of convergence and confidence placed on the prior information.

The following methods of initial guess for \(\mathbf{x}_0\) is suggested.

Zero vector, when there is no prior knowledge about the system.
Physically motivated guess, when the physics of the system is known.
Batch OLS/WLS, taking a short interval data points and obtaining \(\mathbf{x}_0\) with OLS/WLS. They have faster convergence and smaller oscillations.

And the following methods were suggested for the initial guess of \(\mathbf{P}_0\).

A large diagonal matrix \(\mathbf{P} = \alpha \mathbf{I}\), used when the \(\mathbf{x}\) range is unknown.
Parameter scaled diagonal matrix. This is best choice as the scale of each state variable in \(\mathbf{x}\) can be different, and choosing the possible variance in the corresponding scale is physically meaningful and numerically stable.
With batch OLS/WLS, as \(\mathbf{P}_0 = \hat{\sigma}^2 \left(\mathbf{H}^T\mathbf{H}\right)^{-1}\)

RLS algorithm trusts \(\mathbf{P}_0\) more than \(\mathbf{x}_0\). Bad \(\mathbf{x}_0\) and large \(\mathbf{P}_0\) makes RLS quickly correct itself, while the case with small \(\mathbf{P}_0\) will yield slow convergence or even biased estimate.

In the case of Kalman Filters, which is an extension of RLS that will be discussed in the later chapters, the initial state vector \(\mathbf{x}_0\) and its corresponding covariance matrix \(\mathbf{P}_0\) has to be selected appropriately as well.

The selection process for KF is similar to that of RLS. The main difference is the state vector itself. The state vector in RLS usually a constant vector, whereas the state vector in KF represents the evolution of the system.

1.7 Equivalence of OLS / WLS and RLS: A Worked Example

This section demonstrates, using a simple example, that batch Ordinary Least Squares (OLS), Weighted Least Squares (WLS), and Recursive Least Squares (RLS) produce identical final parameter estimates when applied to the same data under identical assumptions. The methods differ only in how the data are processed, not in the underlying solution.

1.7.1 Problem formulation

Consider a linear regression model with constant but unknown parameters:

\[ y_k = \mathbf{H}_k \mathbf{x} + v_k \]

where:

\(\mathbf{x} \in \mathbb{R}^2\) is the unknown parameter vector,
\(\mathbf{H}_k \in \mathbb{R}^{1 \times 2}\) is the regressor (observation) row vector,
\(v_k\) is zero-mean measurement noise.

Assume: \[ v_k \sim \mathcal{N}(0, \sigma^2), \quad \text{i.i.d.} \]

Let the true parameter vector be: \[ \mathbf{x} = \begin{bmatrix} 2 \\ -1 \end{bmatrix} \]

1.7.2 Data generation

Let the regressor at time \(k\) be: \[ \mathbf{H}_k = \begin{bmatrix} 1 & k \end{bmatrix}, \quad k = 1,2,3 \]

The corresponding noise-free measurements are: \[ \begin{aligned} y_1 &= 1 \\ y_2 &= 0 \\ y_3 &= -1 \end{aligned} \]

1.7.3 Ordinary Least Squares (OLS)

1.7.3.1 Stacked formulation

Define the stacked observation matrix and measurement vector:

\[ \mathbf{H} = \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \end{bmatrix}, \quad \mathbf{y} = \begin{bmatrix} 1 \\ 0 \\ -1 \end{bmatrix} \]

1.7.3.2 OLS solution

The OLS estimate minimizes the squared residual norm:

\[ \hat{\mathbf{x}}_{\text{OLS}} = \arg \min_{\mathbf{x}} \| \mathbf{y} - \mathbf{H}\mathbf{x} \|^2 \]

The closed-form solution is:

\[ \hat{\mathbf{x}}_{\text{OLS}} = (\mathbf{H}^\top \mathbf{H})^{-1} \mathbf{H}^\top \mathbf{y} \]

1.7.3.3 Numerical result

\[ \mathbf{H}^\top \mathbf{H} = \begin{bmatrix} 3 & 6 \\ 6 & 14 \end{bmatrix} \]

\[ (\mathbf{H}^\top \mathbf{H})^{-1} = \frac{1}{6} \begin{bmatrix} 14 & -6 \\ -6 & 3 \end{bmatrix} \]

\[ \mathbf{H}^\top \mathbf{y} = \begin{bmatrix} 0 \\ -2 \end{bmatrix} \]

Therefore:

\[ \boxed{ \hat{\mathbf{x}}_{\text{OLS}} = \begin{bmatrix} 2 \\ -1 \end{bmatrix} } \]

1.7.4 Weighted Least Squares (WLS)

1.7.4.1 Noise model

Assume the measurement noise covariance is:

\[ \mathbf{R} = \sigma^2 \mathbf{I} \]

Then the weighting matrix is: \[ \mathbf{W} = \mathbf{R}^{-1} \]

1.7.4.2 WLS solution

The WLS estimate minimizes the weighted residual:

\[ \hat{\mathbf{x}}_{\text{WLS}} = \arg \min_{\mathbf{x}} (\mathbf{y} - \mathbf{H}\mathbf{x})^\top \mathbf{W} (\mathbf{y} - \mathbf{H}\mathbf{x}) \]

The closed-form solution is:

\[ \hat{\mathbf{x}}_{\text{WLS}} = (\mathbf{H}^\top \mathbf{R}^{-1} \mathbf{H})^{-1} \mathbf{H}^\top \mathbf{R}^{-1} \mathbf{y} \]

Since \(\mathbf{R}^{-1} = \sigma^{-2} \mathbf{I}\), the scalar factor cancels out, yielding:

\[ \boxed{ \hat{\mathbf{x}}_{\text{WLS}} = \hat{\mathbf{x}}_{\text{OLS}} } \]

Thus:

\[ \hat{\mathbf{x}}_{\text{WLS}} = \begin{bmatrix} 2 \\ -1 \end{bmatrix} \]

1.7.5 Recursive Least Squares (RLS)

RLS estimates the same parameter vector sequentially as new data arrive.

1.7.5.1 RLS update equations without forgetting factor

\[ \begin{aligned} \mathbf{K}_k &= \frac{\mathbf{P}_{k-1} \mathbf{H}_k^\top} {1 + \mathbf{H}_k \mathbf{P}_{k-1} \mathbf{H}_k^\top} \\ \hat{\mathbf{x}}_k &= \hat{\mathbf{x}}_{k-1} + \mathbf{K}_k \left(y_k - \mathbf{H}_k \hat{\mathbf{x}}_{k-1}\right) \\ \mathbf{P}_k &= (\mathbf{I} - \mathbf{K}_k \mathbf{H}_k)\mathbf{P}_{k-1} \end{aligned} \]

1.7.5.2 Initialization

Choose: \[ \hat{\mathbf{x}}_0 = \begin{bmatrix} 0 \\ 0 \end{bmatrix}, \quad \mathbf{P}_0 = \alpha \mathbf{I}, \quad \alpha \gg 1 \]

This represents a non-informative prior with large initial uncertainty.

1.7.5.3 Recursive estimation result

After processing measurements \(k = 1,2,3\), the RLS estimate converges to:

\[ \boxed{ \hat{\mathbf{x}}_3 = \begin{bmatrix} 2 \\ -1 \end{bmatrix} } \]

and the covariance becomes:

\[ \boxed{ \mathbf{P}_3 = (\mathbf{H}^\top \mathbf{H})^{-1} } \]

1.7.6 Interpretation

For RLS with:

No forgetting factor (\(\lambda = 1\)),
Identical data,
Correct noise assumptions,

the inverse covariance evolves as:

\[ \mathbf{P}_k^{-1} = \sum_{i=1}^k \mathbf{H}_i^\top \mathbf{H}_i \]

After all measurements are processed:

\[ \mathbf{P}_N^{-1} = \mathbf{H}^\top \mathbf{H} \]

and the parameter estimate satisfies:

\[ \hat{\mathbf{x}}_N = (\mathbf{H}^\top \mathbf{H})^{-1} \mathbf{H}^\top \mathbf{y} \]

which is exactly the OLS and WLS solution.

Thus

OLS / WLS solve the normal equations using all data at once.
RLS solves the same equations incrementally via rank-one updates.
The final estimate is identical when the same information is used.

Thus, RLS can be interpreted as an online implementation of WLS.

And the equivalence no longer holds if: - A forgetting factor (\(\lambda < 1\)) is used, - Parameters are time-varying, - Noise statistics are mismatched, - Numerical truncation or regularization is introduced.

Recursive least squares produces the same parameter estimates as batch ordinary or weighted least squares when applied to the same data with identical noise assumptions, differing only in the manner in which information is accumulated and processed.

1.8 Wiener Filtering

Wiener filtering was given in the reference Simon (2006) but mentioned that it is included from a historical perspective. It is not used for the state estimation anymore. However, some key points were collected for the record. They are given below.

Wiener filtering occurs in the frequency domain rather than time domain.
It uses Fourier transform to transform the problem from time to frequency domain.
The main concept of this method is that the power spectrum of the output \(y(t)\) is a function of the Fourier transform of the impulse response system \(G(\omega)\) and the power spectrum of the input \(x(t)\). Impulse response \(g(t)\) (whose transformed version is \(G(\omega)\)) is actually an input control.
Three methods were demonstrated for obtaining optimal filter: Parametric filter optimization, General filter optimization, and (non-)causal filter optimization.
In parametric filter optimization, the filter is assumed to have a functional form with design variables and it reduces to a parameter estimation problem.
In general filter optimization, the optimal filter \(f(t)\) is found by minimizing the expected value of estimation error.
The causal/non-causal filter optimization is similar to the general filter optimization except the data usage differs in estimation. Non-causal filter method uses past,present and future data for estimation, whereas the causal filter uses only past and present data for the estimation.

1.9 Summary

The detailed derivation of final expressions for the methods: ordinary least squares, weighted least squares and recursive least squares, were understood and presented for reference. The key information on Wiener filtering were presented and the book Simon (2006) can be a starting point for the detailed reference.

This lays the foundation for the Kalman filters.