Powell’s Dogleg for Least Squares Minimization from Scratch
I’ve recently released the Rust dogleg
crate and I thought I’d document everything that’s needed to implement a least-squares
Dogleg optimizer from scratch. You probably don’t want to do that, but if you
did, then here’s everything you need to know.
1 Foreword and References
I’ll aim to go into details around the algorithm and implementation, rather than leaving it at a high-level overview. However, since this will be a long article, I’ll need to focus on Dogleg specifically and I can only touch on adjacent topics. Additionally, I will focus on the topic of unconstrained least squares minimization. Dogleg itself is a general purpose minimization algorithm and not restricted to least squares. However, when applied to this important subproblem, we can exploit the structure of the problem in such a way that we can gain some dramatic increases in robustness.
Obviously none of this is original and the best references I know on the topic are Nocedal&Wright “Numerical Optimization” 2nd ed. (in short N&W), Madsen et al. “Methods for Non-Linear Least Squares Problems”, the Ceres Solver documentation and source code, and (to a lesser extent, but surprisingly) the Minpack User Guide and the Modern Minpack source code. I’ll be pretty sloppy with my citations because most of them would just be N&W or Madsen et al.
2 Trust Region Algorithms
Let’s start from the basics: we’d like to minimize a scalar-valued objective function \(f:\mathbb{R}^n \rightarrow \mathbb{R}\) with respect to its parameters \(\boldsymbol{x} \in \mathbb{R}^n\):
\[\min_{\boldsymbol{x}} f(\boldsymbol{x}). \tag{1}\label{min-f}\]Since we are dealing with least-squares minimization, we’ll later set the objective function to the 2-norm of a vector of residuals, but for now that’s not important.
The important thing is that trust region methods are iterative methods that produce a sequence \(\boldsymbol{x}_k\) that (hopefully) converges to an (at least) local1 minimum \(\boldsymbol{x}^\star\). They do this by approximating the function \(f\) in each step \(k\) inside a model function \(m_k : \mathbb{R}^n \rightarrow \mathbb{R}\), which (hopefully) does a reasonable job of approximating the function inside a so-called trust region. The trust region is a multidimensional ellipsoid2, which is shrunk or expanded by how well the model predicted the actual behavior of the function. The model function is typically3 chosen as the quadratic function taken from the 2nd order Taylor expansion of \(f(\boldsymbol{x}_k+ \boldsymbol{p})\) around \(\boldsymbol{x}_k\):
\[\begin{eqnarray} m_k(\boldsymbol{p}) &:=& f_k + \boldsymbol{g}_k^T \boldsymbol{p} + \frac{1}{2} \boldsymbol{p}^T \boldsymbol{B} \boldsymbol{p} \tag{2a} \label{mk-def} \\ &\approx& f(\boldsymbol{x}_k+\boldsymbol{p}) \\ \boldsymbol{g}_k &:=& \nabla f(\boldsymbol{x}_k) \in \mathbb{R}^n, \tag{2b} \label{g-def} \end{eqnarray}\]where \(f_k = f(\boldsymbol{x}_k)\) and \(\boldsymbol{g}_k\) is the gradient of \(f\). \(\boldsymbol{B} \in \mathbb{R}^{n \times n}\) is a symmetric matrix that’s either the Hessian of \(f\) or an approximation of it. We’ll come back to this again later and make this more concrete for the least squares case, but let’s continue with the general purpose trust region description just a little more. In each step \(k\) of the algorithm we try to find a candidate step \(\boldsymbol{p}_k \in \mathbb{R}^n\) by minimizing our current model \(m_k\) inside the trust region. Formally:
\[\boldsymbol{p}_k = \arg \min_{\boldsymbol{p}} m_k(\boldsymbol{p}) \text{ , s.t. } \lVert \boldsymbol{p} \rVert \leq \Delta \tag{3} \label{pk-def}\]where the trust region radius \(\Delta \in \mathbb{R}\) and \(\lVert . \rVert\) is the euclidian vector norm or L2 norm. But wait, eagle-eyed readers will have spotted that \(\lVert \boldsymbol{p} \rVert \leq \Delta\) doesn’t define an elliptical trust region, but a spherical one. This leads into the topic of scaling, which is one of the trickier aspects to get right in practice, but not because the theory is hard. I promise I’ll get back to that, but let’s proceed without scaling for now.
I’ve called \(\boldsymbol{p}_k\) a candidate step, because it’s not automatically accepted. In the following, we’ll introduce a value \(\rho_k \in \mathbb{R}\) to serve both as an acceptance criterion for the step and as a decision criterion when to enlarge or shrink the trust region. It’s defined as the ratio of the actual reduction over the predicted reduction:
\[\rho_k = \frac{f(\boldsymbol{x}_k) - f(\boldsymbol{x}_k+\boldsymbol{p}_k)}{m_k(\boldsymbol{0})-m_k(\boldsymbol{p}_k)} \tag {4} \label{rho-k}.\]Note that the denominator (the predicted reduction) will always be nonnegative (cf. N&W pp 68, 69). Thus, a \(\rho_k > 0\) tells us that the objective function would be reduced by taking the step, whereas the \(\rho_k \leq 0\) indicates that it wouldn’t be. So using a small value \(\rho_\min \geq 0\) as a threshold \(\rho_k > \rho_\min\) allows us to accept or reject the step. N&W give the bound \(\rho_\min \in [0, \frac{1}{4})\), Madsen et al use \(\rho_\min = 0\), Minpack uses \(\rho = 10^{-4}\), and Ceres leaves it configurable but defaults to \(\rho_\min = 10^{-3}\). Very confusing, but at the end not the most important thing in the world.
The other thing that \(\rho_k\) can do for us is help us understand whether the model did a decent job at approximating our actual objective function. If it did a good job, we increase the trust region radius for the next step; if it did a bad job, we shrink the radius. If the model was neither particularly good nor particularly bad at approximating our objective function, we’ll just leave the trust region radius as is. All of this is summarized in the following algorithm.
Algorithm 1 (Trust Region)
- Given:
- initial guess \(\boldsymbol{x}_0\)
- initial trust region radius \(\Delta_0 > 0\)
- step acceptance threshold \(\rho_\min \in [0,\frac{1}{4}]\)
whilestopping criterionnotreached- Obtain step \(\boldsymbol{p}_k\) by (approximately) solving \(\eqref{pk-def}\).
- Calculate \(\rho_k\) using \(\eqref{rho-k}\).
if\(\rho_k > \frac{3}{4}\) then- let \(\Delta_{k+1} = \max(\Delta_k,3 \cdot \lVert \boldsymbol{p}_k \rVert)\)
else if\(\rho_k < \frac{1}{4}\)- let \(\Delta_{k+1} = \Delta_k/2\)
else- let \(\Delta_{k+1} = \Delta_k\)
end ifif\(\rho_k > \rho_\min\)- let \(\boldsymbol{x}_{k+1} = \boldsymbol{x}_k + \boldsymbol{p}_k\)
else- let \(\boldsymbol{x}_{k+1} = \boldsymbol{x}_k\)
end if
end while
The high level outline of a typical trust region algorithm, including Dogleg, cf Algorithm 4.1 in N&W and Algorithm 3.21 in the Madsen et al. paper. I promise that the handwavy parts will be explained in detail later. The one thing I won’t ever explain are the magical 1/4 and 3/4 constants, cf. Nocedal and Madsen for details (but not that many).
Typically, and in this article, the trust region itself is a multidimensional ellipsoid.
3 Least Squares Minimization
For least squares minimization we are concerned with minimizing, you guessed it, a sum of squares:
\[f(\boldsymbol{x}) = \frac{1}{2}\sum_{i=1}^{m} r_i(\boldsymbol{x})^2 = \frac{1}{2} \lVert \boldsymbol{r}(\boldsymbol{x})\rVert^2 \tag{5} \label{f-lsqr},\]where \(\boldsymbol{r}(\boldsymbol{x}) = (r_1(\boldsymbol{x}),\dots, r_m(\boldsymbol{x})) \in \mathbb{R}^m\) is called the residual vector or the vector of residuals. It turns out (cf. N&W p. 245-247), that we can write the gradient and the Hessian of \(f\) in terms of the Jacobian matrix \(\boldsymbol{J}(\boldsymbol{x})\) of \(\boldsymbol{r}(\boldsymbol{x})\). The Jacobian is defined as:
\[\boldsymbol{J}(\boldsymbol{x})= \left(\begin{matrix} \nabla r_1(\boldsymbol{x})^T \\ \vdots \\ \nabla r_m(\boldsymbol{x})^T \\ \end{matrix}\right) \in \mathbb{R}^{m \times n}, \tag{6} \label{jac-r}\]and now we can write the gradient and Hessian of \(f\) as
\[\begin{eqnarray} \nabla f(\boldsymbol{x}) &=:& \boldsymbol{g}(\boldsymbol{x}) = \boldsymbol{J}(\boldsymbol{x})^T \boldsymbol{r} \tag{7.1} \label{grad-f} \\ \nabla^2 f(\boldsymbol{x}) &=& \boldsymbol{J}(\boldsymbol{x})^T \boldsymbol{J}(\boldsymbol{x}) \\ &+& \sum_{i=1}^{m} r_i(\boldsymbol{x}) \nabla^2 r_i(\boldsymbol{x}) \\ &\approx& \boldsymbol{J}(\boldsymbol{x})^T \boldsymbol{J}(\boldsymbol{x}) \tag{7.2} \label{hessian-f}. \end{eqnarray}\]The approximation \(\boldsymbol{B} \approx \boldsymbol{J}^T \boldsymbol{J}\) is typically used for the Hessian of \(f\) and we’ll plug the results above into eq. \(\eqref{mk-def}\):
\[\begin{eqnarray} m_k(\boldsymbol{p}) &=& \frac{1}{2} \lVert \boldsymbol{r}(\boldsymbol{x}_k) \rVert^2 + \boldsymbol{r}(\boldsymbol{x}_k)^T \boldsymbol{J}(\boldsymbol{x}_k)\boldsymbol{p} \\ &+& \frac{1}{2} (\boldsymbol{J}(\boldsymbol{x}_k)\boldsymbol{p})^T \boldsymbol{J}(\boldsymbol{x}_k)\boldsymbol{p} \tag{8.1} \label{mk-lsqr1} \\ &=& \frac{1}{2} \lVert \boldsymbol{r}(\boldsymbol{x}_k) + \boldsymbol{J}(\boldsymbol{x}_k)\boldsymbol{p} \rVert^2 \tag{8.2} \label{mk-lsqr2}. \end{eqnarray}\]Both formulations are useful. Just as an aside, the second line reveals an interesting insight: since \(m_k(\boldsymbol{p})\) is meant to approximate \(f(\boldsymbol{x_k}+\boldsymbol{p}) = \frac{1}{2}\lVert \boldsymbol{r}(\boldsymbol{x_k}+\boldsymbol{p})\rVert^2\), we can see that what we did so far actually implies a linear approximation of the residuals \(\boldsymbol{r}(\boldsymbol{x}+\boldsymbol{p}) \approx \boldsymbol{r}(\boldsymbol{x}) + \boldsymbol{J}(\boldsymbol{x})\boldsymbol{p}\). Not terribly important, but worth noting. We have now taken two big steps on the way to solving the least squares problem with the Dogleg algorithm, but we still need to know the first thing about the Dogleg algorithm. So let’s look into that next.
4 Dogleg Basics
The basic idea of the dogleg algorithm is to approximate a solution to \(\eqref{pk-def}\) by combining two steps: the Gauss-Newton step and the steepest descent step. Please see N&W pp. 73 and Madsen et al sections 3.1 and 3.3 for theoretical background on those steps; in this article I’m just going to assume them as given. Note also that this section already specializes Dogleg to least squares problems.
The Gauss-Newton step \(\boldsymbol{p}_{gn}\) is given as the solution to
\[\begin{eqnarray} \boldsymbol{J}(\boldsymbol{x})^T \boldsymbol{J}(\boldsymbol{x}) \boldsymbol{p}_{gn} = - \boldsymbol{J}^T(\boldsymbol{x}) \boldsymbol{r}(\boldsymbol{x}) \label{normal-eqs} \tag{9.1} \\ \Leftrightarrow \boldsymbol{p}_{gn}(\boldsymbol{x}) = \arg \min_{\boldsymbol{p}} \lVert \boldsymbol{J}(\boldsymbol{x}) \boldsymbol{p} + \boldsymbol{r}(\boldsymbol{x}) \rVert^2, \tag{9.2} \label{p-gn} \end{eqnarray}\]which is just the least squares solution to the following system of equations
\[\boldsymbol{J}(\boldsymbol{x}) \boldsymbol{p}_{gn} \simeq -\boldsymbol{r}(\boldsymbol{x}). \tag{10}\]Note the minus sign on the right hand side. It’s possible to solve this system by directly inverting the normal equations $$\eqref{normal-eqs}, but that becomes numerically unstable quickly. Using matrix decompositions is the way to go here and we’ll return to this later. For now, let’s look at the steepest descent step, which is given as:
\[\boldsymbol{p}_{sd}(\boldsymbol{x}) = -\frac{\lVert \boldsymbol{g}(\boldsymbol{x})\rVert^2}{\lVert \boldsymbol{J}(\boldsymbol{x})\boldsymbol{g}(\boldsymbol{x})\rVert^2} \boldsymbol{g}(\boldsymbol{x}), \tag{11} \label{p-sd}\]where \(\boldsymbol{g}\) is the gradient of \(f\) as defined in eq. \(\eqref{grad-f}\). The Dogleg algorithm searches for the minimum on a path of finite length, which is parametrized using a \(\tau \in [0,2]\) like so:
\[\boldsymbol{p}(\tau) = \left\{ \begin{array}{ll} \tau \, \boldsymbol{p}_{sd}, & \tau \in [0,1] \\ \boldsymbol{p}_{sd} + (\tau-1) (\boldsymbol{p}_{gn}-\boldsymbol{p}_{sd}), & \tau \in (1,2] . \\ \end{array} \right. \tag{12} \label{dogleg-path}\]This definition of the path looks a bit unwieldy at first sight, but it has a very simple geometric interpretation: the path first follows the steepest descent step to its “tip” and from there in a straight line goes to the “tip” of the Gauss-Newton step. Overall, this creates a path with a corner in it, as illustrated in Figure 1.
It can be shown that \(m_k(\boldsymbol{p}(\tau))\) will decrease along the dogleg path and that the path will have at most one intersection with the trust region boundary4. To take the biggest possible step that’s still inside the trust region, we can choose the dogleg step \(\boldsymbol{p}_{dl}\) as the point on the dogleg path where it intersects the trust region boundary. Otherwise the dogleg step is just the end of the path, which is the Gauss-Newton step. So the algorithm for choosing the Dogleg step is:
Algorithm 2 (Classic Dogleg Step)
- Given: Jacobian \(\boldsymbol{J}\), gradient \(\boldsymbol{g}\), residuals \(\boldsymbol{r}\), all evaluated at current \(\boldsymbol{x}_k\)
- Calculate \(\boldsymbol{p}_{sd}\) as in eq. \(\eqref{p-sd}\)
if\(\lVert \boldsymbol{p}_{sd} \rVert \geq \Delta\)- return \(\boldsymbol{p}_{dl} = \Delta \; \boldsymbol{p}_{sd} / \lVert \boldsymbol{p}_{sd}\rVert\)
end if- Calculate \(\boldsymbol{p}_{gn}\) as in eq. \(\eqref{p-gn}\)
if\(\lVert \boldsymbol{p}_{gn} \rVert \leq \Delta\)- return \(\boldsymbol{p}_{dl} = \boldsymbol{p}_{gn}\)
else- Find \(\tau_{dl} \in [1,2)\) such that \(\lVert \boldsymbol{p}(\tau_{dl}) \rVert = \Delta\) with \(\boldsymbol{p}(\tau)\) from \(\eqref{dogleg-path}\)
- return \(\boldsymbol{p}_{dl} = \boldsymbol{p}(\tau_{dl})\)
end if
We call this the classic dogleg step, because later in the article we will make one tiny modification to the calculation of the Gauss-Newton step, that brings a significant stability improvement in practice.
To obtain the magical \(\tau_{dl}\) in the third if branch, we have to
realize that we are now in the \(\tau \in (1,2)\) path segment, because
the case \(\tau \leq 1\) is covered by the first branch and the case
\(\tau = 2\) is covered in the second branch. The condition
\(\lVert \boldsymbol{p}(\tau_{dl}) \rVert = \Delta\) is equivalent to the quadratic
equation
It can be shown that the solution to this equation can always be written as follows (see Appendix A):
\[\begin{eqnarray} \tau_{dl} &=& 1 - \xi + \sqrt{\frac{\Delta^2-\lVert \boldsymbol{p}_{sd}\rVert^2}{\lVert \boldsymbol{p}_{gn}-\boldsymbol{p}_{sd}\rVert^2} + \xi^2} \tag{14.1} \label{tau-dl} \\ \xi &:=& \frac{\boldsymbol{p}_{sd}^T (\boldsymbol{p}_{gn} - \boldsymbol{p}_{sd})}{\lVert \boldsymbol{p}_{gn}-\boldsymbol{p}_{sd}\rVert^2} \tag{14.2}. \end{eqnarray}\]In principle, we now have all the pieces together. We just use the Dogleg step from Algorithm 2 and plug this into Algorithm 1 as the step that approximately solves \(\eqref{pk-def}\). However, there are a couple of pesky little details that we still have to take care of.
5 Parameter Scaling
Parameter scaling is used to address the fact that an objective function can be very sensitive to changes in certain parameters, while it’s less sensitive to others. In this case, we say that the objective function is poorly scaled, which can manifest as the minimizer being in a narrow valley. In this case, elliptical trust regions are a much better fit than the spherical trust regions we’ve considered so far. The test suite of my Dogleg implementation taught me that compensating for parameter scaling is one of the most important things to implement for going from a decent implementation to a great one.
Elliptical trust regions are defined by
\[\begin{eqnarray} \lVert \boldsymbol{D} \boldsymbol{p} \rVert &\leq& \Delta, \tag{15.1} \label{elliptical-tr} \\ \boldsymbol{D} &:=& \text{diag}(d_1,\dots,d_n) \in \mathbb{R}^{n \times n},\; \tag{15.2} \label{d-def} \\ d_j &>& 0 \end{eqnarray}\]where \(\boldsymbol{D}\) is an invertible diagonal matrix. Now the optimization problem in eq. \(\eqref{pk-def}\) becomes
\[\boldsymbol{p}_k = \arg \min_{\boldsymbol{p}} m_k(\boldsymbol{p}) \text{ , s.t. } \lVert \boldsymbol{D p} \rVert \leq \Delta \tag{16} \label{pk-elliptical},\]where the only change to the problem before is the elliptical shape of the trust region. There are ways to express the solution to this problem using the elliptical bounds, but a much simpler alternative is to perform a coordinate transform such that we are now working in a scaled coordinate system for the steps.
\[\boldsymbol{\widetilde{p}} = \boldsymbol{D p}. \tag{17} \label{scaled-p}\]By substituting this into \(\eqref{pk-elliptical}\), we obtain a minimization problem with spherical bounds in the scaled step coordinates:
\[\begin{eqnarray} \boldsymbol{\widetilde{p}}_k &=& \arg \min_{\boldsymbol{\widetilde{p}}} \widetilde{m}_k(\boldsymbol{\widetilde{p}}) \text{ , s.t. } \lVert \boldsymbol{\widetilde{p}} \rVert \leq \Delta \tag{18.1} \label{mk-scaled}, \\ \widetilde{m}_k(\boldsymbol{\widetilde{p}}) &:=& \frac{1}{2} \lVert \boldsymbol{r}(\boldsymbol{x}) \rVert^2 + \boldsymbol{\widetilde{g}}(\boldsymbol{x})^T \boldsymbol{\widetilde{p}} \\ &+& \frac{1}{2} (\boldsymbol{\widetilde{J}}(\boldsymbol{x})\boldsymbol{\widetilde{p}})^T \boldsymbol{\widetilde{J}}(\boldsymbol{x})\boldsymbol{\widetilde{p}} \tag{18.2} \label{mk-scaled-1} \\ &=& \frac{1}{2} \lVert \boldsymbol{r}(\boldsymbol{x}) + \boldsymbol{\widetilde{J}}(\boldsymbol{x})\boldsymbol{\widetilde{p}} \rVert^2 \tag{18.3} \label{mk-scaled2} \\ \boldsymbol{\widetilde{J}}(\boldsymbol{x}) &:=& \boldsymbol{J}(\boldsymbol{x}) \boldsymbol{D}^{-1} \tag{18.4} \label{j-scaled} \\ \boldsymbol{\widetilde{g}}(\boldsymbol{x}) &:=& \boldsymbol{D}^{-1} \boldsymbol{g}(\boldsymbol{x}) = \boldsymbol{\widetilde{J}}(\boldsymbol{x})^T \boldsymbol{r}(\boldsymbol{x}) \tag{18.5} \label{g-scaled}, \end{eqnarray}\]where we call \(\boldsymbol{\widetilde{g}}\) the scaled gradient and \(\boldsymbol{\widetilde{J}}\) the scaled Jacobian. The neat thing is that, at each iteration, we can just calculate the scaled dogleg step using the same math as above as long as we consistently use the scaled Jacobian and gradient for the calculations. Only to calculate the next evaluation point \(\boldsymbol{x}_{k+1} = \boldsymbol{x}_k + \boldsymbol{p}_k\), do we have to translate the step into unscaled coordinates by using \(\boldsymbol{p}_k = \boldsymbol{D}^{-1} \widetilde{\boldsymbol{p}}_k\).
5.1 Practical Parameter Scaling
Now, let’s look at how to actually implement practically useful parameter scaling. Madsen et al. don’t mention scaling at all, while N&W simply state: “Information to construct the scaling matrix \(\boldsymbol{D}\) can be derived from the second derivatives […]. We can allow \(\boldsymbol{D}\) to change from iteration to iteration” (N&W section 4.5). For more practically useful information, we have to turn to gold-standard implementations, like Ceres Solver or Minpack. The Minpack User Guide section 2.5 gives some good theoretical and practical insight. However, the best performing scaling I’ve found is implemented in the Ceres Solver5.
5.1.1 Static and Dynamic Scaling
Curiously, Ceres combines two types of scaling. To be able to talk about these two types, we are going to introduce some (non-standard) vocabulary: Static Scaling \(\boldsymbol{D}_s\) and Dynamic Scaling \(\boldsymbol{D}_d\). Ceres calls those Jacobi scaling, and diagonal scaling, respectively. I find the names misleading because both matrices are diagonal matrices, both matrices act on the Jacobian (among other things), and they are calculated based on the Jacobian (though slightly differently).
The matrix \(\boldsymbol{D}_s\) is calculated once in the first iteration and stays the same for the whole runtime of the algorithm. Its intent is to “improve the conditioning of the Jacobian”6:
\[\begin{eqnarray} \boldsymbol{D}_s &=& \text{diag}(d_{s,1},\,\dots\,,d_{s,n}) \in \mathbb{R}^{n \times n} \tag{19.1} \label{Ds} \\ d_{s,i} &=& \lVert\boldsymbol{j}_i\rVert+1 \tag{19.2}, \\ \end{eqnarray}\]where \(\boldsymbol{j}_i\) is the \(i\)-th column of the Jacobian. For this matrix, since it’s only calculated once in the first iteration, this is the Jacobian at initialization. Using this scaling indeed makes the algorithm perform better in my test suite, but I’ll be honest: this was a bit of a surprise to me, once I understood what was going on.
The dynamic scaling matrix \(\boldsymbol{D}_d^{(k)}\) is updated every time the Jacobian changes, i.e. every time a step is taken. The dynamic scaling implemented by Ceres is very similar to the scaling described in the Minpack User guide. Its primary intent is to give us an adaptive elliptical trust region. I’ve included the \((k)\) superscript index to make the explicit dependence on the step obvious. It’s calculated as 7:
\[\begin{eqnarray} \boldsymbol{D}_d^{(k)} &=& \text{diag}(d_{d,1}^{(k)},\,\dots\,,d_{d,n}^{(k)}) \in \mathbb{R}^{n \times n} \tag{20.1} \label{Dd} \\ d_{d,i}^{(k)} &=& \max(\min(d_{max}, \lVert\boldsymbol{j}_i^{(k)}\rVert),d_{min}), \tag{20.2} \\ \end{eqnarray}\]where \(\boldsymbol{j}_i^{(k)}\) again denotes the \(i\)-th column of the Jacobian evaluated at the current step with index \(k\). The diagonal elements are just the column-norms of the Jacobian clamped to the range \([d_{min}, d_{max}]\), where the default values for the endpoints of the range in Ceres are:
\[\begin{eqnarray} d_{min} &=& 10^{-3} \\ d_{max} &=& 10^{16} \\ \end{eqnarray}\]We can combine those scaling matrices into one step dependent diagonal scaling matrix \(\boldsymbol{D}_k\), where \(k\) is the index of the current step and \(\boldsymbol{x}_k\) is the current iterate.
\[\boldsymbol{D}_k = \boldsymbol{D}_s \boldsymbol{D}_d^{(k)} \tag{21} \label{Dk}\]5.1.2 Implementing Parameter Scaling
Don’t worry if this has all been a bit confusing so far, we’re going to tie it together now. At each iteration, we calculate the scaling matrix \(\boldsymbol{D}_k\) as given in \(\eqref{Dk}\). We can then use this matrix to calculate the scaled Jacobian \(\widetilde{\boldsymbol{J}}\) from the Jacobian \(\boldsymbol{J}\), using \(\eqref{j-scaled}\) and from that the scaled gradient using \(\eqref{g-scaled}\).
We calculate the dogleg step in scaled space by using the same algorithm as described above. The two ingredients to the dogleg step are the \(\widetilde{\boldsymbol{p}}_{gn}\) and \(\widetilde{\boldsymbol{p}}_{sd}\), which are the Gauss-Newton and the steepest descent step, respectively, both in scaled space. From that, we calculate the final dogleg step in scaled space \(\widetilde{\boldsymbol{p}}_{dl}\) using Algorithm 2. We can then use Algorithm 1 to perform the trust region step, but we have to be sure to unscale the dogleg step using \(\boldsymbol{p}_{dl} = \boldsymbol{D}^{-1} \widetilde{\boldsymbol{p}}_{dl}\) before using it to get the next step \(\boldsymbol{x}_{k+1} = \boldsymbol{x}_k + \boldsymbol{p}_k\).
To obtain \(\widetilde{\boldsymbol{p}}_{sd}\), we just modify \(\eqref{p-sd}\) with the scaled gradient and Jacobian:
\[\widetilde{\boldsymbol{p}}_{sd}(\boldsymbol{x}) = -\frac{\lVert \widetilde{\boldsymbol{g}}(\boldsymbol{x})\rVert^2}{\lVert \widetilde{\boldsymbol{J}}(\boldsymbol{x})\widetilde{\boldsymbol{g}}(\boldsymbol{x})\rVert^2} \widetilde{\boldsymbol{g}}(\boldsymbol{x}), \tag{22} \label{p-sd-scaled}\]In the next section, I’ll explain how to calculate the Gauss-Newton step, since I wanted to introduce regularization to it as well. Note that we’ll never actually need to form the scaled Jacobian to calculate \(\eqref{p-sd-scaled}\), because we can just rewrite the matrix vector product in the denominator as \(\widetilde{\boldsymbol{J}}\widetilde{\boldsymbol{g}} = \boldsymbol{J}(\boldsymbol{D}^{-1} \widetilde{\boldsymbol{g}})\), which saves a couple of FLOPS, but probably won’t matter that much in the grand scheme of things.
6 Finding the Regularized Gauss-Newton Step
Instead of using \(\eqref{normal-eqs}\) to calculate the Gauss-Newton step using suitable matrix decompositions, we’ll modify the equations to calculate the regularized Gauss-Newton step for increased stability. None of the sources I know for the Dogleg Algorithm mentions this, but Ceres does it to measurably improve the stability of the algorithm8. The regularization is performed in scaled space such that the regularized normal equations are:
\[(\widetilde{\boldsymbol{J}}^T \widetilde{\boldsymbol{J}} + \mu \boldsymbol{I}) \widetilde{\boldsymbol{p}}_{gn} = - \widetilde{\boldsymbol{J}}^T \boldsymbol{r} \label{normal-eqs-scaled-regularized} \tag{23}, \\\]where I’ve omitted the explicit dependency on \(\boldsymbol{x}\) for notational convenience. Here \(\mu \in \mathbb{R}\) is a scalar with \(\mu > 0\) and \(\boldsymbol{I} \in \mathbb{R}^{n \times n}\) is the identity matrix. Together they improve the conditioning of the approximate Hessian \(\widetilde{\boldsymbol{J}}^T\widetilde{\boldsymbol{J}}\) in scaled space. By substituting \(\widetilde{\boldsymbol{J}} = \boldsymbol{J}\boldsymbol{D}^{-1}\), \(\boldsymbol{p}_{gn} = \boldsymbol{D}^{-1}\widetilde{\boldsymbol{p}}_{gn}\) and then rewriting a little, we can see that equation \(\eqref{normal-eqs-scaled-regularized}\) is equivalent to solving the following unscaled system:
\[(\boldsymbol{J}^T \boldsymbol{J} + \mu \boldsymbol{D}^2) \boldsymbol{p}_{gn} = - \boldsymbol{J}^T \boldsymbol{r} \label{normal-eqs-unscaled-regularized} \tag{24}, \\\]which is also equivalent to the following (augmented) least squares problem:
\[\boldsymbol{p}_{gn} = \arg\min_{\boldsymbol{p}} \left\lVert \left(\begin{matrix} \boldsymbol{J} \\ \sqrt{\mu}\boldsymbol{D} \end{matrix}\right) \boldsymbol{p} - \left(\begin{matrix} \boldsymbol{-r} \\ \boldsymbol{0} \end{matrix}\right) \right\rVert^2 \label{ceres-augmented-system-regularized} \tag{25}\]which is just another way of saying to solve the following system in a least-squares sense:
\[\left(\begin{matrix} \boldsymbol{J} \\ \sqrt{\mu}\boldsymbol{D} \end{matrix}\right) \boldsymbol{p}_{gn} \simeq -\left(\begin{matrix} \boldsymbol{r} \\ \boldsymbol{0} \end{matrix}\right) \tag{26} \label{augmented-least-squares-sense}.\]Ceres uses QR decomposition to solve this augmented system in a least squares sense9, but also allows other solvers to be selected. It’s also possible to use singular value decomposition(SVD) to solve the system. That one even allows us to reuse the SVD of \(\boldsymbol{J}\) for different \(\mu\), but is typically not worth its high computational cost compared to other solvers, which is why it’s not available in Ceres.
All formulations \(\eqref{normal-eqs-scaled-regularized}\) - \(\eqref{augmented-least-squares-sense}\) are equivalent, but there’s one small caveat: if we use any of the formulations \(\eqref{normal-eqs-unscaled-regularized}\) - \(\eqref{augmented-least-squares-sense}\), then we solve for the unscaled Gauss-Newton step, and we have to transform it into scaled space using \(\widetilde{\boldsymbol{p}}_{gn} = \boldsymbol{D} \boldsymbol{p}_{gn}\) for use in Algorithm 2 together with \(\widetilde{\boldsymbol{p}}_{sd}\) to calculate the scaled Dogleg step \(\widetilde{\boldsymbol{p}}_{dl}\). Only after the complete scaled Dogleg step has been calculated, can we unscale it to obtain the next iterate \(\boldsymbol{x}_{k+1}\). If that all seems a bit roundabout, it’s of course fine to use \(\eqref{normal-eqs-scaled-regularized}\), but the advantage of using the other formulations is that we (like in the previous section) can use those without having to ever form the scaled Jacobian explicitly.
6.1 Updating the Regularization Parameter \(\mu\)
To see how to choose the regularization parameter, we’ll dig into the Ceres source code again. There are two important values \(\mu_{min} = 10^{-8}\) and \(\mu_{max} = 1\), which are hard-coded10. The parameter \(\mu\) starts out with \(\mu_{min}\) and is always restricted to the range \([\mu_{min}, \mu_{max}]\) 11. When a linear solver fails to solve the system \(\eqref{augmented-least-squares-sense}\) in a least squares sense, the value of \(\mu\) gets increased by a factor \(10\) and the solution is retried 12. If the solver cannot solve before hitting the limit \(\mu_{max}\), the optimization is terminated with failure. The value of \(\mu\) persists between iterations, but it is decreased by a factor of \(5\) every time a step is accepted13.
This scheme adds a little bit of complexity, but as I said before it is worth it. Luckily, if we’re using a solver that can’t fail, then \(\mu\) always stays at \(\mu_{min}\). Examples of linear solvers that can’t fail are e.g. SVD14 or column-pivoted QR, the latter of which is an option in Ceres.
7 Stopping Criteria
Before putting this all together, there’s one more thing I completely glossed
over in Algorithm 1 and that is “while stopping criterion not reached”. So
let’s talk about stopping criteria, specifically convergence criteria that
tell us whether we think a solution is good enough or whether we’re possibly
stuck. N&W doesn’t define stopping criteria, but Madsen et al., the Minpack
User Guide section 2.3, and (obviously) the Ceres source code do. I’ll list the criteria
in different sections and we’ll see there’s a decent amount of overlap as well.
7.1 Minpack Stopping Criteria
I’ll label the stopping criteria from Minpack with MPK-1, MPK-2,… to distinguish them from the criteria of other sources. Firstly, Minpack describes three convergence criteria to try and estimate how far our current iterate is away from an optimal solution.
-
MPK-1: The Xtol Criterion estimates how close we are to a true solution and is stated as
\[\Delta \leq x_{tol} \cdot \lVert \boldsymbol{Dx} \rVert, \tag{27} \label{minpack-xtol}\]where \(x_{tol} \in \mathbb{R}\) and \(x_{tol} > 0\). Formally, this criterion tries to guarantee a bound on the distance of the current parameter estimate \(\boldsymbol{x}_k\) from the true, unknown optimum \(\boldsymbol{x}^\star\) such that \(\lVert \boldsymbol{D}(\boldsymbol{x}_k-\boldsymbol{x}^\star) \rVert \leq x_{tol} \lVert \boldsymbol{Dx}^\star \rVert\).
-
MPK-2: The Ftol Criterion tries to bound \(f(\boldsymbol{x}) \leq (1+f_{tol}) \cdot f(\boldsymbol{x}^\star)\). Again \(f_{tol} \in \mathbb{R}\) and \(f_{tol} > 0\) is a user-supplied tolerance. Since the true optimum is unknown, the test is defined in terms of the normalized actual reduction and the normalized predicted reduction defined as
\[\begin{eqnarray} \text{ACTRED} &:=& \frac{f(\boldsymbol{x}_{k})- f(\boldsymbol{x}_{k+1})}{f(\boldsymbol{x}_{k})} \\[0.5em] \text{PREDRED} &:=& \frac{ m_k(\boldsymbol{0})- m_k(\boldsymbol{p}_{k})}{ f(\boldsymbol{x}_{k})}. \\ \end{eqnarray}\]The criterion is considered fulfilled if the following three conditions are met:
\[\begin{eqnarray} \text{PREDRED} &\leq& f_{tol} \tag{28.1} \\\ |\text{ACTRED}| &\leq& f_{tol} \tag{28.2} \\ \text{ACTRED} &\leq& 2\cdot\text{PREDRED} \tag{28.3} \end{eqnarray}\]Since this criterion attempts to set a relative bound on the residuals compared to their optimum, this won’t be hit if \(f(\boldsymbol{x}^\star)=0\), unless our residuals also vanish. It’s therefore useful to supplement this criterion with an absolute check \(f(\boldsymbol{x}_k) < \epsilon_f\), where \(\epsilon_f\) is a small tolerance, possibly in the order of machine precision.
-
MPK-3: The Gtol Criterion tries to estimate if we’re at a minimum by checking whether the gradient at the current iterate vanishes. Instead of just comparing the maximum absolute value of the gradient against a threshold, this criterion checks:
\[\max_i\left\{ \frac{| \boldsymbol{j}_i^T \; \boldsymbol{r}(\boldsymbol{x}_k)|}{\lVert \boldsymbol{j}_i\rVert \cdot \lVert \boldsymbol{r}(\boldsymbol{x}_k)\rVert}\right\} \leq g_{tol}, \tag{29} \label{minpack-gtol},\]where \(\boldsymbol{j}_i\) is the \(i\)-th column of the Jacobian evaluated at the current parameters \(\boldsymbol{x}_k\). Formulating the gradient criterion like this makes the criterion more robust to scaling in \(f\), but it might not always confer a practical advantage. Again \(g_{tol} > 0\) is a user supplied real number.
By default, the Minpack implementation sets \(g_{tol} = 0\) and the user guide gives some guidance on the values of \(x_{tol}\) and \(f_{tol}\):
In general, \(x_{tol}\) and \(f_{tol}\) should be smaller than \(10^{-5}\); recommended values for these tolerances are on the order of the square root of the machine precision.
The Minpack implementation actually checks each of the three criteria above twice. First, it performs the checks with the user supplied tolerances. If any of the convergence criteria is hit, then the optimizer terminates successfully. Second, it performs the checks again but uses machine epsilon instead of the user supplied tolerances. If any of those checks hit, then the optimization terminates with failure because it’s reasonable to assume that no further progress can be made.
The Minpack implementation actually checks one more hard stopping criterion, and that’s this one15:
- MPK-4: The Max Eval criterion makes the algorithm terminate when the number of times the function \(f\) was evaluated exceeds a certain limit \(N_{feval}\). At other places in the libraries similar parameters have default values of \(N_{feval} = 100\cdot(n+1)\) or \(200\cdot (n+1)\), where \(n\) is the number of elements in \(\boldsymbol{x}\), i.e. the number of variables.
The number of function evaluations is highly correlated to the number of iterations, but it increases every time that a step is evaluated rather than every time a step is actually accepted. Since function evaluations may be costly, this is a more stringent bound on the limit of computations to perform than the number of iterations, where each iteration possibly tests many steps.
7.2 Madsen et al. Stopping Criteria
To label the stopping criteria in the Madsen, Nielsen, and Tingleff paper I’ll use the prefix MNT. The following four stopping criteria are defined:
-
MNT-1: The Max Gradient Element criterion terminates the optimization with success under the condition
\[\lVert \boldsymbol{g} \rVert_{\infty} \leq \epsilon_g,\]where \(\boldsymbol{g}\) is the gradient at the current iterate \(\boldsymbol{x}_k\) and \(\epsilon_g > 0\) is a real number.
-
MNT-2: The Step Norm Criterion terminates the optimization with success
when
\[\lVert \boldsymbol{p}_{dl} \rVert \leq \epsilon_p \cdot (\lVert \boldsymbol{x}_k \rVert + \epsilon_p),\]where \(\boldsymbol{p}_{dl}\) is the current unscaled Dogleg step and \(\epsilon_p > 0\) is a real number.
-
MNT-3: The Max Residual Criterion terminates with success if
\[\lVert \boldsymbol{r} \rVert_{\infty} \leq \epsilon_r,\]where \(\boldsymbol{r}\) is the residual vector, evaluated at the current iterate \(\boldsymbol{x}_k\) and \(\epsilon_r > 0\) a real number.
-
MNT-4 The Max Iteration Count Criterion terminates with failure if the iteration count surpasses a defined limit \(N_{iter}\):
\[k > N_{iter}\]
We can already see that there’s a lot of conceptual overlap in the stopping criteria between Minpack and Madsen et al.. But there’s also a decent amount of subtle (and not so subtle) differences. Although the Minpack criteria look more sophisticated, my experience is that they don’t always perform better in practice.
7.3 Ceres’ Stopping Criteria
Ceres introduces some new stopping criteria, but also takes almost all the ones from Madsen et al. like so:
- MNT-1 with default gradient tolerance of \(\epsilon_g = 10^{-10}\)
- MNT-2 with default parameter tolerance of \(\epsilon_p = 10^{-8}\)
- MNT-4 with a default iteration limit of \(N_{iter} = 50\)
Additionally, Ceres defines a number of custom stopping criteria, which I’ll prefix with CRS
-
CRS-1: The Function Tolerance Criterion16 is used instead of MNT-3 and terminates successfully if
\[|\text{ACTRED}| \leq \epsilon_a,\]where \(\text{ACTRED}\) is defined as in MPK-2 and \(\epsilon_a = 10^{-6}\) by default.
-
CRS-2: The Min Trust Region Radius17 criterion terminates with success if the trust region radius \(\Delta\) is below a certain threshold:
\[\Delta \leq \Delta_{min},\]where by default \(\Delta_{min} = 10^{-32}\).
-
CRS-3: The Max Solver Time18 criterion terminates with failure if the total time that the solver took exceeds a given limit19.
All the criteria above (including the ones used from Madsen et al.) make
the Ceres solver indicate CONVERGENCE on success and NO_CONVERGENCE on
failure. However, Ceres also has a distinct hard FAILURE mode, which is
different from NO_CONVERGENCE. The following criteria terminate the algorithm
in a hard failure mode:
-
CRS-4: Hard failure is triggered if function or Jacobian evaluation fails or if the linear solver cannot find a regularized Gauss-Newton step despite having hit the maximum regularization parameter \(\mu\).
-
CRS-5: Hard failure is triggered if too many consecutive steps were rejected20. By default, the limit is 5 consecutive invalid steps.
7.4 Which Stopping Criteria To Use?
For my dogleg crate I picked a combination
of the Minpack and the Ceres criteria, which will probably evolve a little
over time21. If you had to pick only one set, my recommendation
would be to just go with the Ceres criteria.
8 Putting It Together
I was considering re-stating Algorithm 1 and Algorithm 2 by adding all the gory details we discussed. But honestly, not that much changes. We only have to make the following additions to Algorithm 1:
- Use the real set of stopping criteria that we decided on in section 7.
- Calculate the diagonal weighting matrix \(\boldsymbol{D}_k\) for each step as given by \(\eqref{Dk}\).
- Calculate the scaled and regularized step as described in sections 5 and 6. Make sure to unscale the step right before evaluating \(\boldsymbol{x}_{k+1} = \boldsymbol{x}_k +\boldsymbol{p}_k\).
For Algorithm 2 we don’t have to make any adjustments other than making sure that we calculate the steepest descent step and the regularized Gauss-Newton step in scaled space. Again, the step must only be unscaled once Algorithm 2 completes.
And that’s it. That’s all you need to put together your own implementation
of a high quality least squares minimization algorithm using the Dogleg Algorithm.
If you’re interested in a Rust implementation, please check out my
open-source dogleg crate.
Appendix A: Finding \(\tau_{dl}\)
Since it’s not really spelled out in the other referenced material, I’ll show how to obtain the solution to \(\eqref{tau-eqn}\):
\[\begin{eqnarray} \Delta^2 &=& \lVert \boldsymbol{p}_{sd} + (\tau-1) (\boldsymbol{p}_{gn}-\boldsymbol{p}_{sd}) \rVert^2 \tag{A1} \\ \Leftrightarrow \Delta^2 &=& \lVert \boldsymbol{p}_{sd}\rVert^2 + 2(\tau-1) \boldsymbol{p}_{sd}^T(\boldsymbol{p}_{gn}-\boldsymbol{p}_{sd}) \\[0.2em] &\,& + (\tau-1)^2 \lVert(\boldsymbol{p}_{gn}-\boldsymbol{p}_{sd}) \rVert^2, \end{eqnarray}\]where we know that the solution \(\tau\) must be in the range \((1,2)\). Let’s introduce the following definitions
\[\begin{eqnarray} x &:=& \tau -1 \in (0,1) \tag{A2.1} \\ a &:=& \lVert \boldsymbol{p}_{sd}\rVert^2 \geq 0 \tag {A2.2} \\ b &:=& \boldsymbol{p}_{sd}^T(\boldsymbol{p}_{gn}-\boldsymbol{p}_{sd}) \geq 0 \tag{A2.3} \\ c &:=& \lVert(\boldsymbol{p}_{gn}-\boldsymbol{p}_{sd}) \rVert^2 > 0, \tag{A2.4} \\ \end{eqnarray}\]where the inequalities follow trivially, except for \(b\geq 0\) and \(c>0\). The inequality \(b \geq 0\) follows from Lemma 4.2 on p.75 of N&W. For the strict inequality \(c>0\), we have to look at Algorithm 2 again. Obviously \(c \geq 0\) because it’s a norm squared. So the one thing we need to prove for \(c>0\) is \(\boldsymbol{p}_{gn} \neq \boldsymbol{p}_{sd}\). Let’s assume \(\boldsymbol{p}_{gn} = \boldsymbol{p}_{sd} = \boldsymbol{p}\). In that case either the first condition \(\lVert\boldsymbol{p}\rVert \geq \Delta\) of Algorithm 2 or its second condition \(\lVert\boldsymbol{p}\rVert \leq \Delta\) kicks in. Thus, at the time where we’re trying to solve the quadratic equation, we know that \(c >0\). Writing the equation using the definitions above gives us
\[\Delta^2 = a + 2 b x + c x^2 \tag{A3}\]and since we know \(c\neq 0\), we can write the solution as
\[x = -\frac{b}{c} \pm \sqrt{\frac{c\Delta^2+b^2-ac}{c^2}}. \tag{A4}\]Since we know from the inequalities above that \(b/c \geq 0\), the only way in which the solution can satisfy \(x>0\) is when the sign before the square root is positive. Therefore, the positive solution must be the unique valid root. It’s also quite easy to prove that \(x \in (0,1)\), but that’s left as an exercise to the reader.
From that, the statement \(\eqref{tau-dl}\) follows. Note that for numerical reasons it might still be necessary to tackle the case \(c\approx 0\), which can manifest when \(|c| < \epsilon\), where \(\epsilon\) might be in the order of machine precision.
-
Convergence guarantees of solver methods are their own beast that I won’t touch at all in this article. Everyone that has ever worked with optimization algorithms knows that finding global optima is often a pipe dream and even finding a local optimum can be highly sensitive to starting conditions, implementation details, condition numbers, birthdates, star signs, etc etc… ↩
-
Other shapes are available. One very common case is a spherical trust region, which is just a special case of the ellipsoid. Another common case would be a box-shaped region in hyperspace. ↩
-
You guessed it: it doesn’t have to be quadratic. See e.g. pp 25, 26 in N&W 2nd ed. for a demonstration how a linear model leads to a steepest descent algorithm. ↩
-
See N&W pp. 74, in particular Lemma 4.2 for this. For this lemma to hold, we need the Jacobian \(\boldsymbol{J}\) to have full rank. We’ll later see that we can also use Dogleg in practice for all Jacobians, if we use a regularized Gauss-Newton step, rather than the vanilla step defined in \(\eqref{p-gn}\). ↩
-
Where “best” is evaluated against the suite of test problems that I use. This is the famous MGH test suite described by More, Garbow, and Hilstrom in “Testing Unconstrained Optimization Software”. ↩
-
See
trust_region_minimizer.cc:265and following lines in the Ceres Solver source code. From the comments in the Ceres source code we can piece together that the scaling is meant to act roughly as \(\text{diag}(\boldsymbol{H})^{-1}\), where \(\boldsymbol{H} \approx \boldsymbol{J}^T \boldsymbol{J}\) is the Hessian of \(f\). The addition of \(1\) is there to counteract division by small numbers. It’s important to note that Ceres actually defines the scaling matrix as the inverse of the matrix that I gave, but they apply the matrix itself (not its inverse) to the Jacobian from the right hand side. To make their matrix consistent with my notation (where always the inverse of a scaling matrix is applied to the Jacobian from the right), I have to invert the definition. That means the scaling is exactly the same both in this document and in Ceres, I’ve just chosen notational consistency. ↩ -
See
dogleg_strategy.cc:117and following, as well astrust_region_strategy.h:71. But note thesqrtoperation in the actual diagonal matrix, which means the actual enforced clamping range is given by thesqrtof the values. ↩ -
See
dogleg_strategy.cc:517. Again, I can confirm that this measure makes the algorithm perform better on my test suite. ↩ -
The last two formulations actually show how Ceres solves the system. First the matrix \(\sqrt{\mu} \boldsymbol{D}\) is formed in
dogleg_strategy.cc:561and passed to the solver options of the least squares solver. Then e.g. in thedense_qr_solver.cc:48we can see that the diagonal is used as described to construct the augmented system before solving this with QR decomposition. ↩ -
See
dogleg_strategy.cc:60. ↩ -
See
dogleg_strategy.cc:63anddogleg_strategy:533. ↩ -
See
doglec_strategy.cc:632. ↩ -
If you’re interested in how to solve the regularized normal equations (in scaled space) with SVD, you can have a look at my code here. ↩
-
The Minpack implementation I’m referring to is actually of the Levenberg-Marquardt algorithm but those criteria are also applicable to Dogleg, see
minpack.f90:1488. ↩ -
As of the time of writing, that time limit defaults to a time period of more than 30 years, which is basically unlimited for all intents and purposes… ↩

Comments
Join the discussion for this article on this ticket. Comments appear on this page instantly.
No comments yet. Be the first to comment!