- The rationality of the step size
In the following section, we would begin to discuss two conditions which a_k(the step size) have to satisfy. If the two conditions are both satisfied, we could say that the proper step size, from x_k to x_(k+1), has been found. The first condition called “sufficient decrease condition”. Intuitively, this condition keeps that the value at x_(k+1) should be smaller than x_(k)`s, which is required to reach the convergence. The second one is called “curvature condition”. This condition keeps that the derivative of f at x_k must be smaller than x_(k+1)`s.
- Sufficient decrease condition
A proper a_k must satisfy sufficient decrease condition. We could make the following descriptions about this condition:
f( x_(k+1) ) <= f( x_k ) + c_1 * a_k * ▽f_k^T * p_k;
if we plugin EQ1, we have
f( x_k + a_k * p_k ) <= f( x_k ) _ c_1 * a_k * ▽f_k^T * p_k
where f( x_k ) is the value of f at x_k; ▽f_k means the gradient at x_k; p_k means the direction from x_k to x_(k+1); a_k means the step size and c_1 is a constant that belongs to [0, 1]
If p_k is the decent direction of function f, we have ▽f_k * p_k < 0. So f( x_k + 1 )<f( x_k ).
As only a_k is a variable, we could rewrite as followed:
Ø(a_k) ≤ l(a_k), where
We could visualize the condition as below:
If the step size could make Ø(a_k) in the acceptable interval, the condition is satisfied.
- Curvature condition
We could describe this condition as following:
In the above equation, ▽f_(k+1) is the gradient at x_(k+1); p_k is the decent direction and c_2 is an constant belongs to [0 1] and usually set as 0.9.
When p_k is the decent direction of the function f, then ▽f_k * p_k < 0
So, it means ▽f _k+1 ≥ c_2 * ▽f_k
From the plot below, we could see that the equation above requires that the decent “speed” at x_(k+1)should be smaller than x_k`s.
- Wolfe condition
Wolfe condition could be seemed as the combination of the two conditions mentioned above, which means the a_k should meet the following equations.
We could visualize the condition as:
In some real optimization application, we may have to use another condition, called string Wolfe condition, which is stemmed from the Wolfe condition. In this case, the a_k need to meet the followings:
The only difference between those two conditions is the strong Wolfe condition avoids the possible positive value of ▽f(x_k + a_k * p_k).
To be continued.