How does the momentum term for backpropagation algorithm work?

Question Detail: 

When updating the weights of a neural network using the backpropagation algorithm with a momentum term, should the learning rate be applied to the momentum term as well?

Most of the information I could find about using momentum have the equations looking something like this:

$W_{i}' = W_{i} - \alpha \Delta W_i + \mu \Delta W_{i-1}$

where $\alpha$ is the learning rate, and $\mu$ is the momentum term.

if the $\mu$ term is larger than the $\alpha$ term then in the next iteration the $\Delta W$ from the previous iteration will have a greater influence on the weight than the current one.

Is this the purpose of the momentum term? or should the equation look more like this?

$W_{i}' = W_{i} - \alpha( \Delta W_i + \mu \Delta W_{i-1})$

ie. scaling everything by the learning rate?

Asked By : guskenny83
Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/31874

Answered By : nico

Using Backpropagation with momentum in a network with $n$ different weights $W_k$ the $i$-th correction for weight $W_k$ is given by

$\Delta W_k(i) = -\alpha \frac{\partial E}{\partial W_k} + \mu \Delta W_k(i-1)$ where $\frac{\partial E}{\partial W_k} $ is the variation of the loss w.r.t. $W_k$.

Introduction of the momentum rate allows the attenuation of oscillations in the gradient descent. The geometric idea behind this idea can probably best be understood in terms of an eigenspace analysis in the linear case. If the ratio between lowest and largest eigenvalue is large then performing a gradient descent is slow even if the learning rate large due to the conditioning of the matrix. The momentum introduces some balancing in the update between the eigenvectors associated to lower and larger eigenvalues.

For more detail I refer to

http://page.mi.fu-berlin.de/rojas/neural/chapter/K8.pdf

No comments

Powered by Blogger.