1 minute read

RMSprop is similar to gradient descent with momentum, but takes into consideration the sign of the gradient like this:

  • If the last two steps both have the same gradient sign (++ or –), multiply the learning rate by some factor > 1
  • Otherwise multiply the learning rate by some factor < 1
  • Step size should be between 1e-6 and 50

For the kth epoch, RMS Prop is updates as follows:

s[k] :=  β S[k-1] + (1 - β)(  L   L )
θ[k] := θ[k] - α ( L)/( \Sqrt(S[k]) + Ɛ )

We can rearrange the last term, where Ɛ«1, and divide by \(\sqrt{S}\).

\[\begin{aligned} \frac{1}{\sqrt{S} + \epsilon } & = \frac{1}{\sqrt{S}} * \frac{1}{1 + \epsilon/ \sqrt{S} } \\ & = \frac{1}{\sqrt{S}} * (1 + \epsilon/ \sqrt{S} )^{-1} \end{aligned}\]

Now, taylor expand to first order around ( (1+x)^n = 1+nx )

\[\begin{aligned} & = \frac{1}{\sqrt{S}} * (1 - \frac{\epsilon}{\sqrt{S}}) \\ & = \frac{1}{\sqrt{S}} - \frac{\epsilon}{\vert S \vert } \end{aligned}\]

This then makes the update algorithm:

s[k] :=  β S[k-1] + (1 - β)(  L   L )
θ[k] := θ[k] - α ( L)(1/Sqrt{s} - Ɛ/|s| )

Is this anymore efficient? I doubt it, but it was worth doing the math and seeing the expanded formula to think about it. The problem is that what may be simpler to solve on paper isn’t necessarily an easier problem for a computer.