A statistic used to represent the similarity (or diversity) of sample sets. For two sets, A and B, the Jaccard index, J, is calculated as follows.
A = {0,1,2,5,6}
B = {0,2,3,4,5,7,9}
Solution:
J(A,B) = |A∩B| / |A∪B|
= |{0,2,5}| / |{0,1,2,3,4,5,6,7,9}|
= 3/9 = 0.33.
A similar statistic to the above, but this measures the dissimilarity between two sets. The two must add to one.
D(A,B) = 1 – J(A,B)
Compared to regular SGD, this can greatly reduce time to convergence. The general idea is add a fraction of the previous update to the current update. The exponential moving average is an averaging of points within a period that puts greater weight on more recent points (a simple moving average treats each point as equally significant). Here, V is the EMA.
\[\begin{aligned} & V_1 = \beta V_0 + (1 — \beta)S_1 \\ & V_2 = \beta V_1 + (1 — \beta)S_2 \\ & ... \end{aligned}\]S is the sequence. EMA’s are common in market forecasting, so often S is price at time t. β is defined from [0,1], and usually a value around 0.90 is used. The continuous update for SGD with momentum, using Andrew Ng’s notation, is as follows:
\[\begin{aligned} V_1 & := \beta *V_0+ (1 — \beta) \nabla_w L(W, X, y) \\ W & := W - \alpha V_1 \end{aligned}\]α is the learning rate, as always. To be clear, \(\nabla_w L\) is the gradient of the loss function with respect to the weights. SGD with momentum tends to work better than SGD because it gives a closer estimate of the full gradient from the batch than SGD. Additionally, the momentum helps push the update through a ravine in the right direction, where SGD tends to oscillate back and forth on in the short way. In my course’s notation:
v[k+1] = β v[k] + ∇_θ L(θ[k])
α[k+1] = θ[k] - α * v[k+1]
Here, v is the velocity term, β is the adjustable momentum term, θ is the model parameters (i.e. the weights or biases), and α is the learning rate as always.
RMS prop is another varient of SGD to improve convergence speed. Once again, the idea is to dampen oscillations in directions where you are close to the minimum, and accelerate movement when you are far from the minimum. Here, we keep a moving average of the squared gradients for each weight. As before, \(\nabla_\theta L\) is the gradient of the cost with respect to weights.
\[\begin{aligned} & S_{k+1} = \beta S_k + (1 - \beta)(\nabla_\theta L \cdot \nabla_\theta L) \\ & \theta_{k+1} = \theta_k - \alpha \frac{\nabla_\theta L}{\sqrt{S_{k+1}} + \epsilon} \end{aligned}\]Per physics convention, \(\epsilon\) is a tiny value. It is added so that if the square root term gets too small the value doesn’t blow up.
Adaptive moment Estimator
Adam is a combination of RMS prop and SGD with momentum. It uses the squred gradients to scale the learning rate (like RMS prop), and it uses a moving average of the gradient (like SGD with momentum).
The gradient acts on a scalar field (scalar valued differentiable function) and returns a vector. For example, \(\nabla v = [ \frac{dv}{dx}, \frac{dv}{y}, \frac{dv}{dz}]\). The gradient points in the direction of greatest increase.
The divergence acts on a vector field and returns a scalar. It is the outward flux of a vector field around a single point. For example, \(\nabla \cdot v = \frac{dv}{dx} + \frac{dv}{y} + \frac{dv}{dz}\).
” Learn the BASICS of physics and mathematics inside out. You can read about and be inspired by work at the cutting edge. But if you don’t learn the basics you will never reach your potential to contribute to our understanding. I encounter many kids who want to jump over the “old” stuff and learn only about research at the frontier. That is a huge mistake. Take the time now to build a solid foundation.” - Brian Greene on learning phyiscs