What I did

Reviewed Einstein notation. I decided if I really want to understand neural networks, and not just know how they work, I need to drill the math into my head. There are a lot of inner products and summations, so it seemed fitting to write everything in Einstein notation.
I’ve been thinking about starting a new kaggle project and moving on from NLP slightly. The jobs I’m most interested in, at the moment, are ones where implementing recommender system is the main objective i.e. suggesting movies on netflix, songs on Spotify, or art up for auction at Sotheby’s.
Invested a significant amount of time to get Latex equations to work properly in Jekyll using markdown files. In the end, this was what worked for me

What I learned

Jaccard Idex

A statistic used to represent the similarity (or diversity) of sample sets. For two sets, A and B, the Jaccard index, J, is calculated as follows.

A = {0,1,2,5,6}
B = {0,2,3,4,5,7,9}
Solution:
J(A,B) = |A∩B| / |A∪B|
= |{0,2,5}| / |{0,1,2,3,4,5,6,7,9}|
= 3/9 = 0.33.

Jaccard Distance

A similar statistic to the above, but this measures the dissimilarity between two sets. The two must add to one.

D(A,B) = 1 – J(A,B)

Adaptive Learning Rate Optimization Algorithms

Stochastic Gradient Descent with Momentum

Compared to regular SGD, this can greatly reduce time to convergence. The general idea is add a fraction of the previous update to the current update. The exponential moving average is an averaging of points within a period that puts greater weight on more recent points (a simple moving average treats each point as equally significant). Here, V is the EMA.

\[\begin{aligned} & V_1 = \beta V_0 + (1 — \beta)S_1 \\ & V_2 = \beta V_1 + (1 — \beta)S_2 \\ & ... \end{aligned}\]

S is the sequence. EMA’s are common in market forecasting, so often S is price at time t. β is defined from [0,1], and usually a value around 0.90 is used. The continuous update for SGD with momentum, using Andrew Ng’s notation, is as follows:

\[\begin{aligned} V_1 & := \beta *V_0+ (1 — \beta) \nabla_w L(W, X, y) \\ W & := W - \alpha V_1 \end{aligned}\]

α is the learning rate, as always. To be clear, \(\nabla_w L\) is the gradient of the loss function with respect to the weights. SGD with momentum tends to work better than SGD because it gives a closer estimate of the full gradient from the batch than SGD. Additionally, the momentum helps push the update through a ravine in the right direction, where SGD tends to oscillate back and forth on in the short way. In my course’s notation:

v[k+1] = β v[k] + ∇_θ L(θ[k])
α[k+1] = θ[k] - α * v[k+1]

Here, v is the velocity term, β is the adjustable momentum term, θ is the model parameters (i.e. the weights or biases), and α is the learning rate as always.

RMS Prop

RMS prop is another varient of SGD to improve convergence speed. Once again, the idea is to dampen oscillations in directions where you are close to the minimum, and accelerate movement when you are far from the minimum. Here, we keep a moving average of the squared gradients for each weight. As before, \(\nabla_\theta L\) is the gradient of the cost with respect to weights.

\[\begin{aligned} & S_{k+1} = \beta S_k + (1 - \beta)(\nabla_\theta L \cdot \nabla_\theta L) \\ & \theta_{k+1} = \theta_k - \alpha \frac{\nabla_\theta L}{\sqrt{S_{k+1}} + \epsilon} \end{aligned}\]

Per physics convention, \(\epsilon\) is a tiny value. It is added so that if the square root term gets too small the value doesn’t blow up.

Adam Optimizer

Adaptive moment Estimator
Adam is a combination of RMS prop and SGD with momentum. It uses the squred gradients to scale the learning rate (like RMS prop), and it uses a moving average of the gradient (like SGD with momentum).

Math Review:

The gradient acts on a scalar field (scalar valued differentiable function) and returns a vector. For example, \(\nabla v = [ \frac{dv}{dx}, \frac{dv}{y}, \frac{dv}{dz}]\). The gradient points in the direction of greatest increase.

The divergence acts on a vector field and returns a scalar. It is the outward flux of a vector field around a single point. For example, \(\nabla \cdot v = \frac{dv}{dx} + \frac{dv}{y} + \frac{dv}{dz}\).

What I will do next

Post about Binary Cross Entropy, Muliclass Cross Entropy. Maybe also linear and logistic regression. Things that I know, but should continue to drill into my brain. This quote seems quite applicable.

” Learn the BASICS of physics and mathematics inside out. You can read about and be inspired by work at the cutting edge. But if you don’t learn the basics you will never reach your potential to contribute to our understanding. I encounter many kids who want to jump over the “old” stuff and learn only about research at the frontier. That is a huge mistake. Take the time now to build a solid foundation.” - Brian Greene on learning phyiscs

Finish post about forward prop and backprop