Week of October 25th
SVD and Eigendecomposition, XGBoost, etc.
What i did
- I started watching an MIT lecture series on Topics in Mathematics with Applications in Finance. Honestly, I started watching it to help me fall asleep, but it’s actually really interesting, and has made me remember how much I miss quality lecturers. The second lecture ended up being a quick review of the important topics in linear algebra.
- I finished week 3 of my Bayesian statistics course. I needed to re-watch and read-up on some of the previous topics in order to get to the point where I felt that I really understood all the new material. This is why I had been procrastinating the continuation of the course. But now I am feeling more than comfortable with the material and looking forward to finishing the course next week and moving on to the next one.
What I learned
Eigen and Singular Value Decomposition
-
Recall that \(v\) and \(\lambda\) are the eigenvector and eigenvalue of a matrix \(A\) if there exists a relationship between them such that
\[\mathrm{A}v = \lambda v\] -
You can think of this relationship as meaning that there is a linear transformation on vector \(v\) that just scales all the components of the vector. This relationship only works for square matrices. Eigendecomposition (EVD) works if the matrix is diagonalizable. A square matrix is diagonalizable (i.e. 0 everywhere except the diagnoal) if there is an orthonormal matrix \(U\) such that \(A = U D U^{-1}\). Recall that a matrix is orthonormal if it equals 1 when multiplied by its inverse (i.e. \(UU^{-1} = \mathbb{I})\) 1. In the case where \(A\) is diagonalizable, \(U\) defines the directions of the vectors and \(D\) defines by how much these vectors get scaled (the elements of \(D\) are the eigenvalues). Also, note that diagonalizable matrices always have real eigenvalues.
Eigendecomposition is a way of representing a diagonalizable matrix in terms of its eigenvalues and eigenvectors.
Singular value decomposition is essentially a generalized form of the eigenvector/eigenvalue relation. Here, an \(m \times n\) matrix \(A\) is related to two orthonormal matrices and \(V\) and a diagonal matrix \(\Sigma\) such that
\[\underset{n \times m}{\mathrm{A}} = \underset{m\times m}{U} ~~ \underset{m \times n}{\Sigma} ~~ \underset{n\times n}{V^T}\]SVD allows you to map from a higher dimensional vector space into a lower one (ex. \(\mathbb{R}^4 \rightarrow \mathbb{R}^3\)), where the matrix \(A\) scales vectors in \(V\) to the corresponding vectors in \(U\). Some vectors in \(V\) might go away (i.e. get scaled to 0).
1 Recall that to invert a matrix (i.e. Make \(U \rightarrow U^{-1}\)) you leave the diagonal as is, negate the rest, and divide by the (original) determinant.
What is Gradient Boosting, Really?
-
First it is necessary to understand the concept of a weak learner or a weak predictor. This is, for example, a decision tree which or a tree node split that is good a predicting one specific feature, but bad at the whole picture. As an example, think of a dog classifier, where a particular section of a decision tree is good at predicting its an animal, but useless at classifying it as a cat or dog. The idea of AdaBoost, or adaptive boosting is to freeze these weak learnings, and add new weak learners that are specialized in another area.2 Together, the combination of these weak learners makes a strong classifier. It is an aggregate of weakly performing (but specialized) decision trees that sum together to make a well performing classifier. The idea of Gradient Boosting is sort of a generalized version of AdaBoost that can work for regression and multi-classification as well.
-
I won’t get too into the weeds, here, but the idea is to use a loss function and take the gradient of the loss with respect to each tree node to determine how to grow each tree. Generally a tree of depth 4-8 performs best. The deeper your trees are, the few trees you will need. This is basically how XGBoost works, but it has lots of regularization options and is super well optimized for parallel training.
2 Note, you don’t want a neural network to do this. Basically the whole point of adding dropout layers is to prevent this sort of specialization in order to make the NN more robust.
What I will do next
-
I couldn’t find a way to host my
sklearn
model without paying for a cloud server or hosting it myself.Tensorflow.js
would work, but it’s either not possible to convert an SKlearn model or at least very complicated. I thought of another good project idea that could be hosted vis tensorflow.js. I would train a Tensorflow or KERAS model with images from several French impressionists and have it predict the painter for unseen paintings. It would be cool to make a website where you could give it a link to a new photo and then it would guess, and you could tell it if it’s correct or not (i.e. people other than me could help train it). It could also have a game version where you compete against the model. It -
I know this is silly, but I’m going to do my chatbot in Tensorflow this time so I can eventually turn it into a
tensorflow.js
script and host it on a website. I’m just going to walk through whatever Tensorflow provides as a chatbot example/tutorial and then apply transfer learning with my Facebook messenger data. -
Complete week 4 of the Bayesian Statistics course in R.