3 minute read

KRR, SVD, and PCA

What I did

  • I thought about how Bayesian statistics can be applied to more real life scenarios. Specifically, how to choose the best product on Amazon given different number of reviews for similarly highly-rated products.
  • Reviewed KRR, since that was one of the most useful algorithms in the transport integral predictions, and PCA since it was the most important technique in the whole project.

What I learned

  • Gradient descent (GD) is fine for large datasets with parallel processing, though you may want to consider SGD or something with updating learning rates, but for a general linear regression using GD is pretty inefficient and guaranteed to not give you the exact answer. If you do linear regression with SciKit learn or NumPy or whatever, they all use Singular Value Decomposition (SVD). SVD is common in NLP for dimensionality reduction, and it’s useful here for linear regression’s least squares problem.
  • I like to think of Kernel Ridge Regression (KRR) as a combination of kernel ridge, or L2 regularization, combined with the kernel trick. Recall that L2 or ridge regression is the typical RSS (residual sum of squares) combined with a regularization term of

    \[\lambda(\theta_i)^2\]

    . This adds more bias (obviously) and pushes the model to be simpler, since it penalizes large thetas (feature parameter weights). The second part is combining this regression model with the kernel trick, where you map your original parameters into a new feature space (kernel) which includes linear combinations of the original parameters. For example if your original parameters are only

    \[(\theta_1, \theta_2)\]

    , your new kernel would be

    \[(\theta_1, \theta_2, \frac{1}{2} \theta_1 \theta_2, \theta_1^2, \theta_2^2)\]

    . That

    \[\frac{1}{2}\]

    could be some other value, but you generally want to scale it down. Using the kernel trick, these newly mapped kernels can be dotted together the same way as before. This allows you to introduce new linear combinations and complexities within the model, but of course it can lead to over-fitting without proper regularization. That’s why KRR is popular.

  • Principle Component Analysis (PCA) is a super popular and useful technique for reducing dimensionality in problems. I find toy examples to be clearer than explanations and definition. Consider a problem with 3 features (dimensions). Plot them such that each perpendicular axis is a feature and each data point is the value. In my TI ML work, this would be each axis is the coulomb interaction feature. You don’t even consider the TI (predicted values), you’re only looking how the values for feature 1 and feature 2 (etc.) are related. So, plot the 3 features, each is their own axis, and then center the features by shifting to be centered around the origin (keeping all relative distances the same), and then plot the line through the direction of maximum variance. You can think of it as minimizing the RSS of the distances between each point and their projection on the best fit line; in calculations, however, it’s better to try and maximize the RSS of distances of the projections from the origin. This line is the eigenvector for PC1. The loading scores for PC1 are the vector components of the line (i.e. the feature one component and feature 2 component that make up the eigenvector). Next, pick the line perpendicular to the first best-fit line (the first new feature), and project the same data onto it and calculate the variance from the origin there as well. Then do it for the 3rd perpendicular line and calculate the variance of that projection. The interesting values are the three explained variances.
\[\text{explained } \sigma^2 = \frac{RSS(\text{projection 1})}{\sum_{i=1,2,3} RSS(\text{projection i})}\]

Where I was

  • I think I’ll include this section every now and again when I’m in the process of moving around a lot
  • During this week I was in Montpellier, Nimes, and Marseille.

What I will do next

  • Calculate “what’s the probability I get covid-19?”, given my circumstance in France. I can compare that to the probability of contracting the virus in Munich and then quantify how great a risk the program has taken by scheduling this mandatory class in Montpellier.
  • Another good problem would be to think about how the statistics for the vaccine testing is probably done. There’s probably some Bayesian inference to test the probability that the vaccine is safe and effective given knowledge about patients given the placebo, the success and failure rate of producing anti-bodies, etc..