4 minute read

Note, last week is missing. Most of what I did is included in the prior week, and the rest is included here. It wasn’t enough to merit its own week, because I was too busy traveling and preparing my residency permit application (so I don’t get deported when my travel VISA expires in 7 weeks).

What I did

  • I’ve picked up the statistics course again
  • I’ve continued to work through a confetti.ai reading and some of their practice on their webpage. It looks very helpful for any upcoming data science interviews, and a good way to make sure I’ve mastered the fundamentals.
  • TexStudio has a new version that incorporates dark mode much better. I’ve found the github repo and I’m curious to see how difficult it would be to change the menu bar to be something more like the atom extension “title-bar-replacer.” I love how productive I am using TexStudio, but its biggest fault has always been how ugly it is compared to something like Atom or VS Code. After looking into it, the atom extension works because Atom is an electron application. This allows you to modify the source code so that Atom launches as a frameless window. This approach won’t work with TexStudio, and I suspect it may not even be possible to make such a modification.

What I learned

  • Gradient descent from scratch, this time with logistic regression instead of linear regression. Only import needed was math in order to do \(log\) and \(exp\).
  • underscores in python
    • _variable is just convention and is the equivalent of a private variable in JAVA, i.e. for internal use only.
    • __variable is interpreted in a special way by the python interpreter and is like a final variable in JAVA. In java this means the variable always has the same value or the same reference when it was first declared. Python does some stuff to these variables to avoid name mangling when the class is extended later. In the below example __id is mangled to _Person__id so the variable is not overridden in any subclasses.
class Person:
  def __init__(self):
      self.name = 'Sarah'
      self._age = 26
      self.__id = 30

>>> p = Person()
>>> dir(p)
['_Person__id', ..., '_age', 'name']

>>> p.name
Sarah

>>> p._age
26

>>> p.__id
AttributeError: 'Person' object has no attribute '__id'

>>> p._Person__id
30

#check out how you can still override __id with a subclass
class Friend(Person):
  def __init__(self):
    self.name = 'Sarah'
    self._age = 26
    self.__id = 25

>>> fr = Friend()

>>> fr._Person__id
30
>>> fr._Friend__id
25

-

  • variable_ is just a convention if you want to name a variable something that is already a keyword like:
class Teacher():
  __init__(self, name, class_):
    self.name
    self.class_

-

  • __method__ are dunder methods, special methods in python. I think I wrote about this in an earlier post.

  • _ is used as a variable when the method you’re calling is returning a value you don’t care about. For example, I don’t know why you would ever do this, but, at least this should make it clear.

for i, _ in enumerate(np.zeros(3)):
  print(i)
>>> 0
>>> 1
>>> 2

Random Stats Review

  • Remember that z-score is:
\[z = \frac{x- \mu}{\sigma}\]
  • and that t-score is:
\[t = \frac{x- \bar{x}}{s/\sqrt{n}}\]
  • use t-score when you have a sample size < 30 or you don’t know the population standard deviation.

  • Chain rule of probability:

\[P(a \cap b \cap c) = P(a \mid b \cap c) * P(b \mid c) * P(c)\]

or this is can also be written as:

\[P(a, b, c) = P(a \mid b, c) * P(b \mid c)* P(c)\]

Correlation Coefficients and Regression

  • Regression y onto x is not usually the same as regression x onto y. They would only be the same if they are both standardized (normalized to z score points). This is because the slope of your regression line is:
\[\overbrace{\hat\beta_1=\frac{\text{Cov}(x,y)}{\text{Var}(x)}}^{y\text{ on } x}~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\overbrace{\hat\beta_1=\frac{\text{Cov}(y,x)}{\text{Var}(y)}}^{x\text{ on }y}\]
  • Notice how if you standardize them, both x and y will have a variance of 1. This would make the slope = to the Pierceson correlation coefficient of r.
\[r = \frac{\text{Cov}(x,y)}{\sigma_x \sigma_y} = \frac{\sum((x-\mu_x)(y-\mu_y))}{\sqrt{\sum(x-\mu_x)}\sqrt{\sum(y-\mu_y)}}\]

So you can kind of think of r as the slope of the best fit curve when both the independent and dependent variables are standardized.

NLP

  • Use the cosine similarity (cosine distance) between two word embeddings (word2vec vectors) to compare text similarity. It measures the angle between the two vectors, rather than the euclidean distance between them. Cosine distance is:
\[\frac{A \cdot B}{||A|| * ||B||} = \frac{\sum(A_i *B_i)}{\sqrt{\sum A_i^2}\sqrt{\sum B_i^2}}\]

An Overlooked Python Basic

  • It’s kind of embarrassing I didn’t know this, but up until now it hasn’t been a problem. I had run into an issue before where if I wanted to import a method or a variable from a different python script (similar to how you’d import a method from a different class in Java), I would find that the entire script with the method would run. The reason is because python is a scripting language, so def is not so much a declaration of a function as it is for instructions for the compiler. In order to tell the compiler not to run something unless you want to run the whole script, you have to put the call in the if __name__ == "__main__":. I’ve seen this call so many times before and never really realized why it was there. I’m kind of amazed I got this far while working around it.

What I will do next

  • Keep on the back burner the idea of modifying and recompiling TexStudio. It would be nice to go through that whole process myself. I know I’m a data scientist, not a software engineer, but isn’t being a but having some full-stack knowledge is always useful.
  • Keep prepping for data science interviews with the confetti.ai website and focus on some NLP review in particular if I have time.