2 minute read

What I did

  • Worked on the backprop post. I wrote the backprop section.
  • Major updates to resume. Formed a new branch for my 1 page resume, specifically tailored to data science jobs.
  • Read another chapter of the stats textbook
  • I’ve spent a lot of time this week reflecting on my privilege and how different my experiences growing up have been. Black lives matter. While, this blog is only meant to serve as a form of note taking for what I am learning in data science, I mention this introspection here because it may seem like I didn’t do much learning this week. In reality, I have done a lot of learning and growing. Just as a person instead of as a data scientist.

What I learned

  • With regards to the backprop post, I feel much more confident in how backprop works. It’s not that I didn’t understand it before, but more that I have a better understanding now. It’s the difference of doing a physics problem and explaining a physics problem. These posts are my Oxbridge-style tutorials.
  • T-tests, a test statistic to compare two population means. Use a pooled t-test when the standard deviations of the samples are very similar. \(S_1\) and \(S_2\) are the STD of the respective samples. Only use pooled samples when the standard deviations are the same (or very close), the samples are independent (obviously).
\[H_0 : \mu_1 - \mu_2 = 0 \\ \text{Pooled} \\ t = \frac{\bar{x_1} - \bar{x_2}}{S_p \sqrt{1/n_1 + 1/n_2}} \\ S_p = \sqrt{\frac{(n-1)s_1^2 + (n-2)s_2^2 }{n_1 + n_2 - 2}} \\ \text{Unpooled} \\ t = \frac{\bar{x_1} - \bar{x_2}}{\sqrt{s_1^2/n_1 + s_2^2/n_2}} \\ df = DOF \\ = (\text{there's an exact formula, but its long. Just pick the smaller of } n_1 - 1 \text{ and } n_2 - 1)\]

This means that the confidence interval of (1-\(\alpha\))% is:

\[(\overline{x}_1 - \overline{x}_2) \pm t_{\alpha/2} \sqrt{\sigma_1^2/n_1 + \sigma_2^2/n_2}\]

for a two tailed hypothesis test, where t is calculated from the degrees of freedom I mentioned above.

  • If each population is normal (or sample size is large enough, i.e. the central limit theorem), then the sampling distribution of \(\bar{x_i}\) is also normally distributed with mean \(\mu_i\), Standard Error \(\frac{\sigma_i}{\sqrt{n_i}}\) and Estimated Standard Error of \(\frac{s_i}{\sqrt{n_i}}\). i is for sample 1 or 2.

  • Pandas Series vs Numpy Array Review

    • pd.Series([1,2,3]) is a 1D, indexed array.
    ds = pd.Series([2,4,6])
    print(ds)
    > 0    2
    > 1    4
    > 2    6
    
    print(ds.values)
    # Returns a numpy array
    > array([2,4,6])
    
    print(ds[2])
    > 6
    

What I will do next

  • I really need to start another Kaggle project.
  • The next i2dl project is on PyTorch. Time to master pytorch
  • It would be a good idea to read through my previous posts to help consolidate what I’ve covered into more permanent.