What I did
- Worked on the backprop post. I wrote the backprop section.
- Major updates to resume. Formed a new branch for my 1 page resume, specifically tailored to data science jobs.
- Read another chapter of the stats textbook
- I’ve spent a lot of time this week reflecting on my privilege and how different my experiences growing up have been. Black lives matter. While, this blog is only meant to serve as a form of note taking for what I am learning in data science, I mention this introspection here because it may seem like I didn’t do much learning this week. In reality, I have done a lot of learning and growing. Just as a person instead of as a data scientist.
What I learned
- With regards to the backprop post, I feel much more confident in how backprop works. It’s not that I didn’t understand it before, but more that I have a better understanding now. It’s the difference of doing a physics problem and explaining a physics problem. These posts are my Oxbridge-style tutorials.
- T-tests, a test statistic to compare two population means. Use a pooled t-test when the standard deviations of the samples are very similar. \(S_1\) and \(S_2\) are the STD of the respective samples. Only use pooled samples when the standard deviations are the same (or very close), the samples are independent (obviously).
\[H_0 : \mu_1 - \mu_2 = 0 \\
\text{Pooled} \\
t = \frac{\bar{x_1} - \bar{x_2}}{S_p \sqrt{1/n_1 + 1/n_2}} \\
S_p = \sqrt{\frac{(n-1)s_1^2 + (n-2)s_2^2 }{n_1 + n_2 - 2}} \\
\text{Unpooled} \\
t = \frac{\bar{x_1} - \bar{x_2}}{\sqrt{s_1^2/n_1 + s_2^2/n_2}} \\
df = DOF \\
= (\text{there's an exact formula, but its long. Just pick the smaller of } n_1 - 1 \text{ and } n_2 - 1)\]
This means that the confidence interval of (1-\(\alpha\))% is:
\[(\overline{x}_1 - \overline{x}_2) \pm t_{\alpha/2} \sqrt{\sigma_1^2/n_1 + \sigma_2^2/n_2}\]
for a two tailed hypothesis test, where t is calculated from the degrees of freedom I mentioned above.
-
If each population is normal (or sample size is large enough, i.e. the central limit theorem), then the sampling distribution of \(\bar{x_i}\) is also normally distributed with mean \(\mu_i\), Standard Error \(\frac{\sigma_i}{\sqrt{n_i}}\) and Estimated Standard Error of \(\frac{s_i}{\sqrt{n_i}}\). i is for sample 1 or 2.
-
Pandas Series vs Numpy Array Review
pd.Series([1,2,3])
is a 1D, indexed array.
ds = pd.Series([2,4,6])
print(ds)
> 0 2
> 1 4
> 2 6
print(ds.values)
# Returns a numpy array
> array([2,4,6])
print(ds[2])
> 6
What I will do next
- I really need to start another Kaggle project.
- The next i2dl project is on PyTorch. Time to master pytorch
- It would be a good idea to read through my previous posts to help consolidate what I’ve covered into more permanent.