Week of November 22nd

R equivalencies in Python.

What I did

I finished the bayesian stats course on Saturday, but I wanted to redo the final exercises in python instead of R.
I worked on some multi-linear regression and some more traditional statistics in Python. Also matplotlib.

What I learned

So, I already learned about the p-value test statistic, but I hadn’t seen it in the context of a linear regression. It struck me as odd and I needed to look up what it actually was referring to.

from scipy import stats
slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)

According to the documentation, the p_value returns a “Two-sided p-value for a hypothesis test whose null hypothesis is that the slope is zero, using Wald Test with t-distribution of the test statistic.” Essentially, what is the probability of obtaining a result at least as extreme as the measured slope, assuming the null hypothesis of zero. For a 2-tailed test, the p-value is:

\[\text{p value} = 2P(Z > |z_0|) \\ \text{where } Z_0 = \frac{\bar{x} - \mu_0}{\sigma/\sqrt{n}}\]

A good python equivalent to regression in R, is:

import statsmodels.formula.api as sfm
# regress acc onto dist
mod = sfm.ols(formula='accuracy ~ avg_dist', data = df) 
res=mod.fit()
print(res.summary())

What I Will do next

I should redo the last section of that stats course in R. Since most jobs require python or R, I figured I should just know how to do everything in python. But there are definitely some things that are better in R, such as ways to get statistical information and plotting.
Practice manipulating dataframes with pandas. It’s been awhile since I learned them, and I keep having to keep looking up basic things. I will find some W3 or Kaggle walk-through for pandas to refresh myself.