Week of November 22nd
R equivalencies in Python.
What I did
- I finished the bayesian stats course on Saturday, but I wanted to redo the final exercises in python instead of R.
- I worked on some multi-linear regression and some more traditional statistics in Python. Also
matplotlib
.
What I learned
- So, I already learned about the p-value test statistic, but I hadn’t seen it in the context of a linear regression. It struck me as odd and I needed to look up what it actually was referring to.
from scipy import stats
slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)
According to the documentation, the p_value returns a “Two-sided p-value for a hypothesis test whose null hypothesis is that the slope is zero, using Wald Test with t-distribution of the test statistic.” Essentially, what is the probability of obtaining a result at least as extreme as the measured slope, assuming the null hypothesis of zero. For a 2-tailed test, the p-value is:
\[\text{p value} = 2P(Z > |z_0|) \\ \text{where } Z_0 = \frac{\bar{x} - \mu_0}{\sigma/\sqrt{n}}\]- A good python equivalent to regression in R, is:
import statsmodels.formula.api as sfm
# regress acc onto dist
mod = sfm.ols(formula='accuracy ~ avg_dist', data = df)
res=mod.fit()
print(res.summary())
What I Will do next
- I should redo the last section of that stats course in R. Since most jobs require python or R, I figured I should just know how to do everything in python. But there are definitely some things that are better in R, such as ways to get statistical information and plotting.
- Practice manipulating dataframes with pandas. It’s been awhile since I learned them, and I keep having to keep looking up basic things. I will find some W3 or Kaggle walk-through for pandas to refresh myself.