Bayesian analysis of Amazon reviews in R and the most important functions in Pandas
R
.# Bayesian Analysis of Amazon Reviews
# cable A, $19.99
# 222 reviews, 4.4 stars, 74% positive (4 & 5 stars)
# Cable B, $15.99
# 638 reviews, 4.2 stars, 79% positive (4 & 5 stars)
# Csble C, $15.99
# 293 reviews, 4.5 stars, 88% positive (4 & 5 stars)
theta = seq(0, 1, .01)
plot(theta, dbeta(theta, 1 + 222*.74, 1 + 222*.26), col = "red", type="l", ylim = c(0,25),
main="Amazon Ratings",
xlab="Positive sentiment",
ylab="",
sub="for cables") # A
lines(theta, dbeta(theta, 1 + 638*.79, 1 + 638*.21), col = "blue", type="l") # B
lines(theta, dbeta(theta, 1 + 293*.88, 1 + 293*.12), col = "green") # C
legend("topleft", legend=c("A", "B", "C"),
col=c("red", "blue", "green"), lwd=(1))
Pandas
uses loc
nd iloc
. Use iloc
for index based selections in the row, column format. iloc
uses stdlib index, so the first index is included and the last is excluded. For example:import pandas as pd
df = pd.DataFrame({'Column 1': [50, 21], 'Column 2': [131, 2]})
df.iloc[0] #gives all of 0th row (i.e.)
df.iloc[:,0] #gives entire column of 'Column 1'
df.iloc[:2] #gives rows 0, and 1 of all the columns. Note how this differs from loc
loc
for label based selections. loc
indexes inclusively. Remember this by thinking about how loc
should be inclusive because you typically pass it strings. For example:import pandas as pd
df = pd.DataFrame[{'Column 1': [10, 20], 'Column 2': [30, 40]}]
df.loc[0, 'Column 1'] #gives row 0, column 1
df.loc[:, ['Column 1', 'Column 2']] #gives all rows for the 2 columns
df.loc[:1] #gives rows 0, and 1 of all the columns #note how this differs from iloc
reviews
where the entry in the column country
is “Italy.”reviews.loc[reviews.country == 'Italy']
# Here's a useful one I forgot about
reviews.loc[reviews.country.isin(['Italy', 'France'])]
reviews.loc[reviews.price.notnull()]
#for col1 I mean something like df.reviews or df['reviews'], I'm just trying to be more generalizable
df.describe()
df.col1.mean()
df.col1.unique()
df.col1.value_counts()
While Python is not a functional language, it has adopted a few functional features, such as the lambda function and map().
review_points_mean = reviews.points.mean()
reviews.points.map(lambda p: p - review_points_mean)
apply
function to enact it.def remean_points(row):
row.points = row.points - review_points_mean
return row
reviews.apply(remean_points, axis='columns')
idxmax
.bargain_wine = reviews.iloc[(reviews.points/reviews.price).idxmax()].title
# or in 2 lines:
bargain_index = (reviews.points/reviews.price).idxmax()
bargain_name = reviews.loc[bargain_idx, 'title']
# remember the the .loc gives the row of bargain_index and the column (i.e. value) 'title'