Week of December 27th
Bayesian analysis of Amazon reviews in R and the most important functions in Pandas
What I did
- I wrote some R code to help determine which Amazon product to buy using some Bayesian Statistics
- Reviewed some Pandas operations
- I spent a lot of time reading Obama’s new book. While, unrelated to my blog, it’s worth mentioning since that’s actually where I spent most of my time this holiday week.
What I Learned
- I needed to decide which cable to buy from Amazon and wanted to incorporate some Bayesian statistics into determining which product had better reviews (with a disparity in the number of reviews for each product). I have some thoughts on my methodology (to come). For now, here’s the code in
R
.
# Bayesian Analysis of Amazon Reviews
# cable A, $19.99
# 222 reviews, 4.4 stars, 74% positive (4 & 5 stars)
# Cable B, $15.99
# 638 reviews, 4.2 stars, 79% positive (4 & 5 stars)
# Csble C, $15.99
# 293 reviews, 4.5 stars, 88% positive (4 & 5 stars)
theta = seq(0, 1, .01)
plot(theta, dbeta(theta, 1 + 222*.74, 1 + 222*.26), col = "red", type="l", ylim = c(0,25),
main="Amazon Ratings",
xlab="Positive sentiment",
ylab="",
sub="for cables") # A
lines(theta, dbeta(theta, 1 + 638*.79, 1 + 638*.21), col = "blue", type="l") # B
lines(theta, dbeta(theta, 1 + 293*.88, 1 + 293*.12), col = "green") # C
legend("topleft", legend=c("A", "B", "C"),
col=c("red", "blue", "green"), lwd=(1))
- Indexing in
Pandas
usesloc
ndiloc
. Useiloc
for index based selections in the row, column format.iloc
uses stdlib index, so the first index is included and the last is excluded. For example:
import pandas as pd
df = pd.DataFrame({'Column 1': [50, 21], 'Column 2': [131, 2]})
df.iloc[0] #gives all of 0th row (i.e.)
df.iloc[:,0] #gives entire column of 'Column 1'
df.iloc[:2] #gives rows 0, and 1 of all the columns. Note how this differs from loc
- Use
loc
for label based selections.loc
indexes inclusively. Remember this by thinking about howloc
should be inclusive because you typically pass it strings. For example:
import pandas as pd
df = pd.DataFrame[{'Column 1': [10, 20], 'Column 2': [30, 40]}]
df.loc[0, 'Column 1'] #gives row 0, column 1
df.loc[:, ['Column 1', 'Column 2']] #gives all rows for the 2 columns
df.loc[:1] #gives rows 0, and 1 of all the columns #note how this differs from iloc
- Selecting conditional data. For example, select all entries in the dataframe
reviews
where the entry in the columncountry
is “Italy.”
reviews.loc[reviews.country == 'Italy']
# Here's a useful one I forgot about
reviews.loc[reviews.country.isin(['Italy', 'France'])]
reviews.loc[reviews.price.notnull()]
- Some useful functions for dataframes and columns within:
#for col1 I mean something like df.reviews or df['reviews'], I'm just trying to be more generalizable
df.describe()
df.col1.mean()
df.col1.unique()
df.col1.value_counts()
Mapping
While Python is not a functional language, it has adopted a few functional features, such as the lambda function and map().
- To map one set of values into another, there are a couple of ways. For example, to re-mean the set of values:
review_points_mean = reviews.points.mean()
reviews.points.map(lambda p: p - review_points_mean)
- You can also write your own mapping function and use the pandas
apply
function to enact it.
def remean_points(row):
row.points = row.points - review_points_mean
return row
reviews.apply(remean_points, axis='columns')
- If you had a dataframe of wines and you wanted to return the title of the wine with the best points to price ratio, you’d need
idxmax
.
bargain_wine = reviews.iloc[(reviews.points/reviews.price).idxmax()].title
# or in 2 lines:
bargain_index = (reviews.points/reviews.price).idxmax()
bargain_name = reviews.loc[bargain_idx, 'title']
# remember the the .loc gives the row of bargain_index and the column (i.e. value) 'title'
What I Will Do Next
- I really need to just spend a week and get this chatbot done. It’s been increasingly difficult though, since the semester gets busier towards the end. I’m also anxious to start writing my thesis. While I’m probably ahead of all the other thesis students, I still feel far behind where I was when I wrote my bachelors thesis at Williams.