3 minute read

Bayesian analysis of Amazon reviews in R and the most important functions in Pandas

What I did

  • I wrote some R code to help determine which Amazon product to buy using some Bayesian Statistics
  • Reviewed some Pandas operations
  • I spent a lot of time reading Obama’s new book. While, unrelated to my blog, it’s worth mentioning since that’s actually where I spent most of my time this holiday week.

What I Learned

  • I needed to decide which cable to buy from Amazon and wanted to incorporate some Bayesian statistics into determining which product had better reviews (with a disparity in the number of reviews for each product). I have some thoughts on my methodology (to come). For now, here’s the code in R.
# Bayesian Analysis of Amazon Reviews

# cable A, $19.99
# 222 reviews, 4.4 stars, 74% positive (4 & 5 stars)
# Cable B, $15.99
# 638 reviews, 4.2 stars, 79% positive (4 & 5 stars)
# Csble C, $15.99
# 293 reviews, 4.5 stars, 88% positive (4 & 5 stars)
theta = seq(0, 1, .01)
plot(theta, dbeta(theta, 1 + 222*.74, 1 + 222*.26), col = "red", type="l", ylim = c(0,25),
    main="Amazon Ratings",
    xlab="Positive sentiment",
    ylab="",
    sub="for cables") # A
lines(theta, dbeta(theta, 1 + 638*.79, 1 + 638*.21), col = "blue", type="l") # B
lines(theta, dbeta(theta, 1 + 293*.88, 1 + 293*.12), col = "green") # C
legend("topleft", legend=c("A", "B", "C"),
       col=c("red", "blue", "green"), lwd=(1))

  • Indexing in Pandas uses loc nd iloc. Use iloc for index based selections in the row, column format. iloc uses stdlib index, so the first index is included and the last is excluded. For example:
import pandas as pd
df = pd.DataFrame({'Column 1': [50, 21], 'Column 2': [131, 2]})
df.iloc[0] #gives all of 0th row (i.e.)
df.iloc[:,0] #gives entire column of 'Column 1'
df.iloc[:2] #gives rows 0, and 1 of all the columns. Note how this differs from loc
  • Use loc for label based selections. loc indexes inclusively. Remember this by thinking about how loc should be inclusive because you typically pass it strings. For example:
import pandas as pd
df = pd.DataFrame[{'Column 1': [10, 20], 'Column 2': [30, 40]}]
df.loc[0, 'Column 1'] #gives row 0, column 1
df.loc[:, ['Column 1', 'Column 2']] #gives all rows for the 2 columns
df.loc[:1] #gives rows 0, and 1 of all the columns #note how this differs from iloc
  • Selecting conditional data. For example, select all entries in the dataframe reviews where the entry in the column country is “Italy.”
reviews.loc[reviews.country == 'Italy']

# Here's a useful one I forgot about
reviews.loc[reviews.country.isin(['Italy', 'France'])]
reviews.loc[reviews.price.notnull()]
  • Some useful functions for dataframes and columns within:
#for col1 I mean something like df.reviews or df['reviews'], I'm just trying to be more generalizable
df.describe()
df.col1.mean()
df.col1.unique()
df.col1.value_counts()

Mapping

While Python is not a functional language, it has adopted a few functional features, such as the lambda function and map().

  • To map one set of values into another, there are a couple of ways. For example, to re-mean the set of values:
review_points_mean = reviews.points.mean()
reviews.points.map(lambda p: p - review_points_mean)
  • You can also write your own mapping function and use the pandas apply function to enact it.
def remean_points(row):
    row.points = row.points - review_points_mean
    return row

reviews.apply(remean_points, axis='columns')
  • If you had a dataframe of wines and you wanted to return the title of the wine with the best points to price ratio, you’d need idxmax.
bargain_wine = reviews.iloc[(reviews.points/reviews.price).idxmax()].title
# or in 2 lines:
bargain_index = (reviews.points/reviews.price).idxmax()
bargain_name = reviews.loc[bargain_idx, 'title']
# remember the the .loc gives the row of bargain_index and the column (i.e. value) 'title'

What I Will Do Next

  • I really need to just spend a week and get this chatbot done. It’s been increasingly difficult though, since the semester gets busier towards the end. I’m also anxious to start writing my thesis. While I’m probably ahead of all the other thesis students, I still feel far behind where I was when I wrote my bachelors thesis at Williams.