Week of October 18th

SQL, Naive Bayes, Python tips

What I did

SQL Practice (hacker rank challenges)
Resume work on my chatbot (finally!). From scratch, now on a google colab notebook, with pytorch, and optimized for GPU training.
Improved work on friend classifier. I realized I never made a proper metric for the performance of my friend classifier. When I went to implement this, I realized that I needed to do a better job pre-processing, and that I had implemented the program in a really inefficient way. I had saved the preprocessed text as a pickle file which could be loaded as a dictionary. I realized that I never actually loaded this file, however, and was importing the dictionary from the pre-processing script. This meant that every time you ran the program to make a prediction, it had to scrape the message data, then pre-process, before fitting the model and making the prediction. This is now fixed, though it ended up requiring a lot more work than anticipated. I’m glad I did it, however, since I learned a lot and really started thinking about new issues and questions. For example, how should I deal with contractions in preprocessing?

What I learned

With this new version of my chatbot, my plan is the train the pytorch model with the cornell movie dialog corpus and then apply transfer learning with my facebook messenger dataset. Recently I’ve seen some interesting data augmentation techniques for natural language processing such as translating a sentence into another language and then back and using this back-translated sentence as a new, different sentence with the same meaning. I think I saw this with Martian AI but I came across a 2019 paper that I would like to read for my next Journal Club post.
I went into the weeds with how Naive Bayes works with text classification. Recall Bayes Theorem:

\[P(A \mid B) = \frac{P(B \mid A) ~ P(A)}{P(B)}\]

Explicitely, in my mystery message sender classifier,

\[P(Sender = Rohan \mid \text{mystery message}) = \frac{P(\text{mystery message} | Rohan) * P(Rohan)}{P(\text{mystery message})}\]

Now what exactly is each term?

\[P(\textnormal{mystery message} \mid Rohan) = \prod_{\textnormal{word in message}} \frac{\textnormal{Num of times word occurs in Rohan's count vector}}{\textnormal{Total number of words in Rohan's dictionary}}\] \[P(Rohan) = \frac{\text{Number of messages sent by } Rohan}{\text{Number of messages sent by } everyone}\] \[P(\textnormal{mystery message}) = \prod_{\textnormal{word in message}} \frac{\textnormal{Number of times word occurs in all messages}}{\textnormal{Total number of words (with repeats) in full dictionary}}\]

Note, by the Rohan's count vector, I mean the python dictionary where each key is a unique word and the corresponding value is the number of times it shows up in all the messages sent by Rohan. I call this the count vector because the most common way to create this dictionary is to use from sklearn.feature_extraction.text import CountVectorizer. I’m also not positive that I defined \(P(Rohan)\) correctly. It might actually be:

\[P(Rohan) = \frac{\text{Number of words sent by } Rohan}{\text{Number of words sent by } everyone}\]

Presumably the value of these two definitions are very similar. I think my original definition is the correct one based based on how you write a spam classifier. In a spam classifier, you’d say the probability that an email is spam (as a prior i.e. given no info about the email) is just the number of spam emails you’ve received divided by the total number of emails. Using this analog, I think my original definition is correct, but I’m not going to spend a lot of time to dissect ski-kit-learn's multinomial naive Baye’s classifier.

I assume the reason this is called Naive Bayes is because there is an assumed orthoganality, or rather, that it is assumed that each word in the sentence is independent of the other words.

Two Overlooked Python Basics

To merge [a=[1,2] and b=[3,4] into c=[1,2,3,4] you can just do c = a.extend(b).
I’ve been using pickle to save data that needs to be loaded by another script by using pickle files, but since I’m just using dictionaries it makes way more sense to just save them as JSON files.

What I will do next

I’m thinking that maybe instead of working on the chatbot, I will spend some time deploying message classifier instead. I’ve done everything from data scraping to modeling, but the last piece of the puzzle is to actually deplore a project. It looks like there’s no way to avoid javascript in doing so, but I guess one can only avoid javascript for so long. I’ve looked a little at tensorflow.js and ONINS.js, and it seems do-able, though more work than building the chatbot out.