SQL, Naive Bayes, Python tips
With this new version of my chatbot, my plan is the train the pytorch model with the cornell movie dialog corpus and then apply transfer learning with my facebook messenger dataset. Recently I’ve seen some interesting data augmentation techniques for natural language processing such as translating a sentence into another language and then back and using this back-translated sentence as a new, different sentence with the same meaning. I think I saw this with Martian AI but I came across a 2019 paper that I would like to read for my next Journal Club post.
I went into the weeds with how Naive Bayes works with text classification. Recall Bayes Theorem:
Explicitely, in my mystery message sender classifier,
\[P(Sender = Rohan \mid \text{mystery message}) = \frac{P(\text{mystery message} | Rohan) * P(Rohan)}{P(\text{mystery message})}\]Now what exactly is each term?
\[P(\textnormal{mystery message} \mid Rohan) = \prod_{\textnormal{word in message}} \frac{\textnormal{Num of times word occurs in Rohan's count vector}}{\textnormal{Total number of words in Rohan's dictionary}}\] \[P(Rohan) = \frac{\text{Number of messages sent by } Rohan}{\text{Number of messages sent by } everyone}\] \[P(\textnormal{mystery message}) = \prod_{\textnormal{word in message}} \frac{\textnormal{Number of times word occurs in all messages}}{\textnormal{Total number of words (with repeats) in full dictionary}}\]Note, by the Rohan's count vector
, I mean the python dictionary where each key is a unique word and the corresponding value is the number of times it shows up in all the messages sent by Rohan. I call this the count vector because the most common way to create this dictionary is to use from sklearn.feature_extraction.text import CountVectorizer
. I’m also not positive that I defined \(P(Rohan)\) correctly. It might actually be:
Presumably the value of these two definitions are very similar. I think my original definition is the correct one based based on how you write a spam classifier. In a spam classifier, you’d say the probability that an email is spam (as a prior i.e. given no info about the email) is just the number of spam emails you’ve received divided by the total number of emails. Using this analog, I think my original definition is correct, but I’m not going to spend a lot of time to dissect ski-kit-learn's
multinomial naive Baye’s classifier.
I assume the reason this is called Naive Bayes is because there is an assumed orthoganality, or rather, that it is assumed that each word in the sentence is independent of the other words.
To merge [a=[1,2]
and b=[3,4]
into c=[1,2,3,4]
you can just do c = a.extend(b)
.
I’ve been using pickle
to save data that needs to be loaded by another script by using pickle files, but since I’m just using dictionaries it makes way more sense to just save them as JSON files.
javascript
in doing so, but I guess one can only avoid javascript for so long. I’ve looked a little at tensorflow.js and ONINS.js
, and it seems do-able, though more work than building the chatbot out.