SQL, Naive Bayes, Python tips

What I did

What I learned

\[P(A \mid B) = \frac{P(B \mid A) ~ P(A)}{P(B)}\]

Explicitely, in my mystery message sender classifier,

\[P(Sender = Rohan \mid \text{mystery message}) = \frac{P(\text{mystery message} | Rohan) * P(Rohan)}{P(\text{mystery message})}\]

Now what exactly is each term?

\[P(\textnormal{mystery message} \mid Rohan) = \prod_{\textnormal{word in message}} \frac{\textnormal{Num of times word occurs in Rohan's count vector}}{\textnormal{Total number of words in Rohan's dictionary}}\] \[P(Rohan) = \frac{\text{Number of messages sent by } Rohan}{\text{Number of messages sent by } everyone}\] \[P(\textnormal{mystery message}) = \prod_{\textnormal{word in message}} \frac{\textnormal{Number of times word occurs in all messages}}{\textnormal{Total number of words (with repeats) in full dictionary}}\]

Note, by the Rohan's count vector, I mean the python dictionary where each key is a unique word and the corresponding value is the number of times it shows up in all the messages sent by Rohan. I call this the count vector because the most common way to create this dictionary is to use from sklearn.feature_extraction.text import CountVectorizer. I’m also not positive that I defined \(P(Rohan)\) correctly. It might actually be:

\[P(Rohan) = \frac{\text{Number of words sent by } Rohan}{\text{Number of words sent by } everyone}\]

Presumably the value of these two definitions are very similar. I think my original definition is the correct one based based on how you write a spam classifier. In a spam classifier, you’d say the probability that an email is spam (as a prior i.e. given no info about the email) is just the number of spam emails you’ve received divided by the total number of emails. Using this analog, I think my original definition is correct, but I’m not going to spend a lot of time to dissect ski-kit-learn's multinomial naive Baye’s classifier.

I assume the reason this is called Naive Bayes is because there is an assumed orthoganality, or rather, that it is assumed that each word in the sentence is independent of the other words.

Two Overlooked Python Basics

What I will do next