Text Classification in NLP — Naive Bayes

Abhinav Rai
7 min readJan 7, 2017

We can use Supervised Machine Learning:


  • a document d
  • a fixed set of classes C = { c1, c2, … , cn }
  • a training set of m documents that we have pre-determined to belong to a specific class

We train our classifier using the training set, and result in a learned classifier.

We can then use this learned classifier to classify new documents.

Notation: we use Υ(d) = C to represent our classifier, where Υ() is the classifier, d is the document, and c is the class we assigned to the document.

(Google’s mark as spam button probably works this way).

Naive Bayes Classifier

This is a simple (naive) classification method based on Bayes rule. It relies on a very simple representation of the document (called the bag of words representation)

Imagine we have 2 classes ( positive and negative ), and our input is a text representing a review of a movie. We want to know whether the review was positive or negative. So we may have a bag of positive words (e.g. love, amazing,hilarious, great), and a bag of negative words (e.g. hate, terrible).

We may then count the number of times each of those words appears in the document, in order to classify the document as positive or negative.

This technique works well for topic classification; say we have a set of academic papers, and we want to classify them into different topics (computer science, biology, mathematics).

For difference between Naive Bayes & Multinomial Naive Bayes:

  • Naive Bayes is generic. Multinomial Naive Bayes is a specific instance of Naive Bayes where the P(Featurei|Class) follows multinomial distribution (word counts, probabilities, etc.)
  • More Information can be found here.

Bayes’ Rule applied to Documents and Classes

For a document d and a class c, and using Bayes’ rule,

P( c | d ) = [ P( d | c ) x P( c ) ] / [ P( d ) ]

The class mapping for a given document is the class which has the maximum value of the above probability.

Since all probabilities have P( d ) as their denominator, we can eliminate the denominator, and simply compare the different values of the numerator:

P( c | d ) = P( d | c ) x P( c )

Now, what do we mean by the term P( d | c ) ?

Let’s represent the document as a set of features (words or tokens) x1, x2, x3, …

We can then re-write P( d | c ) as:

P( x1, x2, x3, … , xn | c )

What about P( c ) ? How do we calculate it?

=> P( c ) is the total probability of a class. => How often does this class occur in total?

E.g. in the case of classes positive and negative, we would be calculating the probability that any given review is positive ornegative, without actually analyzing the current input document.

This is calculated by counting the relative frequencies of each class in a corpus.

E.g. out of 10 reviews we have seen, 3 have been classified as positive.

=> P ( positive ) = 3 / 10

Now let’s go back to the first term in the Naive Bayes equation:

P( d | c ), or P( x1, x2, x3, … , xn | c ).

How do we actually calculate this?

We use some assumptions to simplify the computation of this probability:

  • the Bag of Words assumption => assume the position of the words in the document doesn’t matter.
  • Conditional Independence => Assume the feature probabilities P( xi | cj ) are independent given the class c.

It is important to note that both of these assumptions aren’t actually correct — of course, the order of words matter, and they are not independent. A phrase like this movie was incredibly terrible shows an example of how both of these assumptions don't hold up in regular english.

However, these assumptions greatly simplify the complexity of calculating the classification probability. And in practice, we can calculate probabilities with a reasonable level of accuracy given these assumptions.


To calculate the Naive Bayes probability, P( d | c ) x P( c ), we calculate P( xi | c ) for each xi in d, and multiply them together.

Then we multiply the result by P( c ) for the current class. We do this for each of our classes, and choose the class that has the maximum overall value.

How do we learn the values of P ( c ) and P ( xi | c ) ?

=> We can use Maximum Likelihood estimates.

Simply put, we look at frequency counts.

P ( ci ) = [ Num documents that have been classified as ci ] / [ Num documents ]

In english, Out of all the documents, how many of them were in class i refers the above probability.

P ( wi | cj ) = [ count( wi, cj ) ] / [ Σw∈V count ( w, cj ) ]

In english, The probability of word i given class j is the count that the word occurred in documents of class j, divided by the sum of the counts of each word in our vocabulary in class j refers the above probability.

So for the denominator, we iterate through each word in our vocabulary, look up the frequency that it has occurred in class j, and add these up.

Problems with Maximum Likelihood Estimates

What if we haven’t seen any training documents with the word fantastic in our class positive ?

In this case, P ( fantastic | positive ) = 0

=> This is BAD

Since we are calculating the overall probability of the class by multiplying individual probabilities for each word, we would end up with an overall probability of 0 for the positive class.

So how do we fix this issue?

We can use a Smoothing Algorithm, for example Add-one smoothing (or Laplace smoothing).

  • Laplace Smoothing

We modify our conditional word probability by adding 1 to the numerator and modifying the denominator as such:

P ( wi | cj ) = [ count( wi, cj ) + 1 ] / [ Σw∈V( count ( w, cj ) + 1 ) ]

This can be simplified to

P ( wi | cj ) = [ count( wi, cj ) + 1 ] / [ Σw∈V( count ( w, cj ) ) + |V| ]

where |V| is our vocabulary size (we can do this since we are adding 1 for each word in the vocabulary in the previous equation).

Note here: Create a Mega Document for calculating P ( wi | cj ) which contains all the documents whose class is cj.

Relationship with Language Model: Here the features x1,x2, x3 etc. Can be any sort of feature. But if we use only word features, then they become unigram language models. They have an important similarity to language models. Multiplying all features is equivalent to getting probability of the sentence in Language model (Unigram here). Therefore Naive Bayes can be used as Language Model.

Precision, Recall & F-measure

We need more accurate measure than contingency table (True, false positive and negative) as talked in my blog “Basics of NLP”.

Precision: % of selected items that are correct. Tp (True Positive) / Tp + Fp(False Positive)

Recall: % of correct items that are selected. Tp / Tp + Fn

There is a tradeoff between precision & recall. A standard way proposed to combine this measure is F-measure. F-Measure is weighted harmonic mean.

Mostly balanced F measure (F1 measure) is used.


F=1 / (k/P + (1-k)/R) = (B*B + 1)*R / (B*B*P + R)

for F1, put B(Beta)=1.

Micro and Macro Average

So in Summary, to Machine-Learn your Naive-Bayes Classifier:


  • an input document
  • the category that this document belongs to

We do:

  • increment the count of total documents we have learned from N.
  • increment the count of documents that have been mapped to this category Nc.
  • if we encounter new words in this document, add them to our vocabulary, and update our vocabulary size |V|.
  • update count( w, c ) => the frequency with which each word in the document has been mapped to this category.
  • update count ( c ) => the total count of all words that have been mapped to this class.

So when we are confronted with a new document, we calculate for each class:

P( c ) = Nc / N

=> how many documents were mapped to class c, divided by the total number of documents we have ever looked at. This is the overall, or prior probability of this class.

Then we iterate thru each word in the document, and calculate:

P( w | c ) = [ count( w, c ) + 1 ] / [ count( c ) + |V| ]

=> the count of how many times this word has appeared in class c, plus 1, divided by the total count of all words that have ever been mapped to class c, plus the vocabulary size. This uses the Laplace-Smoothing, so we don’t get tripped up by words we’ve never seen before. This equation is used both for words we have seen, as well as words we haven’t seen.

=> we multiply each P( w | c ) for each word w in the new document, then multiply by P( c ), and the result is the probability that this document belongs to this class.

Some Ways that we can tweak our Naive Bayes Classifier

Depending on the domain we are working with, we can do things like

  • Collapse Part Numbers or Chemical Names into a single token
  • Upweighting (counting a word as if it occurred twice)
  • Feature selection (since not all words in the document are usually important in assigning it a class, we can look for specific words in the document that are good indicators of a particular class, and drop the other words — those that are viewed to be semantically empty)

=> If we have a sentence that contains a title word, we can upweight the sentence (multiply all the words in it by 2 or 3 for example), or we can upweight the title word itself (multiply it by a constant).

Choosing Classifier

  • if there is No data -> Handwritten rules!
  • Less training data -> Naive Bayes is best
  • Reasonable amount -> SVMs & Logical Regression. Can also use decision trees



Abhinav Rai

Buliding Engagebud | Guitarist | Traveller | Entrepreneur