
Millions of tweets are posted every second. It helps us know how the public is responding to a particular event. To get the sentiments of tweets, We can use the Naive Bayes classification algorithm, which is simply the application of Bayes rule.

每秒发布数百万条推文。 它可以帮助我们了解公众如何响应特定事件。 为了获得推文的情感,我们可以使用朴素贝叶斯分类算法,这只是贝叶斯规则的应用。

贝叶斯规则 (Bayes Rule)

Bayes rule is merely describing the probability of an event on prior knowledge of the occurrence of another event related to it.


Then the probability of occurrence of event A given that event B has already occurred is


And for the probability of occurrence of event B given that event A has already occurred is


Using both these equations, we can rewrite them collectively as


Let’s take a look at tweets and how we are going to extract features from them


We will be having two corpora of tweets, positive and negative tweets.


Positive tweets: ‘I am happy because I am learning NLP,’ ‘I am happy, not sad.’

积极的推文 :“我很高兴,因为我正在学习NLP”,“我很高兴,而不是悲伤”。

Negative tweets: ‘I am sad, I am not learning NLP,’ ‘I am sad, not happy.’

负面推文 :“我很难过,我没有学习NLP”,“我很难过,不开心”。

前处理 (Preprocessing)

We need to preprocess our data so that we can save a lot of memory and reduce the computational process.


  1. Lowercase: We will convert all the text to lower case. so, that the words like Learning and leaning can be taken as same words小写:我们将所有文本转换为小写。 因此,可以将像“学习”和“学习”之类的词视为同一词
  2. Removing punctuations, URLs, names: We will remove the punctuations URLs and names or hashtags because they don’t contribute to sentiment analysis of a tweet.删除标点符号,URL,名称:我们将删除标点符号URL,名称或主题标签,因为它们对推文的情绪分析没有帮助。
  3. Removing stopwords: The stopwords like ‘the’, ‘is’ don’t contribute in sentiment. Therefore these words have to be removed.删除停用词:诸如“ the”,“ is”之类的停用词不会助长情绪。 因此,必须删除这些单词。
  4. Stemming: The words like ‘took’, ‘taking’ are treated as the same words and are converted to there base words, here it is ‘take’. This saves a lot of memory and time.词干:“ took”,“ take”等词被视为相同的词,并转换为那里的基本词,此处为“ take”。 这样可以节省大量内存和时间。

概率方法: (Probabilistic approach:)

In order to get the probability stats for the words, we will be creating a dictionary of these words and counting the occurrence of each word in positive and negative tweets.


Let’s see how these word counts are helpful in finding the probability of the word for both classes. Here the word ‘i’ occurred three times, and the total unique words in the positive corpus are 13. Therefore, the probability of occurrence of the word ‘i’ given that the tweet is positive will be

让我们看看这些单词计数对查找两个类中单词的概率有何帮助。 在这里,单词“ i”出现了3次,并且正语料库中的唯一词总数为13。因此,假设推文为肯定,则单词“ i”出现的概率为

freq denotes the frequency of occurrence of a word, class: {pos, neg}

Doing this for all our words in our vocabulary, we will get a table like this:


In the Naive Bayes, We will find how each word is contributing to the sentiment, which can be calculated by the ratio of the probability of occurrence of the word for positive and negative class. Let’s take an example; We can see that the probability of occurrence of the word ‘sad’ is more for negative than positive class. So, we will find the ratio of these probabilities for every word by the formula:

在朴素贝叶斯中,我们将发现每个单词如何对情感产生影响,可以通过正负两类单词出现的概率之比来计算。 让我们举个例子。 我们可以看到,“消极”一词出现的可能性在消极类别中比在积极类别中更大。 因此,我们将通过公式找到每个单词的这些概率之比:

This ratio is known as the likelihood, and its value lies between (0, ∞). The value tending to zero indicates that it has very low probability to occur in a positive tweet as compared to the probability to occur in a negative tweet and the ratio value tending to infinity shows that it has very low probability to occur in a negative tweet as compared to the probability to occur in a positive tweet. In other words, the high value of ratio implies positivity. Also, the ratio value 1 means that the name is neutral.

该比率称为似然,其值介于(0,∞)之间。 趋于零的值表示与在负面推文中出现的可能性相比,在正推文中发生的可能性非常低,并且比率值趋于无穷大表示在负推文中发生的可能性非常低,例如与出现正面推文的可能性相比。 换句话说,比率的高值表示阳性。 此外,比率值1表示名称是中性的。

拉普拉斯平滑 (Laplace Smoothing)

Some words might have occurred in any particular class only. The words which did not occur in the negative class will have probability 0 which makes the ratio undefined. So, we will use the Laplace smoothing technique to pursue this kind of situation. Let’s take on how equation changes on applying Laplace smoothing:

某些单词可能仅在任何特定类中出现过。 否定类中未出现的单词的概率为0,这使得比率不确定。 因此,我们将使用拉普拉斯平滑技术来解决这种情况。 让我们来看看在应用拉普拉斯平滑处理时方程如何变化:

By adding ‘1’ in the numerator makes the probability non zero. This factor is called alpha-factor and is between (0,1]; specifically, when we set this alpha-factor to 1, the smoothing is termed as Laplace smoothing. Also, the sum of probabilities will remain at 1.

通过在分子中加“ 1”,使概率非零。 该因子称为Alpha因子,介于(0,1]之间;具体来说,当我们将此Alpha因子设置为1时,平滑称为Laplace平滑,而且概率之和将保持为1。

Here in our example, the number of unique words is eight gives us V= 8.

在我们的示例中,唯一字的数量为8,因此我们得到V = 8。

After Laplace smoothing the table of the probability will look like this:


朴素贝叶斯: (Naive Bayes:)

To estimate the sentiment of a tweet, we will take the product of the probability ratio of each word occurred in the tweet. Note, the words which are not present in our vocabulary will not contribute and will be taken as neutral. The equation for naive Bayes in our application will be like this:

要估算推文的情绪,我们将采用推文中每个单词出现的概率比的乘积。 请注意,我们词汇表中不存在的单词将不会有帮助,并将被视为中立的。 在我们的应用程序中,朴素贝叶斯方程将如下所示:

m = number of words in a tweet, w = set of words in a tweet
m =一条推文中的单词数,w =一条推文中的单词集

Since the data can be imbalanced and can cause biased results for a particular class, we multiply the above equation with a prior factor, which is the ratio of the probability of positive tweets to the probability of negative tweets.


complete equation of Naive Bayes

Since we are taking the product of all these ratios, we can end up with a number too large or too small to be stored on our device, so here comes the concept of log-likelihood. We take the log over our equation of Naive Bayes.

由于我们采用了所有这些比率的乘积,因此最终得出的数字太大或太小而无法存储在设备中,因此出现了对数似然的概念。 我们将日志记录在朴素贝叶斯方程上。

After taking the log of the likelihood equation, the scale will be changed as follows:


Let’s see an example. Tweet: ‘I am happy because I am learning.

让我们来看一个例子。 推文:“我很高兴,因为我正在学习。

This is the overall log-likelihood for our tweet.

Hence, the value of the overall log-likelihood of the tweet is greater than zero, which implies that the tweet is positive.


缺点: (Drawbacks:)

  1. Naive Bayes algorithm assumes that the words are independent of each other.朴素贝叶斯算法假设单词彼此独立。
  2. Relative frequencies in the corpus: Some times the people blocks particular type of tweets which might be offensive, etc. which leads to an imbalance of data语料库中的相对频率:有时人们会阻止可能令人反感的特定类型的推文,这会导致数据不平衡
  3. Word order: By changing the order of words, the sentiment might change, but with Naive Bayes, we can not encounter that.单词顺序:通过更改单词顺序,可能会改变情绪,但是对于朴素贝叶斯,我们无法做到这一点。
  4. Removal of punctuations: Remember in data preprocessing, we removed punctuations, which might change the sentiment of the tweet. Here is an example: ‘My beloved grandmother :( ’删除标点符号:请记住,在数据预处理中,我们删除了标点符号,这可能会改变推文的情绪。 这是一个例子:'我亲爱的祖母:('

结论: (Conclusion:)

The Naive Bayes is a straightforward and powerful algorithm, knowing the data one can preprocess the data accordingly. Naive Bayes algorithm is also used in many aspects of society like spam classification, Loan approval, etc.

朴素贝叶斯算法是一种简单而强大的算法,它知道数据可以相应地进行预处理。 朴素贝叶斯算法还被用于社会的许多方面,例如垃圾邮件分类,贷款批准等。

