Text Classification -News Category using Naïve Bayes

maheshkamineni35
Apr 18, 2022
3 min read

Updated: Apr 22, 2022

About Algorithm: It is a very simple algorithm based on conditional probability and counting. Essentially, your model is a probability table that gets updated through your training data. To predict a new observation, you’d simply “lookup” the class probabilities in your “probability table” based on its feature values.

It’s called “naïve” because its core assumption of conditional independence (i.e. all input features are independent of one another) rarely holds in the real world.

Strengths: Even though the conditional independence assumption rarely holds, NB models perform surprisingly well in practice, especially for their simplicity. They are easy to implement and can scale with your dataset.

Here are some areas where this algorithm finds applications:

Text Classification
- Most of the time, Naïve Bayes finds uses in-text classification due to its assumption of independence and high performance in solving multi-class problems. It enjoys a high rate of success than other algorithms due to its speed and efficiency.

Sentiment Analysis
- One of the most prominent areas of machine learning is sentiment analysis, and this algorithm is quite useful there as well. Sentiment analysis focuses on identifying whether the customers think positively or negatively about a certain topic (product or service).

Recommender Systems
- With the help of Collaborative Filtering, Naïve Bayes Classifier builds a powerful recommender system to predict if a user would like a particular product (or resource) or not. Amazon, Netflix, and Flipkart are prominent companies that use recommender systems to suggest products to their customers.

About the Dataset:

This dataset contains around 200k news headlines from the year 2012 to 2018 obtained from HuffPost. The model trained on this dataset could be used to identify tags for untracked news articles or to identify the type of language used in different news articles.

The file contains 202,372 records. Each json record contains following attributes:

category: Category article belongs to

headline: Headline of the article

authors: Person authored the article

link: Link to the post

short_description: Short description of the article

date: Date the article was published

Link to the Dataset:https://www.kaggle.com/rmisra/news-category-dataset

Experimentation:

First we need to import necessary packages

after importing packages and dataset into Collab we need to remove punctuation marks and extra symbols from the dataset.

Later, for removing stop words / rare words for example if the occurrence is less than five times

in the dataset it was omitted

Stop words are any word in a stop list (or stop list or negative dictionary) which are filtered out (i.e. stopped) before or after processing of natural language data (text).[1] There is no single universal list of stop words used by all natural language processing tools, nor any agreed upon rules for identifying stop words, and indeed not all tools even use such a list. Therefore, any group of words can be chosen as the stop words for a given purpose

FINDING PROBABILITY

for finding probability P(y | X) = [P(X | y) P(y)P(X)]/P(X)

Here, y stands for the class variable (Was it Stolen?) to show if the thieves stole the car not according to the conditions. X stands for the features.

X = x1, x2, x3, …..... xn)

If we assume each set as classes

hence the probability of each class can be