A Deep Dive into Text Classification with TF-IDF


Unlocking the potential within textual data is a rewarding journey, and text classification, a cornerstone of Natural Language Processing (NLP), stands as a beacon in this exploration. In this blog post, we delve into the intricacies of text classification using Python, Pandas, NLTK, and scikit-learn. Our practical example revolves around travel and food-related sentences, illustrating the application of TF-IDF (Term Frequency-Inverse Document Frequency) in extracting meaningful insights.

Setting up the Data:

Our dataset encapsulates the essence of travel and food experiences, with each sentence tagged with a category (‘t’ for travel and ‘f’ for food).

import pandas as pd

content = ["i will be travelling to mumbai in train", 
           "i will be eating in train", 
           "i love travel alot", 
           "i love to eat south indian food"]

classes = ['t','f','t','f']

dic = {'category': classes, 'description': content}

df = pd.DataFrame(dic)

The table representation of the data is as follows:

ti will be travelling to mumbai in train
fi will be eating in train
ti love travel a lot
fi love to eat south indian food

Fig : Sample Dataset

Text Preprocessing:

A crucial step before classification involves text preprocessing, including stemming to reduce words to their root form. Here, the PorterStemmer from NLTK aids in this transformation.

from nltk.stem import PorterStemmer

ps = PorterStemmer()

all_words = " ".join(content)
stem_words = [ps.stem(w) for w in all_words.split()]
vocabulary = set(stem_words)

Feature Extraction with TF-IDF:

Moving forward, the TF-IDF Vectorizer from scikit-learn transforms our raw text data into numerical features, assigning weights to words based on their importance in each document and across the entire corpus.

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?'

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus).todense()

words = vectorizer.get_feature_names()
sentences = [sentence for sentence in corpus]

df_transformed = pd.DataFrame(X, index=sentences, columns=words)

Unveiling Insights:

Our journey through text classification reveals the significance of text preprocessing and TF-IDF in deciphering meaningful patterns within textual data. The amalgamation of NLP techniques and machine learning tools empowers data enthusiasts to navigate and derive insights from diverse text datasets.


In conclusion, this exploration showcases the transformative potential of NLP and TF-IDF in the realm of text analysis. Armed with the knowledge of text preprocessing, feature extraction, and classification techniques, analysts and data scientists can unravel valuable insights from the ever-expanding realm of textual information, enhancing decision-making processes across various domain.

About the Author:

I am Kishore Kumar K, a dedicated data scientist with a passion for unraveling insights hidden within complex datasets. With a background in MBA in Business Analytics and a BCA in Computer Applications, I have honed my skills in statistical analysis, machine learning, and data visualization.


Leave a Reply

Your email address will not be published. Required fields are marked *

nineteen − nineteen =