NLP

A Deep Dive into Text Classification with TF-IDF

January 5, 2024January 5, 2024

Introduction:

Unlocking the potential within textual data is a rewarding journey, and text classification, a cornerstone of Natural Language Processing (NLP), stands as a beacon in this exploration. In this blog post, we delve into the intricacies of text classification using Python, Pandas, NLTK, and scikit-learn. Our practical example revolves around travel and food-related sentences, illustrating the application of TF-IDF (Term Frequency-Inverse Document Frequency) in extracting meaningful insights.

Setting up the Data:

Our dataset encapsulates the essence of travel and food experiences, with each sentence tagged with a category (‘t’ for travel and ‘f’ for food).

import pandas as pd

content = ["i will be travelling to mumbai in train", 
           "i will be eating in train", 
           "i love travel alot", 
           "i love to eat south indian food"]

classes = ['t','f','t','f']

dic = {'category': classes, 'description': content}

df = pd.DataFrame(dic)

The table representation of the data is as follows:

category	description
t	i will be travelling to mumbai in train
f	i will be eating in train
t	i love travel a lot
f	i love to eat south indian food

Fig : Sample Dataset

Text Preprocessing:

A crucial step before classification involves text preprocessing, including stemming to reduce words to their root form. Here, the PorterStemmer from NLTK aids in this transformation.

from nltk.stem import PorterStemmer

ps = PorterStemmer()

all_words = " ".join(content)
stem_words = [ps.stem(w) for w in all_words.split()]
vocabulary = set(stem_words)

Feature Extraction with TF-IDF:

Moving forward, the TF-IDF Vectorizer from scikit-learn transforms our raw text data into numerical features, assigning weights to words based on their importance in each document and across the entire corpus.

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?'
]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus).todense()

words = vectorizer.get_feature_names()
sentences = [sentence for sentence in corpus]

df_transformed = pd.DataFrame(X, index=sentences, columns=words)

Unveiling Insights:

Our journey through text classification reveals the significance of text preprocessing and TF-IDF in deciphering meaningful patterns within textual data. The amalgamation of NLP techniques and machine learning tools empowers data enthusiasts to navigate and derive insights from diverse text datasets.

Conclusion:

In conclusion, this exploration showcases the transformative potential of NLP and TF-IDF in the realm of text analysis. Armed with the knowledge of text preprocessing, feature extraction, and classification techniques, analysts and data scientists can unravel valuable insights from the ever-expanding realm of textual information, enhancing decision-making processes across various domain.

About the Author:

I am Kishore Kumar K, a dedicated data scientist with a passion for unraveling insights hidden within complex datasets. With a background in MBA in Business Analytics and a BCA in Computer Applications, I have honed my skills in statistical analysis, machine learning, and data visualization.

Data Analytics | Machine Learning

Essential Pandas for Machine Learning: Part 1

ByKishore January 5, 2024May 28, 2024

Pandas is a powerful and versatile open-source library for data analysis in Python. It provides easy-to-use data structures like Series and DataFrames, making it an essential tool for handling and manipulating data in machine learning projects. In this blog post, we will explore some key aspects of Pandas that are crucial for anyone working in…

Generative AI

Parameter-Efficient Fine-Tuning of Large Language Models with Hugging Face’s PEFT Library

ByKishore April 25, 2024May 24, 2024

Introduction: Large Language Models (LLMs) like GPT, T5, and BERT have shown remarkable performance in NLP tasks. However, fine-tuning these models on downstream tasks can be computationally expensive. Parameter-Efficient Fine-Tuning (PEFT) approaches aim to address this challenge by fine-tuning only a small number of parameters while freezing most of the pretrained model. In this blog…

Machine Learning

The Mathematics Behind Machine Learning

ByKishore March 2, 2024May 27, 2024

Machine learning is a branch of artificial intelligence that enables computers to learn from data and make decisions or predictions without being explicitly programmed. At the core of machine learning algorithms lie mathematical concepts and principles that drive their functionality. In this blog post, we’ll explore some key mathematical concepts behind machine learning. Linear Algebra…

Generative AI

A Deep Dive into Transformers and its Function

ByKishore April 24, 2024May 24, 2024

Introduction: In recent years, Generative AI has witnessed a paradigm shift with the introduction of transformer models. These models, characterized by their attention mechanisms, have revolutionized natural language processing (NLP) and other generative tasks. In this blog post, we’ll explore the transformer architecture, its applications in NLP, and its extension to other creative domains. Understanding…

Data Analytics

Harness the hidden power of nested functions to craft elegant, efficient, and mind-bending Python code 🐍

ByKishore January 10, 2024May 25, 2024

Nested functions, also known as inner functions, are a fascinating aspect of Python that enables the definition of functions within other functions. This feature allows for a more modular and organized structure in code. In this exploration, we will dive into the world of nested functions, understanding their creation, usage, and the concept of nonlocal…

Data Analytics | Machine Learning

Composite Estimators using Pipeline & FeatureUnions

ByKishore February 26, 2024May 25, 2024

In machine learning workflows, data often requires various preprocessing steps before it can be fed into a model. Composite estimators, such as Pipelines and FeatureUnions, provide a way to combine these preprocessing steps with the model training process. This blog post will explore the concepts of composite estimators and demonstrate their usage in scikit-learn (version…