Unveiling the Power of Word Embeddings with Gensim

In the realm of Natural Language Processing (NLP), word embeddings have emerged as a game-changer. Unlike traditional approaches that use words as features, word embeddings leverage dense, low-dimensional vectors to capture the meaning and usage of a word. One pioneering model in this domain is Word2Vec, developed by Thomas Mikolov and team at Google. In this blog post, we’ll delve into the world of word embeddings using the original Word2Vec approach, implemented with the Gensim library.

Training Word Embeddings

Training word embeddings with Gensim is a breeze. All you need is a corpus of sentences in the language of interest. For our exploration, we’ll use 5,000,000 sentences from Dutch Wikipedia. Let’s jump into the code:

import os
import gensim

class SentenceCorpus(object):
    def __init__(self, filename):
        self.filename = filename

    def __iter__(self):
        with open(self.filename, "r") as i:
            for line in i:
                tokens = line.strip().split()
                yield tokens
                
WIKI_FILE = os.path.join("../data", "nlwiki_20170620_tok_small.txt")
sentences = SentenceCorpus(WIKI_FILE)

model = gensim.models.Word2Vec(sentences, min_count=100, window=5, size=100)

Using Word Embeddings

Now that we have our embeddings trained, let’s explore their capabilities. We can access the embeddings using the wv attribute of the model. For instance:

# Retrieving the embedding for the word "koning" (king) king_embedding = model.wv["koning"]

We can also measure the similarity between two words:

similarity_king_queen = model.wv.similarity("koning", "koningin") # Expected: high similarity_king_coffee = model.wv.similarity("koning", "koffie") # Expected: low

Furthermore, finding words most similar to a target word is straightforward:

similar_words_to_king = model.wv.similar_by_word("koning", topn=10)

The model even allows us to explore analogies:

analogy_result = model.wv.most_similar(positive=['vrouw', 'koning'], negative=["man"], topn=10)

Visualizing Embeddings

Visualizing embeddings in a high-dimensional space can be challenging. We use t-distributed Stochastic Neighbor Embedding (t-SNE) to map the embeddings to a 2D space for visualization:

import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

target_word = "belgië"
selected_words = [w[0] for w in model.wv.most_similar(positive=[target_word], topn=200)]
embeddings = [model.wv[w] for w in selected_words]

mapped_embeddings = TSNE(n_components=2, metric='cosine', init='pca').fit_transform(embeddings)

# Plotting the 2D embeddings
plt.scatter(mapped_embeddings[:, 0], mapped_embeddings[:, 1])

# Annotating words on the plot
for i, txt in enumerate(selected_words):
    plt.annotate(txt, (mapped_embeddings[i, 0], mapped_embeddings[i, 1]))

plt.show()

Exploring Hyperparameters

Choosing the right hyperparameters is crucial. We evaluate the impact of embedding size and context window:

sizes = [100, 200, 300]
windows = [2, 5, 10]

for size in sizes:
    for window in windows:
        model = gensim.models.Word2Vec(sentences, min_count=100, window=window, size=size)
        acc = evaluate(model, word2pos)
        df[size][window] = acc

The results suggest that smaller contexts tend to work better, and 200-dimensional embeddings strike a balance.

Clustering Embeddings

Clustered embeddings can be valuable for tasks like Named Entity Recognition. We use agglomerative clustering and save the clusters to a file:

from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import normalize

vocab = list(model.wv.vocab)
vectors = [model.wv[w] for w in vocab]
vectors_norm = normalize(vectors)

clusterer = AgglomerativeClustering(n_clusters=500)
clusters = clusterer.fit_predict(vectors_norm)

# Save clusters to a file
with open("data/clusters_nl.tsv", "w") as o:
    for c in cluster_dictionary:
        for w in cluster_dictionary[c]:
            o.write(f"{w}\t{c}\n")

Conclusion

Word embeddings open up exciting possibilities in NLP, allowing us to model word meanings and discover semantic relationships. Gensim’s Word2Vec implementation empowers us to navigate this landscape effortlessly. From training embeddings to visualizing and fine-tuning, word embeddings offer a rich playground for language exploration.

In future experiments, we’ll leverage these embeddings for Named Entity Recognition and other advanced NLP tasks. Stay tuned for more insights into the fascinating world of word embeddings!

Unraveling Text Classification: Traditional Approaches with Scikit-learn

ByKishore January 31, 2024May 26, 2024

Welcome to a journey into the world of text classification, where we’ll explore some traditional yet powerful approaches using Scikit-learn. While deep learning has taken center stage in Natural Language Processing (NLP), these classical methods remain quick and effective for training text classifiers. Our playground for this experiment is the 20 Newsgroups dataset, a classic…

NLP

Visualizing NLP with Pretrained Models – spaCy and StanfordNLP

ByKishore January 11, 2024

Natural Language Processing (NLP) is a crucial aspect of understanding and processing human language using computational methods. In this tutorial, we will explore two popular NLP libraries – spaCy and StanfordNLP – and demonstrate their capabilities using pretrained models. spaCy – English NLP Let’s start with spaCy and an English example. We’ll use a snippet…

Data Analytics

Exploratory Data Analysis and Market Basket Analysis with Python

ByKishore January 10, 2024May 27, 2024

In the realm of retail, understanding customer behavior and optimizing product offerings can be a game-changer. In this blog post, we’ll explore how to perform Exploratory Data Analysis (EDA) and Market Basket Analysis using Python, specifically focusing on a dataset related to retail transactions. Introduction The dataset we’re working with contains information about retail transactions….

Data Analytics

A Comprehensive Guide to Array Handling and Advanced Operations using Numpy

ByKishore January 5, 2024January 5, 2024

Numpy, Your Gateway to Powerful Array Manipulation in Python If you’re venturing into the realm of scientific computing or data analysis with Python, Numpy is your trusted companion. This library is tailored for multidimensional array operations, offering features like seamless data consistency checks, efficient memory usage, and lightning-fast vector arithmetic. In this comprehensive guide, we’ll…

Machine Learning

Understanding Decision Trees: A Comprehensive Guide with Python Implementation

ByKishore February 20, 2024May 27, 2024

Introduction: Decision trees are powerful tools in the field of machine learning and data science. They are versatile, easy to interpret, and can handle both classification and regression tasks. In this blog post, we will explore decision trees in detail, understand how they work, and implement a decision tree classifier using Python. What is a…

Machine Learning

Understanding Support Vector Machines (SVMs) in Depth

ByKishore December 20, 2023May 28, 2024

Support Vector Machines (SVMs) are a powerful class of supervised algorithms used for both classification and regression tasks. In this blog post, we will delve into the intuition behind SVMs and their application in solving classification problems. Motivation To begin, let’s consider a simple classification task with well-separated classes. We’ll generate some synthetic data with…