Vectorizing Text — TF-IDF, Word2Vec & GloVe

Convert words and documents into numerical vectors that capture semantic meaning.

Intermediate · 20 min read

Text Vectorization Methods

Machine learning models work with numbers, not words. Vectorization converts text into numerical representations. The quality of this representation determines how much semantic information the model can capture.

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

corpus = [
    "The cat sat on the mat",
    "The dog chased the cat up the tree",
    "A bird sat on the tree branch",
]

vectorizer = TfidfVectorizer(stop_words='english', max_features=10)
X = vectorizer.fit_transform(corpus)

df = pd.DataFrame(
    X.toarray(),
    columns=vectorizer.get_feature_names_out(),
    index=[f"doc{i+1}" for i in range(len(corpus))]
)
print(df.round(3))
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess

sentences = [
    "the king wore a crown and ruled the kingdom",
    "the queen sat beside the king on the throne",
    "a man walked through the forest alone",
    "a woman sang a beautiful song in the palace",
]

tokenized = [simple_preprocess(s) for s in sentences]
model = Word2Vec(sentences=tokenized, vector_size=100, window=5, min_count=1, sg=1, epochs=200)

print(model.wv.similarity('king', 'queen'))   # high
print(model.wv.similarity('king', 'forest'))  # low

# Analogy: king - man + woman ≈ queen
result = model.wv.most_similar(positive=['king', 'woman'], negative=['man'], topn=1)
print(f"king - man + woman = {result[0][0]}")
Word2Vec GloVe
Predicts context words (local) Factorizes co-occurrence matrix (global)
Skip-gram or CBOW training Captures global corpus statistics
Fast to train on large corpora Slightly better on analogy tasks

Part of the NLP & Language Models series on Tekivex. Browse all tutorials or explore our open-source products.