Text Preprocessing

Tokenization, stemming, lemmatization, stopword removal, and cleaning pipelines that convert raw text into model-ready input.

Beginner · 16 min read

Why Preprocess Text?

Raw text is noisy — it contains HTML tags, punctuation, inconsistent capitalization, and irrelevant words. Preprocessing transforms messy human language into structured, clean input for models.

Tokenization

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
import spacy

nltk.download('punkt', quiet=True)
text = "Dr. Smith went to the U.S.A. He bought a car for $50,000."

words = word_tokenize(text)
print("Words:", words)

sentences = sent_tokenize(text)
print("Sentences:", sentences)

# spaCy (smarter — handles contractions)
nlp = spacy.load("en_core_web_sm")
doc = nlp("I can't believe it's already 2025!")
print("spaCy tokens:", [token.text for token in doc])

Stemming vs Lemmatization

Stemming	Lemmatization
Crude rule-based truncation	Uses vocabulary and morphological analysis
"running" → "run", "studies" → "studi"	"running" → "run", "better" → "good"
Fast, but may produce non-words	Slower, always produces real words
Use: search engines, IR systems	Use: sentiment analysis, chatbots

import re, string, spacy
from nltk.corpus import stopwords

nlp = spacy.load("en_core_web_sm")
stop_words = set(stopwords.words('english'))

def preprocess(text: str) -> list[str]:
    text = text.lower()
    text = re.sub(r'https?://\S+', '', text)
    text = re.sub(r'<.*?>', '', text)
    text = text.translate(str.maketrans('', '', string.punctuation))
    doc = nlp(text)
    return [
        token.lemma_ for token in doc
        if not token.is_stop and token.is_alpha and len(token.text) > 2
    ]

print(preprocess("The cats are running quickly across the green field!"))
# ['cat', 'run', 'quickly', 'green', 'field']

Part of the NLP & Language Models series on Tekivex. Browse all tutorials or explore our open-source products.