Design a Search Engine (Google)

Web crawling, inverted index construction, ranking with TF-IDF and PageRank, query processing pipeline, and building a distributed search at web scale.

Advanced · 68 min read

System Components

Component Role
Web Crawler Discover and download web pages (BFS from seed URLs)
Link Extractor Parse HTML; extract URLs; add to frontier queue
Content Store Store raw HTML pages (distributed file store)
Indexer Parse content; build inverted index; compute TF-IDF scores
PageRank Batch job: compute authority scores from link graph
Query Processor Parse query; lookup index; merge + rank results; return top K
Cache Cache results for popular queries (LFU eviction)

Inverted Index

An inverted index maps each word to a list of documents containing it (called a posting list). This is the core data structure that allows Google to search billions of pages in milliseconds.

// inverted_index["system"] → [
//   { docId: "url1", tf: 0.05, positions: [12, 45] },
//   { docId: "url2", tf: 0.03, positions: [7]      },
// ]

// TF-IDF scoring:
// TF  = (term frequency in doc) / (total terms in doc)
// IDF = log(total docs / docs containing term)
// Score = TF * IDF — rare terms in relevant docs rank higher

function score(term: string, docId: string): number {
  const tf = termFrequency(term, docId);
  const idf = Math.log(totalDocs / docsContaining(term));
  const pageRank = getPageRank(docId);
  return tf * idf * Math.log(1 + pageRank); // Combine relevance + authority
}

Web Crawler Design

  • Frontier queue — priority queue of URLs to crawl (prioritise fresh/high-PageRank pages)
  • Politeness — respect robots.txt; rate limit per domain (max 1 req/sec)
  • DNS caching — cache DNS lookups to avoid per-URL resolution overhead
  • Deduplication — URL fingerprint (simhash) to skip near-duplicate pages
  • Distributed — consistent hash assigns URL ranges to crawler workers

Part of the System Design series on Tekivex. Browse all tutorials or explore our open-source products.