Big Data Fundamentals

The 5 Vs, batch vs stream processing, Lambda and Kappa architectures, and the modern data lakehouse — how hyperscalers process petabytes.

Advanced · 12 min read

The 5 Vs of Big Data

V	Meaning	Example Challenge
Volume	Terabytes to petabytes	Can't fit in a single machine's RAM or disk
Velocity	High rate of data arrival	Millions of events/second from IoT sensors
Variety	Structured, semi-structured, unstructured	JSON logs + CSV exports + video files
Veracity	Data quality and accuracy	Missing values, duplicate events, schema drift
Value	Business insight from data	Actionable analytics that justify storage costs

Batch vs Stream Processing

Batch Processing	Stream Processing
Process a finite dataset all at once	Process events as they arrive
High throughput, high latency (hours)	Low latency (ms to seconds)
Apache Spark, Hadoop MapReduce	Apache Kafka + Flink, Spark Streaming
ETL jobs, ML training, monthly reports	Real-time dashboards, fraud detection
Cheaper to run, simpler to debug	More complex, more infrastructure

Lambda Architecture

Lambda architecture runs two processing paths: a batch layer (accurate, slow) and a speed layer (approximate, fast). A serving layer merges results. The downside: you maintain two codebases.

Modern Data Lakehouse

The Data Lakehouse (Databricks Delta Lake, Apache Iceberg, Apache Hudi) unifies the cheap storage of a data lake (S3/GCS) with the ACID transactions and schema enforcement of a data warehouse. One system, one codebase, one truth.

Layer	Technology	Role
Ingestion	Kafka, Kinesis, Firehose	Stream events from services
Storage	S3 + Iceberg/Delta	Cheap, durable, queryable
Processing	Spark, Flink, dbt	Transform raw → curated → aggregated
Query	Trino, Athena, BigQuery	Ad-hoc SQL on PB-scale data
BI / ML	Tableau, Looker, SageMaker	Consume insights

Part of the System Design series on Tekivex. Browse all tutorials or explore our open-source products.