Apache Iceberg: Definition & Developer Guide

Apache Iceberg is a high-performance open-source table format designed to handle petabyte-scale datasets in data lake architectures. Unlike traditional formats like Parquet or ORC that only manage file storage, Iceberg provides a table management layer with transactional guarantees, scalable schema tracking, and time travel capabilities. Created at Netflix and now an Apache top-level project, it solves critical consistency and performance issues in modern data architectures.

Fundamentals of Apache Iceberg

Three-layer architecture: data files (Parquet/ORC/Avro), manifest files (file lists), and table metadata (snapshots, schemas, partitioning)
Full ACID transactions with serializable isolation, enabling consistent reads even during concurrent writes
Schema evolution without rewrites: add, drop, and rename columns without affecting existing data
Hidden partitioning: partition transformations applied automatically without exposing structure to queries

Strategic Benefits

Optimal performance: partition and file pruning based on detailed statistics, drastically reducing scanned data volume
Built-in time travel: access any historical snapshot of the table for audits, reproductions, or rollbacks
Multi-engine compatibility: works natively with Spark, Flink, Trino, Hive, Presto without proprietary ecosystem dependencies
Metadata scalability: optimized structure to handle millions of partitions without planning performance degradation
Advanced atomic operations: MERGE, UPDATE, DELETE performed transactionally on distributed tables

Practical Example: Iceberg Table Architecture

iceberg_table_operations.py

from pyspark.sql import SparkSession

# Spark configuration for Iceberg
spark = SparkSession.builder \
    .appName("IcebergDemo") \
    .config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.4.0") \
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
    .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.my_catalog.type", "hadoop") \
    .config("spark.sql.catalog.my_catalog.warehouse", "s3://my-bucket/warehouse") \
    .getOrCreate()

# Create Iceberg table with hidden partitioning
spark.sql("""
    CREATE TABLE my_catalog.db.events (
        event_id STRING,
        user_id LONG,
        event_type STRING,
        event_timestamp TIMESTAMP,
        metadata MAP<STRING, STRING>
    )
    USING iceberg
    PARTITIONED BY (days(event_timestamp), event_type)
    TBLPROPERTIES (
        'write.format.default' = 'parquet',
        'write.metadata.compression-codec' = 'gzip'
    )
""")

# Insert data
df = spark.read.json("s3://source/events/*.json")
df.writeTo("my_catalog.db.events").append()

# Time travel: read version from 2 hours ago
spark.read \
    .option("as-of-timestamp", "2024-01-15 10:00:00") \
    .table("my_catalog.db.events") \
    .show()

# Atomic MERGE operation
spark.sql("""
    MERGE INTO my_catalog.db.events t
    USING updates s
    ON t.event_id = s.event_id
    WHEN MATCHED THEN UPDATE SET *
    WHEN NOT MATCHED THEN INSERT *
""")

# Schema evolution without downtime
spark.sql("""
    ALTER TABLE my_catalog.db.events
    ADD COLUMN device_type STRING AFTER event_type
""")

Implementation Steps

Catalog selection: choose between Hive Metastore, AWS Glue, Nessie, or JDBC based on existing infrastructure
Storage configuration: define warehouse on S3, ADLS, GCS, or HDFS with appropriate permissions
Compute engine integration: configure Spark, Flink, or Trino with Iceberg extensions and connectors
Partitioning strategy: define hidden transformations (days, hours, bucket) based on query patterns
Retention policy: configure snapshot expiration and cleanup to optimize storage costs
Progressive migration: use migration procedures to convert existing tables (Hive, Delta) to Iceberg
Monitoring: implement tracking for compaction metrics, snapshot count, and table size

Performance Optimization

Regularly execute maintenance operations: REWRITE DATA FILES to optimize file sizes (avoid small files), EXPIRE SNAPSHOTS to remove obsolete history, and REWRITE MANIFESTS to consolidate fragmented metadata. These operations maintain optimal long-term performance.

Tools and Ecosystem

Apache Spark: primary engine for batch and streaming operations on Iceberg tables
Apache Flink: real-time streaming with native support for ACID Iceberg writes
Trino/Presto: high-performance interactive SQL querying on Iceberg data lakes
Nessie: Git-like catalog providing branches, tags, and versioning for Iceberg tables
AWS Glue/Azure Purview: managed catalogs with Iceberg support for centralized metadata
dbt: SQL transformations with incremental materialization support on Iceberg
Tableau/Looker: visualization and BI directly on Iceberg tables via JDBC/ODBC connectors

Apache Iceberg represents a major evolution in data lake architecture, bringing database-type guarantees to cloud storage scale. Its ability to provide ACID transactions, time travel, and schema evolution without compromising performance makes it the preferred choice for modern data architectures requiring reliability and flexibility. By unifying batch and streaming workloads under a standardized and vendor-neutral format, Iceberg reduces operational complexity while improving data governance and infrastructure cost efficiency.

Apache Iceberg

Fundamentals of Apache Iceberg

Strategic Benefits

Practical Example: Iceberg Table Architecture

Implementation Steps

Performance Optimization

Tools and Ecosystem

Need expert help on this topic?

Related terms

The money is already on the table.