From Glue to Horizon: Our Real Journey Building an Iceberg Lakehouse on Snowflake
We set out to build an open lakehouse: Iceberg tables on AWS S3, Spark/Glue for pipelines, Snowflake for analytics compute power. What could go wrong? Everything—from query performance implosions to uncloneable dynamic tables. Here's the unfiltered journey, including why we pivoted to Snowflake-managed Iceberg via Horizon Catalog and abandoned automated dynamic tables for explicit, observable incremental processing. ❌ Phase 1 failed: Glue-generated Iceberg files (32-64MB) caused 5-10x slower Snowflake queries ✅ Phase 2 wins: Snowflake-managed Iceberg auto-compacts to 256-512MB, 2-5x faster, ~35% cost savings ⚠️ Dynamic Tables gotcha: Cannot clone, opaque refresh timing—unusable for production DevOps ✅ Solution: Explicit Streams + Tasks on log tables—boring, debuggable, production-ready 🎯 Key decision: Data team owns Bronze→Silver only; business owns Gold (saved endless remodeling debates) 💰 Reality check: "Cloud-neutral" = readable across engines, not free migration Before diving into architecture decisions, here's who we are: Skills: Strong SQL/dbt, limited Spark/Scala experience Priorities: Ship fast, avoid operational black boxes Constraint: No dedicated DevOps for Glue cluster tuning This shaped every decision below. A team fluent in Spark would have made different trade-offs. Unify data from SAP HANA (change data capture), Salesforce Data Cloud, and raw event streams into a single cloud-neutral lakehouse—no proprietary lock-in, full cross-tool interoperability. We evaluated Delta Lake, Apache Hudi, and Apache Iceberg a couple of years ago. Here's the comprehensive comparison that drove our decision: Criterion Iceberg Delta Lake Hudi Partition Evolution ✅ Change without rewrite ❌ Requires full table rewrite ❌ Not supported Primary Keys ✅ Native support ❌ Not supported ✅ Supported Automated Compaction ✅ MAINTAIN ICEBERG TABLE ⚠️ Manual (OPTIMIZE tuning) ⚠️ Manual/semi-auto Schema Evolution ✅ Full (add/drop/rename/reorder) ✅ Add/drop columns ⚠️ Limited Engine Compatibility ✅ Spark, Trino, Flink, Snowflake, Dremio ⚠️ Spark-first, limited others ⚠️ Spark-first Platform Support ✅ AWS Glue/Athena, Azure, GCP, Snowflake ✅ AWS, Azure (Fabric native), Databricks ⚠️ AWS, limited Azure File Format Flexibility ✅ Parquet, ORC, Avro ⚠️ Parquet only ✅ Parquet, ORC, Avro Community Governance ✅ Apache Foundation (vendor-neutral) ⚠️ Databricks-controlled ✅ Apache Foundation Iceberg's winning factors for our context: Partition evolution without rewrites: Our SAP data's partitioning strategy evolved over time (daily → monthly as data matured). Iceberg lets us change partition specs without rewriting billions of rows. Delta requires full table rewrite—a multi-day, multi-TB operation we couldn't afford. Snowflake-managed support: Only Iceberg offers native managed tables in Snowflake Horizon Catalog with automatic compaction. Delta/Hudi would lock us into external table limitations with the Phase 1 performance issues. Automated compaction: Snowflake's MAINTAIN ICEBERG TABLE handles file optimization automatically. Delta requires manual compaction tuning in Spark—expertise our SQL-first team doesn't have. Primary key enforcement: Iceberg supports primary keys natively, critical for our SAP source data integrity (customer IDs, order numbers). Delta lacks this—you must enforce it in application logic. Vendor-neutral governance: Apache Foundation stewardship means no single vendor controls the spec. Delta's governance is tied to Databricks' business interests. Bottom line: Iceberg was the safe bet for multi-engine flexibility without vendor lock-in. If you're Azure-only with Fabric, Delta is pragmatic. If you're Databricks-native, Delta is the path of least resistance. But for AWS + Snowflake + future optionality, Iceberg was the only choice. Three differentiators that closed the decision: 1. Separation of Storage & Compute Scale workloads independently to meet business demands while enabling detailed chargeback per team or domain—without disrupting other workloads. → Deep dive: Understanding Snowflake Virtual Warehouses 2. Automatic Caching + Smart Pruning (RELY Operators) Sub-second query performance on petabyte-scale data through intelligent result/metadata caching and constraint-based optimization. → Deep dive: RELY Constraint for Star-Schema Queries 3. Compute Billing Precision POC result: Our thousands of BI dashboards were only charged for calculation time—not data transfer to BI tools. Estimated 40–60% cost savings vs. **competitors **billing full query duration. Our lakehouse follows the classic three-layer medallion model, tailored for SAP source systems: ┌─────────────────────────────────────────────────────────────┐ │ DATA SOURCES │ │ SAP HANA (CDC) │ Salesforce Data Cloud │ Raw Streams │ └────────────┬──────────────────┬─────────────────────┬───────┘ │ │ │ ▼ ▼ ▼ ┌─────────────────────────────────────────────────────────────┐ │ 🥉 BRONZE LAYER — AWS Glue Spark → Iceberg writes to S3 │ │ Raw ingestion, no transformation, full fidelity │ └─────────────────────────┬───────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ 🥈 SILVER LAYER — Iceberg Tables (AWS-hosted, S3) │ │ Matches source table structures │ │ e.g., KNA1 (Customers), MARA (Materials), VBAK (Orders) │ └─────────────────────────┬───────────────────────────────────┘ │ Stream on log_table ▼ ┌─────────────────────────────────────────────────────────────┐ │ 🥇 GOLD LAYER — Snowflake-Managed Iceberg (Horizon Cat.) │ │ Proper dimensions & facts with business names │ │ e.g., DIM_CUSTOMER, FACT_SALES_ORDER, DIM_PRODUCT │ └─────────────────────────┬───────────────────────────────────┘ │ ▼ BI Tools / ML / Fabric Export Sounds ideal? Early reality: Snowflake choked on our Glue-generated files. Here's what we learned the hard way: Don't own Gold if you don't have to. Why this boundary matters: Silver = Source truth: Matches SAP table structures (KNA1, MARA, VBAK). Data engineering controls quality, structure, and change tracking. Gold = Business semantics: DIM_CUSTOMER, FACT_SALES_ORDER. Business teams decide how to model, aggregate, and interpret. Boundary = Contract: Silver provides clean, change-tracked source data with explicit SLAs; business teams own downstream transformations. ❌ Endless "why did the customer count change?" debates (business definition shifts, not data quality issues) ❌ Remodeling dimensions every quarter when business logic evolves ❌ Being the bottleneck for every dashboard request ❌ Owning interpretations of business rules we don't fully understand Your team may differ, but define the ownership boundary early or you'll own every downstream interpretation forever. The tools (dbt, Iceberg, Snowflake) don't enforce this—you must. We started simple: Spark jobs in Glue created Iceberg tables stored in Glue Catalog; Snowflake linked them as external Iceberg tables. Snowflake's Iceberg scanner is optimized for specific file characteristics per official recommendations: Parameter Snowflake Recommendation What Glue Produced File size 256 – 512 MB 32 – 64 MB (many small files) Row group size 16 – 256 MB < 16 MB Row groups per file Multiple (for parallelism) Often 1 The result: Full table scans instead of pruned reads, query times 5–10x slower than expected. ┌──────────────────────────────────────────────────────┐ │ PHASE 1 ARCHITECTURE │ │ │ │ AWS Glue Spark ──writes──▶ Iceberg (Glue Catalog) │ │ │ │ │ S3 Parquet files │ │ (small, fragmented) │ │ │ │ │ Snowflake ◀──external table──────┘ │ │ (slow scans, no auto-compaction) │ └──────────────────────────────────────────────────────┘ Option A: Hire Spark expertise, tune file compaction settings Timeline: 3-6 months Ongoing cost: Maintain Spark expertise, monitor file sizes Risk: Our team lacks Spark internals experience Option B: Snowflake-managed Iceberg via Horizon Catalog Timeline: 2 weeks Ongoing cost: Snowflake MAINTAIN ICEBERG TABLE compute Upside: Handles compaction automatically, team stays in SQL/dbt comfort zone We chose Option B: Our team's strength is SQL/dbt, not Spark internals. Let Snowflake handle the file lifecycle. Snowflake-managed Iceberg tables put Snowflake in charge of the table lifecycle on your S3 bucket—Horizon Catalog governs metadata, access, and interoperability. ┌──────────────────────────────────────────────────────────┐ │ PHASE 2 ARCHITECTURE │ │ │ │ AWS Glue Spark ──writes──▶ Snowflake Horizon Catalog │ │ │ │ │ Horizon manages Iceberg metadata │ │ Auto-compaction to 256-512MB files │ │ │ │ │ S3 (your bucket) │ │ Optimal Parquet layout │ │ │ │ │ Snowflake ◀──native read─────────┘ │ │ (2–5x faster, full pruning, cloneable*) │ │ │ │ BI Tools ◀── Snowflake compute │ │ Glue/Spark ◀── Iceberg open format (bidirectional) │ └──────────────────────────────────────────────────────────┘ ✅ 2–5x query speedup vs. Phase 1 external tables ✅ Automatic compaction to optimal file sizes via MAINTAIN ICEBERG TABLE ✅ Bidirectional access: Glue/Spark can write, Snowflake reads natively; BI tools use Snowflake compute ✅ Open format preserved: Gold layer exportable to Fabric/Polaris later Trade-off: Snowflake compute is billed for maintenance runs—but total ops cost is lower than Phase 1's slow queries burning warehouse credits. Dynamic Tables on Iceberg sounded perfect—zero-code pipelines, automatic refresh. We hit two walls that forced us back to explicit patterns: Problem 1 — No Cloning Support: -- This FAILS silently on dynamic Iceberg tables CREATE DATABASE prod_clone CLONE prod_db; -- Dynamic Iceberg tables are simply skipped in the clone DB/schema clones skip dynamic Iceberg tables entirely. This is a DevOps killer—no dev/test environment parity, no blue-green deploys. Problem 2 — Opaque Refresh Timing: Incremental refresh latency was unpredictable and nearly impossible to debug for SLA enforcement. Monitoring refresh lag and debugging failures was guesswork with no visibility into what triggered refreshes or why they were delayed. Dynamic Tables promise automation but hide: When refreshes actually run What triggered the refresh How to debug failures in production at 3am Streams + Tasks = More code, but debuggable in production. Architecture decision-makers: Optimize for production support, not dev convenience. Resolution: Abandoned dynamic tables for Silver → Gold. Back to explicit Streams + Tasks. ┌─────────────────────────────────────────────────────┐ │ SILVER → GOLD PIPELINE (Current) │ │ │ │ Silver Layer │ │ │ │ │ ├──▶ silver.log_table (tracks all changes) │ │ │ │ │ ▼ │ │ CREATE STREAM log_changes │ │ ON TABLE silver.log_table; ◀── stream │ │ │ │ │ ▼ │ │ CREATE TASK gold_refresh ◀── task │ │ SCHEDULE = '5 MINUTE' │ │ AS MERGE INTO gold.customer_360 │ │ USING log_changes ...; │ │ │ │ │ ▼ │ │ Gold Layer (Snowflake-Managed Iceberg) │ └─────────────────────────────────────────────────────┘ -- The explicit, cloneable, debuggable pattern CREATE STREAM log_changes ON TABLE silver.log_table; CREATE TASK gold_refresh WAREHOUSE = compute_xs SCHEDULE = '5 MINUTE' WHEN SYSTEM$STREAM_HAS_DATA('log_changes') AS MERGE INTO gold.customer_360 tgt USING ( SELECT * FROM log_changes WHERE metadata$action = 'INSERT' ) src ON tgt.customer_id = src.customer_id WHEN MATCHED THEN UPDATE SET ... WHEN NOT MATCHED THEN INSERT ...; ✅ Full cloning support (dev/prod parity restored) ✅ Transparent costs — every execution logged in TASK_HISTORY ✅ Debuggable — stream offset visible, failures isolated ✅ Fabric-friendly — Gold can be exported as Iceberg/Parquet later ✅ Open format preserved: Can read Iceberg from Spark, Trino, Athena ✅ Gold exportable: Stream to Parquet → Fabric mirroring works ⚠️ Fabric constraint: OneLake wants Delta + same Azure region for zero-copy ⚠️ Horizon lock-in: Snowflake-managed Iceberg metadata tied to Horizon Catalog Not: "Deploy anywhere tomorrow with zero effort" Actually: "Readable by multiple engines, movable with effort" Layer Migration Effort Estimated Timeline Bronze → Silver ingestion Rewrite Glue jobs to Delta 2-3 weeks Silver → Gold dbt models Port to Fabric SQL (syntax diffs) 1-2 months Gold Iceberg tables Export as Parquet, re-create in Fabric Warehouse 1-2 weeks ┌────────────────────────────────────────────────────────────┐ │ CATALOG LANDSCAPE │ │ │ │ Our Setup: AWS us-east-1 │ │ ┌────────────────────────────────────────────────────┐ │ │ │ Snowflake Horizon Catalog (current) │ │ │ │ + AWS Glue Catalog (bronze/silver ingestion) │ │ │ └────────────────────────────────────────────────────┘ │ │ │ │ Future Options: │ │ ┌─────────────────┐ ┌──────────────────────────────┐ │ │ │ Polaris Catalog │ │ Microsoft Fabric OneLake │ │ │ │ (Snowflake SaaS)│ │ (wants Delta; Azure-region │ │ │ │ Maturing fast │ │ only for zero-copy) │ │ │ └─────────────────┘ └──────────────────────────────┘ │ │ │ │ Gold Streams → Parquet export → Fabric compatible ✅ │ └────────────────────────────────────────────────────────────┘ We prioritized Snowflake ecosystem depth over day-1 multi-cloud portability. Iceberg gave us an exit path, not a free exit. For teams needing Fabric OneLake zero-copy, starting with Delta on Azure is the pragmatic choice. Phase 1 (Glue + External Iceberg): Glue: costs to write and maintain iceberg tables Snowflake query costs: 3x higher due to full scans Phase 2 (Snowflake-managed Iceberg): MAINTAIN ICEBERG TABLE: in warehouse credits Query costs: 60% reduction (pruning + caching works correctly) Net savings: 35% monthly vs. Phase 1 Takeaway: Optimize for query performance where your users actually spend time, not just ingestion costs. Target 256-512MB Parquets or let Snowflake MAINTAIN ICEBERG TABLE handle it automatically. Small files kill Snowflake performance. Dynamic Iceberg tables are powerful but not clone-safe for databases/schemas. Test your DevOps workflow before committing to production. If your team uses both Spark and Snowflake, Horizon gives you the best of both without choosing sides. Write with Spark, read with Snowflake—all on the same Iceberg tables. Explicit Streams + Tasks always win in production ops: observable at 3am, debuggable from logs, cloneable for dev/test. Don't try to migrate all layers at once. Start with Gold (highest query frequency), validate performance, then move Silver. Bronze→Silver→Gold isn't just technical layers—it's organizational boundaries. Decide who owns what before the first production table. Your Situation Recommendation Why Team strong in Spark, need multi-engine Glue-managed Iceberg + External tables Keep expertise where it is Team SQL-first, Snowflake primary engine Snowflake-managed Iceberg Let Snowflake handle file lifecycle Must support Fabric OneLake zero-copy Delta Lake on Azure Iceberg works but not zero-copy on Fabric Need dev/prod clones + SLA guarantees Streams + Tasks (avoid Dynamic Tables) Observable, debuggable Bronze/Silver only (like us) dbt incremental + Developer Toolkit Explicit watermark control Bottom line: Cloud-neutral means readable across tools, not free migration. Choose the platform that matches your team's strengths, use open formats for portability insurance. Investigate Polaris Catalog or wait for Horizon Catalog for true multi-vendor Iceberg metadata management Evaluate cost/performance of streaming directly from Kafka → Snowflake Iceberg Explore Iceberg v3 features for enhanced BCDR and CDC capabilities Since our original evaluation, Snowflake released Apache Iceberg v3 support in public preview (March 2026) with capabilities that address several of our pain points and unlock new architectural patterns. 1. Cross-Region Replication for Snowflake-Managed Tables What it enables: BCDR failover and replication groups for Iceberg tables across regions Why it matters: Previously, our DR strategy required complex Parquet exports. Now, Snowflake-managed Iceberg tables can replicate with full consistency (including row lineage and deletion vectors) Architecture impact: We can now deploy active-passive DR without custom tooling 📖 Replication Config Docs 2. Catalog-Linked Databases What it enables: Connect to remote Iceberg catalogs (AWS Glue, Polaris, etc.) with automatic namespace discovery and read/write support Why it matters: Our Phase 1 external table limitations are eliminated—we can now write back to Glue Catalog-managed tables from Snowflake Architecture impact: True bidirectional catalog federation; could unblock our "Glue as ingestion, Snowflake as analytics" hybrid 📖 CREATE Catalog-Linked Database Docs Which Features Would Have Changed Our Decision? Would have stayed with our choice (Iceberg + Horizon): ✅ Cross-region replication validates our bet on Snowflake-managed Iceberg (Delta still doesn't have this) ✅ Catalog-linked databases eliminate the Phase 1 external table pain without abandoning Glue ingestion For SQL-first teams on Snowflake, Iceberg v3 eliminates the last major operational friction points we encountered. The combination of Horizon Catalog + v3 features delivers the "cloud-neutral with vendor optimization" balance we were seeking. What's your biggest Iceberg-on-Snowflake headache? Drop it below 👇 Snowflake #ApacheIceberg #Lakehouse #DataEngineering #HorizonCatalog #DataArchitecture #AWS #OpenLakehouse
