ArchitectureMay 7, 2026·4 min read
Building a Modern Lakehouse on Azure
An architecture walkthrough of a production-grade Azure Lakehouse — medallion layout, Delta Lake, governance, and the design choices that matter.
The Lakehouse pattern — combining the openness of a data lake with the reliability and performance of a warehouse — is the de facto target architecture for new analytical platforms on Azure. The shape is well understood; the engineering choices that make it succeed are not always obvious. This post walks through a reference Azure Lakehouse end to end and calls out the decisions that actually move the needle. ## Why a Lakehouse, not a warehouse A traditional warehouse forces a single storage and compute coupling and a structured schema before you can query anything. A pure data lake gives you flexibility but no transactions, no schema enforcement, and weak performance for BI. The Lakehouse splits the difference: open columnar storage (Delta Parquet) with ACID transactions, schema evolution, and time travel — queryable directly by Spark, SQL warehouses, and BI engines. For most enterprises, the pragmatic argument is simpler: one copy of data, many compute engines, no vendor lock-in on storage. ## Reference architecture A production Azure Lakehouse typically has six layers: 1. **Sources** — operational databases, SaaS APIs, event streams, files. 2. **Ingestion** — Azure Data Factory or Fabric Pipelines for batch, Event Hubs / Kafka for streaming, and tools like Fivetran or Airbyte for SaaS. 3. **Storage** — Azure Data Lake Storage Gen2 with hierarchical namespaces, organized into Bronze, Silver, and Gold containers. 4. **Compute** — Databricks or Microsoft Fabric for transformation; Synapse Serverless or Fabric Warehouse for ad-hoc SQL. 5. **Serving** — Power BI semantic models (Direct Lake or Import), and SQL endpoints for downstream apps. 6. **Governance** — Microsoft Purview for catalog and lineage, Unity Catalog (if Databricks) for fine-grained access, Azure RBAC and ACLs at the storage layer. ## The medallion pattern done right The Bronze / Silver / Gold layout is well documented but often misapplied. A few rules of thumb: - **Bronze is immutable.** Land raw data exactly as received, with metadata columns (ingestion timestamp, source file, batch id). Do not transform here. - **Silver is conformed.** Apply schema, deduplicate, parse types, handle late-arriving data, and join in reference data. Silver tables should be reusable across many Gold consumers. - **Gold is purpose-built.** One Gold model per business domain or use case — finance reporting, customer 360, ML feature store. Optimize physical layout (Z-Order, partitioning) for the query patterns each model serves. A common mistake is letting Silver and Gold blur — Silver picks up dashboard-specific aggregations and Gold becomes a thin wrapper. Resist this; it leads to duplicated logic across Gold models and brittle dashboards. ## Delta Lake essentials Delta is the storage format that makes the Lakehouse work. The features worth understanding deeply: - **ACID transactions** make concurrent writes safe — no more "is this dashboard reading a half-written file" failures. - **Schema enforcement and evolution** catch upstream changes before they break downstream queries. - **Time travel** is invaluable for debugging and for reproducible ML training. Keep retention reasonable (default 30 days) to control storage costs. - **OPTIMIZE and Z-ORDER** matter for performance. Run them on a schedule for high-read tables. - **VACUUM** removes old files. Set retention policies carefully — too aggressive and you break time travel; too generous and storage bills creep. ## Governance and security Governance is where most Lakehouse projects underinvest and pay for it later. The minimum viable controls: - **Identity** — managed identities for compute, no service principals with passwords lying around. - **Network** — private endpoints on storage and compute, no public-internet data paths. - **Access** — Unity Catalog (Databricks) or Fabric workspace roles + storage ACLs. Avoid broad Storage Blob Data Reader assignments. - **Catalog and lineage** — Purview scans of ADLS, Databricks, and Synapse, with lineage extracted from Spark and SQL. - **Audit** — diagnostic logs to Log Analytics; alerts on unusual access patterns. Treat governance as a product, not a project. Assign a steward per domain. ## Cost considerations A few choices have outsized cost impact: - Use **autoscaling clusters** with sensible min/max for Databricks, and **photon** for SQL workloads. - Right-size the **Fabric capacity** by load-testing rather than by provider recommendation. - Tier ADLS Gen2 data: hot for active layers, cool for older Bronze partitions, archive for compliance retention. - Compact small files. Thousands of tiny Parquet files are the silent killer of Lakehouse query performance and cost. ## Final thoughts A successful Azure Lakehouse is less about picking Databricks vs. Fabric and more about the disciplines around it: a clean medallion separation, Delta hygiene, governance from day one, and cost guardrails. Get those right and the platform will scale gracefully. Skip them and you will be re-platforming again in three years.