The Semantic Layer Belongs in the Warehouse

For the last few years, many teams ended up defining their business logic inside the BI layer.

That was often the practical path. Power BI models, Qlik semantic layers, calculated measures, hidden join logic, report-specific filters, and a growing amount of metric definition spread across dashboards. It worked well enough to unlock reporting fast.

But it also created a structural problem: the most important definitions in the company were sitting at the edge of the stack, not near the data platform itself.

That is the part I think we should now actively move away from.

If a metric matters to the business, its definition should live as close as possible to the warehouse, versioned with the transformation code, documented with the models it depends on, and reusable by every consumer, not just by one BI product.

That is why warehouse-native semantic layers are becoming one of the most important ideas in modern data platforms. And among the current implementations, dbt Semantic Layer powered by MetricFlow is one of the most interesting and practical.

What a semantic layer is for

A semantic layer is the place where analytical meaning is defined in a reusable way.

Its job is not to store raw data and not to replace the warehouse. Its job is to define things like metrics, dimensions, entities, time logic, and valid aggregation rules so that different consumers can ask consistent analytical questions without each one rebuilding the same business logic from scratch.

Historically, many teams implemented that layer inside the BI tool itself. Power BI models, LookML, Qlik semantic models, calculated measures, hidden joins, report filters, and curated datasets all served this purpose in different ways. The idea was always similar: give users a governed analytical interface on top of the raw warehouse tables.

That is the main purpose of a semantic layer:

define business concepts once
make them reusable across tools and users
reduce metric drift and ad hoc SQL logic
give consumers a safer interface than raw warehouse tables

The problem is not the concept. The problem is where the concept lives.

The problem with BI-owned semantics

Semantic layers inside BI tools are not a bad idea. They solved a real need: give analysts and business users curated dimensions, measures, and governed datasets without asking everyone to write SQL.

The problem is where that logic ends up living.

Once definitions are embedded in Power BI or Qlik Cloud, a few things usually happen:

Metric definitions start drifting across reports and workspaces.
Logic becomes harder to review with normal engineering workflows.
Reuse works well inside the BI tool, but poorly outside of it.
Documentation becomes partial, stale, or tool-specific.
External consumers, APIs, notebooks, and AI agents do not share the same semantic contract.
You accumulate product lock-in around the most valuable part of the analytics stack: business meaning.

This is usually not caused by bad engineering. It is just a consequence of putting the semantic center of gravity in the wrong layer.

The warehouse already contains the data. dbt already contains the transformation logic. Tests, lineage, documentation, and deployment processes already live there. Keeping metric definitions somewhere else creates an avoidable split-brain architecture.

What a warehouse-native semantic layer changes

A warehouse-native semantic layer flips the model.

Instead of asking each downstream tool to recreate metric logic, you define entities, dimensions, measures, and metrics once against warehouse models, then let consumers query those definitions consistently.

That brings a few important advantages immediately:

Business logic moves closer to the actual source of truth.
Metric definitions become versioned, reviewable, and testable in Git.
Documentation improves because semantic definitions live next to transformation logic.
Re-aggregation becomes a first-class capability rather than an afterthought.
BI tools become consumers of metrics, not owners of them.
Governance gets stronger without forcing every use case into precomputed marts.

AI tools such as Claude, Codex, and Gemini can inspect and edit semantic definitions much more quickly and consistently when they live as code in a repository rather than inside a proprietary UI built around drag-and-drop flows and manual clicks.

This last point matters.

A good semantic layer is not just a dictionary of metric names. It is a logical layer that understands grain, joins, time, and aggregation rules well enough to answer new questions without rebuilding a new table every time.

That is a major upgrade from the old pattern of “build one mart per dashboard and hope the definitions stay aligned.”

Why dbt and MetricFlow stand out

The generic warehouse-native idea is compelling on its own, but dbt makes it especially practical because the semantic layer sits where many teams already manage transformation logic.

With dbt, the semantic definition is declarative. You describe the semantic model in YAML: entities for joins, dimensions for slicing, measures as the atomic aggregations, and metrics as the business-facing layer on top. MetricFlow then uses those definitions to generate the SQL needed to answer metric queries dynamically.

That also means a consumer can group and filter a metric by all the dimensions linked to that semantic model, not only the columns sitting physically on the base fact table. Dimensions reachable through entity relationships become available through the semantic graph, so analysts and applications do not need to hand-write the join logic each time they want to slice a metric by customer, product, region, or some other linked context.

At a high level, it looks like this:

semantic_models:
  - name: orders
    model: ref('fct_orders')
    defaults:
      agg_time_dimension: ordered_date
    entities:
      - name: order
        type: primary
        expr: order_id
      - name: customer
        type: foreign
        expr: customer_id
    dimensions:
      - name: ordered_date
        type: time
        type_params:
          time_granularity: day
      - name: country
        type: categorical
    measures:
      - name: revenue
        agg: sum
        expr: order_amount
      - name: cost
        agg: sum
        expr: order_cost

metrics:
  - name: total_revenue
    label: Total Revenue
    type: simple
    type_params:
      measure:
        name: revenue
  - name: total_cost
    label: Total Cost
    type: simple
    type_params:
      measure:
        name: cost
  - name: gross_margin
    label: Gross Margin
    type: derived
    type_params:
      expr: (total_revenue - total_cost) / total_revenue
      metrics:
        - name: total_revenue
        - name: total_cost

That is the important shift.

The metric is no longer trapped in a Power BI model, a Qlik app, or a dashboard-specific SQL query. It becomes part of the data platform contract.

From there, the same metric can be queried at different grains, across valid dimensions, with MetricFlow generating the SQL rather than every analyst rewriting it by hand. Even a lightweight command such as:

mf query --metrics gross_margin --group-by metric_time__month,country

already hints at the real value: the platform understands what gross_margin means, how it should be derived from underlying input metrics, which time dimension it aggregates on, and which dimensions can be joined safely.

That is much more powerful than storing a number definition in documentation and then hoping every downstream implementation respects it.

Not everything should go through the semantic layer

This is where the practical architecture matters more than the slogan.

A warehouse-native semantic layer should become the home for governed metrics and curated analytical concepts. It should not become a mandatory gateway for every table in the platform.

In a realistic setup, you usually want two access patterns to coexist:

the semantic layer for shared metrics and governed analytical datasets
direct access to raw or exploratory tables for open-ended analysis

That is not a contradiction. It is the healthy split.

A frontend application, notebook, or BI tool may ask the semantic layer for total_revenue, active_customers, or gross_margin, while still querying a raw events table or a curated wide dataset directly for exploration-heavy use cases. Trying to force every unpredictable question through the metric interface usually creates friction rather than governance.

The semantic layer is most valuable where consistency matters most. Raw access is still valuable where flexibility matters most.

Open enough to start now

One reason this is worth adopting now is that you do not need to treat it as an all-or-nothing platform bet.

The managed dbt Semantic Layer experience, with APIs and integrations, is part of the dbt platform. But the underlying modeling approach is not locked away. dbt Core users can still define semantic models and metrics and query them locally with MetricFlow.

That matters a lot.

It means teams can start by using the semantic layer as a governed modeling contract, even before they fully standardize downstream consumption. In practice, that already delivers value:

clearer documentation
shared metric definitions
better reviewability
less BI-specific duplication
a path away from product lock-in

Even if the first phase is “define it well in dbt, consume it selectively later,” that is still a very good outcome.

And yes, dbt is particularly relevant here because many platform teams already use it across Snowflake, Redshift, and Databricks environments. The semantic layer then extends the same engineering workflow rather than introducing a parallel one.

This is also an anti-lock-in move

A semantic layer in the BI tool often looks harmless until you try to leave the BI tool.

Then you realize your metric logic is buried inside proprietary measures, tool-specific calculation engines, report models, and workspace conventions that do not translate cleanly anywhere else.

That is not just BI lock-in. It is semantic lock-in.

With dbt, the definitions live in code, in your repository, against your warehouse models. Even if your consumption layer changes later, your business definitions remain portable and inspectable. That is a much healthier architecture.

It also changes the relationship between the data team and the business. Instead of each tool inventing its own version of the truth, the platform provides a single semantic contract and every consumer works from it.

That is opinionated, yes. But it is the right kind of opinionated.

The AI angle is not optional anymore

This is where warehouse-native semantic layers become even more important.

AI agents, copilots, notebook assistants, internal data apps, and external consumers all need one thing if they are going to interact safely with analytical data: a rigorous definition of what metrics mean.

Without that, AI does what humans also do in weakly governed environments:

guesses joins
invents filters
mixes grains
uses similar-looking columns as if they were equivalent
returns numbers that sound plausible but are semantically wrong

A semantic layer is one of the strongest ways to reduce that failure mode.

If metrics, entities, and dimensions are explicitly modeled, an AI consumer does not need to reverse-engineer business logic from raw tables or scattered dashboards. It can operate against a defined contract. That is safer, more scalable, and easier to audit.

This is why I do not see semantic layers as just a BI modernization topic. They are quickly becoming an interface layer for machine consumers as much as for human ones.

In that sense, semantic modeling and AI reinforce each other:

AI makes it easier to bootstrap and maintain semantic definitions.
Semantic definitions make AI much less likely to produce misleading answers.
Shared metric contracts let agents, dashboards, notebooks, and APIs stay aligned.

That positive loop is already enough reason to start now rather than waiting for the ecosystem to become “fully mature.”

The trade-offs are real, but manageable

None of this means the semantic layer is free.

There are obvious trade-offs, and pretending otherwise would just turn the argument into marketing.

The main ones are:

Some queries will be more expensive because aggregations are computed dynamically.
Star-schema flexibility can mean more joins at query time.
Latency and performance tuning still matter, especially for hot paths and high-concurrency use cases.
Downstream integration patterns may still be less straightforward than simply building another static aggregate table.

But these are engineering trade-offs, not reasons to avoid the approach.

In fact, one of the most important points to understand is that a semantic layer does not forbid precomputed aggregates. Quite the opposite: it gives you a cleaner contract on top of them.

The elegant default is to start from well-modeled fact and dimension tables, keep the semantic definition logical, and let MetricFlow re-aggregate when that is efficient enough. Then, for the cases where performance really matters, you can absolutely place pragmatic pre-aggregated, denormalized, or semantic-ready wide tables underneath the same contract.

That is not cheating. That is good platform design.

The semantic layer defines meaning. Physical optimization remains a separate concern.

This distinction is easy to miss, but it is central.

You should define the metric once as a business concept, then decide which physical model should back it. If gross_revenue is logically sum(order_item_amount), that metric should stay stable even if the warehouse team later changes the underlying source from a star-schema fact table to a flattened performance-optimized table.

Consumers should not need to care whether the answer came from a normalized model, a denormalized fact, or a pre-aggregated table. They should keep asking for the same metric.

Star schema, wide facts, and entities

This is also why the “semantic layer versus denormalized tables” framing is often wrong.

You do not have to choose between clean dimensional modeling and practical performance. Most good implementations end up hybrid.

Often the base warehouse still follows a familiar Kimball-ish structure:

clear fact tables at a defined grain
shared dimensions for reusable business context
additional wide or flattened tables for performance-sensitive workloads

The semantic layer then sits on top of that mix.

If joins on the star schema are cheap enough, defining measures directly on those facts is perfectly reasonable. If the same joins are repeatedly expensive, a wide semantic-ready fact table is often the better execution layer. Either way, the goal is to keep the semantic definition stable while remaining pragmatic about query cost.

That does not mean the right answer is always “one giant flat table with every dimension duplicated into it.”

Wide fact tables are useful, but literal mega-tables create their own problems:

storage and scan costs grow quickly
updates become harder to manage
slowly changing dimensions become awkward
shared dimensions lose some of their reuse value

In practice, many teams keep the most commonly used slicing attributes close to the fact grain and leave less common or more volatile enrichment in separate dimension models.

MetricFlow’s entity model fits this pattern well. Entities are mostly about expressing join paths between semantic models. If a wide fact table already contains everything needed for a given workload, the semantic model may expose only its primary entity and require no joins for common queries. That is a useful optimization, not a rule for the whole warehouse.

The better mental model is simple:

metrics are defined once
physical tables can change underneath them
some semantic models are self-contained and wide
other semantic models still rely on shared dimensions and explicit join paths

That is usually what a mature semantic layer looks like in the real world.

If you are already working on a modern data platform, I think the question is no longer whether semantic layers are useful. The question is whether you want your semantic layer to live in the right place.

For me, the answer is increasingly clear:

define metrics near the warehouse
keep them in code
expose them consistently
let BI tools consume them rather than own them

dbt Semantic Layer and MetricFlow are not the end state for every team, and not every integration path is equally mature yet. But the core idea is already strong enough to adopt now.

Even in the weakest case, you get better documentation, better governance, and cleaner shared definitions.

In the stronger case, you get a true semantic contract across dashboards, APIs, external consumers, and AI agents.

That is a meaningful architectural step forward.

And unlike many “future of data” ideas, this one is not speculative. It is implementable today.