Go home
Menu

The Snowflake design choices that make downstream models easier to trust

Snowflake models stay easier to trust when raw landing absorbs source drift, staged tables normalize names and types, and curated models keep row meaning, joins, and recovery boundaries explicit.

· 6 min read by Berhan Turkkaynagi

A Snowflake model gets hard to trust when one source change has no obvious place to stop.

A feed sends quantity as text, a timestamp arrives in a new format, a nested location field appears, and suddenly every downstream join has to decide what orders means again.

I do not use raw, stage, and curated layers as warehouse ceremony. I use them to isolate change: raw preserves source fidelity, stage normalizes names and types, and curated locks row meaning plus safe joins.

Problem

A broken query is only the visible symptom.

The deeper problem is a warehouse where nobody can say which layer owns source fidelity, which layer owns type cleanup, or which table a planner should actually trust.

That blur turns one source change into extra joins, wider backfills, and longer incident notes.

If raw, stage, and curated objects all carry half-cleaned business logic, every upstream drift leaks further than it should.

The source contract still matters, so I still want the source table expectations written down first.

I also want the curated model grain stated explicitly.

Both disciplines get easier to sustain when Snowflake itself has clear layer jobs.

Default approach

  • Keep raw landing in an explicit raw schema and keep it close to source shape so replay, audit, and file-level evolution stay possible.
  • Use staged tables to normalize names, cast strings into typed timestamps and numbers, and flatten the predictable parts of semi-structured payloads.
  • Use curated models to freeze business grain, safe joins, and reader-facing measures so downstream users do not reason about source drift directly.
  • Make the layer obvious from the fully qualified name. If orders exists in three places, I want analytics.raw.order_events, analytics.stage.order_lines_typed, and analytics.curated.fct_order_lines.
  • Declare keys and relationships as metadata on trusted tables when they help humans and BI tools understand the join path, while staying honest that Snowflake treats primary-key and foreign-key constraints on standard tables as informational metadata rather than enforced integrity (table design guidance).
  • Treat transient storage, schema evolution, and clustering as bounded tools, not the default trust pattern.

Example

Imagine an order-and-inventory feed lands in Snowflake every hour.

One morning the upstream export changes in three ways:

Source change
- quantity: "12" instead of 12
- event_timestamp: "02/17/2026 05:14:08 -0400" instead of ISO 8601
- location.attributes.zone: new nested attribute

If that change leaks straight into the trusted model, three downstream problems show up at once.

  • Quantity math becomes less safe.
  • Timestamp filters and backfill windows get ambiguous.
  • Location joins start depending on a nested payload shape instead of a stable typed column.

This is the smallest boundary note I want instead:

FieldValue
RAWanalytics.raw.order_events keep landed payload in VARIANT preserve source naming for replay and audit allow schema evolution here only for controlled file loads, with `ENABLE_SCHEMA_EVOLUTION`, `MATCH_BY_COLUMN_NAME`, and the table boundary documented
STAGEanalytics.stage.order_lines_typed expose order_id, order_line_id, sku_id, location_id quantity NUMBER(38,0) event_ts TIMESTAMP_TZ location_zone VARCHAR
CURATEDanalytics.curated.fct_order_lines grain: one row per order_line_id safe joins: dim_dates on order_date, dim_locations on location_id lifecycle: permanent unless recovery boundaries are documented otherwise

The staged model is where I want the ugly conversion work:

select
  payload:order_id::varchar as order_id,
  payload:line_id::varchar as order_line_id,
  try_to_number(payload:quantity) as quantity,
  try_to_timestamp_tz(
    payload:event_timestamp::varchar,
    'MM/DD/YYYY HH24:MI:SS TZHTZM'
  ) as event_ts,
  payload:location:id::varchar as location_id,
  payload:location:attributes:zone::varchar as location_zone
from analytics.raw.order_events;

That is the stage boundary doing its job.

Raw stays faithful to the source. Stage holds the typing and flattening work. Curated facts do not need to reinterpret the payload every time the feed drifts.

If try_to_number(payload:quantity) starts returning NULL, I want that failure to surface in stage, not inside a curated fact with string-shaped quantity logic.

I want the same boundary for timestamps. If the feed is not ISO 8601, stage should parse it with an explicit format instead of relying on session settings.

On the curated side, I still want row meaning frozen. fct_order_lines stays one row per order_line_id. fct_inventory_snapshots stays at its own declared grain.

A backfill or incident note should point to the staged normalization boundary, not force both facts to reinterpret raw payloads on the fly.

I also declare the trusted join path there. On standard Snowflake tables, that metadata does not enforce integrity, but it still makes joins to dates, locations, and inventory snapshots easier to review.

I keep the storage boundary explicit too.

If the raw landing table is reconstructable from external files, I might accept transient storage there. I do not make the same default for curated facts or dimensions.

Snowflake documents that transient tables have no Fail-safe, so I only use them where loss is acceptable or reconstruction is already documented (temporary and transient tables).

I keep the same boundary on schema evolution. If I enable ENABLE_SCHEMA_EVOLUTION, I want it on the raw file-load table that is supposed to absorb column additions. Snowflake limits automatic schema evolution to file loads and Snowpipe with MATCH_BY_COLUMN_NAME; it can add columns and drop NOT NULL when new files omit a field, which is exactly why I keep it out of curated facts (schema evolution docs).

I keep clustering just as narrow.

Snowflake already micro-partitions automatically, and its table design guidance says clustering is unnecessary for most tables and usually only worth revisiting when large tables spend real time scanning on a query path that differs from load order (table design guidance).

Until then, clear layer ownership earns more trust than premature storage tuning.

Tradeoffs

  • Breaks when: raw starts carrying business logic because the team is afraid to create another table → Mitigation: keep raw faithful to source shape and move typed cleanup plus reusable keys into stage.
  • Breaks when: staged models leave predictable timestamps, numbers, or join attributes trapped inside VARIANT → Mitigation: flatten and type the fields the team actually filters, joins, or backfills against.
  • Breaks when: I pretend Snowflake constraints are enforcing integrity on standard tables → Mitigation: use primary and foreign keys as metadata for legibility and tooling, then keep the actual trust checks in model logic and tests.
  • Breaks when: automatic schema evolution in file loads becomes permission to let curated models drift silently → Mitigation: let raw absorb evolving files, but require staged and curated changes to stay deliberate and reviewable.
  • Breaks when: transient tables become the default for long-lived trusted models → Mitigation: reserve transient storage for scratch or reconstructable layers and keep trusted curated models on the safer lifecycle.
  • Breaks when: I treat clustering as part of the base design pattern → Mitigation: keep it as a late optimization for large scan-heavy tables instead of mixing it into the minimum trust boundary.

Close

Next step: For one source change your team has already seen in Snowflake, write where the change should stop: raw replay, staged typing, curated grain, or recovery boundary.

The boundary review is useful when it explains why raw, stage, and curated layers disagree before trust erodes downstream.