Every important model needs an explicit grain

A model can look clean in SQL and still be unsafe to use. If nobody wrote down what one row means, the first many-to-many join turns a reasonable table into an argument.

I treat grain as a trust boundary, not a documentation chore. Before I trust a revenue sum or a dashboard card, I want one line that says what the row represents, which key should be unique, and which joins preserve that meaning.

If the row meaning is fuzzy here, the definition work that stops metric drift across dashboards starts from the wrong model. If the source rules are fuzzy too, I start earlier with a minimum data contract: expected keys, lateness window, and the rule for bad records.

Problem

Imagine fct_customer_orders is meant to hold one row per completed order_id. A dashboard request comes in for revenue by marketing channel, so someone joins that model directly to fct_sessions on customer_id.

The query runs. The chart even looks plausible. But one customer can have several sessions before one order, so booked revenue gets repeated across joined rows. The SQL is only part of the problem. The deeper issue is that the team cannot say, in one sentence, what the model is supposed to preserve.

Default approach

Write the grain in one line: one row equals which business entity, at which time boundary, and what state or event qualifies it.
Name the key or key combination that should be unique at that grain, and what should happen when duplicates appear.
List the joins that preserve the grain and the ones that require reshaping or pre-aggregation first.
Mark which measures are safe to sum, count, or average from the model.
Add one test that fails when the declared grain starts duplicating.

That is enough context to block most accidental many-to-many joins before they ship.

Example

This is the lightweight grain note I would want next to a business-critical order model:

Field	Value
Model	fct_customer_orders
Declared grain	one row = one completed order_id
Expected unique key	order_id
Safe joins	dim_customers on customer_id (many orders to one customer) dim_dates on order_date (many orders to one date)
Unsafe without reshaping first	fct_sessions on customer_id ad_clicks on customer_id
Safe measures from this model	sum(revenue_usd) count(order_id) avg(order_value_usd)
First grain test	fail if count(*) != count(distinct order_id)

Now imagine the data looks like this:

fct_customer_orders

order_id	customer_id	revenue_usd
1001	C42	120
1002	C77	80

fct_sessions

session_id	customer_id	channel
s1	C42	paid
s2	C42	email
s3	C42	organic
s4	C77	paid

A direct join can look innocent:

select
  s.channel,
  sum(o.revenue_usd) as revenue_usd
from fct_customer_orders o
join fct_sessions s
  on o.customer_id = s.customer_id
group by 1

If I run that, order 1001 shows up three times because customer C42 had three sessions. Revenue becomes 440 instead of 200.

The safer move is to reshape the session side to order grain before I bring it onto the order model. I pick one attribution rule, build a helper model with one row per order_id, and join that helper back to orders. Once both sides share grain, the revenue sum is safe again.

If row counts stay roughly stable while duplicates sneak into the declared key, I fall back to the check set I use before I trust a pipeline. Stable volume does not protect me from grain drift.

Tradeoffs

Breaks when: a legacy model already mixes order, order-line, and session logic in one table → Mitigation: write the current grain note first, mark unsafe measures clearly, and split the highest-risk use case into a cleaner model before attempting a full rewrite.
Breaks when: source keys are unstable and duplicates are normal upstream behavior → Mitigation: quarantine raw duplicates, state that the curated model is not yet authoritative, and tighten the source rule before downstream teams start summing it.
Breaks when: one business question genuinely needs two grains, such as order conversion and session behavior in the same analysis → Mitigation: keep separate models for each grain and join only after pre-aggregating both sides to the decision grain.
Breaks when: code review checks SQL syntax and row counts but never checks key duplication → Mitigation: add one uniqueness test on the declared grain and treat failures there as trust failures, not minor cleanup.

Close

Next step: Pick one business-critical model and write down the grain, expected key, and one join you would block before the next pull request touches it.

Which model on your team is one quiet rename away from a mixed-grain row that joins still trust?

Every important model needs an explicit grain

Problem

Default approach

Example

Tradeoffs

Close

Continue reading

The definition card I use to stop metric drift across dashboards

Row counts are not enough: the checks I add before I trust a pipeline