Every important model needs an explicit grain
How I keep models trustworthy with a simple grain note: row meaning, expected key, safe joins, and the first duplication test I write.
A model can look clean in SQL and still be unsafe to use. If nobody wrote down what one row means, the first many-to-many join turns a reasonable table into an argument.
I treat grain as a trust boundary, not a documentation chore. Before I trust a revenue sum or a dashboard card, I want one line that says what the row represents, which key should be unique, and which joins preserve that meaning.
If the row meaning is fuzzy here, the definition work that stops metric drift across dashboards starts from the wrong model. If the source rules are fuzzy too, I start earlier with a minimum data contract: expected keys, lateness window, and the rule for bad records.
Problem
Imagine fct_customer_orders is meant to hold one row per completed order_id. A dashboard request comes in for revenue by marketing channel, so someone joins that model directly to fct_sessions on customer_id.
The query runs. The chart even looks plausible. But one customer can have several sessions before one order, so booked revenue gets repeated across joined rows. The SQL is only part of the problem. The deeper issue is that the team cannot say, in one sentence, what the model is supposed to preserve.
Default approach
- Write the grain in one line: one row equals which business entity, at which time boundary, and what state or event qualifies it.
- Name the key or key combination that should be unique at that grain, and what should happen when duplicates appear.
- List the joins that preserve the grain and the ones that require reshaping or pre-aggregation first.
- Mark which measures are safe to sum, count, or average from the model.
- Add one test that fails when the declared grain starts duplicating.
That is enough context to block most accidental many-to-many joins before they ship.
Example
This is the lightweight grain note I would want next to a business-critical order model:
| Field | Value |
|---|---|
| Model | fct_customer_orders |
| Declared grain | one row = one completed order_id |
| Expected unique key | order_id |
| Safe joins | dim_customers on customer_id (many orders to one customer) dim_dates on order_date (many orders to one date) |
| Unsafe without reshaping first | fct_sessions on customer_id ad_clicks on customer_id |
| Safe measures from this model | sum(revenue_usd) count(order_id) avg(order_value_usd) |
| First grain test | fail if count(*) != count(distinct order_id) |
Now imagine the data looks like this:
fct_customer_orders
| order_id | customer_id | revenue_usd |
|---|---|---|
| 1001 | C42 | 120 |
| 1002 | C77 | 80 |
fct_sessions
| session_id | customer_id | channel |
|---|---|---|
| s1 | C42 | paid |
| s2 | C42 | |
| s3 | C42 | organic |
| s4 | C77 | paid |
A direct join can look innocent:
select
s.channel,
sum(o.revenue_usd) as revenue_usd
from fct_customer_orders o
join fct_sessions s
on o.customer_id = s.customer_id
group by 1
If I run that, order 1001 shows up three times because customer C42 had three sessions. Revenue becomes 440 instead of 200.
The safer move is to reshape the session side to order grain before I bring it onto the order model. I pick one attribution rule, build a helper model with one row per order_id, and join that helper back to orders. Once both sides share grain, the revenue sum is safe again.
If row counts stay roughly stable while duplicates sneak into the declared key, I fall back to the check set I use before I trust a pipeline. Stable volume does not protect me from grain drift.
Tradeoffs
- Breaks when: a legacy model already mixes order, order-line, and session logic in one table → Mitigation: write the current grain note first, mark unsafe measures clearly, and split the highest-risk use case into a cleaner model before attempting a full rewrite.
- Breaks when: source keys are unstable and duplicates are normal upstream behavior → Mitigation: quarantine raw duplicates, state that the curated model is not yet authoritative, and tighten the source rule before downstream teams start summing it.
- Breaks when: one business question genuinely needs two grains, such as order conversion and session behavior in the same analysis → Mitigation: keep separate models for each grain and join only after pre-aggregating both sides to the decision grain.
- Breaks when: code review checks SQL syntax and row counts but never checks key duplication → Mitigation: add one uniqueness test on the declared grain and treat failures there as trust failures, not minor cleanup.
Close
Next step: Pick one business-critical model and write down the grain, expected key, and one join you would block before the next pull request touches it.
Which model on your team is one quiet rename away from a mixed-grain row that joins still trust?