Distribution of a Variable: How to Find the Good, the Bad, and the Ugly?

by BerhanJanuary 22, 2023September 4, 2022 5 min read

As data analysts and data engineer, one of the most important things we do is understand our data. Many times we’ll want to simply understand how our data looks and other times we’ll want to understand how a specific variable or set of variables are distributed in the larger data set. This latter is often tricky because many times we’re dealing with high cardinality and it’s difficult to get good intuition on what’s happening.

The best way to get a handle on this is to use a technique called binning. Binning allows us to explicitly define the bins we want to create. Once we create them, we can calculate statistics for each bin and get a really good sense of what’s happening in the data.

Count values

SELECT unanswered_count, count(*)
FROM stackoverflow
WHERE tag='azure-blob'
GROUP BY unanswered_count
ORDER BY unanswered_count;

For columns with a small number of discrete values, we can view the distribution by counting the number of observations with each distinct value. We group by, and order the results by, the column of interest.

unanswered_count  | count
------------------+-------
               37 | 12
               38 | 40
...
               43 | 10
               44 | 8
               45 | 17
               46 | 4
               47 | 1
...
               54 | 131
               55 | 34
               56 | 1
(20 rows)

There are 20 distinct values in the unanswered_count column in the stackoverflow data with the tag azure-blob. Only partial results are shown here.

Twenty values are manageable to examine, but when the variable you’re interested in takes on many different values, binning or grouping the values can make the output more useful.

Truncate

One way to do this is with the trunc function. Trunc is short for truncate. The trunc function reduces the precision of a number. This means replacing the smallest numeric places – the right-most digits – with zeros. Truncating is not the same as rounding: you’ll never get a result with a larger absolute value than the original number. Trunc takes two arguments: the value to truncate and the number of places to truncate it to. Positive values for the second argument indicate the number of digits after the decimal to keep.

SELECT trunc(24.1256, 2);

For example, truncating 24-point-1256 to 2 places keeps only the first two digits after the decimal. Negative values for the second argument indicate places before the decimal to replace with zero.

24.12

SELECT trunc(12345,-2);

For example, truncating 12,345 to -2 replaces the two digits to the left of the decimal with zero.

Truncating and grouping

We can use the trunc function to group values in the unanswered_count column into three groups based on the digit in the tens place of the number.

SELECT trunc(unanswered_count, -1) AS trunc_ua, count(*)
FROM stackoverflow
WHERE tag='azure_blob'
GROUP BY trunc_ua -- column alias
ORDER BY trunc_ua; -- column alias

Note that the second argument to the trunc function here is a -1. There are 74 values between 30 and 39.

trunc_ua  | count
----------+-------
       30 | 74
       40 | 194
       50 | 480
(3 rows)

Generate series

What if you want to group values by a quantity other than the place value of a number, such as by units of 5 or 20?

SELECT generate_series(start, end, step);

The generate_series function can help. It generates a series of numbers from a starting value to an ending value, inclusive, by steps of a third value.

SELECT generate_series(1, 10, 2);

For example, we can generate a series from 1 to 10 by steps of 2, or a series from 0 to 1 by steps of 1/10th.

generate_series
-----------------
               1
               3
               5
               7
               9
(5 rows)

SELECT generate_series(0, 1, .1);

generate_series
-----------------
               0
             0.1
             0.2
             0.3
             0.4
             0.5
             0.6
             0.7
             0.8
             0.9
             1.0
(11 rows)

Create bins

generate_series can be used to group values into bins. Here’s an example of what we want to create: a series of lower and upper values, and the count of the number of observations falling in each bin.

Output:

  lower | upper | count
-------+-------+-------
   30  |   35  |     0 
   35  |   40  |    74 
   40  |   45  |   155 
   45  |   50  |    39 
   50  |   55  |   445 
   55  |   60  |    35 
   60  |   65  |     0
(7 rows)

Create bins

Let’s build the query to create that output. A WITH clause allows us to alias the results of a subquery to use later in the query. Here, we generate two series: one for the lower bounds of the bins and another for the upper. We name this “bins.”

Query:

-- Create bins
WITH bins AS (
       SELECT generate_series(30, 60, 5) AS lower
              generate_series(35, 65, 5) AS upper),
      -- Subset data to tag of interest
      blob AS (
       SELECT unanswered_count
       FROM stackoverflow 
       WHERE tag='azure-blob')
      -- Count values in each bin
       SELECT lower, upper, count(unanswered_count)
      -- left join keeps all bins
       FROM bins
       LEFT JOIN blob
       ON unanswered_count >= lower
       AND unanswered_count < upper
      -- Group by bin bounds to create the groups
       GROUP BY lower, upper
       ORDER BY lower;

Because we’re only summarizing data for tag azure-blob, we also create that subset of the stackoverflow table and call it blob.

Then write the main select query to join the results of the subqueries we created and count the values. We join blob to bins where the column unanswered_count is greater than or equal to the lower bound and strictly less than the upper bound. A left join keeps all bins in the result, even those with no values in them.

Finally, group by the lower and upper bin values to count the values in each bin.

Each row in the output has the count of days where the number of unanswered questions was greater than or equal to the lower bound and strictly less than the upper bound. Note that the result contains bins with 0 values. This is because we counted non-null values of unanswered_count instead of just the number of rows.

Create bins: output

  lower | upper | count
-------+-------+-------
   30  |   35  |     0 
   35  |   40  |    74 
   40  |   45  |   155 
   45  |   50  |    39 
   50  |   55  |   445 
   55  |   60  |    35 
   60  |   65  |     0
(7 rows)