Analyzing data using buckets in a SQL database

Analyzing data using buckets in a SQL database

by Mike Sherrill

Analyzing data often involves sorting it into “buckets”, where each bucket represents a specific time interval. One bucket might represent one day, one hour, or one minute.

Some platforms can generate tables of buckets on the fly. PostgreSQL’s function generate_series() returns a set of integers or timestamps. Even so, it can be useful to build a base table that can be populated once, then used in any query that covers the same time period.

Assume that users will need several tables of buckets. Choose table names and view names so they’ll group and sort sensibly. If you have many of these, you might want to isolate them in their own schema.

The tables’ names makes it easy to extend the system to support intervals of any number of hours. For example, use “buckets_hour_02” for two-hour intervals, “buckets_minute_15” for 15-minute intervals, etc. Use “buckets_hour_01_2013” for a view that returns one-hour buckets for only the year 2013.

A table for one-hour buckets

The following SQL statement creates a table for one-hour buckets. Pay particular attention to the constraints.

The constraints

The primary key constraint guarantees that “bucket_start” will be unique. One check constraint guarantees that the minutes will be zero, and another guarantees that the seconds will be zero. Taken together, these two check constraints guarantee that “bucket_start” can only contain a value that’s an even hour. A final check constraint guarantees that “bucket_end” can only be one hour later than “bucket_start”.

There’s no need to declare “bucket_end” to be unique. The combination of the primary key constraint and the check constraints guarantee that it will be unique.

create table buckets_hour_01 (
  bucket_start timestamp not null,
  bucket_end timestamp not null,
  primary key (bucket_start),

  check (extract(minute from bucket_start) = 0),

  -- Seconds includes fractional seconds. The range is 0-60, not 
  -- 0-59. Leap seconds are a real thing.
  -- 
  -- Ref: http://www.postgresql.org/docs/9.3/static/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT
  check (extract(second from bucket_start) = 0),
  check (bucket_end = bucket_start + interval '1 hour')
);

create index on buckets_hour_01 (bucket_start, bucket_end);

The missing constraint

Ideally, this table would also have a “no gaps” constraint. The relational model would call that a database constraint. I don’t think any currently shipping SQL database management system implements database constraints. Practice defensive programming.

  • Tightly control privileges for insert, update, and delete statements.
  • Run a periodic job that verifies the number of rows.
  • Run a periodic job that verifies the minimum and maximum values.

For example, using PostgreSQL, you can run this query to verify there are no gaps in a table that should have rows from 1 to 1,000,000.

select count(*) row_count
     , max(n) as max_n
     , min(n) as min_n
     , count(*) = max(n) - min(n) + 1 is_good_count
from integers;

Output looks like this.

row_count  max_n    min_n  is_good_count
--
1000000    1000000  1      t

When there are no gaps,

  • the column min_n will contain the integer 1,
  • the column max_n will contain the integer 1000000,
  • and the column “is_good_count” will contain the Boolean value “t”.

There are many ways to automate this kind of test.

One-hour buckets for ten years

This SQL statement generates ten years of one-hour buckets—less than 100,000 rows. Follow the instructions in the code’s comments to change the range of years and to insert rows rather than just generating them. On my modest development computer, generating and inserting these rows takes just over one second.

with years as (
  -- Change these years as needed.
  select 2014 as start_year, 2024 as end_year
)
, date_range as (
  select 
    cast(((select start_year from years) || '-01-01 00:00:00') as timestamp) as start_time,
    cast(((select end_year   from years) || '-12-31 24:00:00') as timestamp) as end_time
)
-- Uncomment the following line to insert rows into the table.
-- insert into buckets_hour_01 (bucket_start, bucket_end)
select 
  generate_series((select start_time from date_range), 
                  (select end_time - interval '1' hour from date_range), 
                  interval '1 hour') as bucket_start,
  generate_series((select start_time + interval '1' hour from date_range), 
                  (select end_time from date_range), 
                  interval '1 hour') bucket_end
;

Views to simplify reporting

You might find it useful to create one view per year to simplify other queries. Use inequalities rather than the SQL BETWEEN operator. The BETWEEN operator includes both endpoints. Using inequalities avoids having to guess about microsecond support and leap seconds.

create view buckets_hour_01_2014 as
select bucket_start, bucket_end
from buckets_hour_01
where bucket_start >= '2014-01-01'
  and bucket_start <  '2015-01-01';

create view buckets_hour_01_2015 as
select bucket_start, bucket_end
from buckets_hour_01
where bucket_start >= '2015-01-01'
  and bucket_start <  '2016-01-01';

A table for one-minute buckets

create table buckets_minute_01 (
  bucket_start timestamp not null,
  bucket_end timestamp not null,
  primary key (bucket_start),

  -- Seconds includes fractional seconds. The range is 0-60, not 
  -- 0-59. Leap seconds are a real thing.
  -- 
  -- Ref: http://www.postgresql.org/docs/9.3/static/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT
  check (extract(second from bucket_start) = 0),
  check (bucket_end = bucket_start + interval '1' minute)
);

One-minute buckets for a ten years

Each year contains over half a million one-minute buckets. My development computer takes less than five seconds to insert half a million rows. The code is very similar to the code above.

with years as (
  -- Change these years as needed.
  select 2014 as start_year, 2024 as end_year
)
, date_range as (
  select 
    cast(((select start_year from years) || '-01-01 00:00:00') as timestamp) as start_time,
    cast(((select end_year   from years) || '-12-31 24:00:00') as timestamp) as end_time
)
-- Uncomment the following line to insert rows into the table.
-- insert into buckets_minute_01 (bucket_start, bucket_end)
select 
  generate_series((select start_time from date_range), 
                  (select end_time - interval '1' minute from date_range), 
                  interval '1 minute') as bucket_start,
  generate_series((select start_time + interval '1' minute from date_range), 
                  (select end_time from date_range), 
                  interval '1 minute') bucket_end
;

Some data to analyze

Here I’ve pasted a couple of hundred timestamps into a common table expression (CTE). Of course, you’d usually store these in a base table. After the CTE, you’ll find a query that counts them by the hour, using the “buckets_hour_01” table I built earlier, along with sample output.

with logfile as (
  select timestamp '2013-01-13 10:16:41' as ts union all
  select timestamp '2013-01-13 12:20:42' as ts union all
  select timestamp '2013-01-13 12:20:42' as ts union all
  select timestamp '2013-01-13 12:20:42' as ts union all
  select timestamp '2013-01-13 12:20:38' as ts union all
  select timestamp '2013-01-13 12:20:42' as ts union all
  select timestamp '2013-01-13 12:20:43' as ts union all
  select timestamp '2013-01-13 12:20:43' as ts union all
  select timestamp '2013-01-13 12:20:43' as ts union all
  select timestamp '2013-01-13 12:20:43' as ts union all
  select timestamp '2013-01-13 12:20:43' as ts union all
  select timestamp '2013-01-13 12:20:43' as ts union all
  select timestamp '2013-01-13 12:20:43' as ts union all
  select timestamp '2013-01-13 12:20:43' as ts union all
  select timestamp '2013-01-13 12:20:43' as ts union all
  select timestamp '2013-01-13 12:20:43' as ts union all
  select timestamp '2013-01-13 12:20:43' as ts union all
  select timestamp '2013-01-13 12:20:43' as ts union all
  select timestamp '2013-01-13 12:20:43' as ts union all
  select timestamp '2013-01-13 14:21:10' as ts union all
  select timestamp '2013-01-13 13:21:10' as ts union all
  select timestamp '2013-01-13 13:21:10' as ts union all
  select timestamp '2013-01-13 13:21:10' as ts union all
  select timestamp '2013-01-13 13:21:10' as ts union all
  select timestamp '2013-01-13 13:21:10' as ts union all
  select timestamp '2013-01-13 13:21:10' as ts union all
  select timestamp '2013-01-13 13:21:10' as ts union all
  select timestamp '2013-01-13 13:21:10' as ts union all
  select timestamp '2013-01-13 13:21:10' as ts union all
  select timestamp '2013-01-13 13:21:10' as ts union all
  select timestamp '2013-01-13 13:21:10' as ts union all
  select timestamp '2013-01-13 13:21:10' as ts union all
  select timestamp '2013-01-13 13:21:10' as ts union all
  select timestamp '2013-01-13 13:21:10' as ts union all
  select timestamp '2013-01-13 13:21:10' as ts union all
  select timestamp '2013-01-13 13:21:10' as ts union all
  select timestamp '2013-01-13 13:21:13' as ts union all
  select timestamp '2013-01-13 13:21:13' as ts union all
  select timestamp '2013-01-13 13:21:14' as ts union all
  select timestamp '2013-01-13 13:21:14' as ts union all
  select timestamp '2013-01-13 13:21:14' as ts union all
  select timestamp '2013-01-13 13:21:14' as ts union all
  select timestamp '2013-01-13 13:21:14' as ts union all
  select timestamp '2013-01-13 13:21:14' as ts union all
  select timestamp '2013-01-13 13:21:14' as ts union all
  select timestamp '2013-01-13 13:21:14' as ts union all
  select timestamp '2013-01-13 13:21:14' as ts union all
  select timestamp '2013-01-13 13:21:14' as ts union all
  select timestamp '2013-01-13 13:21:14' as ts union all
  select timestamp '2013-01-13 13:21:14' as ts union all
  select timestamp '2013-01-13 13:21:14' as ts union all
  select timestamp '2013-01-13 13:21:14' as ts union all
  select timestamp '2013-01-13 13:21:14' as ts union all
  select timestamp '2013-01-13 13:21:14' as ts union all
  select timestamp '2013-01-13 13:21:14' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:17' as ts union all
  select timestamp '2013-01-13 13:21:23' as ts union all
  select timestamp '2013-01-13 13:21:40' as ts union all
  select timestamp '2013-01-13 13:21:40' as ts union all
  select timestamp '2013-01-13 13:21:40' as ts union all
  select timestamp '2013-01-13 13:21:40' as ts union all
  select timestamp '2013-01-13 13:21:40' as ts union all
  select timestamp '2013-01-13 13:21:40' as ts union all
  select timestamp '2013-01-13 13:21:40' as ts union all
  select timestamp '2013-01-13 13:21:40' as ts union all
  select timestamp '2013-01-13 13:21:40' as ts union all
  select timestamp '2013-01-13 13:21:40' as ts union all
  select timestamp '2013-01-13 13:21:40' as ts union all
  select timestamp '2013-01-13 13:21:40' as ts union all
  select timestamp '2013-01-13 13:21:40' as ts union all
  select timestamp '2013-01-13 13:21:40' as ts union all
  select timestamp '2013-01-13 13:21:40' as ts union all
  select timestamp '2013-01-13 13:21:40' as ts union all
  select timestamp '2013-01-13 13:21:40' as ts union all
  select timestamp '2013-01-13 13:21:40' as ts union all
  select timestamp '2013-01-13 13:21:40' as ts union all
  select timestamp '2013-01-13 13:21:40' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:19' as ts union all
  select timestamp '2013-01-13 14:24:24' as ts union all
  select timestamp '2013-01-13 14:24:40' as ts union all
  select timestamp '2013-01-13 14:24:40' as ts union all
  select timestamp '2013-01-13 14:24:40' as ts union all
  select timestamp '2013-01-13 14:24:40' as ts union all
  select timestamp '2013-01-13 14:24:40' as ts union all
  select timestamp '2013-01-13 14:24:40' as ts union all
  select timestamp '2013-01-13 14:24:40' as ts union all
  select timestamp '2013-01-13 14:24:40' as ts union all
  select timestamp '2013-01-13 14:24:40' as ts union all
  select timestamp '2013-01-13 14:24:40' as ts union all
  select timestamp '2013-01-13 14:24:40' as ts union all
  select timestamp '2013-01-13 14:24:40' as ts union all
  select timestamp '2013-01-13 14:24:40' as ts union all
  select timestamp '2013-01-13 14:24:40' as ts union all
  select timestamp '2013-01-13 14:24:40' as ts union all
  select timestamp '2013-01-13 14:24:40' as ts union all
  select timestamp '2013-01-13 14:24:40' as ts union all
  select timestamp '2013-01-13 14:24:40' as ts union all
  select timestamp '2013-01-13 14:24:40' as ts union all
  select timestamp '2013-01-13 14:24:40' as ts union all
  select timestamp '2013-01-13 14:24:46' as ts union all
  select timestamp '2013-01-13 15:09:43' as ts union all
  select timestamp '2013-01-13 15:09:58' as ts union all
  select timestamp '2013-01-13 15:09:58' as ts union all
  select timestamp '2013-01-13 15:09:58' as ts union all
  select timestamp '2013-01-13 15:09:58' as ts union all
  select timestamp '2013-01-13 15:09:58' as ts union all
  select timestamp '2013-01-13 15:09:58' as ts union all
  select timestamp '2013-01-13 15:09:59' as ts union all
  select timestamp '2013-01-13 15:09:59' as ts union all
  select timestamp '2013-01-13 15:09:59' as ts union all
  select timestamp '2013-01-13 15:09:59' as ts union all
  select timestamp '2013-01-13 15:09:59' as ts union all
  select timestamp '2013-01-13 15:09:59' as ts union all
  select timestamp '2013-01-13 15:09:59' as ts union all
  select timestamp '2013-01-13 15:09:59' as ts 
)

select bucket_start, bucket_end
     , count(t.ts)
from buckets_hour_01
left join logfile t
  on t.ts >= bucket_start
 and t.ts < bucket_end
where bucket_start >= '2013-01-13'
  and bucket_start <  '2014-01-01'
group by bucket_start, bucket_end
order by bucket_start;

bucket_start          bucket_end           count
--
[snip]
2013-01-13 10:00:00   2013-01-13 11:00:00    1
2013-01-13 11:00:00   2013-01-13 12:00:00    0
2013-01-13 12:00:00   2013-01-13 13:00:00   18
2013-01-13 13:00:00   2013-01-13 14:00:00  121
2013-01-13 14:00:00   2013-01-13 15:00:00   88
2013-01-13 15:00:00   2013-01-13 16:00:00   15
2013-01-13 16:00:00   2013-01-13 17:00:00    0
2013-01-13 17:00:00   2013-01-13 18:00:00    0
[snip]

About the author

Mike Sherrill is a SQL database developer, database administrator, software developer, and web developer living and working in the Midwestern United States. He has clients spanning the globe, kind of like ABC’s Wide World of Sports. Like the athletes, he regularly experiences both the thrill of victory and the agony of defeat.