In this post, we’re going to look at a techinque for finding the lastest full record for each member of a group.
Now, it might not be obvious why this is an annoying problem, but I see people back into having this problem from two different directions.
First, you know how to find the MAX
date for each member in a group:
metric | latest_metric
--------+------------------------
A | 2019-04-30 00:00:00-04
B | 2019-05-08 00:00:00-04
C | 2019-05-08 00:00:00-04
D | 2019-05-08 00:00:00-04
E | 2019-05-08 00:00:00-04
F | 2019-05-08 00:00:00-04
(6 rows)
But you need additional column data from the same row as that most recent date.
Or…
You found the whole record for one member in the set:
metric | date | value
--------+------------------------+-------
A | 2019-04-30 00:00:00-04 | 889
(1 row)
But you need to do the same thing for all the metrics too.
To solve this kind of problem, we can use the ROW_NUMBER
function to sort our data and reference it as a sorted sub-sets.
Let’s break down the pieces that get us to a ROW_NUMBER
solution.
Here is our metrics table:
Table "public.metrics"
Column | Type | Collation | Nullable | Default
--------+--------------------------+-----------+----------+---------
date | timestamp with time zone | | |
metric | text | | |
value | double precision | | |
Start by finding the most recent date for each metric:
SELECT
metric,
max(date) as latest_metric
FROM
metrics
GROUP BY 1
ORDER BY 1;
metric | latest_metric
--------+------------------------
A | 2019-04-30 00:00:00-04
B | 2019-05-08 00:00:00-04
C | 2019-05-08 00:00:00-04
D | 2019-05-08 00:00:00-04
E | 2019-05-08 00:00:00-04
F | 2019-05-08 00:00:00-04
(6 rows)
Again, this gets us the latest date for each metric, but we don’t really have access to the row this data is from. That’s what we need if we want to pull additional column values.
If you only have to answer the question for a single metric, it’s easy:
SELECT
metric,
date,
value
FROM
metrics
WHERE metric = 'A'
ORDER BY date DESC
LIMIT 1;
metric | date | value
--------+------------------------+-------
A | 2019-04-30 00:00:00-04 | 889
(1 row)
If this is code that lives inside an application that is just looking up the latest data for a specific metric, you’re fine with the ORDER BY
and LIMIT
solution. But any time you need to jump from a solution that works for one member of the group to all members of the group, you need a window-based solution to do it in a single query.
Finding Records, Not Just Values
In SQL, window functions allow you to perform operations over subsets of rows by defining a window on the rows. ROW_NUMBER
is a special window function that will number the rows inside your window.
So, if we take all the metric ‘A’ rows and add a ROW_NUMBER
column:
SELECT
metric,
date,
value,
ROW_NUMBER() OVER(PARTITION BY metric ORDER BY date DESC) AS rn
FROM
metrics
WHERE metric = 'A'
ORDER BY date DESC;
The numbered rows look like this:
metric | date | value | rn
--------+------------------------+-------+----
A | 2019-04-30 00:00:00-04 | 889 | 1
A | 2019-04-13 00:00:00-04 | 156 | 2
A | 2019-04-10 00:00:00-04 | 274 | 3
A | 2019-04-02 00:00:00-04 | 812 | 4
A | 2019-03-31 00:00:00-04 | 904 | 5
A | 2019-03-29 00:00:00-04 | 611 | 6
A | 2019-03-27 00:00:00-04 | 761 | 7
A | 2019-03-23 00:00:00-04 | 209 | 8
A | 2019-03-06 00:00:00-05 | 231 | 9
A | 2019-03-04 00:00:00-05 | 82 | 10
A | 2019-02-27 00:00:00-05 | 138 | 11
A | 2019-02-25 00:00:00-05 | 537 | 12
A | 2019-02-19 00:00:00-05 | 228 | 13
A | 2019-02-12 00:00:00-05 | 581 | 14
A | 2019-02-11 00:00:00-05 | 657 | 15
A | 2019-02-10 00:00:00-05 | 9 | 16
A | 2019-02-05 00:00:00-05 | 965 | 17
A | 2019-02-04 00:00:00-05 | 826 | 18
A | 2019-01-31 00:00:00-05 | 878 | 19
A | 2019-01-28 00:00:00-05 | 729 | 20
A | 2019-01-26 00:00:00-05 | 168 | 21
A | 2019-01-22 00:00:00-05 | 797 | 22
A | 2019-01-18 00:00:00-05 | 618 | 23
(23 rows)
Without the WHERE metric = 'A'
filter, you can see that the window function repeats the numbering for each metric:
metric | date | value | rn
--------+------------------------+-------+----
A | 2019-04-30 00:00:00-04 | 889 | 1
A | 2019-04-13 00:00:00-04 | 156 | 2
A | 2019-04-10 00:00:00-04 | 274 | 3
B | 2019-05-08 00:00:00-04 | 954 | 1
B | 2019-04-10 00:00:00-04 | 639 | 2
B | 2019-03-31 00:00:00-04 | 989 | 3
C | 2019-05-08 00:00:00-04 | 594 | 1
C | 2019-04-30 00:00:00-04 | 28 | 2
C | 2019-04-13 00:00:00-04 | 61 | 3
D | 2019-05-08 00:00:00-04 | 160 | 1
D | 2019-04-30 00:00:00-04 | 432 | 2
D | 2019-04-02 00:00:00-04 | 102 | 3
E | 2019-05-08 00:00:00-04 | 49 | 1
E | 2019-04-10 00:00:00-04 | 307 | 2
E | 2019-04-02 00:00:00-04 | 674 | 3
F | 2019-05-08 00:00:00-04 | 773 | 1
F | 2019-04-30 00:00:00-04 | 918 | 2
F | 2019-04-13 00:00:00-04 | 799 | 3
(18 rows)
The window definition in our ROW_NUMBER
function has two critical parts:
PARTITION BY
which defines the columns we want to group byORDER BY
which defines how the numbering will be applied to the windowed rows
Because we defined our window as “grouped by” metric and “ordered by” date descending, the most recent date for each metric will always have a value of 1
in the row number column.
With the numbering in place, we can run the numbering query as a subquery and pull out rn = 1
for the final results:
SELECT
metric,
date,
value
FROM (
SELECT
metric,
date,
value,
ROW_NUMBER() OVER(PARTITION BY metric ORDER BY date DESC) AS rn
FROM
metrics
) m
WHERE rn = 1;
Which leaves us with the full record for the most recent date of each metric:
metric | date | value
--------+------------------------+-------
A | 2019-04-30 00:00:00-04 | 889
B | 2019-05-08 00:00:00-04 | 954
C | 2019-05-08 00:00:00-04 | 594
D | 2019-05-08 00:00:00-04 | 160
E | 2019-05-08 00:00:00-04 | 49
F | 2019-05-08 00:00:00-04 | 773
(6 rows)
Share this post
Twitter
Google+
Facebook
Reddit
LinkedIn
StumbleUpon
Pinterest
Email