Sometimes when you are inspecting data, you come across duplicates that shouldn’t exist.
Here’s a an easy way to remove duplicate rows using the ROW_NUMBER
function.
In a previous post about using ROW_NUMBER
to find the latest record for each member in a group, I set up some sample data with this statement:
CREATE TABLE metrics AS (
SELECT
date,
CASE WHEN n > random() * 2 and n < random() * 4 THEN 'A'
WHEN n > random() * 2 and n < random() * 4 THEN 'B'
WHEN n > random() * 4 and n < random() * 6 THEN 'C'
WHEN n > random() * 6 and n < random() * 8 THEN 'D'
WHEN n > random() * 8 and n < random() * 10 THEN 'E'
ELSE 'F' END as metric,
round(random() * 1000) as value
FROM
generate_series(1,11,1) as n
JOIN (
SELECT
day + interval '1 day' * round(random()*100) as date
FROM generate_series('2019-01-01', '2019-01-31', interval '1 day') day
) d
ON true
);
While that statement is a great way to get a bunch of random sample data to play with, there is no guarantee that the values or the dates will be unique. In fact, there were several entries for the same metric on the same date:
SELECT
metric,
date,
value
FROM
metrics
WHERE metric = 'A' AND date > '2019-04-01'
ORDER BY date DESC, rn;
metric | date | value
--------+------------------------+-------
A | 2019-04-30 00:00:00-04 | 889
A | 2019-04-30 00:00:00-04 | 891
A | 2019-04-13 00:00:00-04 | 156
A | 2019-04-10 00:00:00-04 | 274
A | 2019-04-02 00:00:00-04 | 812
A | 2019-04-02 00:00:00-04 | 303
A | 2019-04-02 00:00:00-04 | 786
(7 rows)
Use Row Number to Count Duplicates
With a small set like our example above, you can see that 4/30/2019 has two entries, and 4/2/2019 has three, but in a larger set it becomes harder to detect.
If you want to systematically remove the duplicates, it becomes the same class of problem as the “latest record per member in a group” problem we explored in the other post.
We can use ROW_NUMBER
to do the detection for us like this:
SELECT
metric,
date,
value,
ROW_NUMBER() OVER(PARTITION BY metric, date) AS rn
FROM
metrics
WHERE metric = 'A' AND date > '2019-04-01'
ORDER BY date DESC, rn;
metric | date | value | rn
--------+------------------------+-------+----
A | 2019-04-30 00:00:00-04 | 889 | 1
A | 2019-04-30 00:00:00-04 | 891 | 2
A | 2019-04-13 00:00:00-04 | 156 | 1
A | 2019-04-10 00:00:00-04 | 274 | 1
A | 2019-04-02 00:00:00-04 | 812 | 1
A | 2019-04-02 00:00:00-04 | 303 | 2
A | 2019-04-02 00:00:00-04 | 786 | 3
(7 rows)
Notice how we set the partition: ROW_NUMBER() OVER(PARTITION BY metric, date)
By partitioning on the duplicate columns, metric and date, we can now count the number of duplicates in each set. More importantly, as we’ll see below, we now have a way to systematically access one of each of the duplicate sets.
Now, if the source of the duplication was copying the same exact rows into the table more than once, we would probably have to add the value column to the partition as well. We only need to use metric and date here because of the way the data was generated, but the principle is the same: build your partition by adding each column you need to dedupe on.
Choose One of Each Duplicate Set
We use the same principle to get our deduped set as we did to take the “latest” value, put the ROW_NUMBER
query in a sub-query, then filter for rn=1
in your outer query, leaving only one record for each duplicated set of rows:
SELECT
metric,
date,
value
FROM (
SELECT
metric,
date,
value,
ROW_NUMBER() OVER(PARTITION BY metric, date) AS rn
FROM
metrics
WHERE metric = 'A' AND date > '2019-04-01'
) m
WHERE rn = 1
ORDER BY date DESC;
metric | date | value
--------+------------------------+-------
A | 2019-04-30 00:00:00-04 | 889
A | 2019-04-13 00:00:00-04 | 156
A | 2019-04-10 00:00:00-04 | 274
A | 2019-04-02 00:00:00-04 | 812
(4 rows)
You can build off this concept in a bunch of different ways, depending on how your data is structured.
For example, we don’t actually have an opinion about which value is correct in this case, so we ignored the value column.
You might have a created_at
timestamp on each record that signifies the most recently created record should be used. In that case, include ORDER BY created_at DESC
to the OVER
statement to make sure you select the correct duplicate value.
This strategy for deduping data can come in handy as you’re exploring new tables or when you actually need to find, debug, and correct data transformations. Hope it helps!
Share this post
Twitter
Google+
Facebook
Reddit
LinkedIn
StumbleUpon
Pinterest
Email