SQL Quick Tip: Deduping Data with Row Number
Sometimes when you are inspecting data, you come across duplicates that shouldn’t exist. Here’s a an easy way to remove duplicate rows using the ROW_NUMBER function. In a previous post about using ROW_NUMBER to find the latest record for each member in a group, I set up some sample data with this statement: CREATE TABLE metrics AS ( SELECT date, CASE WHEN n > random() * 2 and n < random() * 4 THEN 'A' WHEN n > random() * 2 and n < random() * 4 THEN 'B' WHEN n > random() * 4 and n < random() * 6 THEN 'C' WHEN n > random() * 6 and n < random() * 8 THEN 'D' WHEN n > random() * 8 and n < random() * 10 THEN 'E' ELSE 'F' END as metric, round(random() * 1000) as value FROM generate_series(1,11,1) as n JOIN ( SELECT day + interval '1 day' * round(random()*100) as date FROM generate_series('2019-01-01', '2019-01-31', interval '1 day') day ) d ON true ); While that statement is a great way to get a bunch of random sample data to play with, there is no guarantee that the values or the dates will be unique. In fact, there were several entries for the same metric on the same date: ...