Friday, April 24, 2009

SELECT DISTINCT with ORDER BY

I recently wrote a query in MySQL that didn't seem to be returning the right results, and at first I couldn't figure out why. Here is a toy example, where we are tracking pages and page views (one-to-many relationship):


create table page (
    page_id integer unsigned primary key,
    name varchar(32) not null,
    created datetime not null
) engine=InnoDB;

create table page_view (
    page_view_id integer unsigned primary key,
    page_id integer unsigned not null,
    created datetime not null,
    
    foreign key (page_id) references page (page_id) on delete cascade
) engine=InnoDB;


What I want to get is the most recently viewed pages. Let's say I have the following data in my tables:


mysql> select * from page;
+---------+--------+---------------------+
| page_id | name   | created             |
+---------+--------+---------------------+
|       1 | page 1 | 2000-01-01 00:00:00 |
|       2 | page 2 | 2000-01-02 00:00:00 |
|       3 | page 3 | 2000-01-03 00:00:00 |
|       4 | page 4 | 2000-01-04 00:00:00 |
+---------+--------+---------------------+
4 rows in set (0.00 sec)

mysql> select * from page_view;
+--------------+---------+---------------------+
| page_view_id | page_id | created             |
+--------------+---------+---------------------+
|            1 |       3 | 2000-01-01 00:00:00 |
|            2 |       1 | 2000-01-02 00:00:00 |
|            3 |       1 | 2000-01-03 00:00:00 |
|            4 |       3 | 2000-01-04 00:00:00 |
|            5 |       2 | 2000-01-05 00:00:00 |
|            6 |       4 | 2000-01-06 00:00:00 |
|            7 |       2 | 2000-01-07 00:00:00 |
+--------------+---------+---------------------+
7 rows in set (0.00 sec)


What I want to get back is page 2 (most recently viewed), then page 4, then page 3, then page 1.

So I write my query:


mysql> select distinct p.page_id, p.name, p.created from page p join page_view pv on p.page_id = pv.page_id order by pv.created desc;
+---------+--------+---------------------+
| page_id | name   | created             |
+---------+--------+---------------------+
|       4 | page 4 | 2000-01-04 00:00:00 |
|       2 | page 2 | 2000-01-02 00:00:00 |
|       1 | page 1 | 2000-01-01 00:00:00 |
|       3 | page 3 | 2000-01-03 00:00:00 |
+---------+--------+---------------------+
4 rows in set (0.00 sec)


That's not right at all! What's going on?

The problem is that I'm using distinct just on the page table, but ordering by the page_view table. Since there is a many-to-one, what is the database supposed to do when a page has multiple views? which view should it use for the order by?

What I wanted the query to do is first join, then order, then apply the distinct. That's not what MySQL does, though. It first joins, then applies the distinct, then orders the results (or something like that). You can think of it like MySQL going sequentially through the page_view table, finding rows with distinct page ids. So it would pick rows 1,2,5,6:


+--------------+---------+---------------------+
| page_view_id | page_id | created             |
+--------------+---------+---------------------+
|            1 |       3 | 2000-01-01 00:00:00 |
|            2 |       1 | 2000-01-02 00:00:00 |
|            5 |       2 | 2000-01-05 00:00:00 |
|            6 |       4 | 2000-01-06 00:00:00 |
+--------------+---------+---------------------+
4 rows in set (0.00 sec)


You can see that if you order those by created, you get the page order that the (badly written) query returned (4,2,1,3).

We can force MySQL to do things in the order we want by changing the query to:


mysql> select distinct p.page_id, p.name, p.created from (select p.page_id, p.name, p.created from page p join page_view pv on p.page_id = pv.page_id order by pv.created desc) as p;
+---------+--------+---------------------+
| page_id | name   | created             |
+---------+--------+---------------------+
|       2 | page 2 | 2000-01-02 00:00:00 |
|       4 | page 4 | 2000-01-04 00:00:00 |
|       3 | page 3 | 2000-01-03 00:00:00 |
|       1 | page 1 | 2000-01-01 00:00:00 |
+---------+--------+---------------------+
4 rows in set (0.00 sec)


But I think that's kind of a hack, and depends on MySQL doing the distinct in a certain order (I don't think order by in a subquery is standard sql, and shouldn't necessarily constraint the order of the entire query). So what's the "right" way to write this type of query?

Before I tackled that, I thought, "What would a strict database like PostgreSQL do with this type of query?" My hope was that it would throw it out altogether. And it does. Here's what I get:


postgres=# select distinct t1.id, t1.name, t1.created from table1 t1 join table2 t2 on t1.id = t2.table1_id order by t2.created desc;
ERROR: for SELECT DISTINCT, ORDER BY expressions must appear in select list


That's much better, and the error message is very helpful, and makes sense. So here's the query I came up with that will give the correct results, and is correct SQL, in MySQL...:


mysql> select p.page_id, p.name, p.created from page p join (select page_id, max(created) as created from page_view group by page_id) v on p.page_id = v.page_id order by v.created desc;
+---------+--------+---------------------+
| page_id | name   | created             |
+---------+--------+---------------------+
|       2 | page 2 | 2000-01-02 00:00:00 |
|       4 | page 4 | 2000-01-04 00:00:00 |
|       3 | page 3 | 2000-01-03 00:00:00 |
|       1 | page 1 | 2000-01-01 00:00:00 |
+---------+--------+---------------------+
4 rows in set (0.00 sec)

...and in PostgreSQL:

postgres=# select p.page_id, p.name, p.created from page p join (select page_id, max(created) as created from page_view group by page_id) v on p.page_id = v.page_id order by v.created desc;
page_id |  name  |       created       
---------+--------+---------------------
       2 | page 2 | 2000-01-02 00:00:00
       4 | page 4 | 2000-01-04 00:00:00
       3 | page 3 | 2000-01-03 00:00:00
       1 | page 1 | 2000-01-01 00:00:00
(4 rows)


Is there a better performing query out there to do the same thing? I'd love to know, please leave a comment! :)

No comments:

Post a Comment