How to use GROUP BY which takes into account two columns? - mysql

I have a message table like this in MySQL.
+--------------------+--------------+------+-----+---------------------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------------------+--------------+------+-----+---------------------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| subject | varchar(120) | NO | | NULL | |
| body | longtext | NO | | NULL | |
| sent_at | datetime | YES | | NULL | |
| recipient_read | tinyint(1) | NO | | 0 | |
| recipient_id | int(11) | NO | MUL | 0 | |
| sender_id | int(11) | NO | MUL | 0 | |
| thread_id | int(11) | NO | MUL | 0 | |
+--------------------+--------------+------+-----+---------------------+----------------+
Messages in a recepient's inbox are to be grouped by thread_id like this:
SELECT * FROM message WHERE recipient_id=42 GROUP BY thread_id ORDER BY sent_at DESC
My problem is how to take recipient_read into account so that each row in the result also show what is the recipient_read value of the last message in the thread?

In the original query, the ORDER BY is only satisfied after the GROUP BY operation. The ORDER BY affects the order of the returned rows. It does not influence which rows are returned.
With the non-aggregate expression in the SELECT list, it is indeterminate which values will be returned; the value of each column will be from some row in the collapsed group. But it's not guaranteed to be the first row, or the latest row, or any other specific row. The behavior of MySQL (allowing the query to run without throwing an error) is enabled by a MySQL extension.
Other relational databases would throw a "non-aggregate in SELECT list not in GROUP BY" type error with the query. MySQL exhibits a similar (standard) behavior when ONLY_FULL_GROUP_BY is included in sql_mode system variable. MySQL allows the original query to run (and return unexpected results) because of a non-standard, MySQL-specific extension.
The pattern of the original query is essentially broken.
To get a resultset that satisfies the specification, we can write a query to get the latest (maximum) sent_at datetime for each thread_id, for a given set of recipient_id (in the example query, the set is a single recipient_id.)
SELECT lm.recipient_id
, lm.thread_id
, MAX(lm.sent_at) AS latest_sent_at
FROM message lm
WHERE lm.recipient_id = 42
GROUP
BY lm.recipient_id
, lm.thread_id
We can use the result from that query in another query, by making in an inline view (wrap it in parens, and reference it in the FROM clause like table, assign an alias).
We can join that resultset to the original table to retrieve all of the columns from the rows that match.
Something like this:
SELECT m.id
, m.subject
, m.body
, m.sent_at
, m.recipient_read
, m.recipient_id
, m.sender_id
, m.thread_id
FROM (
SELECT lm.recipient_id
, lm.thread_id
, MAX(lm.sent_at) AS latest_sent_at
FROM message lm
WHERE lm.recipient_id = 42
GROUP
BY lm.recipient_id
, lm.thread_id
) l
JOIN message m
ON m.recipient_id = l.recipient_id
AND m.thread_id = l.thread_id
AND m.sent_at = l.latest_sent_at
ORDER
BY ...
Note that if (recipient_id,thread_id,sent_at) is not guaranteed to be unique, there is a potential that there will be multiple rows with the same "maximum" sent_at; that is, we could get more than one row back for a given maximum sent_at.
We can order that result however we want, with whatever expressions. That will affect only the order that the rows are returned in, not which rows are returned.

If you want the last message, you want filtering, not aggregation:
SELECT m.*
FROM message m
WHERE m.recipient_id = 42 AND
m.sent_at = (SELECT MAX(m2.sent_at)
FROM messages m2
WHERE m2.thread_id = m.thread_id
)
ORDER BY m.sent_at DESC;

Related

SQL Use Result from one Query for another Query

This is an excerpt from one table:
| id | type | other_id | def_id | ref_def_id|
| 1 | int | NULL | 5 | NULL |
| 2 | string | NULL | 5 | NULL |
| 3 | int | NULL | 5 | NULL |
| 20 | ref | 3 | NULL | 5 |
| 21 | ref | 4 | NULL | 5 |
| 22 | ref | 5 | NULL | 5 |
What I want is to find entries with type ref. Then I would for example have this one entry in my result:
| 22 | ref | 5 | NULL | 5 |
The problem I am facing is that I now want to combine this entry with other entries of the same table where def_id = 5.
So I would get all entries with def_id = 5 for this specific ref type as result. I somehow need the output from my first query, check what the ref_def_id is and then make another query for this id.
I really have problems to understand how to proceed. Any input is much appreciated.
If I understand correctly you need to find rows with a type of 'ref' and then use the values in their ref_def_id columns to get the rows with the same values in def_id. In that case you need to use a subquery for getting the rows with 'ref' type and combine it using either IN or EXISTS:
select *
from YourTable
where def_id in (select ref_def_id from YourTable where type='ref');
select *
from YourTable
where exists (select * from YourTable yt
where yt.ref_def_id=YourTable.def_id and yt.type='ref')
Both queries are equivalent, IN is easier to understand at first sight but EXISTS allow more complex conditions (for example you can use more than one column for combining with the subquery).
Edit: since you comment that you need also the id from the 'ref' rows then you need to use a subquery:
select source_id, YourTable.*
from YourTable
join (select id as source_id, ref_def_id
from YourTable
where type='ref')
as refs on refs.ref_def_id=YourTable.def_id
order by source_id, id;
With this for each 'ref' row you would get all the rows with the associated ref_id.
use below query to get column from sub query.
select a.ref_def_id
from (select ref_def_id from YourTable where type='ref') as a;
What you are looking for is a subquery or even better a join operation.
Have a look here: http://www.mysqltutorial.org/mysql-left-join.aspx
Joins / the left join allows you to combine rows of tables within one query on a given condition. The condition could be id = 5 for your purpose.
You would seem to want aggregation:
select max(id) as id, type, max(other_id) as other_id,
max(def_id) as def_id, ref_def_id
from t
where type = 'ref'
group by type, ref_def_id

Fastest way to order by having true result on a left join in MYSQL

I am trying to set up something where data is being matched on two different tables. The results would be ordered by some data being true on the second table. However, not everyone in the first table is in the second table. My problem is twofold. 1) Speed. My current MYSQL query takes 4 seconds to go through several thousand results on each table. 2) Not ordering correctly. I need it to order the results by who is online, but still be alphabetical. As it stands now it orders everyone by whether or not they are online according to chathelp table, then fills in the rest with the users table.
What I have:
SELECT u.name, u.id, u.url, c.online
FROM users AS u
LEFT JOIN livechat AS c ON u.url = CONCAT('http://www.software.com/', c.chat_handle)
WHERE u.live_account = 'y'
ORDER BY c.online DESC, u.name ASC
LIMIT 0, 24
users
+-----------------------------------------------------------+--------------+
| id | name | url | live_account |
+-----------------------------------------------------------+--------------|
| 1 | Lisa Fuller | http://www.software.com/LisaHelpLady | y |
| 2 | Eric Reiner | | y |
| 3 | Tom Lansen | http://www.software.com/SaveUTom | y |
| 4 | Billy Bob | http://www.software.com/BillyBob | n |
+-----------------------------------------------------------+--------------+
chathelp
+------------------------------------+
| chat_id | chat_handle | online |
+------------------------------------+
| 12 | LisaHelpLady | 1 |
| 34 | BillyBob | 0 |
| 87 | SaveUTom | 0 |
+------------------------------------+
What I would like the data I receive to look like:
+----------------------------------------------------------------------+
| name | id | url | online |
+----------------------------------------------------------------------+
| Lisa Fuller | 1 | http://www.software.com/LisaHelpLady | 1 |
| Eric Reiner | 4 | | 0 |
| Tom Lansen | 3 | http://www.software.com/SaveUTom | 0 |
+----------------------------------------------------------------------+
Explanation: Billy is excluded right off the bat for not having a live account. Lisa comes before Eric because she is online. Tom comes after Eric because he is offline and alphabetically later in the data. The only matching data between the two tables is a portion of the url column with the chat_handle column.
What I am getting instead:
(basically, I am getting Lisa, Tom, then Eric)
I am getting everybody in the chathelp table listed first whether or not they are online or not. So 600 people come first, then I get the remaining people who aren't in both tables from users table. I need people who are offline in the chathelp table to be sorted into the users table people in alphabetical order. So if Lisa and Tom were the only users online they would come first, but everyone else from the users table regardless of whether or not they set up their chathelp handle would come alphabetically after those two users.
Again, I need to sort them and figure out how to do this in less than 4 seconds. I have tried indexes on both tables, but they don't help. Explain says it is using a key (name) on table users hitting rows 4771 -> Using where;Using temporary; Using filesort and on table2 NULL for key with 1054 rows and nothing in the extra column.
Any help would be appreciated.
Edit to add table into and explain statement
CREATE TABLE `chathelp` (
`chat_id` int(13) NOT NULL,
`chat_handle` varchar(100) NOT NULL,
`online` tinyint(1) NOT NULL DEFAULT '0',
UNIQUE KEY `chat_id` (`chat_id`),
KEY `chat_handle` (`chat_handle`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
CREATE TABLE `users` (
`id` int(8) NOT NULL AUTO_INCREMENT,
`name` varchar(50) NOT NULL,
`url` varchar(250) NOT NULL,
`live_account` varchar(1) NOT NULL DEFAULT 'n',
PRIMARY KEY (`id`),
KEY `livenames` (`live_account`,`name`)
) ENGINE=MyISAM AUTO_INCREMENT=9556 DEFAULT CHARSET=utf8
+----+-------------+------------+------+---------------+--------------+---------+-------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+------+---------------+--------------+---------+-------+------+----------------------------------------------+
| 1 | SIMPLE | users | ref | livenames | livenames | 11 | const | 4771 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | chathelp | ALL | NULL | NULL | NULL | NULL | 1144 | |
+----+-------------+------------+------+---------------+--------------+---------+-------+------+----------------------------------------------+
We're going to guess that online is integer datatype.
You can modify the expression in your order by clause like this:
ORDER BY IFNULL(online,0) DESC, users.name ASC
^^^^^^^ ^^^
The problem is that for rows in user that don't have a matching row in chathelp, the value of the online column in the resultset is NULL. And NULL always sorts after all non-NULL values.
If we assume that a missing row in helpchat is to be treated equally with a row in helpchat that has a 0 for online, we can replace the NULL value with a 0. (If there are NULL values in the online column, we won't be able to distinguish between that, and a missing row in helpchat (using this expression in the ORDER BY.))
EDIT
Optimizing Performance
To address performance, we'd need to see the output from EXPLAIN.
With the query as its written above, there's no getting around the "Using filesort" to get the rows returned in the order specified, on that expression.
We may be able to re-write the query to get an equivalent result faster.
But I suspect the "Using filesort" operation is not really the problem, unless there's a bloatload (thousands and thousands) of rows to sort.
I suspect that suitable indexes aren't available for the join operation.
But before we go to the knee jerk "add an index!", we really need to look at EXPLAIN, and look at the table definitions including the indexes. (The output from SHOW CREATE TABLE is suitable.
We just don't have enough information to make recommendations yet.
Reference: 8.8.1 Optimizing Queries with EXPLAIN
As a guess, we might want to try a query like this:
SELECT u.name
, u.id
, l.url
, l.online
FROM users
LEFT
JOIN livechat
ON l.url = CONCAT('http://www.software.com/', u.chat_handle)
AND l.online = 1
WHERE u.live_account = 'y'
ORDER
BY IF(l.online=1,0,1) ASC
, u.name ASC
LIMIT 0,24
After we've added covering indexes, e.g.
.. ON user (live_account,chat_handle,name, id)
...ON livechat (url, online)
(If query is using a covering index, EXPLAIN should show "Using index" in the Extra column.)
One approach might be to break the query into two parts: an inner join, and a semi-anti join. This is just a guess at something we might try, but again, we'd want to compare the EXPLAIN output.
Sometimes, we can get better performance with a pattern like this. But for better performance, both of the queries below are going to need to be more efficient than the original query:
( SELECT u.name
, u.id
, l.url
, l.online
FROM users u
JOIN livechat
ON l.url = CONCAT('http://www.software.com/', u.chat_handle)
AND l.online = 1
WHERE u.live_account = 'y'
ORDER
BY u.name ASC
LIMIT 0,24
)
UNION ALL
( SELECT u.name
, u.id
, NULL AS url
, 0 AS online
FROM users u
LEFT
JOIN livechat
ON l.url = CONCAT('http://www.software.com/', u.chat_handle)
AND l.online = 1
WHERE l.url IS NULL
AND u.live_account = 'y'
ORDER
BY u.name ASC
LIMIT 0,24
)
ORDER BY 4 DESC, 1 ASC
LIMIT 0,24

Optimizing a query for optional fields from another table

I have a innodb table called items that powers one ecommerce site. The search system allows you to search for optional/additional fields, so that you can e.g. search for only repaired computers or cars only older than 2000.
This is done via additional table called items_fields.
It has a very simple design:
+------------+------------------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| field_id | int(11) | NO | MUL | NULL | |
| item_id | int(11) | NO | MUL | NULL | |
| valueText | varchar(500) | YES | | NULL | |
| valueInt | decimal(10,1) unsigned | YES | | NULL | |
+------------+------------------------+------+-----+---------+----------------+
There is also a table called fields which contains only field names and types.
The main query, which returns search results, is the following:
SELECT items...
FROM items
WHERE items... AND (
SELECT count(id)
FROM items_fields
WHERE items_fields.field_id = "59" AND items_fields.item_id = items.id AND
items_fields.valueText = "Damaged")>0
ORDER by ordering desc LIMIT 35;
On a large scale (4 million+ search queries only, per day), I need to optimize these advanced search even more. Currently, the average advanced search query takes around 100ms.
How can I speed up this query? Do you have any other suggestions, advices, for optimization? Both tables are innodb, server stack is absolutely awesome, however I still got this query to solve :)
Add and index for (item_id, field_id, valueText) since this is your search.
Get rid of the inner select!!! MySQL up to 5.5 cannot optimize queries with inner selects. As far as I know MariaDB 5.5 is the only MySQL replacement that currently supports inner select optimization.
SELECT i.*, f2.* as damageCounter FROM items i
JOIN items_fields f ON f.field_id = 59
AND f.item_id = i.id
AND f.valueText = "Damaged"
JOIN item_fields f2 ON f2.item_id = i.id
ORDER by i.ordering desc
LIMIT 35;
The first join will limit the set being returned. The second join will grab all item_fields for items meeting the first join. Between the first and last joins, you can add more Join conditionals that will filter out results based on additional points. For example:
SELECT i.*, f3.* as damageCounter FROM items i
JOIN items_fields f ON f.field_id = 59
AND f.item_id = i.id
AND f.valueText = "Damaged"
JOIN items_fields f2 ON f2.field_id = 22
AND f2.item_id = i.id
AND f.valueText = "Green"
JOIN item_fields f3 ON f3.item_id = i.id
ORDER by i.ordering desc
LIMIT 35;
This would return a result set of all items that had fields 59 with the value of "Damaged" and field 22 with the value of "Green" along with all their item_fields.

Retrieving the previous 10 results from a table that are nearest to a certain date and maintaining ascending sorting order

I have a table containing calendar items; in my web application, I have two views:
View 1: primary view that shows the next 10 items, starting from now
View 2: view that shows the previous/next 10 items based on the timestamp of the first/last item in view 1. This is the troublemaker.
On the bottom of the page, previous/next links are shown that lead to view 2.
The problem:
How do I retrieve the previous set of 10 items without knowing what date they are?
At first, this seemed quite simple to me, but apparently, it is not.
Database table:
+-------------+------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+------------------+------+-----+---------+----------------+
| id | int(11) unsigned | NO | PRI | NULL | auto_increment |
| name | varchar(255) | YES | MUL | NULL | |
| start_time | datetime | YES | | NULL | |
| end_time | datetime | YES | | NULL | |
| created | datetime | NO | | NULL | |
| updated | datetime | YES | | NULL | |
| deleted | tinyint(1) | NO | | 0 | |
+-------------+------------------+------+-----+---------+----------------+
SQL query for showing next 10 items starting from now (no problems here):
SELECT ci.*
FROM `calendar_item` AS `ci`
WHERE ci.end_time >= NOW()
GROUP BY `ci`.`id`
ORDER BY `ci`.`end_time` ASC
LIMIT 10
SQL query for showing previous 10 items, based on timestamp of the 1st item in the primary view:
SELECT ci.*
FROM `calendar_item` AS `ci`
WHERE (ci.id IN (
SELECT id FROM calendar_item
WHERE (end_time < FROM_UNIXTIME(1334667600))
ORDER BY end_time DESC
))
GROUP BY `ci`.`id`
ORDER BY `ci`.`end_time` ASC
LIMIT 10
The timestamp is passed via the URL; in view 1 the subquery is not shown at all. The problem lies in the fact that items should be sorted ascending; this would result in the earliest items in the database being returned, instead of the nearest to timestamp. To counter this problem, I created a subquery that sorts descendingly. This subquery works fine when I run it as a normal query, but when contained by the above query, it simply displays the same results as would the following:
SELECT ci.*
FROM `calendar_item` AS `ci`
WHERE ci.end_time <= 1334667600
GROUP BY `ci`.`id`
ORDER BY `ci`.`end_time` ASC
LIMIT 10
I am most likely overlooking something, so I could use your help. Thanks in advance.
This is a simple one, LIMIT the subquery
SELECT ci.*
FROM `calendar_item` AS `ci`
WHERE (ci.id IN (
SELECT id FROM calendar_item
WHERE (end_time < FROM_UNIXTIME(1334667600))
ORDER BY end_time DESC
LIMIT 10
))
GROUP BY `ci`.`id`
ORDER BY `ci`.`end_time` ASC
LIMIT 10
Without the limit in the subquery you are selecting ALL rows with a timestamp < FROM_UNIXTIMESTAMP. You are then reordering ASC and selecting the first 10, i.e. the earliest 10.
If you limit the subquery you get the 10 highest which satisfy your FROM_UNIXTIME, and the outer can then select them.
An alternative (and my preferred) would be the following, where the subquery gets the data, and the outer query simply reorders it before spitting it back out.
SELECT i.*
FROM (
SELECT ci.*
FROM calendar_item AS ci
WHERE ci.end_time < FROM_UNIXTIME(1334667600)
ORDER BY ci.end_time DESC
LIMIT 10
) AS i
ORDER BY i.`end_time` ASC

SQL LIMIT to get latest records

I am writing a script which will list 25 items of all 12 categories. Database structure is like:
tbl_items
---------------------------------------------
item_id | item_name | item_value | timestamp
---------------------------------------------
tbl_categories
-----------------------------
cat_id | item_id | timestamp
-----------------------------
There are around 600,000 rows in the table tbl_items. I am using this SQL query:
SELECT e.item_id, e.item_value
FROM tbl_items AS e
JOIN tbl_categories AS cat WHERE e.item_id = cat.item_id AND cat.cat_id = 6001
LIMIT 25
Using the same query in a loop for cat_id from 6000 to 6012. But I want the latest records of every category. If I use something like:
SELECT e.item_id, e.item_value
FROM tbl_items AS e
JOIN tbl_categories AS cat WHERE e.item_id = cat.item_id AND cat.cat_id = 6001
ORDER BY e.timestamp
LIMIT 25
..the query goes computing for approximately 10 minutes which is not acceptable. Can I use LIMIT more nicely to give the latest 25 records for each category?
Can anyone help me achieve this without ORDER BY? Any ideas or help will be highly appreciated.
EDIT
tbl_items
+---------------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------------------+--------------+------+-----+---------+-------+
| item_id | int(11) | NO | PRI | 0 | |
| item_name | longtext | YES | | NULL | |
| item_value | longtext | YES | | NULL | |
| timestamp | datetime | YES | | NULL | |
+---------------------+--------------+------+-----+---------+-------+
tbl_categories
+----------------+------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------------+------------+------+-----+---------+-------+
| cat_id | int(11) | NO | PRI | 0 | |
| item_id | int(11) | NO | PRI | 0 | |
| timestamp | datetime | YES | | NULL | |
+----------------+------------+------+-----+---------+-------+
Can you add indices? If you add an index on the timestamp and other appropriate columns the ORDER BY won't take 10 minutes.
First of all:
It seems to be a N:M relation between items and categories: a item may be in several categories. I say this because categories has item_id foreign key.
If is not a N:M relationship then you should consider to change design. If it is a 1:N relationship, where a category has several items, then item must constain category_id foreign key.
Working with N:M:
I have rewrite your query to make a inner join insteat a cross join:
SELECT e.item_id, e.item_value
FROM
tbl_items AS e
JOIN
tbl_categories AS cat
on e.item_id = cat.item_id
WHERE
cat.cat_id = 6001
ORDER BY
e.timestamp
LIMIT 25
To optimize performance required indexes are:
create index idx_1 on tbl_categories( cat_id, item_id)
it is not mandatory an index on items because primary key is also indexed.
A index that contains timestamp don't help as mutch. To be sure can try with an index on item with item_id and timestamp to avoid access to table and take values from index:
create index idx_2 on tbl_items( item_id, timestamp)
To increase performace you can change your loop over categories by a single query:
select T.cat_id, T.item_id, T.item_value from
(SELECT cat.cat_id, e.item_id, e.item_value
FROM
tbl_items AS e
JOIN
tbl_categories AS cat
on e.item_id = cat.item_id
ORDER BY
e.timestamp
LIMIT 25
) T
WHERE
T.cat_id between 6001 and 6012
ORDER BY
T.cat_id, T.item_id
Please, try this querys and come back with your comments to refine it if necessary.
Leaving aside all other factors I can tell you that the main reason why the query is so slow, is because the result involves longtext columns.
BLOB and TEXT fields in MySQL are mostly meant to store complete files, textual or binary. They are stored separately from the row data for InnoDB tables. Each time a query involes sorting (explicitly or for a group by), MySQL is sure to use disk for the sorting (because it can not be sure in advance how large any file is).
And it is probably a rule of thumb: if you need to return more than a single row of a column in a query, the type of the field is almost never should be TEXT or BLOB, use VARCHAR or VARBINARY instead.
UPD
If you can not update the table, the query will hardly be fast with the current indexes and column types. But, anyway, here is a similar question and a popular solution to your problem: How to SELECT the newest four items per category?