SQL LIMIT to get latest records - mysql

I am writing a script which will list 25 items of all 12 categories. Database structure is like:
tbl_items
---------------------------------------------
item_id | item_name | item_value | timestamp
---------------------------------------------
tbl_categories
-----------------------------
cat_id | item_id | timestamp
-----------------------------
There are around 600,000 rows in the table tbl_items. I am using this SQL query:
SELECT e.item_id, e.item_value
FROM tbl_items AS e
JOIN tbl_categories AS cat WHERE e.item_id = cat.item_id AND cat.cat_id = 6001
LIMIT 25
Using the same query in a loop for cat_id from 6000 to 6012. But I want the latest records of every category. If I use something like:
SELECT e.item_id, e.item_value
FROM tbl_items AS e
JOIN tbl_categories AS cat WHERE e.item_id = cat.item_id AND cat.cat_id = 6001
ORDER BY e.timestamp
LIMIT 25
..the query goes computing for approximately 10 minutes which is not acceptable. Can I use LIMIT more nicely to give the latest 25 records for each category?
Can anyone help me achieve this without ORDER BY? Any ideas or help will be highly appreciated.
EDIT
tbl_items
+---------------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------------------+--------------+------+-----+---------+-------+
| item_id | int(11) | NO | PRI | 0 | |
| item_name | longtext | YES | | NULL | |
| item_value | longtext | YES | | NULL | |
| timestamp | datetime | YES | | NULL | |
+---------------------+--------------+------+-----+---------+-------+
tbl_categories
+----------------+------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------------+------------+------+-----+---------+-------+
| cat_id | int(11) | NO | PRI | 0 | |
| item_id | int(11) | NO | PRI | 0 | |
| timestamp | datetime | YES | | NULL | |
+----------------+------------+------+-----+---------+-------+

Can you add indices? If you add an index on the timestamp and other appropriate columns the ORDER BY won't take 10 minutes.

First of all:
It seems to be a N:M relation between items and categories: a item may be in several categories. I say this because categories has item_id foreign key.
If is not a N:M relationship then you should consider to change design. If it is a 1:N relationship, where a category has several items, then item must constain category_id foreign key.
Working with N:M:
I have rewrite your query to make a inner join insteat a cross join:
SELECT e.item_id, e.item_value
FROM
tbl_items AS e
JOIN
tbl_categories AS cat
on e.item_id = cat.item_id
WHERE
cat.cat_id = 6001
ORDER BY
e.timestamp
LIMIT 25
To optimize performance required indexes are:
create index idx_1 on tbl_categories( cat_id, item_id)
it is not mandatory an index on items because primary key is also indexed.
A index that contains timestamp don't help as mutch. To be sure can try with an index on item with item_id and timestamp to avoid access to table and take values from index:
create index idx_2 on tbl_items( item_id, timestamp)
To increase performace you can change your loop over categories by a single query:
select T.cat_id, T.item_id, T.item_value from
(SELECT cat.cat_id, e.item_id, e.item_value
FROM
tbl_items AS e
JOIN
tbl_categories AS cat
on e.item_id = cat.item_id
ORDER BY
e.timestamp
LIMIT 25
) T
WHERE
T.cat_id between 6001 and 6012
ORDER BY
T.cat_id, T.item_id
Please, try this querys and come back with your comments to refine it if necessary.

Leaving aside all other factors I can tell you that the main reason why the query is so slow, is because the result involves longtext columns.
BLOB and TEXT fields in MySQL are mostly meant to store complete files, textual or binary. They are stored separately from the row data for InnoDB tables. Each time a query involes sorting (explicitly or for a group by), MySQL is sure to use disk for the sorting (because it can not be sure in advance how large any file is).
And it is probably a rule of thumb: if you need to return more than a single row of a column in a query, the type of the field is almost never should be TEXT or BLOB, use VARCHAR or VARBINARY instead.
UPD
If you can not update the table, the query will hardly be fast with the current indexes and column types. But, anyway, here is a similar question and a popular solution to your problem: How to SELECT the newest four items per category?

Related

Fastest way to order by having true result on a left join in MYSQL

I am trying to set up something where data is being matched on two different tables. The results would be ordered by some data being true on the second table. However, not everyone in the first table is in the second table. My problem is twofold. 1) Speed. My current MYSQL query takes 4 seconds to go through several thousand results on each table. 2) Not ordering correctly. I need it to order the results by who is online, but still be alphabetical. As it stands now it orders everyone by whether or not they are online according to chathelp table, then fills in the rest with the users table.
What I have:
SELECT u.name, u.id, u.url, c.online
FROM users AS u
LEFT JOIN livechat AS c ON u.url = CONCAT('http://www.software.com/', c.chat_handle)
WHERE u.live_account = 'y'
ORDER BY c.online DESC, u.name ASC
LIMIT 0, 24
users
+-----------------------------------------------------------+--------------+
| id | name | url | live_account |
+-----------------------------------------------------------+--------------|
| 1 | Lisa Fuller | http://www.software.com/LisaHelpLady | y |
| 2 | Eric Reiner | | y |
| 3 | Tom Lansen | http://www.software.com/SaveUTom | y |
| 4 | Billy Bob | http://www.software.com/BillyBob | n |
+-----------------------------------------------------------+--------------+
chathelp
+------------------------------------+
| chat_id | chat_handle | online |
+------------------------------------+
| 12 | LisaHelpLady | 1 |
| 34 | BillyBob | 0 |
| 87 | SaveUTom | 0 |
+------------------------------------+
What I would like the data I receive to look like:
+----------------------------------------------------------------------+
| name | id | url | online |
+----------------------------------------------------------------------+
| Lisa Fuller | 1 | http://www.software.com/LisaHelpLady | 1 |
| Eric Reiner | 4 | | 0 |
| Tom Lansen | 3 | http://www.software.com/SaveUTom | 0 |
+----------------------------------------------------------------------+
Explanation: Billy is excluded right off the bat for not having a live account. Lisa comes before Eric because she is online. Tom comes after Eric because he is offline and alphabetically later in the data. The only matching data between the two tables is a portion of the url column with the chat_handle column.
What I am getting instead:
(basically, I am getting Lisa, Tom, then Eric)
I am getting everybody in the chathelp table listed first whether or not they are online or not. So 600 people come first, then I get the remaining people who aren't in both tables from users table. I need people who are offline in the chathelp table to be sorted into the users table people in alphabetical order. So if Lisa and Tom were the only users online they would come first, but everyone else from the users table regardless of whether or not they set up their chathelp handle would come alphabetically after those two users.
Again, I need to sort them and figure out how to do this in less than 4 seconds. I have tried indexes on both tables, but they don't help. Explain says it is using a key (name) on table users hitting rows 4771 -> Using where;Using temporary; Using filesort and on table2 NULL for key with 1054 rows and nothing in the extra column.
Any help would be appreciated.
Edit to add table into and explain statement
CREATE TABLE `chathelp` (
`chat_id` int(13) NOT NULL,
`chat_handle` varchar(100) NOT NULL,
`online` tinyint(1) NOT NULL DEFAULT '0',
UNIQUE KEY `chat_id` (`chat_id`),
KEY `chat_handle` (`chat_handle`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
CREATE TABLE `users` (
`id` int(8) NOT NULL AUTO_INCREMENT,
`name` varchar(50) NOT NULL,
`url` varchar(250) NOT NULL,
`live_account` varchar(1) NOT NULL DEFAULT 'n',
PRIMARY KEY (`id`),
KEY `livenames` (`live_account`,`name`)
) ENGINE=MyISAM AUTO_INCREMENT=9556 DEFAULT CHARSET=utf8
+----+-------------+------------+------+---------------+--------------+---------+-------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+------+---------------+--------------+---------+-------+------+----------------------------------------------+
| 1 | SIMPLE | users | ref | livenames | livenames | 11 | const | 4771 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | chathelp | ALL | NULL | NULL | NULL | NULL | 1144 | |
+----+-------------+------------+------+---------------+--------------+---------+-------+------+----------------------------------------------+
We're going to guess that online is integer datatype.
You can modify the expression in your order by clause like this:
ORDER BY IFNULL(online,0) DESC, users.name ASC
^^^^^^^ ^^^
The problem is that for rows in user that don't have a matching row in chathelp, the value of the online column in the resultset is NULL. And NULL always sorts after all non-NULL values.
If we assume that a missing row in helpchat is to be treated equally with a row in helpchat that has a 0 for online, we can replace the NULL value with a 0. (If there are NULL values in the online column, we won't be able to distinguish between that, and a missing row in helpchat (using this expression in the ORDER BY.))
EDIT
Optimizing Performance
To address performance, we'd need to see the output from EXPLAIN.
With the query as its written above, there's no getting around the "Using filesort" to get the rows returned in the order specified, on that expression.
We may be able to re-write the query to get an equivalent result faster.
But I suspect the "Using filesort" operation is not really the problem, unless there's a bloatload (thousands and thousands) of rows to sort.
I suspect that suitable indexes aren't available for the join operation.
But before we go to the knee jerk "add an index!", we really need to look at EXPLAIN, and look at the table definitions including the indexes. (The output from SHOW CREATE TABLE is suitable.
We just don't have enough information to make recommendations yet.
Reference: 8.8.1 Optimizing Queries with EXPLAIN
As a guess, we might want to try a query like this:
SELECT u.name
, u.id
, l.url
, l.online
FROM users
LEFT
JOIN livechat
ON l.url = CONCAT('http://www.software.com/', u.chat_handle)
AND l.online = 1
WHERE u.live_account = 'y'
ORDER
BY IF(l.online=1,0,1) ASC
, u.name ASC
LIMIT 0,24
After we've added covering indexes, e.g.
.. ON user (live_account,chat_handle,name, id)
...ON livechat (url, online)
(If query is using a covering index, EXPLAIN should show "Using index" in the Extra column.)
One approach might be to break the query into two parts: an inner join, and a semi-anti join. This is just a guess at something we might try, but again, we'd want to compare the EXPLAIN output.
Sometimes, we can get better performance with a pattern like this. But for better performance, both of the queries below are going to need to be more efficient than the original query:
( SELECT u.name
, u.id
, l.url
, l.online
FROM users u
JOIN livechat
ON l.url = CONCAT('http://www.software.com/', u.chat_handle)
AND l.online = 1
WHERE u.live_account = 'y'
ORDER
BY u.name ASC
LIMIT 0,24
)
UNION ALL
( SELECT u.name
, u.id
, NULL AS url
, 0 AS online
FROM users u
LEFT
JOIN livechat
ON l.url = CONCAT('http://www.software.com/', u.chat_handle)
AND l.online = 1
WHERE l.url IS NULL
AND u.live_account = 'y'
ORDER
BY u.name ASC
LIMIT 0,24
)
ORDER BY 4 DESC, 1 ASC
LIMIT 0,24

Why is my SQL query so slow?

I run the following query on a weekly basis, but it is getting to the point where it now takes 22 hours to run! The purpose of the report is to aggregate impression and conversion data at the ad placement and date, so the main table I am querying does not have a primary key as there can be multiple events with the same date/placement.
The main data set has about 400K records, so it shouldn't take more than a few minutes to run this report.
The table descriptions are:
tbl_ads (400,000 records)
day_est DATE (index)
conv_day_est DATE (index)
placement_id INT (index)
adunit_id INT (index)
cost_type VARCHAR(20)
cost_value DECIMAL(10,2)
adserving_cost DECIMAL(10,2)
conversion1 INT
estimated_spend DECIMAL(10,2)
clicks INT
impressions INT
publisher_clicks INT
publisher_impressions INT
publisher_spend DECIMAL (10,2)
source VARCHAR(30)
map_external_id (75,000 records)
placement_id INT
adunit_id INT
external_id VARCHAR (50)
primary key(placement_id,adunit_id,external_id)
SQL Query
SELECT A.day_est,A.placement_id,A.placement_name,A.adunit_id,A.adunit_name,A.imp,A.clk, C.ads_cost, C.ads_spend, B.conversion1, B.conversion2,B.ID_Matched, C.pub_imps, C.pub_clicks, C.pub_spend, COALESCE(A.cost_type,B.cost_type) as cost_type, COALESCE(A.cost_value,B.cost_value) as cost_value, D.external_id
FROM (SELECT day_est, placement_id,adunit_id,placement_name,adunit_name,cost_type,cost_value,
SUM(impressions) as imp, SUM(clicks) as clk
FROM tbl_ads
WHERE source='delivery'
GROUP BY 1,2,3 ) as A LEFT JOIN
(
SELECT conv_day_est, placement_id,adunit_id, cost_type,cost_value, SUM(conversion1) as conversion1,
SUM(conversion2) as conversion2,SUM(id_match) as ID_Matched
FROM tbl_ads
WHERE source='attribution'
GROUP BY 1,2,3
) as B on A.day_est=B.conv_day_est AND A.placement_id=B.placement_id AND A.adunit_id=B.adunit_id
LEFT JOIN
(
SELECT day_est,placement_id,adunit_id,SUM(adserving_cost) as ads_cost, SUM(estimated_spend) as ads_spend,sum(publisher_clicks) as pub_clicks,sum(publisher_impressions) as pub_imps,sum(publisher_spend) as pub_spend
FROM tbl_ads
GROUP BY 1,2,3 ) as C on A.day_est=C.day_est AND A.placement_id=C.placement_id AND A.adunit_id=C.adunit_id
LEFT JOIN
(
SELECT placement_id,adunit_id,external_id
FROM map_external_id
) as D on A.placement_id=D.placement_id AND A.adunit_id=D.adunit_id
INTO OUTFILE '/tmp/weekly_report.csv';
Results of EXPLAIN:
+----+-------------+--------------------+-------+---------------+---------+---------+------+--------+----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------------+-------+---------------+---------+---------+------+--------+----------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 136518 | |
| 1 | PRIMARY | <derived3> | ALL | NULL | NULL | NULL | NULL | 5180 | |
| 1 | PRIMARY | <derived4> | ALL | NULL | NULL | NULL | NULL | 198190 | |
| 1 | PRIMARY | <derived5> | ALL | NULL | NULL | NULL | NULL | 23766 | |
| 5 | DERIVED | map_external_id | index | NULL | PRIMARY | 55 | NULL | 20797 | Using index |
| 4 | DERIVED | tbl_ads | index | NULL | PIndex | 13 | NULL | 318400 | |
| 3 | DERIVED | tbl_ads | ALL | NULL | NULL | NULL | NULL | 318400 | Using filesort |
| 2 | DERIVED | tbl_ads | index | NULL | PIndex | 13 | NULL | 318400 | Using where |
+----+-------------+--------------------+-------+---------------+---------+---------+------+--------+----------------+
More of a speculative answer, but I don't think 22 hours is too unrealistic..
First things first... you don't need the last subquery, just state
LEFT JOIN map_external_id as D on A.placement_id=D.placement_id AND A.adunit_id=D.adunit_id
Second, in the first and second subqueries you have the field source in your WHERE statement and this field is not listed in your table scheme. Obviously it might be or enum or string type, does it have an index? I've had a table with 1'000'000 or so entries where a missing index caused a processing time of 30 seconds for a simple query (cant believe the guy who put the query in the login process).
Irrelevant question inbetween, what's the final result set size?
Thirdly, my assumption is that by running the aggregating subqueries mysql actually creates temporary tables that do not have any indices - which is bad.
Have you yet had a look on the result sets of the single subqueries? What is the typical set size? From your statements and my guesses about your typical data I would assume that the aggregation actually only marginally reduces the set size (apart from the WHERE statement). So let me guess in order of the subqueries: 200'000, 100'000, 200'000
Each of the subqueries then joins with the next on three assumably not indexed fields. So worst case for the first join: 200'000 * 100'000 = 20'000'000'000 comparisons. Going from my 30 sec for a query on 1'000'000 records experience that makes it 20'000 * 30 = 600'000 sec =+- 166 hours. obviously that's way too much, maybe there's a digit missing, maybe it was 20 sec not 30, the result sets might be different, worst case is not average case - but you get the image.
My solution approach then would be to try to create additional tables which replace your aggregation subqueries. Judging from your queries you could update it daily, as I guess you just insert rows on impressions etc, so you can just add the aggregation data incrementally. Then you transform your mega-query into the two steps of
updating your aggregation tables
doing the final dump.
The aggregation tables obviously should be indexed meaningfully. I think that should bring the final queries down to a few seconds.
Thanks for all your advice. I ended up splitting the sub queries and creating temporary tables (with PKs) for each, then joined the temp tables together at the end and it now takes about 10 mins to run.

Optimizing a query for optional fields from another table

I have a innodb table called items that powers one ecommerce site. The search system allows you to search for optional/additional fields, so that you can e.g. search for only repaired computers or cars only older than 2000.
This is done via additional table called items_fields.
It has a very simple design:
+------------+------------------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| field_id | int(11) | NO | MUL | NULL | |
| item_id | int(11) | NO | MUL | NULL | |
| valueText | varchar(500) | YES | | NULL | |
| valueInt | decimal(10,1) unsigned | YES | | NULL | |
+------------+------------------------+------+-----+---------+----------------+
There is also a table called fields which contains only field names and types.
The main query, which returns search results, is the following:
SELECT items...
FROM items
WHERE items... AND (
SELECT count(id)
FROM items_fields
WHERE items_fields.field_id = "59" AND items_fields.item_id = items.id AND
items_fields.valueText = "Damaged")>0
ORDER by ordering desc LIMIT 35;
On a large scale (4 million+ search queries only, per day), I need to optimize these advanced search even more. Currently, the average advanced search query takes around 100ms.
How can I speed up this query? Do you have any other suggestions, advices, for optimization? Both tables are innodb, server stack is absolutely awesome, however I still got this query to solve :)
Add and index for (item_id, field_id, valueText) since this is your search.
Get rid of the inner select!!! MySQL up to 5.5 cannot optimize queries with inner selects. As far as I know MariaDB 5.5 is the only MySQL replacement that currently supports inner select optimization.
SELECT i.*, f2.* as damageCounter FROM items i
JOIN items_fields f ON f.field_id = 59
AND f.item_id = i.id
AND f.valueText = "Damaged"
JOIN item_fields f2 ON f2.item_id = i.id
ORDER by i.ordering desc
LIMIT 35;
The first join will limit the set being returned. The second join will grab all item_fields for items meeting the first join. Between the first and last joins, you can add more Join conditionals that will filter out results based on additional points. For example:
SELECT i.*, f3.* as damageCounter FROM items i
JOIN items_fields f ON f.field_id = 59
AND f.item_id = i.id
AND f.valueText = "Damaged"
JOIN items_fields f2 ON f2.field_id = 22
AND f2.item_id = i.id
AND f.valueText = "Green"
JOIN item_fields f3 ON f3.item_id = i.id
ORDER by i.ordering desc
LIMIT 35;
This would return a result set of all items that had fields 59 with the value of "Damaged" and field 22 with the value of "Green" along with all their item_fields.

slow left join using mysql

Here is the SQL query in question:
select * from company1
left join company2 on company2.model
LIKE CONCAT(company1.model,'%')
where company1.manufacturer = company2.manufacturer
company1 contains 2000 rows while company2 contains 9000 rows.
The query takes around 25 seconds to complete.
I have company1.model and company2.model indexed.
Any idea how I can speed this up? Thanks!
+----+-------------+-----------+------+---------------+------+---------+------+------+--------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+------+---------------+------+---------+------+------+--------------------------------+
| 1 | SIMPLE | company1 | ALL | NULL | NULL | NULL | NULL | 2853 | |
| 1 | SIMPLE | company2 | ALL | NULL | NULL | NULL | NULL | 8986 | Using where; Using join buffer |
+----+-------------+-------+---+------+---------------+------+---------+------+------+--------------------------------+
This query is not conceptually identical to yours, but maybe you want something like this? I am quite sure it will give you the same result as yours:
select
*
from
company1 inner join company2
on company1.manufacturer = company2.manufacturer
where
company2.model LIKE CONCAT(company1.model,'%')
EDIT: i also removed your left join and put an inner join. If the join doesn't succeed, company2.model is always null and NULL LIKE 'Something%' can never be true.
One way to speed this up is to remove the LIKE CONCAT() from the join condition.
MySQL is not able to use an index for substring based searches like that, so your query results in a full table scan.
Your EXPLAIN shows that you have no indexes that can be used.
Appropriate indexes on both tables would help. Either a single index on (manufacturer) or a composite (manufacturer, model):
ALTER TABLE company1
ADD INDEX manufacturer_model_IDX --- this is just a name (of your choice)
(manufacturer, model) ; --- for the index
ALTER TABLE company2
ADD INDEX manufacturer_model_IDX
(manufacturer, model) ;

Three Queries Faster than One -- What's Wrong with my Joins?

I've got a JPA ManyToMany relationship set up, which gives me three important tables: my Ticket table, my Join table, and my Inventory table. They're InnoDB tables on MySQL 5.1. The relevant bits are:
Ticket:
+--------+----------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------+----------+------+-----+---------+----------------+
| ID | int(11) | NO | PRI | NULL | auto_increment |
| Status | longtext | YES | | NULL | |
+--------+----------+------+-----+---------+----------------+
JoinTable:
+-------------+---------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------+---------+------+-----+---------+-------+
| InventoryID | int(11) | NO | PRI | NULL | | Foreign Key - Inventory
| TicketID | int(11) | NO | PRI | NULL | | Foreign Key - Ticket
+-------------+---------+------+-----+---------+-------+
Inventory:
+--------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------------+--------------+------+-----+---------+----------------+
| ID | int(11) | NO | PRI | NULL | auto_increment |
| TStampString | varchar(32) | NO | MUL | NULL | |
+--------------+--------------+------+-----+---------+----------------+
TStampStrings are of the form "yyyy.mm.dd HH:MM:SS Z" (for example, '2010.03.19 22:27:57 GMT'). Right now all of the Tickets created directly correspond to some specific hour TStampString, so that SELECT COUNT(*) FROM Ticket; is the same as SELECT COUNT(DISTINCT(SUBSTRING(TStampString, 1, 13))) FROM Inventory;
What I'd like to do is regroup certain Tickets based on the minute granularity of a TStampString: (SUBSTRING(TStampString, 1, 16)). So I'm profiling and testing the SELECT of an INSERT INTO ... SELECT statement:
EXPLAIN SELECT SUBSTRING(i.TStampString, 1, 16) FROM Ticket t JOIN JoinTable j
ON t.ID = j.TicketID JOIN Inventory i ON j.InventoryID = i.ID WHERE t.Status
= 'Regroup' GROUP BY SUBSTRING(i.TStampString, 1, 16);
+--+------+---+--------+-------------+-----+-----+----------+-------+-----------+
|id| type |tbl| type | psbl_keys | key | len | ref | rows | Extra |
+--+------+---+--------+-------------+-----+-----+----------+-------+-----------+
|1 | SMPL | t | ALL | PRI | NULL| NULL| NULL | 35569 | where |
| | | | | | | | | | +temporary|
| | | | | | | | | | +filesort |
|1 | SMPL | j | ref | PRI,FK1,FK2 | FK2 | 4 | t.ID | 378 | index |
|1 | SMPL | i | eq_ref | PRI | PRI | 4 | j.Invent | 1 | |
| | | | | | | | oryID | | |
+--+------+---+--------+-------------+-----+-----+----------+-------+-----------+
What this implies to me is that for each row in Ticket, MySQL first does the joins then later decides that the row is invalid due to the WHERE clause. Certainly the runtime is abominable (I gave up after 30 minutes). Note that it goes no faster with t.Status = 'Regroup' moved to the first JOIN clause and no WHERE clause.
But what's interesting is that if I run this query manually in three steps, doing what I thought the optimizer would do, each step returns almost immediately:
--Step 1: Select relevant Tickets (results dumped to file)
SELECT ID FROM Ticket WHERE Status = 'Regroup';
--Step 2: Get relevant Inventory entries
SELECT InventoryID FROM JoinTable WHERE TicketID IN (step 1s file);
--Step 3: Select what I wanted all along
SELECT SUBSTRING(TStampString, 1, 16) FROM Inventory WHERE ID IN (step 2s file)
GROUP BY SUBSTRING(TStampString, 1, 16);
On my particular tables, the first query gives 154 results, the second creates 206,598 lines, and the third query returns 9198 rows. All of them combined take ~2 minutes to run, with the last query having the only significant runtime.
Dumping the intermediate results to a file is cumbersome, and more importantly I'd like to know how to write my original query such that it runs reasonably. So how do I structure this three-table-join such that it runs as fast as I know is possible?
UPDATE: I've added a prefix index on Status(16), which changes my EXPLAIN profile rows to 153, 378, and 1 respectively (since the first row has a key to use). The JOIN version of my query now takes ~6 minutes, which is tolerable but still considerably slower than the manual version. I'd still like to know why the join is performing sorely suboptimally, but it may be that one can't create independent subqueries in buggy MySQL 5.1. If enough time passes I'll accept Add Index as the solution to my problem, although it's not exactly the answer to my question.
In the end I did end up manually recreating every step of the join on disk. Tens of thousands of files each with a thousand queries was still significantly faster than anything I could get my version of MySQL to do. But since that process would be horribly specific and unhelpful for the layman, I'm accepting ypercube's answer of Add (Partial) Indexes.
What you can do to speed up the query:
Add an index on Status. Even if you don't change the type to VARCHAR, you can still add a partial index:
ALTER TABLE Ticket
ADD INDEX status_idx
Status(16) ;
I assume that the Primary key of the Join table is (InventoryID, TicketID). You can add another index on (TicketID, InventoryID) as well. This may not benefit this particular query but it will be helpful in other queries you'll have.
The answer on why this happens is that the optimizer does not always choose the best plan. You can try this variation of your query and see how the EXPLAIN plan differs and if there is any efficiency gain:
SELECT SUBSTRING(i.TStampString, 1, 16)
FROM
( SELECT (DISTINCT) j.InventoryID
FROM Ticket t
JOIN JoinTable j
ON t.ID = j.TicketID
WHERE t.Status = 'Regroup'
) AS tmp
JOIN Inventory i
ON tmp.InventoryID = i.ID
GROUP BY SUBSTRING(i.TStampString, 1, 16) ;
try giving the first substring-clause an alias and using it in the group-by.
SELECT SUBSTRING(i.TStampString, 1, 16) as blaa FROM Ticket t JOIN JoinTable j
ON t.ID = j.TicketID JOIN Inventory i ON j.InventoryID = i.ID WHERE t.Status
= 'Regroup' GROUP BY blaa;
also avoid the join altogether since you dont need it..
SELECT distinct(SUBSTRING(i.TStampString, 1,16)) from inventory i where i.ID in
( select id from JoinTable j where j.TicketID in
(select id from Ticket t where t.Status = 'Regroup'));
would that work?
btw. you do have an index on the Status field ?