Optimize 5 table SQL query (stores => items => words) - mysql

Tables
stores (100,000 rows): id (pk), name, lat, lng, ...
store_items (9,000,000 rows): store_id (fk), item_id (fk)
items (200,000 rows): id(pk), name, ...
item_words (1,000,000 rows): item_id(fk), word_id(fk)
words (50,000 rows): id(pk), word VARCHAR(255)
Note: all ids are integers.
========
Indexes
CREATE UNIQUE INDEX storeitems_storeid_itemid_i ON store_items(store_id,item_id);
CREATE UNIQUE INDEX itemwords_wordid_itemid_i ON item_words(word_id,item_id);
CREATE UNIQUE INDEX words_word_i ON words(word);
Note: I prefer multi column indexes (storeitems_storeid_itemid_i and itemwords_wordid_itemid_i) because: http://www.mysqlperformanceblog.com/2008/08/22/multiple-column-index-vs-multiple-indexes/
QUERY
select s.name, s.lat, s.lng, i.name
from words w, item_words iw, items i, store_items si, stores s
where iw.word_id=w.id
and i.id=iw.item_id
and si.item_id=i.id
and s.id=si.store_id
and w.word='MILK';
Problem: elapsed time is 20-120 sec (depending on the word)!!!
explain $QUERY$
+----+-------------+-------+--------+-------------------------------------------------------+-----------------------------+---------+-----------------------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+-------------------------------------------------------+-----------------------------+---------+-----------------------------+------+-------------+
| 1 | SIMPLE | w | const | PRIMARY,words_word_i | words_word_i | 257 | const | 1 | Using index |
| 1 | SIMPLE | iw | ref | itemwords_wordid_itemid_i,itemwords_itemid_fk | itemwords_wordid_itemid_i | 4 | const | 1 | Using index |
| 1 | SIMPLE | i | eq_ref | PRIMARY | PRIMARY | 4 | iw.item_id | 1 | |
| 1 | SIMPLE | si | ref | storeitems_storeid_itemid_i,storeitems_itemid_fk | storeitems_itemid_fk | 4 | iw.item_id | 16 | Using index |
| 1 | SIMPLE | s | eq_ref | PRIMARY | PRIMARY | 4 | si.store_id | 1 | |
I want elapsed time to be less than 5 secs!!! Any ideas???
==============
What I tried
I tried to see when increase in the execution time happens by adding tables to the query.
1 table
select * from words where word='MILK';
Elapsed time: 0.4 sec
2 tables
select count(*)
from words w, item_words iw
where iw.word_id=w.id
and w.word='MILK';
Elapsed time: 0.5-2 sec (depending on word)
3 tables
select count(*)
from words w, item_words iw, items i
where iw.word_id=w.id
and i.id=iw.item_id
and w.word='MILK';
Elapsed time: 0.5-2 sec (depending on word)
4 tables
select count(*)
from words w, item_words iw, items i, store_items si
where iw.word_id=w.id
and i.id=iw.item_id
and si.item_id=i.id
and w.word='MILK';
Elapsed time: 20-120 sec (depending on word)
I guess the problem with the indexes or with the design of query/database. But there must be a way to make it work fast. Google does it somehow and their tables are much bigger!

a) You're actually writing queries to do FTS inside mysql -> use real FTS like lucene instead.
b) Clearly, adding the 9M row join is the performance issue
c) How about limiting that join (maybe it's being done in full with the current query plan) like this :
SELECT
s.name, s.lat, s.lng, i.name
FROM
(SELECT * FROM words WHERE word='MILK') w
INNER JOIN
item_words iw
ON
iw.word_id=w.id
INNER JOIN
items i
ON
i.id=iw.item_id
INNER JOIN
store_items si
ON
si.item_id=i.id
INNER JOIN
stores s
ON
s.id=si.store_id;
The logic behind this is that instead of joining full tables and then limiting the results, you start by limiting the tables on which you will join, this (if the join order happens to be the one I wrote) will greatly reduce your working set and inner queries running time.
d) Google does NOT use mysql for FTS

Consider de-normalising the structure - the first candidate is the 1 million record item_words table - bring the words directly into the table. Creating a list of unique words might be more easily achieved through a view (depends on how often you need this data compared to, for example, your need to extract a list of shops with products associated with a keyword).
Secondly - create indexed views (not an option in MySQL, but certainly an option on other commercial databases).

You don't have an index that it can use to find the store_id if given the item_id. If the cardinality of store_id is low enough it might gain some benefit from storeitems_storeid_itemid_i, but since you have 100,000 stores this might not be so useful. You might try creating an index on store_items that lists the item_id first:
CREATE UNIQUE INDEX storeitems_item_store ON store_items(item_id, store_id);
Also, I'm not sure if putting join conditions in the where clause will affect performance adversely in the way you're seeing but you might try changing the query to something like this:
select s.name, s.lat, s.lng, i.name
from words w LEFT JOIN item_words iw ON w.id=iw.word_id
LEFT JOIN items i ON i.id=iw.item_id
LEFT JOIN store_items si ON si.item_id=i.id
LEFT JOIN stores s ON s.id=si.store_id
where w.word='MILK';

Without knowing the exact layout of your tables it's hard to give a good answer. But these types of multi table joins has a tendency to get really bogged down. Especially if one of the factors making up the expression of selection is a dynamic string.
You could try to return multiple resultsets of the tables in one go, from a stored procedure or something and then joining the data outside of SQL. This way I got the query time of a massive join down from 2 minutes to 4 seconds. Or do it using a temporary table and return the resultset from that when you are done.
Start with selecting from the words table since that's where you have the dynamic string. Then you can select from the other tables based on the data returned from that query.

Try this one.
Rewrite the query in such way
select s.name, s.lat, s.lng, i.name
from words w LEFT JOIN item_words iw ON w.id=iw.word_id AND w.word='MILK'
LEFT JOIN items i ON i.id=iw.item_id
LEFT JOIN store_items si ON si.item_id=i.id
LEFT JOIN stores s ON s.id=si.store_id
And create index on (w.id, w.word)

Have you tried analyzing the tables ?
this will help the optimiser select the best possible execution plan.
e.g:
ANALYZE TABLE words
ANALYZE TABLE item_words
ANALYZE TABLE items
ANALYZE TABLE store_items
ANALYZE TABLE stores
see: http://dev.mysql.com/doc/refman/5.0/en/analyze-table.html

Related

get one record at a time from joint table

I want to get a record from a joint table at a time. But I don't hope the tables are joined as a whole.
The actual tables are as follow.
table contents -- stores content information.
+----+----------+----------+----------+-------------------+
| id | name |status |priority |last_registered_day|
+----+----------+----------+----------+-------------------+
| 1 | content_1|0 |1 |2020/10/10 11:20:20|
| 2 | content_2|2 |1 |2020/10/10 11:21:20|
| 3 | content_3|2 |2 |2020/10/10 11:22:20|
+----+----------+----------+----------+-------------------+
table clusters -- stores cluster information
+----+----------+
| id | name |
+----+----------+
| 1 | cluster_1|
| 2 | cluster_2|
+----+----------+
table content_cluster -- each record indicates that one content is on one cluster
+----------+----------+-------------------+
|content_id|cluster_id| last_update_date|
+----------+----------+-------------------+
| 1 | 1 |2020-10-01T11:30:00|
| 2 | 2 |2020-10-01T11:30:00|
| 3 | 1 |2020-10-01T10:30:00|
| 3 | 2 |2020-10-01T10:30:00|
+----------+----------+-------------------+
By specifying a cluster_id, I want to get one content name at a time where contents.status=2 and (contents name, cluster_id) pair is in content_cluster. The query in sql is something like follow.
SELECT contents.name
FROM contents
JOIN content_cluster
ON contents.content_id = content_cluster.content_id
where contents.status = 2
AND content_cluster.cluster_id = <cluster_id>
ORDER
BY contents.priority
, contents.last_registered_day
, contents.name
LIMIT 1;
However, I don't want the tables to be joined as a whole every time as I have to do it frequently and the tables are large. Is there any efficient way to do this? I can add some indices to the tables. What should I do?
I would try writing the query like this:
SELECT c.name
FROM contents c
WHERE EXISTS (SELECT 1
FROM content_cluster cc
WHERE cc.content_id = c.content_id AND
cc.cluster_id = <cluster_id>
) AND
c.status = 2
ORDER BY c.priority, c.last_registered_day, c.name
LIMIT 1;
Then create the following indexes:
content(status, priority, last_registered_day, name, content_id, name)
content_cluster(content_id, cluster_id).
The goal is for the execution plan to scan the index for context and for each row, look up to see if there is a match in content_cluster. The query stops at the first match.
I can't guarantee that this will generate that plan (avoiding the sort), but it is worth a try.
This query can easily be optimized by applying correct indexes. Apply the alter statements I am mentioning below. And let me know if the performance have considerably increased or not:
alter table contents
add index idx_1 (id),
add index idx_2(status);
alter table content_cluster
add index idx_1 (content_id),
add index idx_2(cluster_id);
If a content can be in multiple clusters and the number of clusters can change, I think that doing a join like this is the best solution.
You could try splitting your contents table into different tables each containing the contents of a specific cluster, but it would need to be updated frequently.

MySQL sorting on joined table column extremely slow (temp table)

I have some tables:
object
person
project
[...] (some more tables)
type
The object table has foreign keys to all other tables.
Now I do a query like:
SELECT * FROM object
LEFT JOIN person ON (object.person_id = person.id)
LEFT JOIN project ON (object.project_id = project.id)
LEFT JOIN [...] (all other joins)
LEFT JOIN type ON (object.type_id = type.id)
WHERE object.customer_id = XXX
ORDER BY object.type_id ASC
LIMIT 25
This works perfectly well and fast, even for big resultsets. For example I have 90000 objects and the query takes about 3 seconds. The result ist quite big because the tables have a lot of columns and all of them are fetched. For info: I'm using Symfony with Propel, InnoDB and the "doSelectJoinAll"-function.
But if do a query like (sort by type.name):
SELECT * FROM object
LEFT JOIN person ON (object.person_id = person.id)
LEFT JOIN project ON (object.project_id = project.id)
LEFT JOIN [...] (all other joins)
LEFT JOIN type ON (object.type_id = type.id)
WHERE object.customer_id = XXX
ORDER BY type.name ASC
LIMIT 25
The query takes about 200 seconds!
EXPLAIN:
id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
1 | SIMPLE | object | ref | object_FI_2 | object_FI_2 | 4 | const | 164966 | Using where; Using temporary; Using filesort
1 | SIMPLE | person | eq_ref | PRIMARY | PRIMARY | 4 | db.object.person_id | 1
1 | SIMPLE | ... | eq_ref | PRIMARY | PRIMARY | 4 | db.object...._id | 1
1 | SIMPLE | type | eq_ref | PRIMARY | PRIMARY | 4 | db.object.type_id | 1
I saw in the processlist, that MySQL is creating a temporary table for such a sorting on a joined table.
Adding an index to type.name didn't improve the performance. There are only about 800 type rows.
I found out that the many joins and the big result is the problem, because if I do a query with just one join like:
SELECT * FROM object
LEFT JOIN type ON (object.type_id = type.id)
WHERE object.customer_id = XXX
ORDER BY type.name ASC
LIMIT 25
it works as fast as expected.
Is there a way to improve such sorting queries on a big resultset with many joined tables? Or is it just a bad habit to sort on a joined table column and this shouldn't be done anyway?
Thank you
LEFT gets in the way of rearranging the order of the tables. How fast is it without any LEFT? Do you get the same answer?
LEFT may be a red herring... Here's what the optimizer is likely to be doing:
Decide what order to do the tables in. Take into consideration any WHERE filtering and any LEFTs. Because of WHERE object.customer_id = XXX, object is likely to be the best table to start with.
Get the rows from object that satisfy the WHERE.
Get the columns needed from the other tables (do the JOINs).
Sort according to the ORDER BY ** see below
Deliver the first 25 rows.
** Let's dig deeper into these two:
WHERE object.customer_id = XXX ORDER BY object.id
WHERE object.customer_id = XXX ORDER BY virtually-anything-else
You have INDEX(customer_id), correct? And the table is InnoDB, correct? Well, each secondary index implicitly includes the PRIMARY KEY, as if you had said INDEX(customer_id, id). The optimal index for the first WHERE + ORDER BY is precisely that. It will locate XXX and scan 25 rows, then stop. You might say that steps 2,4,5 are blended together.
The second WHERE just gather all the stuff through step 4. This could be thousands of rows. Hence it is likely to be a lot slower.
See also article on building optimal indexes.

retrieving top-ranking rows from large tables using FULLTEXT is very slow

When we log into our database with mysql-client and launch these queries:
first test query:
select a.*
from ads a
inner join searchs_titles s on s.id_ad = a.id
where match(s.label) against ('"bmw serie 3"' in boolean mode)
order by a.ranking asc limit 0, 10;
The result is:
10 rows in set (1 min 5.37 sec)
second test query:
select a.*
from ads a
inner join searchs_titles s on s.id_ad = a.id
where match(s.label) against ('"ford mondeo"' in boolean mode)
order by a.ranking asc limit 0, 10;
The result is:
10 rows in set (2 min 13.88 sec)
These queries are too slow. Is there a way to improve this?
The 'ads' table contains 2 millions rows, triggers are set to duplicate the data into search title. Search titles contains the id, title and label of each row in ads.
Table 'ads' is powered by innoDB and 'searchs_titles' by myISAM with a fulltext index on the label field.
Do we have too many columns? Too many indexes? Too many rows?
Is it a bad query?
Thanks a lot for the time you will spend helping us!
Edit: add explain
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
| 1 | SIMPLE | s | fulltext | id_ad,label | label | 0 | | 1 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | a | eq_ref | PRIMARY,id,id_2,id_3 | PRIMARY | 4 | XXXXXX.s.id_ad | 1 | |
Pro tip: Never use * in a SELECT statement in production software (unless you have a very good reason). By asking for all columns, you are denying the optimizer access to information about how best to exploit your indexes.
Observation: you're ordering by ads.ranking and taking ten results. But ads.ranking has very low cardinality -- according to that image in your question, it has 26 distinct values. Is your query working correctly?
Observation: You've said that the fulltext part of your search takes .77 seconds. I mean this part:
select s.id
from searchs_titles AS s
where match(s.label) against ('"ford mondeo"' in boolean mode)
That is good. It means we can focus on the rest of the query.
You also said you've been testing with the insertions to the table turned off. That's good because it rules out contention as a cause for the slow queries.
Suggestion: Create a suitable compound index for ads. For your present query, try an index on (id, ranking) This may allow your ORDER BY operation to avoid a full table scan.
Then, try this query to extract the set of ten a.id values you need, and then retrieve the data rows. This will exploit your compound index.
select z.*
from ads AS z
join ( select a.id, a.ranking
from ads AS a
inner join searchs_titles s on s.id_ad = a.id
where match(s.label) against ('"ford mondeo"' in boolean mode)
order by a.ranking asc
limit 0, 10
) AS b ON z.id = b.id
order by z.ranking
This uses a subquery to do the order by ... limit ... datashuffling operation on a small subset of the columns. This should make the retrieval of the appropriate id values much faster. Then the outer query fetches the appropriate rows.
The bottom line is this: ORDER BY ... LIMIT ... can be a very expensive operation if it's done on lots of data. But if you can arrange for it to be done on a minimal choice of columns, and those columns are indexed correctly, it can be very fast.

What type of Join to use?

I've got a core table and and 3 tables that extend the 'core' table in different ways.
I'm working with MLS data and I have a 'common' table that contains information common to all mls listings and then a table that has specifically "residential" information, one for "commercial",etc... I have been using mls number to join a single table when I know a listing when the property type is known, but for searching I want to join all of them and have the special fields available for search criteria (not simply searching the common table).
What type of join will give me a dataset that will contain all listings (including the extended fields in the idx tables) ?
For each Common table record there is a single corresponding record in ONLY ONE of the idx tables.
___________
| |
| COMMON |
| |
|___________|
_|_
|
___________________|_____________________
_|_ _|_ _|_
_____|_____ _____|______ ____|______
| | | | | |
| IDX1 | | IDX2 | | IDX3 |
| | | | | |
|___________| |____________| |___________|
If you want everything in one row, you can use something like this format. Basically it gives you all the "Common" fields, then the other fields if there is a match otherwise NULL:
SELECT Common.*,
Idx1.*,
Idx2.*,
Idx3.*
FROM Common
LEFT JOIN Idx1
ON Idx1.MLSKey = Common.MLSKey
LEFT JOIN Idx2
ON Idx2.MLSKey = Common.MLSKey
LEFT JOIN Idx3
ON Idx3.MLSKey = Common.MLSKey
Bear in mind it's better to list out fields than to use the SELECT * whenever possible...
Also I'm assuming MySQL syntax is the same as SQL Server, which is what I use.
I have a similar set up of tables where the table 'jobs' is the core table.
I have this query that selects certain elements from each of the other 2 tables:
SELECT jobs.frequency, twitterdetails.accountname, feeds.feed
FROM jobs
JOIN twitterdetails ON twitterdetails.ID = jobs.accountID
JOIN feeds ON jobs.FeedID = feeds.FeedID
WHERE jobs.username ='".$currentuser."';");
So, as you can see, no specific JOIN, but the linking fields defined. You'd probably just need an extra JOIN line for your set up.
Ugly solution / poor attempt / may have misunderstood question:
SELECT common.*,IDX1.field,NULL,NULL FROM COMMON
LEFT JOIN IDX1 ON COMMON.ID = IDX1.ID
WHERE TYPE="RESIDENTIAL"
UNION ALL
SELECT common.*,NULL,IDX2.field,NULL FROM COMMON
LEFT JOIN IDX2 ON COMMON.ID = IDX2.ID
WHERE TYPE="RESIDENTIAL"
UNION ALL
SELECT common.*,NULL,NULL,IDX3.field FROM COMMON
LEFT JOIN IDX3 ON COMMON.ID = IDX3.ID
WHERE TYPE="INDUSTRIAL"
Orbit is close. Use inner join, not left join. You don't want common to show up in the join if it does not have a row in idx.
You MUST union 3 queries to get the proper results assuming each record in common can only have 1 idx table. Plug in "NULL" to fill in the columns that each idx table is missing so they can be unioned.
BTW your table design is good.

How to optimize SQL query with two where clauses?

My query is something like this
SELECT * FROM tbl1
JOIN tbl2 ON something = something
WHERE 1 AND (tbl2.date = '$date' OR ('$date' BETWEEN tbl1.planA AND tbl1.planB ))
When I run this query, it is considerably slower than for example this query
SELECT * FROM tbl1
JOIN tbl2 ON something = something
WHERE 1 AND ('$date' BETWEEN tbl1.planA AND tbl1.planB )
or
SELECT * FROM tbl1
JOIN tbl2 ON something = something
WHERE 1 AND tbl2.date = '$date'
In localhost, the first query takes about 0.7 second, the second query about 0.012 second and the third one 0.008 second.
My question is how do you optimize this? If currently I have 1000 rows in my tables and it takes 0.7 second to display the first query, it will take 7 seconds if I have 10.000 rows right? That's a massive slow down compared to second query (0.12 second) and third (0.08).
I've tried adding indexes, but the result is no different.
Thanks
Edit : This application will only work locally, so no need to worry about the speed over the web.
Sorry, I didn't include the EXPLAIN because my real query are much more complicated (about 5 joins). But the joins (I think) don't really matter, cos I've tried omitting them and still get approximately the same result as above.
The date belongs to tbl1, planA and planB belongs to tbl2. I've tried adding indexes to tbl1.date, tbl2.planA and tbl2.planB but the result is insignificant.
By schema do you mean MyISAM or InnoDB? It's MyISAM.
Okay, I'll just post my query straight away. Hopefully it's not that confusing.
SELECT *
FROM tb_joborder jo
LEFT JOIN tb_project p ON jo.project_id = p.project_id
LEFT JOIN tb_customer c ON p.customer_id = c.customer_id
LEFT JOIN tb_dispatch d ON jo.joborder_id = d.joborder_id
LEFT JOIN tb_joborderitem ji ON jo.joborder_id = ji.joborder_id
LEFT JOIN tb_mix m ON ji.mix_id = m.mix_id
WHERE dispatch_date = '2011-01-11'
OR '2011-01-11'
BETWEEN planA
AND planB
GROUP BY jo.joborder_id
ORDER BY customer_name ASC
And the describe output
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE jo ALL NULL NULL NULL NULL 453 Using temporary; Using filesort
1 SIMPLE p eq_ref PRIMARY PRIMARY 4 db_dexada.jo.project_id 1
1 SIMPLE c eq_ref PRIMARY PRIMARY 4 db_dexada.p.customer_id 1
1 SIMPLE d ALL NULL NULL NULL NULL 2048 Using where
1 SIMPLE ji ALL NULL NULL NULL NULL 455
1 SIMPLE m eq_ref PRIMARY PRIMARY 4 db_dexada.ji.mix_id 1
You can just use UNION to merge results of 2nd and 3d queries.
More about UNION.
First thing that comes to mind is to union the two:
SELECT * FROM tbl1
JOIN tbl2 ON something = something
WHERE 1 AND ('$date' BETWEEN planA AND planB )
UNION ALL
SELECT * FROM tbl1
JOIN tbl2 ON something = something
WHERE 1 AND date = '$date'
You have provided too little to make optimizations. We don't know anything about your data structures.
Even if most slow queries are usually due to the query itself or index setup of the used tables, you can try to find out where your bottleneck is with using the MySQL Query Profiler, too. It has been implemented into MySQL since Version 5.0.37.
Before you start your query, activate the profiler with this statement:
mysql> set profiling=1;
Now execute your long query.
With
mysql> show profiles;
you can now find out what internal number (query number) your long query has.
If you now execute the following query, you'll get alot of details about what took how long:
mysql> show profile for query (insert query number here);
(example output)
+--------------------+------------+
| Status | Duration |
+--------------------+------------+
| (initialization) | 0.00005000 |
| Opening tables | 0.00006000 |
| System lock | 0.00000500 |
| Table lock | 0.00001200 |
| init | 0.00002500 |
| optimizing | 0.00001000 |
| statistics | 0.00009200 |
| preparing | 0.00003700 |
| executing | 0.00000400 |
| Sending data | 0.00066600 |
| end | 0.00000700 |
| query end | 0.00000400 |
| freeing items | 0.00001800 |
| closing tables | 0.00000400 |
| logging slow query | 0.00000500 |
+--------------------+------------+
This is a more general, administrative approach, but can help narrow down or even find out the cause for slow queries very nice.
A good tutorial on how to use the MySQL Query Profiler can be found here in the MySQL articles.