not sure why join query is returning resultset longer than i'd expect, and taking long to execute - mysql

I have reached an impasse with my knowledge regarding mysql joins, and the query I'm trying to execute is taking way too long... Although I'm only a short while into learning mysql on my own, I have put time into reading about the mechanics of indexes and joins, done many google searches and tried a few different query formats. To no avail, I need help please.
Firstly, I will say that my database is, at the moment, to be optimized for speed of select queries. I know I have a few too many indexes... my theory of learning mysql is to make a few too many indexes and examine what the mysql optimizer chooses for my purposes (determined by using explain) and then determine why it has chosen said index.
Anyhow, I have four tables: table1, table2, table3, table4...
table1.ID1 is the primary key, and other data in table1 might be divided into multiple content in table2.
table2.ID1 identifies every entry in table1 that is built upon content form table1
table2.ID2 is the primary key for table2
table3.ID2 identifies every entry in table3 that is built upon content form table2
table3.ID3 is the primary key for table3
table4.ID3 identifies every entry in table4 that is built upon content form table3
Not every entry in table1 has corresponding data in table2, and similarly table2 to table3, and table3 to table4.
What I need to do is retrieve the distinct values of ID2 that appear within a date range, and also only if the table2 content eventually appears in table4. The challenge I'm facing is that only table1 has a date column, and I need only entries that also appear in table4.
The following query takes approx 2 minutes.
select table2.ID2 from table1
left join table2 on
table1.ID1 = table2.ID1
left join table3 on
table3.ID2 = table2.ID2
left join table4 on
table4.ID3 = table3.ID3
where table1.Date between "2012-03-11" and "2012-03-18
by using explain with the above query I see no reason why it should take so long.
+----+-------------+--------------+-------+----------------------+----------+---------+------------------------------+-------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+-------+----------------------+----------+---------+------------------------------+-------+--------------------------+
| 1 | SIMPLE | table1 | range | ... | Datekey | 9 | NULL | 17528 | Using where; Using index |
| 1 | SIMPLE | table2 | ref | ... | ID1key | 8 | mydata.table1.POSTID | 1 | |
| 1 | SIMPLE | table3 | ref | ... | ID2key | 8 | mydata.table2.SrcID | 20 | |
| 1 | SIMPLE | table4 | ref | ... | ID3key | 8 | mydata.table3.ParsedID | 10 | Using index |
+----+-------------+--------------+-------+----------------------+----------+---------+------------------------------+-------+--------------------------+
I've replaced the names of possible keys with '...' as its not that important. In any case, a key is selected.
Moreover, the number of rows in the resultset in the query is much more than the purported matching 17528 rows in the explain resultset. How could it be more??
What am I doing wrong? I've also tried inner join with no luck. The way I interpret my query is a 4-way venn diagram, with very few number of rows with overlapping criteria, and further optimized by an index on the daterange.
I at least get the resultset that i want if I add 'distinct(table2.ID2)', but why am I otherwise getting a resultset much longer than what I'd expect, and why is it taking so long?
Sorry if any part of my question has been ambiguous, I'd be happy to clarify as needed.
Thanks,
Brian
EDIT:
All indexes refer to a BIGINT column, as I expect my database to get rather large and need quite a number of unique row identifiers... perhaps bigint is overkill and reducing the size of that column and/or the index would speed things up further.
Here's my final solution, based on the accepted answer below:
select ID2 from table2
where exists
(select 1 from table1 r
where table1.Date between "2012-03-11" and "2012-03-18" and table2.ID1 = table1.ID1
)
and exists
(select 1 from table3
where exists
(select 1 from table4 where table4.ID3 = table3.ID3)
)
Additionally, I realized I was missing a multi-field index, associating table2.ID1 and table2.ID2... After adding this index, this statement returns in about 11 seconds, and returns approx 20,000 rows.
I think this is reasonable considering the number of rows in each of my tables
table1: ~480,000
table2: ~480,000
table3: ~6,000,000
table4: ~60,000,000
Does this sound efficient? I'll accept the answer after I get confirmation this is the best performance I should expect. I'm running on a Xeon 3GHz system with 3gb mem, ubuntu 12.04, mysql 5.5.24

In all likelihood, your tables have multiple matches between them. Say table1 matches 5 rows in table2 and 10 rows in table3. Then you end up with 50 rows in the output.
So solve this, you need to limit your joins to one row per table.
One way is to use the in clause. If you are using the joins for filtering, then you can use a where clause instead:
where table2.id1 in (select table1.id1 from table1)
The "in" prevents duplicates.
The other alternative is to pre-aggregate the queries in the joins by doing joins.
Mysql seems to prefer a slightly different construct for the where clause, from an optimization perspective:
where exists (select 1 from table1 where table1.id = table2.id)

Related

Please help to optimize MySql UPDATE with two tables and M:N row mapping

I'm post-processing traces for two different kinds of events, where the data is stored in table A and B. Both tables have an producer ID and a time index value. While the same producer can trigger a record in both tables, the time when the different events occur are independent, and much more frequent for table B.
I want to update table A such that, for every row in table A, a column value from table B is taken for the most recent row in table B for the same producer.
Example mappings between two tables:
Here is a simplified example with just one producer in both tables. The goal is not to get the oldest entry in table B, but rather the most recent entry in table B relative to a row in table A. I'm showing B.tIdx < A.tIdx in this example, but <= is just as good for my purposes; just a detail.
Table A Table B
+----+------+----------------------+ +------+------+-------+
| ID | tIdx | NEW value SET FROM B | | ID | tIdx | value |
+----+------+----------------------+ +------+------+-------+
| 1 | 2 | 12.5 | | 1 | 1 | 12.5 |
| 1 | 4 | 4.3 | | 1 | 2 | 9.0 |
+----+------+----------------------+ | 1 | 3 | 4.3 |
| 1 | 4 | 7.8 |
| 1 | 5 | 6.2 |
+------+------+-------+
The actual tables have thousands of different IDs, millions of rows, and nearly as many distinct time index values as rows. I'm having trouble to come up with an UPDATE that doesn't take days to complete.
The following UPDATE works, but executes far too slowly; it starts off at a rate of 100s of updates/s, but soon slows to roughly 5 updates/s.
UPDATE A AS x SET value =
(SELECT value
FROM B AS y
WHERE x.ID = y.ID AND x.tIdx > y.tIdx
ORDER BY y.tIdx DESC
LIMIT 1);
I've tried creating indexes for ID and tIdx separately, but also multi-column indexes with both orders (ID,tIdx) and (tIdx,ID). But even when the multi-column indexes exist, EXPLAIN shows that it only ever indexes on ID or tIdx, but not both together.
I was wondering if the solution is to create nested SELECTs, to first get a temporary table with a particular ID, and then find the 1 row in table B that will meet the time constraint for each tIdx for that particular ID. The following SELECT, with hardcoded ID and tIdx, works and is very fast, completing in 0.00 sec.
SELECT value, ID, tIdx
FROM (
SELECT value, ID, tIdx
FROM B
WHERE ID = 5216
) y
WHERE tIdx < 1253707
ORDER BY tIdx DESC LIMIT 1;
I'd like to incorporate this into an UPDATE somehow, but replace the hardcoded ID and tIdx with the ID,tIdx pair for each row in A.
Or try any other suggestion for a more efficient UPDATE statement.
This is my first post to stackoverflow. Sincere apologizes in advance if I have violated any etiquette.
Update with Inner Join should do it, but it's going to get nasty to do this.
Update a INNER JOIN
(Select b.ID, maxb.atIdx, b.value
From b INNER JOIN (Select a.ID, a.tIdx as atIdx, max(b.tIdx) as bigb
From b INNER JOIN a
ON b.ID=a.ID
Where b.tIdx<=a.tIdx
Group By a.ID,a.tIdx) maxb
ON b.ID=maxb.ID and b.tIdx=maxb.bigb
) bestb ON a.ID=bestb.ID and a.tIdx=bestb.atIdx
Set a.value=bestb.value
To explain this it's best to start with the innermost SQL and work your way to the outermost UPDATE. To start, we need to join every record in table A to every record in table B for each ID. We can filter out the B records that are too recent and summarize that result for each table A record. That leaves us with the tIdx of the B table whose value goes into A for every record key in A. So then we join that to the B table to select the values to update, preserving the A-table's keys. That result is joined back to A to perform the update.
You'll have to see whether this is fast enough for you - I'm worried that this accesses the B table twice and the inner query creates A LOT of join combinations. I would pull out that inner query and see how long it runs by itself. On the positive side, they are all very simple, straightforward queries and they are connected by Inner Joins so there is some opportunity for efficiency in the query optimizer. I think indexes on a(ID,TIdx) [fast lookup to get the Update row] and b(ID) would be useful here.
One thing you can try is lead() to see if that helps the performance:
UPDATE A JOIN
(SELECT b.*,
LEAD(tIDx) OVER (PARTITION BY id order by tIDx) as next_tIDx
FROM b
) b
ON a.id = b.id AND
a.tIDx >= b.tIDx AND
(b.next_tIDx IS NULL or a.tIDx < b.next_tIDx)
SET a.value = b.value;
And for this you want an index on b(id, tidx).

2 inner joins between same 2 tables

I am trying to select columns from 2 tables,
The INNER JOIN conditions are $table1.idaction_url=$table2.idaction AND $table1.idaction_name=$table2.idaction.
However, From the query below, there is no output. It seems like the INNER JOIN can only take 1 condition. If I put AND to include both conditions as shown in the query below, there wont be any output. Please look at the picture below. Please advice.
$mysql=("SELECT conv(hex($table1.idvisitor), 16, 10) as visitorId,
$table1.server_time, $table1.idaction_url,
$table1.time_spent_ref_action,$table2.name,
$table2.type, $table1.idaction_name, $table2.idaction
FROM $table1
INNER JOIN $table2
ON $table1.idaction_url=$table2.idaction
AND $table1.idaction_name=$table2.idaction
WHERE conv(hex(idvisitor), 16, 10)='".$id."'
ORDER BY server_time DESC");
Short answer:
You need to use two separate inner joins, not only a single join.
E.g.
SELECT `actionurls`.`name` AS `actionUrl`, `actionnames`.`name` AS `actionName`
FROM `table1`
INNER JOIN `table2` AS `actionurls` ON `table1`.`idaction_url` = `actionurls`.`idaction`
INNER JOIN `table2` AS `actionnames` ON `table1`.`idaction_name` = `actionurls`.`idaction`
(Modify this query with any additional fields you want to select).
In depth: INNER JOIN, when done on a value unique to the second table (the table joined to the first in this operation) will only ever fetch one row. What you want to do is fetch data from the other table twice, into the same row, reading the select part of the statement.
INNER JOIN table2 ON [comparison] will, for each row selected from table1, grab any rows from table2 for which [comparison] is TRUE, then copy the row from table1 N times, where N is the amount of rows found in table2. If N = 0, then the row is skipped. In our case N=1 so INNER JOIN of idaction_name in table1 to idaction in table2 for example will allow you to select all the action names.
In order to get the action urls as well we have to INNER JOIN a second time. Now you can't join the same table twice normally, as SQL won't know which of the two joined tables is meant when you type table2.name in the first part of your query. This would be ambiguous if both had the same name. There's a solution for this, table aliases.
The output (of my answer above) is going to be something like:
+-----+------------------------+-------------------------+
| Row | actionUrl | actionName |
+-----+------------------------+-------------------------+
| 1 | unx.co.jp/ | UNIX | Kumamoto Home |
| 2 | unx.co.jp/profile.html | UNIX | Kumamoto Profile |
| ... | ... | ... |
+-----+------------------------+-------------------------+
While if you used only a single join, you would get this kind of output (using OR):
+-----+-------------------------+
| Row | actionUrl |
+-----+-------------------------+
| 1 | unx.co.jp/ |
| 2 | UNIX | Kumamoto Home |
| 3 | unx.co.jp/profile.html |
| 4 | UNIX | Kumamoto Profile |
| ... | ... |
+-----+-------------------------+
Using AND and a single join, you only get output if idaction_name == idaction_url is TRUE. This is not the case, so there's no output.
If you want to know more about how to use JOINS, consult the manual about them.
Sidenote
Also, I can't help but notice you're using variables (e.g. $table1) that store the names of the tables. Do you make sure that those values do not contain user input? And, if they do, do you at least whitelist a list of tables that users can access? You may have some security issues with this.
INNER JOIN does not put any restriction on number of conditions it can have.
The zero resultant rows means, there is no rows satisfying the two conditions simultaneously.
Make sure you are joining using correct columns. Try going step by step to identify from where the data is lost

MySQL sorting on joined table column extremely slow (temp table)

I have some tables:
object
person
project
[...] (some more tables)
type
The object table has foreign keys to all other tables.
Now I do a query like:
SELECT * FROM object
LEFT JOIN person ON (object.person_id = person.id)
LEFT JOIN project ON (object.project_id = project.id)
LEFT JOIN [...] (all other joins)
LEFT JOIN type ON (object.type_id = type.id)
WHERE object.customer_id = XXX
ORDER BY object.type_id ASC
LIMIT 25
This works perfectly well and fast, even for big resultsets. For example I have 90000 objects and the query takes about 3 seconds. The result ist quite big because the tables have a lot of columns and all of them are fetched. For info: I'm using Symfony with Propel, InnoDB and the "doSelectJoinAll"-function.
But if do a query like (sort by type.name):
SELECT * FROM object
LEFT JOIN person ON (object.person_id = person.id)
LEFT JOIN project ON (object.project_id = project.id)
LEFT JOIN [...] (all other joins)
LEFT JOIN type ON (object.type_id = type.id)
WHERE object.customer_id = XXX
ORDER BY type.name ASC
LIMIT 25
The query takes about 200 seconds!
EXPLAIN:
id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
1 | SIMPLE | object | ref | object_FI_2 | object_FI_2 | 4 | const | 164966 | Using where; Using temporary; Using filesort
1 | SIMPLE | person | eq_ref | PRIMARY | PRIMARY | 4 | db.object.person_id | 1
1 | SIMPLE | ... | eq_ref | PRIMARY | PRIMARY | 4 | db.object...._id | 1
1 | SIMPLE | type | eq_ref | PRIMARY | PRIMARY | 4 | db.object.type_id | 1
I saw in the processlist, that MySQL is creating a temporary table for such a sorting on a joined table.
Adding an index to type.name didn't improve the performance. There are only about 800 type rows.
I found out that the many joins and the big result is the problem, because if I do a query with just one join like:
SELECT * FROM object
LEFT JOIN type ON (object.type_id = type.id)
WHERE object.customer_id = XXX
ORDER BY type.name ASC
LIMIT 25
it works as fast as expected.
Is there a way to improve such sorting queries on a big resultset with many joined tables? Or is it just a bad habit to sort on a joined table column and this shouldn't be done anyway?
Thank you
LEFT gets in the way of rearranging the order of the tables. How fast is it without any LEFT? Do you get the same answer?
LEFT may be a red herring... Here's what the optimizer is likely to be doing:
Decide what order to do the tables in. Take into consideration any WHERE filtering and any LEFTs. Because of WHERE object.customer_id = XXX, object is likely to be the best table to start with.
Get the rows from object that satisfy the WHERE.
Get the columns needed from the other tables (do the JOINs).
Sort according to the ORDER BY ** see below
Deliver the first 25 rows.
** Let's dig deeper into these two:
WHERE object.customer_id = XXX ORDER BY object.id
WHERE object.customer_id = XXX ORDER BY virtually-anything-else
You have INDEX(customer_id), correct? And the table is InnoDB, correct? Well, each secondary index implicitly includes the PRIMARY KEY, as if you had said INDEX(customer_id, id). The optimal index for the first WHERE + ORDER BY is precisely that. It will locate XXX and scan 25 rows, then stop. You might say that steps 2,4,5 are blended together.
The second WHERE just gather all the stuff through step 4. This could be thousands of rows. Hence it is likely to be a lot slower.
See also article on building optimal indexes.

retrieving top-ranking rows from large tables using FULLTEXT is very slow

When we log into our database with mysql-client and launch these queries:
first test query:
select a.*
from ads a
inner join searchs_titles s on s.id_ad = a.id
where match(s.label) against ('"bmw serie 3"' in boolean mode)
order by a.ranking asc limit 0, 10;
The result is:
10 rows in set (1 min 5.37 sec)
second test query:
select a.*
from ads a
inner join searchs_titles s on s.id_ad = a.id
where match(s.label) against ('"ford mondeo"' in boolean mode)
order by a.ranking asc limit 0, 10;
The result is:
10 rows in set (2 min 13.88 sec)
These queries are too slow. Is there a way to improve this?
The 'ads' table contains 2 millions rows, triggers are set to duplicate the data into search title. Search titles contains the id, title and label of each row in ads.
Table 'ads' is powered by innoDB and 'searchs_titles' by myISAM with a fulltext index on the label field.
Do we have too many columns? Too many indexes? Too many rows?
Is it a bad query?
Thanks a lot for the time you will spend helping us!
Edit: add explain
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
| 1 | SIMPLE | s | fulltext | id_ad,label | label | 0 | | 1 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | a | eq_ref | PRIMARY,id,id_2,id_3 | PRIMARY | 4 | XXXXXX.s.id_ad | 1 | |
Pro tip: Never use * in a SELECT statement in production software (unless you have a very good reason). By asking for all columns, you are denying the optimizer access to information about how best to exploit your indexes.
Observation: you're ordering by ads.ranking and taking ten results. But ads.ranking has very low cardinality -- according to that image in your question, it has 26 distinct values. Is your query working correctly?
Observation: You've said that the fulltext part of your search takes .77 seconds. I mean this part:
select s.id
from searchs_titles AS s
where match(s.label) against ('"ford mondeo"' in boolean mode)
That is good. It means we can focus on the rest of the query.
You also said you've been testing with the insertions to the table turned off. That's good because it rules out contention as a cause for the slow queries.
Suggestion: Create a suitable compound index for ads. For your present query, try an index on (id, ranking) This may allow your ORDER BY operation to avoid a full table scan.
Then, try this query to extract the set of ten a.id values you need, and then retrieve the data rows. This will exploit your compound index.
select z.*
from ads AS z
join ( select a.id, a.ranking
from ads AS a
inner join searchs_titles s on s.id_ad = a.id
where match(s.label) against ('"ford mondeo"' in boolean mode)
order by a.ranking asc
limit 0, 10
) AS b ON z.id = b.id
order by z.ranking
This uses a subquery to do the order by ... limit ... datashuffling operation on a small subset of the columns. This should make the retrieval of the appropriate id values much faster. Then the outer query fetches the appropriate rows.
The bottom line is this: ORDER BY ... LIMIT ... can be a very expensive operation if it's done on lots of data. But if you can arrange for it to be done on a minimal choice of columns, and those columns are indexed correctly, it can be very fast.

How to optimize SQL query with two where clauses?

My query is something like this
SELECT * FROM tbl1
JOIN tbl2 ON something = something
WHERE 1 AND (tbl2.date = '$date' OR ('$date' BETWEEN tbl1.planA AND tbl1.planB ))
When I run this query, it is considerably slower than for example this query
SELECT * FROM tbl1
JOIN tbl2 ON something = something
WHERE 1 AND ('$date' BETWEEN tbl1.planA AND tbl1.planB )
or
SELECT * FROM tbl1
JOIN tbl2 ON something = something
WHERE 1 AND tbl2.date = '$date'
In localhost, the first query takes about 0.7 second, the second query about 0.012 second and the third one 0.008 second.
My question is how do you optimize this? If currently I have 1000 rows in my tables and it takes 0.7 second to display the first query, it will take 7 seconds if I have 10.000 rows right? That's a massive slow down compared to second query (0.12 second) and third (0.08).
I've tried adding indexes, but the result is no different.
Thanks
Edit : This application will only work locally, so no need to worry about the speed over the web.
Sorry, I didn't include the EXPLAIN because my real query are much more complicated (about 5 joins). But the joins (I think) don't really matter, cos I've tried omitting them and still get approximately the same result as above.
The date belongs to tbl1, planA and planB belongs to tbl2. I've tried adding indexes to tbl1.date, tbl2.planA and tbl2.planB but the result is insignificant.
By schema do you mean MyISAM or InnoDB? It's MyISAM.
Okay, I'll just post my query straight away. Hopefully it's not that confusing.
SELECT *
FROM tb_joborder jo
LEFT JOIN tb_project p ON jo.project_id = p.project_id
LEFT JOIN tb_customer c ON p.customer_id = c.customer_id
LEFT JOIN tb_dispatch d ON jo.joborder_id = d.joborder_id
LEFT JOIN tb_joborderitem ji ON jo.joborder_id = ji.joborder_id
LEFT JOIN tb_mix m ON ji.mix_id = m.mix_id
WHERE dispatch_date = '2011-01-11'
OR '2011-01-11'
BETWEEN planA
AND planB
GROUP BY jo.joborder_id
ORDER BY customer_name ASC
And the describe output
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE jo ALL NULL NULL NULL NULL 453 Using temporary; Using filesort
1 SIMPLE p eq_ref PRIMARY PRIMARY 4 db_dexada.jo.project_id 1
1 SIMPLE c eq_ref PRIMARY PRIMARY 4 db_dexada.p.customer_id 1
1 SIMPLE d ALL NULL NULL NULL NULL 2048 Using where
1 SIMPLE ji ALL NULL NULL NULL NULL 455
1 SIMPLE m eq_ref PRIMARY PRIMARY 4 db_dexada.ji.mix_id 1
You can just use UNION to merge results of 2nd and 3d queries.
More about UNION.
First thing that comes to mind is to union the two:
SELECT * FROM tbl1
JOIN tbl2 ON something = something
WHERE 1 AND ('$date' BETWEEN planA AND planB )
UNION ALL
SELECT * FROM tbl1
JOIN tbl2 ON something = something
WHERE 1 AND date = '$date'
You have provided too little to make optimizations. We don't know anything about your data structures.
Even if most slow queries are usually due to the query itself or index setup of the used tables, you can try to find out where your bottleneck is with using the MySQL Query Profiler, too. It has been implemented into MySQL since Version 5.0.37.
Before you start your query, activate the profiler with this statement:
mysql> set profiling=1;
Now execute your long query.
With
mysql> show profiles;
you can now find out what internal number (query number) your long query has.
If you now execute the following query, you'll get alot of details about what took how long:
mysql> show profile for query (insert query number here);
(example output)
+--------------------+------------+
| Status | Duration |
+--------------------+------------+
| (initialization) | 0.00005000 |
| Opening tables | 0.00006000 |
| System lock | 0.00000500 |
| Table lock | 0.00001200 |
| init | 0.00002500 |
| optimizing | 0.00001000 |
| statistics | 0.00009200 |
| preparing | 0.00003700 |
| executing | 0.00000400 |
| Sending data | 0.00066600 |
| end | 0.00000700 |
| query end | 0.00000400 |
| freeing items | 0.00001800 |
| closing tables | 0.00000400 |
| logging slow query | 0.00000500 |
+--------------------+------------+
This is a more general, administrative approach, but can help narrow down or even find out the cause for slow queries very nice.
A good tutorial on how to use the MySQL Query Profiler can be found here in the MySQL articles.