MySQL - Extremely inefficient and unreliable query joining 3 tables - how to improve? - mysql

I am using MySQL with an INNODB engine on a DigitalOcean machine. The machine has 4GB memory, an 80 GB DISK and 2vCPUs and runs on Ubuntu 16.04.
We have a query that joins three tables that runs very slowly (takes about 5 minutes to return, if it works at all). The size of the tables are 6 million, 20 million and 100 thousand rows, respectively, and there are unique indexes in the tables for each row.
The query looks like this:
SELECT *, table2.column1
FROM table1
INNER JOIN table2 on table1.column1 = table2.column1
INNER JOIN table3 on table1.column2 = table3.column1
WHERE table3.column2 == "{ID}";
We want to embed this query in a data processing / analysis pipeline which dynamically pulls relevant data into memory and then runs further analysis using R. For this purpose, we need to make this query [or an alternative, which does the same thing] run way more efficiently.
Does anyone have any ideas as to how to make this query more efficient, or what the reasons for this slowdown may be? Any help would be greatly appreciated,
Many thanks!

For this query:
select table1.*, table2.column1
from table1 inner join
table2
on table1.column1 = table2.column1 inner join
table3
on table1.column2 = table3.column1
where table3.column2 = "{ID}";
You want indexes on:
table3(column2, column1)
table1(column2, column1)
table2(column1)

Related

MySQL Optimiser - cost planner doesn't know when DuplicateWeedout Strategy creates disk table

This is my sample query
Select table1.id
from table1
where table.id in (select table2.id
from table2
where table2.id in (select table3.id
from table3)
)
order by table1.id
limit 100
On checking the optimiser trace for the above query.
Optimiser trace cost
DUPLICATE-WEEDOUT strategy - Cost: 1.08e7
FIRST MATCH strategy - Cost: 1.85e7
As DUPLICATE-WEEDOUT cost is less, mysql took DUPLICATE-WEEDOUT strategy for the above query.
Seems everything good in join_optimization part right. But finally, after checking the join_execution part.
DUPLICATE-WEEDOUT usually creates temp table. But here as the heap-size is not enough for temp table, it went on creating ondisk temp table(converting_tmp_table_to_ondisk).
Due to disk temp table my query execution became slower.
So what happened here?
Optimiser trace doesn't calculate the cost of disk table in join-optimisation part itself. If disk table cost was calculated, it would be higher than first match.
Then final_semijoin_strategy would be FIRST-MATCH strategy, with this my query would have been faster.
Is there any way MYSQL calculate the cost of disk table in join-optimisation part itself or any other work around for this particular issue?
MYSQ-5.7, INNODB
Note: This is a very dynamic query where multiple condition will add based on request in query. So I have done optimising the query in all possible manner. And finally stuck with this disk table cost issue. Kindly avoid optimising the query(like changing the query structure, forcing first-match strategy). And for increasing the heap size(Im not sure much about it, in different forum many said it might bring different issue in other queries)
IN( SELECT ... ) has been notoriously inefficient. Try to avoid it.
The query, as presented, is probably equivalent to
SELECT t1.id
FROM t1
JOIN t2 USING(id)
JOIN t3 USING(id)
ORDER BY id
LIMIT 100
which will optimize nicely.
This formulation should not need to build any temp table, much less a disk-based one.

Optimize JOIN query in MySQL

This is my MySQL in general:
SELECT *
FROM
Table1
INNER JOIN Table2 ON (Table1.Table2_id = Table2.id)
INNER JOIN Table3 ON (Table1.Table3_id = Table3.id)
INNER JOIN Table4 ON (Table1.Table4_id = Table4.id)
INNER JOIN Table5 ON (Table1.Table5_id = Table5.id)
LEFT JOIN Table6 ON (Table1.Table6_id = Table6.id)
ORDER BY Table1.barcode DESC
LIMIT 50
I put index on all the ID (By default there is but for being sure I rechecked them) and also for Table1.barcode.
It's a very slow query on my database (about 15 secs).
I check it without order by and as I excepted that was really fast.
I remove limit 50 and order by, it took the same time (about 15 secs).
I should say that there are for sure lots of data:
Table1: 300442 records,
Table2: 77 records,
Table3: 314085 records,
Table4: 28987 records,
Table5: 127805 records,
Table6: 3230 records
I want to make it fast.
Maybe I can change the * to the field that I need (So I try to do it).
Is changing the join order would help me?
I can increase the memory of server and also the number of CPU and the speed of CPU, in my query which of them is more effective?
Is there any other recommendation?
Thanks in advance.
You should perhaps try to use explain and figure out what is going on:
https://dev.mysql.com/doc/refman/5.7/en/using-explain.html
Is changing the join order would help me?
The join order usually does not matter since most engines would do the order optimization internally. Some also have a way to force join order and check yourself whether for your case you might get better results with a specific order that is different than what the engine generated.
I can increase the memory of server and also the number of CPU and the speed of CPU, in my query which of them is more effective?
I would focus on memory (caching, shared_buffer) but before doing any server change you should first investigate the actual issue well and try to tune your existing system. (for ideas see: https://wiki.postgresql.org/wiki/Performance_Optimization (General Setup and Optimization section)
Maybe I can change the * to the field that I need (So I try to do it).
Definitely. Prefer that to * in general.

Does multiple table join slows down mysql

My simple question is : Does multiple table join slows down mysql performance?
I have a data set where I need to do about 6 tables JOIN, on properly indexed columns.
I read the threads like
Join slows down sql
MySQL adding join slows down whole query
MySQL multiple table join query performance issue
But the question remains still as it is.
Can someone who experienced this thing reply?
MySQL, by default, uses the Block Nested-Loop join algorithm for joins.
SELECT t1.*, t2.col1
FROM table1 t1
LEFT JOIN table2 t2
ON t2.id = t1.id
In effect, yields the same performance as a subquery like the following:
SELECT t1.*, (SELECT col1 FROM table2 t2 WHERE t2.id = t1.id)
FROM table1 t1
Indexes are obviously important to satisfy the WHERE clause in the subquery, and are used in the same fashion for join operations.
The performance of a join, assuming proper indexes, amounts to the number of lookups that MySQL must perform. The more lookups, the longer it takes.
Hence, the more rows involved, the slower the join. Joins with small result sets (few rows) are fast and considered normal usage. Keep your result sets small and use proper indexes, and you'll be fine. Don't avoid the join.
Of course, sorting results from multiple tables can be a bit more complicated for MySQL, and any time you join text or blob columns MySQL requires a temporary table, and there are numerous other details.

Does mysql optimize the IN clause

When i execute this mysql query like
select * from t1 where colomn1 in (select colomn1 from t2) ,
what really happens?
I want to know if it executes the inner statement for every row?
PS: I have 300,000 rows in t1 and 50,000 rows in t2 and it is taking a hell of a time.
I'm flabbergasted to see that everyone points out to use JOIN as if it is the same thing. IT IS NOT!, not with the information given here. E.g. What if t2.column1 has doubles ?
=> Assuming there are no doubles in t2.column1, then yes, put a UNIQUE INDEX on said column and use a JOIN construction as it is more readable and easier to maintain. If it is going to be faster; that depends on what the query engine makes from it. In MSSQL the query-optimizer (probably) would consider them the same thing; maybe MySQL is 'not so eager' to recognize this... don't know.
=> Assuming there can be doubles in t2.column1, put a (non-unique) INDEX on said column and rewrite the WHERE IN (SELECT ..) into a WHERE EXISTS ( SELECT * FROM t2 WHERE t2.column1 = t1.column1). Again, mostly for readability and ease of maintenance; most likely the query engine will treat them the same...
The things to remember are
Always make sure you have proper indexing (but don't go overboard)
Always realize that what really happens will be an interpretation of your sql-code; not a 'direct translation'. You can write the same functionality in different ways to achieve the same goal. And some of these are indeed more resilient to different scenarios.
If you only have 10 rows, pretty much everything works. If you have 10M rows it could be worth examining the query plan... which most-likely will be different from the one with 10 rows.
A join would be quicker, viz:
select t1.* from t1 INNER JOIN t2 on t1.colomn1=t2.colomn1
Try with INNER JOIN
SELECT t1.*
FROM t1
INNER JOIN t2 ON t1.column1=t2.column1
You should do indexing in column1 and then you can use inner join
for indexing
CREATE INDEX index1 ON t1 (col1);
CREATE INDEX index2 ON t2 (col2);
select t1.* from t1 INNER JOIN t2 on t1.colomn1=t2.colomn1

Create Table from inner join extremely slow

I have a statement as below
CREATE TABLE INPUT_OUTPUT
SELECT T1_C1,.....,T1_C300, T1_PID from T1
INNER JOIN (SELECT T2_C1,T2_C2,T2_PID FROM T2) as RESPONSE ON T1.T1_PID=RESPONSE.T2_PID
which is running extremely slow - for 5 hours now. The two tables have about 4 million rows and a few hundred columns.
I have an 8-core, 64gb ram ubuntu-linux machine and using top I can see that not even 3gb is being used by the mysql process on just one core, although admittedly it's usage is consistently at 100%. It's upsetting that not all cores are being used.
I want to create the table much faster than this.
Should I use
CREATE TABLE INPUT_OUTPUT LIKE T1
alter INPUT_OUTPUT by adding the extra columns for those relevant in T2 and then populate it? I'm not sure of the syntax to do it and whether it will lead to a speed up.
Does T1_PID have an index? If so, this should run quickly. Run an EXPLAIN of the SELECT part of your query and see what it says.
That said, I don't understand why you need the subquery. What is wrong with:
CREATE TABLE INPUT_OUTPUT
SELECT T1_C1,.....,T1_C300, T1_PID, T2_C1, T2_C2, T2_PID
FROM T1 INNER JOIN T2 ON T1.T1_PID=T2.T2_PID
Using the latter should work if either T1 or T2 has a PID index.