Create Table from inner join extremely slow - mysql

I have a statement as below
CREATE TABLE INPUT_OUTPUT
SELECT T1_C1,.....,T1_C300, T1_PID from T1
INNER JOIN (SELECT T2_C1,T2_C2,T2_PID FROM T2) as RESPONSE ON T1.T1_PID=RESPONSE.T2_PID
which is running extremely slow - for 5 hours now. The two tables have about 4 million rows and a few hundred columns.
I have an 8-core, 64gb ram ubuntu-linux machine and using top I can see that not even 3gb is being used by the mysql process on just one core, although admittedly it's usage is consistently at 100%. It's upsetting that not all cores are being used.
I want to create the table much faster than this.
Should I use
CREATE TABLE INPUT_OUTPUT LIKE T1
alter INPUT_OUTPUT by adding the extra columns for those relevant in T2 and then populate it? I'm not sure of the syntax to do it and whether it will lead to a speed up.

Does T1_PID have an index? If so, this should run quickly. Run an EXPLAIN of the SELECT part of your query and see what it says.
That said, I don't understand why you need the subquery. What is wrong with:
CREATE TABLE INPUT_OUTPUT
SELECT T1_C1,.....,T1_C300, T1_PID, T2_C1, T2_C2, T2_PID
FROM T1 INNER JOIN T2 ON T1.T1_PID=T2.T2_PID
Using the latter should work if either T1 or T2 has a PID index.

Related

MySQL Optimiser - cost planner doesn't know when DuplicateWeedout Strategy creates disk table

This is my sample query
Select table1.id
from table1
where table.id in (select table2.id
from table2
where table2.id in (select table3.id
from table3)
)
order by table1.id
limit 100
On checking the optimiser trace for the above query.
Optimiser trace cost
DUPLICATE-WEEDOUT strategy - Cost: 1.08e7
FIRST MATCH strategy - Cost: 1.85e7
As DUPLICATE-WEEDOUT cost is less, mysql took DUPLICATE-WEEDOUT strategy for the above query.
Seems everything good in join_optimization part right. But finally, after checking the join_execution part.
DUPLICATE-WEEDOUT usually creates temp table. But here as the heap-size is not enough for temp table, it went on creating ondisk temp table(converting_tmp_table_to_ondisk).
Due to disk temp table my query execution became slower.
So what happened here?
Optimiser trace doesn't calculate the cost of disk table in join-optimisation part itself. If disk table cost was calculated, it would be higher than first match.
Then final_semijoin_strategy would be FIRST-MATCH strategy, with this my query would have been faster.
Is there any way MYSQL calculate the cost of disk table in join-optimisation part itself or any other work around for this particular issue?
MYSQ-5.7, INNODB
Note: This is a very dynamic query where multiple condition will add based on request in query. So I have done optimising the query in all possible manner. And finally stuck with this disk table cost issue. Kindly avoid optimising the query(like changing the query structure, forcing first-match strategy). And for increasing the heap size(Im not sure much about it, in different forum many said it might bring different issue in other queries)
IN( SELECT ... ) has been notoriously inefficient. Try to avoid it.
The query, as presented, is probably equivalent to
SELECT t1.id
FROM t1
JOIN t2 USING(id)
JOIN t3 USING(id)
ORDER BY id
LIMIT 100
which will optimize nicely.
This formulation should not need to build any temp table, much less a disk-based one.

Does adding join condition on two different tables (excluding the table to be joined) slows down query and performance

I have 3 tables in mySQL => table1, table2 and table3 and the data in all three tables is large (>100k)
My join condition is :
select * from table1 t1
join table2 t2 on t1.col1 = t2.col1
join table3 t3 on t3.col2 = t2.col2 and t3.col3 = t1.col3
This query renders result very slow and according to me the issue is in the second join condition as if I remove the second condition, the query renders result instantly.
Can anyone please explain the reason of the query being slow?
Thanks in advance.
Do you have these indexes?
table2: (col1)
table3: (col2, col3) -- in either order
Another tip: Don't use * (as in SELECT *) unless you really need all the columns. It prevents certain optimizations. If you want to discuss this further, please provide the real query and SHOW CREATE TABLE for each table.
If any of the columns used for joining are not the same datatype, character set, and collation, then indexes may not be useful.
Please provide EXPLAIN SELECT ...; it will give some clues we can discuss.
How many rows in the resultset? Sounds like over 100K? If so, then perhaps the network transfer time is the real slowdown?
Since the second join is over both tables (two joins) it creates more checks on evaluation. This is creating a triangle rather than a long joined line.
Also, since all three tables have ~100K lines, even with clustered index on the given columns, it's bound to have a performance hit, also due to all columns being retrieved.
At least, have the select statement as T1.col1, T1.col2...,T2.col1... and so on.
Also have distinct indexes on all columns used in join condition.
More so, do you really want a huge join without a where clause? Try adding restrictive conditions for each table and see the magic as it first filters out the available set of results from each table (100k may become 10k) and then the join is attempted.
Also check SQL Profiler output to see if a TABLE SCAN is being used (most probably yes), if so, having an INDEX SCAN should improve the situation.

Does multiple table join slows down mysql

My simple question is : Does multiple table join slows down mysql performance?
I have a data set where I need to do about 6 tables JOIN, on properly indexed columns.
I read the threads like
Join slows down sql
MySQL adding join slows down whole query
MySQL multiple table join query performance issue
But the question remains still as it is.
Can someone who experienced this thing reply?
MySQL, by default, uses the Block Nested-Loop join algorithm for joins.
SELECT t1.*, t2.col1
FROM table1 t1
LEFT JOIN table2 t2
ON t2.id = t1.id
In effect, yields the same performance as a subquery like the following:
SELECT t1.*, (SELECT col1 FROM table2 t2 WHERE t2.id = t1.id)
FROM table1 t1
Indexes are obviously important to satisfy the WHERE clause in the subquery, and are used in the same fashion for join operations.
The performance of a join, assuming proper indexes, amounts to the number of lookups that MySQL must perform. The more lookups, the longer it takes.
Hence, the more rows involved, the slower the join. Joins with small result sets (few rows) are fast and considered normal usage. Keep your result sets small and use proper indexes, and you'll be fine. Don't avoid the join.
Of course, sorting results from multiple tables can be a bit more complicated for MySQL, and any time you join text or blob columns MySQL requires a temporary table, and there are numerous other details.

Does this SQL query create Temporary Tables?

SELECT *
FROM (
(SELECT table1.id1, table1.id1_type AS id1,
FROM child AS table2 STRAIGHT_JOIN parent AS table1 ON table1.id1=table2.id1
AND table1.id2=table2.id2
AND table1.time=table2.time
WHERE table1.id1=123456
AND ((table1.time>=0
AND table1.time<=1361936895)))
UNION ALL
(SELECT table1.id1 AS id1, table1.id1_type
FROM child AS table2 STRAIGHT_JOIN step_parent AS table1 ON table1.id1=table2.id1
AND table1.id2=table2.id2
AND table1.time=table2.time
WHERE table1.id1=123456
AND table1.time<=1361936895)))) AS T
WHERE id1_type NOT IN (15)
ORDER BY time DESC LIMIT 2000
I'm using the following sql query (two joins, one union all) and I'm seeing heavy increased latency after adding the joins inside. I can see the storage space usage shoot up on my machines and I'm wondering if it's because I'm creating temporary tables?
As added context, when I added the joins I also added the 'table1', 'table2' aliases so that I could avoid ambiguity while choosing columns I started seeing these space usage increases.
Any suggestions on why this addition, or the query as a whole, is causing a huge storage spike on these queries would be appreciated :)
It's up to the database engine to decide what it thinks is the best strategy to fulfill your query. Spooling to temporary tables is definitely one of the options it has.
The table aliases really shouldn't have anything to do with it, the right column is the right column whatever label you're using for it.
Out of interest, did you try it with join instead of straight_join? You're limiting the query optimizer's options by specifying straight_join.

JOIN or INNER SELECT with IN, which is faster?

I was wondering which is faster an INNER JOIN or INNER SELECT with IN?
select t1.* from test1 t1
inner join test2 t2 on t1.id = t2.id
where t2.id = 'blah'
OR
select t1.* from test1 t1
where t1.id IN (select t2.id from test2 t2 where t2.id = 'blah')
Assuming id is key, these queries mean the same thing, and a decent DBMS will execute them in the exact same way. Unfortunately MySQL doesn't, as can be seen by expanding the "View Execution Plan" link in this SQL Fiddle. Which one will be faster probably depends on the size of tables - if TABLE1 has very few rows, then IN has a chance for being faster, while JOIN will likely be faster in all other cases.
This is a peculiarity of MySQL's query optimizer. I've never seen Oracle, PostgreSQL or MS SQL Server execute such simple equivalent queries differently.
If you have to guess, INNER JOIN is likely to be more efficient than an IN (SELECT ...), but that can vary from one query to another.
The EXPLAIN keyword is one of your best friends. Type EXPLAIN in front of your complete SELECT query and MySQL will give you some basic information about how it will execute the query. It'll tell you where it's using file sorts, where it's using indices you've created (and where it's ignoring them), and how many rows it will probably have to examine to fulfill the request.
If all else is equal, use the INNER JOIN mostly because it's more predictable and thus easier to understand to a new developer coming in. But of course if you see a real advantage to the IN (SELECT ...) form, use it!
Though you'd have to check the execution plan on whatever RDBS you're inquiring about, I would guess the inner join would be faster or at least the same. Perhaps someone will correct me if I'm wrong.
The nested select will most likely run the entire inner query anyway, and build a hash table of possible values from test2. If that query returns a million rows, you've incurred the cost of loading that data into memory no matter what.
With the inner join, if test1 only has 2 rows, it will probably just do 2 index scans on test2 for the id values of each of those rows, and not have to load a million rows into memory.
It's also possible that a more modern database system can optimize the first scenario since it has statistics on each table, however at the very best case, the inner join would be the same.
In most of the cases JOIN is much faster than sub query but sub-query is more readable than JOIN.
RDBMS creates an execution plan against JOIN so it can be predict that what data should be loaded to be processed. This definitely saves time. On the other hand for the sub-query it run all the queries and load all their data to do the processing.
For more details please check this link.