SELECT *
FROM (
(SELECT table1.id1, table1.id1_type AS id1,
FROM child AS table2 STRAIGHT_JOIN parent AS table1 ON table1.id1=table2.id1
AND table1.id2=table2.id2
AND table1.time=table2.time
WHERE table1.id1=123456
AND ((table1.time>=0
AND table1.time<=1361936895)))
UNION ALL
(SELECT table1.id1 AS id1, table1.id1_type
FROM child AS table2 STRAIGHT_JOIN step_parent AS table1 ON table1.id1=table2.id1
AND table1.id2=table2.id2
AND table1.time=table2.time
WHERE table1.id1=123456
AND table1.time<=1361936895)))) AS T
WHERE id1_type NOT IN (15)
ORDER BY time DESC LIMIT 2000
I'm using the following sql query (two joins, one union all) and I'm seeing heavy increased latency after adding the joins inside. I can see the storage space usage shoot up on my machines and I'm wondering if it's because I'm creating temporary tables?
As added context, when I added the joins I also added the 'table1', 'table2' aliases so that I could avoid ambiguity while choosing columns I started seeing these space usage increases.
Any suggestions on why this addition, or the query as a whole, is causing a huge storage spike on these queries would be appreciated :)
It's up to the database engine to decide what it thinks is the best strategy to fulfill your query. Spooling to temporary tables is definitely one of the options it has.
The table aliases really shouldn't have anything to do with it, the right column is the right column whatever label you're using for it.
Out of interest, did you try it with join instead of straight_join? You're limiting the query optimizer's options by specifying straight_join.
Related
I have 3 tables in mySQL => table1, table2 and table3 and the data in all three tables is large (>100k)
My join condition is :
select * from table1 t1
join table2 t2 on t1.col1 = t2.col1
join table3 t3 on t3.col2 = t2.col2 and t3.col3 = t1.col3
This query renders result very slow and according to me the issue is in the second join condition as if I remove the second condition, the query renders result instantly.
Can anyone please explain the reason of the query being slow?
Thanks in advance.
Do you have these indexes?
table2: (col1)
table3: (col2, col3) -- in either order
Another tip: Don't use * (as in SELECT *) unless you really need all the columns. It prevents certain optimizations. If you want to discuss this further, please provide the real query and SHOW CREATE TABLE for each table.
If any of the columns used for joining are not the same datatype, character set, and collation, then indexes may not be useful.
Please provide EXPLAIN SELECT ...; it will give some clues we can discuss.
How many rows in the resultset? Sounds like over 100K? If so, then perhaps the network transfer time is the real slowdown?
Since the second join is over both tables (two joins) it creates more checks on evaluation. This is creating a triangle rather than a long joined line.
Also, since all three tables have ~100K lines, even with clustered index on the given columns, it's bound to have a performance hit, also due to all columns being retrieved.
At least, have the select statement as T1.col1, T1.col2...,T2.col1... and so on.
Also have distinct indexes on all columns used in join condition.
More so, do you really want a huge join without a where clause? Try adding restrictive conditions for each table and see the magic as it first filters out the available set of results from each table (100k may become 10k) and then the join is attempted.
Also check SQL Profiler output to see if a TABLE SCAN is being used (most probably yes), if so, having an INDEX SCAN should improve the situation.
I want to union two queries. Both queries use an inner join into a data set, that is very intensive to compute, but the dataset query is the same for both queries. For example:
SELECT veggie_id
FROM potatoes
INNER JOIN ( [...] ) massive_market
ON massive_market.potato_id=potatoes.potato_id
UNION
SELECT veggie_id
FROM carrots
INNER JOIN ( [...] ) massive_market
ON massive_market.carrot_id=carrots.carrot_id
Where [...] corresponds to a subquery that takes a second to compute, and returns rows of at least carrot_id and potato_id.
I want to avoid having the query for massive_market [...] twice in my overal query.
Whats the best way to do this?
If that subquery takes more than a second to run, I'd say it's down to an indexing issue as opposed to the query itself (of course, without seeing that query, that is somewhat conjecture, I'd recommend posting that query too). In my experience, 9/10 slow queries issues are down to improper indexing of the database.
Ensure veggie_id, potato_id and carrot_id are indexed
Also, if you're using any joins in the massive_market subquery, ensure the columns you're performing the joins on are indexed too.
Edit
If indexing has been done properly, the only other solution I can think of off the top of my head is:
CREATE TEMPORARY TABLE tmp_veggies (potato_id [datatype], carrot_id [datatype]);
INSERT IGNORE INTO tmp_veggies (potato_id, carrot_id) select potatoes.veggie_id, carrots.veggie_id from [...] massive_market
RIGHT OUTER JOIN potatoes on massive_market.potato_id = potatoes.potato_id
RIGHT OUTER JOIN carrots on massive_market.carrot_id = carrots.carrot_id;
SELECT carrot_id FROM tmp_veggies
UNION
SELECT potato_id FROM tmp_veggies;
This way, you've reversed the query so it's only running the massive subquery once and the UNION is happening on the temporary table (which'll be dropped automatically but not until the connection is closed, so you may want to drop the table manually). You can add any additional columns you need into the CREATE TEMPORARY TABLE and SELECT statement
The goal is to pull all repeated query-strings out of the list of query-strings requiring the repeated query-strings. So I kept potatoes and carrots within one unionizing subquery, and placed massive_market afterwards and outside this unification.
This seems obrvious, but my question originated from a much more complex query, and the work needed to pull this strategy off was a bit more involving in my case. For my simple example in my question above, this would resolve in something like:
SELECT veggie_id
FROM (
SELECT veggie_id, potato_id, NULL AS carrot_id FROM potatoes
UNION
SELECT veggie_id, NULL AS potato_id, carrot_id FROM carrots
) unionized
INNER JOIN ( [...] ) massive_market
ON massive_market.potato_id=unionized.potato_id
OR massive_market.carrot_id=unionized.carrot_id
I have a statement as below
CREATE TABLE INPUT_OUTPUT
SELECT T1_C1,.....,T1_C300, T1_PID from T1
INNER JOIN (SELECT T2_C1,T2_C2,T2_PID FROM T2) as RESPONSE ON T1.T1_PID=RESPONSE.T2_PID
which is running extremely slow - for 5 hours now. The two tables have about 4 million rows and a few hundred columns.
I have an 8-core, 64gb ram ubuntu-linux machine and using top I can see that not even 3gb is being used by the mysql process on just one core, although admittedly it's usage is consistently at 100%. It's upsetting that not all cores are being used.
I want to create the table much faster than this.
Should I use
CREATE TABLE INPUT_OUTPUT LIKE T1
alter INPUT_OUTPUT by adding the extra columns for those relevant in T2 and then populate it? I'm not sure of the syntax to do it and whether it will lead to a speed up.
Does T1_PID have an index? If so, this should run quickly. Run an EXPLAIN of the SELECT part of your query and see what it says.
That said, I don't understand why you need the subquery. What is wrong with:
CREATE TABLE INPUT_OUTPUT
SELECT T1_C1,.....,T1_C300, T1_PID, T2_C1, T2_C2, T2_PID
FROM T1 INNER JOIN T2 ON T1.T1_PID=T2.T2_PID
Using the latter should work if either T1 or T2 has a PID index.
I want to ask a question about database queries. In case of query such like where clause of the query is coming from the another query. For example
select ? from ? where ? = select ? from ?
This is the simple example so it is easy to write this. But for the more complex case, i want to know what is the best way in case of performance. Join? seperate queries? nested or another?
Thank you for answers.
Best Regards.
You should test it. These things depend a lot on the details of the query and of the indices it can use.
In my experience JOINs tend to be faster than nested queries in MySQL. In some cases MySQL isn't very smart and appears to run the subquery for every row produced by the outer query.
You can read more about these things in the official documentation:
Optimizing subqueries: http://dev.mysql.com/doc/refman/5.6/en/optimizing-subqueries.html
Rewriting subqueries as joins: http://dev.mysql.com/doc/refman/5.6/en/rewriting-subqueries.html
This is case dependent. In case you have a very less result in the inner query you should go for it. The flow works in the manner where in the inner query is executed first and the result set is being used in the outer query.
Meanwhile joins give you a Cartesian product which is again a heavy operation.
As Mitch and Joni stated, it depends. But generally a join will offer the best performance. You're trying to avoid running the nested query for each row of the outer query. A good query optimizer may do this for you anyway, by interpreting what you're trying to do and essentially "fixing" your mistake. But with the vast majority of queries, you should be writing it as a join in the first place. That way you're being explicit about what you're trying to do and you're fully understanding yourself what is being done, and what the most efficient way to do the work is.
I EXPECT the joins to be quicker, mainly because you have an equivalence and an explicit JOIN. Still use explain to see the differences in how the SQl engine will interpret them.
I would not expect these to be so different, where you can get real, large performance gains in using joins instead of subqueries is when you use correlated subqueries.
Since almost everyone is saying that joins will give the optimal performance I just logged in to say the exact opposite experience I had.
So some days back I was writing a query for 3-4 tables which had huge amount of data. I wrote a big sql query with joins and it was taking around 2-3 hours to execute it. Then I restructured it, created a nested select query, put as many where constraints as I can inside the nested one & made it as stricter as possible and then the performance improved by >90%, it now takes less than 4 mins to run.
This is just my experience and may be theoretically joins are better. I just felt to share my experience. Its better to try out different things, getting additional knowledge about the tables, it's indexes etc would help a lot.
Update:
And I just found out what I did is actually suggested in this optimization reference page of MySQL. http://dev.mysql.com/doc/refman/5.6/en/optimizing-subqueries.html
Pasting it here for quick reference:
Replace a join with a subquery. For example, try this:
SELECT DISTINCT column1 FROM t1 WHERE t1.column1 IN ( SELECT column1
FROM t2);
Instead of this:
SELECT DISTINCT t1.column1 FROM t1, t2 WHERE t1.column1 =
t2.column1;
Move clauses from outside to inside the subquery. For example, use
this query:
SELECT * FROM t1 WHERE s1 IN (SELECT s1 FROM t1 UNION ALL SELECT s1
FROM t2); Instead of this query:
SELECT * FROM t1 WHERE s1 IN (SELECT s1 FROM t1) OR s1 IN (SELECT s1
FROM t2); For another example, use this query:
SELECT (SELECT column1 + 5 FROM t1) FROM t2; Instead of this query:
SELECT (SELECT column1 FROM t1) + 5 FROM t2;
I have the following question. I have two tables and I want to store them separately. However, in my application I need to regularly perform UNION operation on them, so I effectively need to treat them as one. I think this is not good in terms of performance. I was thinking about creating the following view:
CREATE VIEW merged_tables(fields from both tables) AS
SELECT * FROM table_a
UNION
SELECT * FROM table_b
How does it impact on the performance of SELECT query? Does it have any real impact (inner representation is different) or it is just a matter of using a simpler query to select?
Using a UNION inside a view will be no different in performance than using the UNION in a discrete SELECT statement. Since you are using SELECT * on both tables and require no inner complexity (e.g. no JOINS or WHERE clauses), there won't really be any way to further optimize it.
However, if the tables do hold similar data and you want to keep them logically separated, you might consider storing it all in one table with a boolean column that indicates whether a row would otherwise have been a resident of table_a or table_b. You gain a way to tell the rows apart and avoid the added confusion of the UNION then, and performance isn't significantly impacted either.
It is just a matter of using a simpler query to select. There will be no speed difference, however the cost of querying the union should not be much worse (in most cases) than if you kept all the data in a single table.
If the tables are really the same structure, you could also consider another design in which you used a single table for storage and two views to logically separate the records:
CREATE VIEW table_a AS SELECT * FROM table_all WHERE rec_type = 'A'
CREATE VIEW table_b AS SELECT * FROM table_all WHERE rec_type = 'B'
Because these are single-table, non-aggregated VIEWs you can use them like tables in INSERT, UPDATE, DELETE, and SELECT but you have the advantage of also being able to program against table_all when it makes sense. The advantage here over your solution is that you can update against either the table_a, table_b entities or the table_all entity, and you don't have two physical tables to maintain.