Querying to find if some columns are in array - mysql

I have a complex nested-query which is inside a join, is it possible to find several columns that match that query instead of repeating the query in the Join? ie:
select * from
A left join B on a.xid=b.xid and
(a.userid or b.userid) in (select userid from A where..)
^^^ don't want to duplicate the nested-query...
There is a nested query that should match several columns from the parent-query (as seen in the example above). The simple way is to duplicate the nested query several times. ie-
select * from A
left join B
on a.xid=b.xid
and a.userid in (select userid from ...)
and b.userid in (Select userid from ....)
BUT - since the subquery is bit complicated I don't want mysql to run it twice, but rather only once and than match it against several of the parent query columns.

If your subquery is working properly and you have the query cache turned on you won't have to worry about performance. If its a question of it being overly complex then maybe you could use a proc for this query: put the results of the sub into a temp table and join to it.
There are lots of ways to approach this.

Related

How to avoid running an expensive sub-query twice in a union

I want to union two queries. Both queries use an inner join into a data set, that is very intensive to compute, but the dataset query is the same for both queries. For example:
SELECT veggie_id
FROM potatoes
INNER JOIN ( [...] ) massive_market
ON massive_market.potato_id=potatoes.potato_id
UNION
SELECT veggie_id
FROM carrots
INNER JOIN ( [...] ) massive_market
ON massive_market.carrot_id=carrots.carrot_id
Where [...] corresponds to a subquery that takes a second to compute, and returns rows of at least carrot_id and potato_id.
I want to avoid having the query for massive_market [...] twice in my overal query.
Whats the best way to do this?
If that subquery takes more than a second to run, I'd say it's down to an indexing issue as opposed to the query itself (of course, without seeing that query, that is somewhat conjecture, I'd recommend posting that query too). In my experience, 9/10 slow queries issues are down to improper indexing of the database.
Ensure veggie_id, potato_id and carrot_id are indexed
Also, if you're using any joins in the massive_market subquery, ensure the columns you're performing the joins on are indexed too.
Edit
If indexing has been done properly, the only other solution I can think of off the top of my head is:
CREATE TEMPORARY TABLE tmp_veggies (potato_id [datatype], carrot_id [datatype]);
INSERT IGNORE INTO tmp_veggies (potato_id, carrot_id) select potatoes.veggie_id, carrots.veggie_id from [...] massive_market
RIGHT OUTER JOIN potatoes on massive_market.potato_id = potatoes.potato_id
RIGHT OUTER JOIN carrots on massive_market.carrot_id = carrots.carrot_id;
SELECT carrot_id FROM tmp_veggies
UNION
SELECT potato_id FROM tmp_veggies;
This way, you've reversed the query so it's only running the massive subquery once and the UNION is happening on the temporary table (which'll be dropped automatically but not until the connection is closed, so you may want to drop the table manually). You can add any additional columns you need into the CREATE TEMPORARY TABLE and SELECT statement
The goal is to pull all repeated query-strings out of the list of query-strings requiring the repeated query-strings. So I kept potatoes and carrots within one unionizing subquery, and placed massive_market afterwards and outside this unification.
This seems obrvious, but my question originated from a much more complex query, and the work needed to pull this strategy off was a bit more involving in my case. For my simple example in my question above, this would resolve in something like:
SELECT veggie_id
FROM (
SELECT veggie_id, potato_id, NULL AS carrot_id FROM potatoes
UNION
SELECT veggie_id, NULL AS potato_id, carrot_id FROM carrots
) unionized
INNER JOIN ( [...] ) massive_market
ON massive_market.potato_id=unionized.potato_id
OR massive_market.carrot_id=unionized.carrot_id

SQLJoin Results

Select * from a join b on a.id=b.id and a.vol<5
Select * from a join b on a.id=b.id where a.vol<5
Do they produce the same results?
If they don't produce the same results, a has 1000 rows, b jas 100 rows, how many rows will each produce?
I would say yes, it does.
A "Join" implies an "Inner Join" so it doesn't matter if you have an "and" in the join or a "Where" after the join.
It would be different if it was an "outer Join" Specifying a "Where" with an outer joined table will turn the join into an "Inner Join" or simply "Join"
Hope that made sense
For an INNER JOIN, like the simple query you have here, they are the same.
For an OUTER JOIN, they might not be the same.
For example, take these two queries:
select * from orders o left join orderlines ol on ol.order_id = o.id where o.id=12345
and
select * from orders o left join orderlines ol on ol.order_id = o.id and o.id=12345
The first query will give you data on order #12345 and it's lines, if any. The second query will give you data from all orders, but only order #12345 will have any item data.
This also illustrates how the two options have different semantic meanings. Even if they produce the same results, the two queries from your question have different semantic meanings, which might be important as an application grows over time.
I think you satisfied from answers but I want to mention about another side of this usage.
This two method generates the same result but compiler uses the different techniques to get the result.
Of course, different technique generates different results. But when ? It is very hard to illustrate the stiation but I will try to explain.
Think that we have two table but first table has isDeleted column for records. This application does not deletes the rows and get just updates the IsDeleted column and ignored that records.
In first case if you do not filter records in ON operator and you filtered it in where criteria. These records will be included in other joins and you will calculate the result wrong. Think that you joined this table Amounts table. The result is wrong because deleted records included and then you filtered them in where criteria.
This difference can lead to very big mistakes specially in queries which has many joins.
I wish I succeded the explanation. I m not good at. :)

MySQL creating temp table then join faster than left join

I have a LEFT JOIN that is very expensive:
    select X.c1, COUNT(Y.c3) from X LEFT JOIN Y on X.c1=Y.c2 group by X.c1;
After several minutes (20+), it still does not finish. But I want all rows in X. So I really do need a LEFT JOIN at some point.
It appears that I can hack my way around this to return the result set I am looking for by using a temp table in less than two minutes. I first trim down table Y so that it only contains rows in the join.
CREATE TEMPORARY TABLE IF NOT EXISTS table2 AS 
(select X.c1 as t, COUNT(Y.c2) as c from X
INNER JOIN Y where X.c1=Y.c2 group by X.c1);
select X.c1, table2.c from X 
LEFT JOIN table2 on X.c1 = table2.t; 
This finishes in under two minutes.
My questions are:
1) Are they equivalent?
2) Why is the second so much faster (why doesn't MySQL do this kind of optimization), meaning, do I need to do these kinds mysql?
EDIT: additional info: C1, C2 are BIGINTS. C1 is unique but there can be many C2s that all point to the same C1. As far as I know, I have not indexed any tables. X.C1 is an _id column that Y.c2 refers to.
Try indexing X.c1 and Y.c2 and running your original query.
It's hard to tell why your 1st query runs slower without the indexes without comparing the query plans from both queries (you can get the query plan by running your queries with explain at the beginning) but I suspect it's because the 2nd table contains many rows that do not have a corresponding row in the 1st table.
If x.c1 is unique, then I would suggest writing the query as:
select X.c1,
(select COUNT(Y.c3)
from Y
where X.c1 = Y.c2
)
from X;
For this query, you want an index on Y(c2, c3).
The reason why a left join might take longer is if many rows do not match. In that case, the group by is aggregating by many more rows than it really needs to. And no, MySQL does not attempt this type of optimization.

MySQL using select with 2 queries, subquery or join?

Related to my last question (MySQLi performance, multiple (separate) queries vs subqueries) I came across another question.
Sometimes I'm using a subquery to select the value from another table (eg. the username connected to an ID), but I'm not sure about the select-in-select, because it doesn't seem to be very clean and I'm not sure about the performance.
The subquery could look like this:
SELECT
(SELECT `user_name` FROM `users`
WHERE `user_id` = table2.user_id) AS `user_name`
, `value1`
, `value2`
FROM
`table2`
....
Would it be "better" to use a separate query for the result from table1 and another for table2 (doubles the connections, but no need to cross tables), or should I even use a JOIN to get the results in a single query?
I don't have much experience with JOINS and subqueries yet, so I'm not sure if a JOIN would be "too much" in this case, because I really just need one name connected to an ID (or maybe count the number of rows from a table), or if it doesn't matter, because the select-in-select is treated like some kind of JOIN, too..
Solution with JOIN could look like this:
SELECT
users.user_name , table2.value1, table2.value2
FROM
`table2`
INNER JOIN
`users`
ON
users.user_id = table2.user_id
....
And if I should prefer JOIN, which one would be best in this case: left join, inner join or something else?
The very fact that you are asking whether to use inner join or left join indeed shows that you haven't done much work with them.
The purposes of these two are entirely different, inner join is used to return columns from two or more tables where some columns have matching values. left join is used when you want the rows from the table specified left in the join clause to return even when there is no matching column in the other tables. It depends on your application. If one table has names of players, and another table contains details of penalties paid by them, then you will most certainly want to use left join, to account for players without a penalty, and thus without a record in the 2nd table.
Regarding whether to use subquery or join, joins can be much faster when properly used. By properly I mean, when there are indices on the join columns, the tables are specified in increasing order of the number of containing rows (generally. There might be exceptions), the join columns have similar data-types, etc. If all these conditions match, join would be the better option.

MySQL Join clause vs WHERE clause

What's the difference in a clause done the two following ways?
SELECT * FROM table1 INNER JOIN table2 ON (
table2.col1 = table1.col2 AND
table2.member_id = 4
)
I've compared them both with basic queries and EXPLAIN EXTENDED and don't see a difference. I'm wondering if someone here has discovered a difference in a more complex/processing intensive envornment.
SELECT * FROM table1 INNER JOIN table2 ON (
table2.col1 = table1.col2
)
WHERE table2.member_id = 4
With an INNER join the two approaches give identical results and should produce the same query plan.
However there is a semantic difference between a JOIN (which describes a relationship between two tables) and a WHERE clause (which removes rows from the result set). This semantic difference should tell you which one to use. While it makes no difference to the result or to the performance, choosing the right syntax will help other readers of your code understand it more quickly.
Note that there can be a difference if you use an outer join instead of an inner join. For example, if you change INNER to LEFT and the join condition fails you would still get a row if you used the first method but it would be filtered away if you used the second method (because NULL is not equal to 4).
If you are trying to optimize and know your data, by adding the clause "STRAIGHT_JOIN" can tremendously improve performance. You have an inner join ON... So, just to confirm, you want only records where table1 and table2 are joined, but only for table 2 member ID = some value.. in this case 4.
I would change the query to have table 2 as the primary table of the select as it has an explicit "member_id" that could be optimized by an index to limit rows, then joining to table 1 like
select STRAIGHT_JOIN
t1.*
from
table2 t2,
table1 t1
where
t2.member_id = 4
and t2.col1 = t1.col2
So the query would pre-qualify only the member_id = 4 records, then match between table 1 and 2. So if table 2 had 50,000 records and table 1 had 400,000 records, having table2 listed first will be processed first. Limiting the ID = 4 even less, and even less when joined to table1.
I know for a fact the straight_join works as I've implemented it many times dealing with gov't data of 14+ million records linking to over 15 lookup tables where the engine got confused trying to think for me on the critical table. One such query was taking 24+ hours before hanging... Adding the "STRAIGHT_JOIN" and prioritizing what the "primary" table was in the query dropped it to a final correct result set in under 2 hours.
There's not really much of a difference in the situation you describe; in a situation with multiple complex joins, my understanding is that the first is somewhat preferential, as it will reduce the complexity somewhat; that said, it's going to be a small difference. Overall, you shouldn't notice much of a difference in most if not all situations.
With an inner join, it makes almost* no difference; if you switch to outer join, all the difference in the world.
*I say "almost" because optimizers are quirky beasts and it isn't impossible that under some circumstances, it might do a better job optimizing the former or the latter. Do not attempt to take advantage of this behavior.