how sql engine handle join query with non-equal? - mysql

sql engine would use HashJoin if a query like this:
select * from table1 t1 left join table2 t2 on t1.id = t2.id;
that's fine. but if the query is like this:
select * from table1 t1 left join table2 t2 on t1.id > t2.id;
how to handle this?
the nestedloop join would work, but is there any better way?

For distributed SQL, a straight up non-qual join (t1.id > t2.id) is pretty expensive to execute. If one side is small you do a broadcast, and then use a sorted index on every node. If both sides are large, you can to range partition one and build a sorted index, and then replicate the other rows to any range that might match.
Normally, you have a combination equality and non-equal join like t1.id = t2.id and t1.cost < t2.cost. In that case, case you can do a normal distributed hash join, and then keep a sorted list of the secondary items to perform the non-equal part. This is what Presto does.

Related

Which of the two SELECT statements is faster?

It seems that the second statement applies the where condition first before joining and the first one does join before applying the where condition, so the second one would be faster because it would do less joining. But is that really the case? Is there a reference which says definitely that in the first statement the where condition is executed after all the other joining operations finish?
SELECT * FROM class t1
LEFT JOIN class_students t2 ON t1.id = t2.class_id
LEFT JOIN student t3 ON t2.student_id = t3.id
WHERE t1.id = 1;
or
SELECT * FROM (SELECT * FROM class WHERE id = 1) t1
LEFT JOIN class_students t2 ON t1.id = t2.class_id
LEFT JOIN student t3 ON t2.student_id = t3.id;
Your second option has a "derived table" (a subquery in FROM or JOIN). Subqueries usually take extra effort. So, usually it is better to avoid them.
In your particular example, the Optimizer will probably start with t1 because the WHERE clause mentions it. That is, the execution will filter based on t1.id = 1, just as you suggest the second version would do.
Note my italicized words... There are exceptions to my statements; if you find a case where the second version runs faster, present it, I may be able to explain why it runs faster. (A likely example is where the subquery has GROUP BY and/or LIMIT. This is different enough from WHERE to make a difference.)

SQL: Shuold I use INNER JOIN conditions as a WHERE conditions?

Shuold I use INNER JOIN conditions as a WHERE conditions?
Consider these two sample queries to explain the question:
SELECT t1.*, t2.*
FROM table1 AS t1
INNER JOIN table2 AS t2
ON t1.id = t2.foreign_key
WHERE t1.year < 2014
and this without the WHERE clause
SELECT t1.*, t2.*
FROM table1 AS t1
INNER JOIN table2 AS t2
ON t1.id = t2.foreign_key
AND t1.year < 2014
Since the JOIN type is INNER, both queries will result on typical result set.
Which is better in term of performance?
Generally performance should be similar since both queries should execute in the same way (if query optimizer is good).
I usually use WHERE clause since having simple join condition make sure that index scan will be used (if there is appropriate index).
For eaxample if you have slightly change in your query (see conditions order):
SELECT t1.*, t2.*
FROM table1 AS t1
INNER JOIN table2 AS t2
ON t1.year < 2014
AND t1.id = t2.foreign_key
Some optimizer engines could decide not to use index on t2.foreign_key column.
Try to check your query plans, should be near identical.
Also, db engine can optimize query to a better execution plan, so there should be no difference

SQL "IN" combined with "=" in WHERE clause

I'm struggling with someone else's code. What might the WHERE clause do in the following (MySQL) statement?
SELECT * FROM t1, t2 WHERE t1.id = t2.id IN (1,2,3)
It's not providing the desired result in my case, but I'm trying to figure what the original author intended.
Can anyone provide an example of the use of a WHERE clause like this?
This condition starts from the right, evaluates t2.id IN (1,2,3), gets the result (0 or 1), and uses it for join with t1.id. All rows of t2 with id from the IN list are joined to the row in t1 that has id of one; all other rows of t2 are joined with the row in t1 that has id of zero. Here is a small demo on sqlfiddle.com: link.
It is hard to imagine that that was the intent of the author, however: I think a more likely check was for both items to be in the list, and also being equal to each other. The equality to each other is important, because it looks like the author wanted to join the two tables.
A more modern way of doing joins is with ANSI SQL syntax. Here is the equivalent of your query in ANSI SQL:
SELECT * FROM t1 JOIN t2 ON t1.id = t2.id IN (1,2,3)

Shorten a join query

I have a query with 3 joins:
SELECT t1.email, t2.firstname, t2.lastname, t4.value
FROM t1
left join t2 on t1.email = t2.email
Inner join t3 on t2.entity_id = t3.order_id
Inner join t4 on t3.product_id = t4.entity_id
WHERE t4.attribute_id = 126
I think my server just can't make it :) --> time is running out so an error occurs!
Thanks a lot
Table structur:
T1:
email (which is the same then in t2)
T2:
email firstname lastname orderid (which is called entity id in t3)
T3:
entityid product id (which is called entity id in t4)
T4:
entityid attributeid value
Unless t2 links straight to t4 there is no way.
Also, do you need a left join between t1 and t2?
As #Sachin already stated, you can't "shorten" this query unless t2 links straight to t4 without requiring a comparison with t3. However, in order to speed up your query, you should have indexes on some or all of the columns referenced in your join conditions (i.e. t1.email, t2.email, t2.entity_id, etc).
Having an index on each of these columns will give you much faster SELECT queries, but it will slow down your INSERT and UPDATE queries. So if you SELECT more often than you INSERT or UPDATE, then you should definitely be using indexes. If not, try to make indexes in wise places (tables that have INSERT or UPDATE statements run less often but still have a lot of rows, for instance).
For further clarification, see the following links:
More information on how indexes work
Syntax for creating indexes
Try your query this way:
SELECT t1.email, t2.firstname, t2.lastname, t4.value
FROM t4
INNER JOIN t3 ON t3.product_id = t4.entity_id
INNER JOIN t2 ON t2.entity_id = t3.order_id
INNER JOIN t1 ON t1.email = t2.email
WHERE t4.attribute_id = 126
It's basically your query but "backwards". Your original way, your DBMS has to try to join t2 for ALL records in t1, then join t3 for ALL records found in t2 before it can even attempt to address your WHERE clause.
My way, you're finding all the records in t4 where attribute_id = 126 first, THEN attempting to join other tables. It should be a lot quicker. You should then be able to speed things up even more by making sure the proper indexes exist on the tables involved. You can prepend the keyword EXPLAIN to your query to see how the DBMS attempts to seek data in your query.

thoughts on innerjoin mysql

We have tables with more then 3m records. When using innerjoin it is much slower then select * from db1,db2 where db1.field=db2.field
Any thoughts?
INNER JOIN should not be any different from a SELECT FROM t1,t2 WHERE t1.c=t2.c, it is just a different syntax for doing the same thing and is treated the same by the optimiser.
Any difference in performance is in some other aspect of the query. Please POST:
The schema of both tables including their indexes (SHOW CREATE TABLE gives you this)
Both the queries you're comparing
Some detail about your performance testing methodology (it may be flawed)
The EXPLAIN output of both queries.
If you want a reasonable answer.
SELECT * from t1, t2 where t1.id = t2.id
is equivalent to
SELECT * from t1 INNER JOIN t2 on t1.id = t2.id.
However, if there are other criteria for the SQL query, then the behaviour may differ. For instance.
SELECT * from t1, t2 where t1.id = t2.id and t1.col1 is not null;
can be written in two different ways with the INNER JOIN:
SELECT * from t1 INNER JOIN t2 on t1.id = t2.id and t1.col1 is not null
or
SELECT * from t1 INNER JOIN t2 on t1.id = t2.id
WHERE t1.col1 is not null
This may or may not end up being the same query (according to the optimiser), and the complexity of the other parts of the query. The EXPLAIN PLAN will tell you if you are executing the same query.
Why are the above queries different? Because the restriction on not null is done at different stages of the query, which may have an impact on the performance, or even on the number of rows returned.
In general, the ...where db1.field=db2.field... syntax is an inner join. It's just the implicit notation instead of the explicit. If you're joining on the same columns and returning the same columns, performance should be identical. More: http://en.wikipedia.org/wiki/Join_(SQL)#Inner_join
I generally use explicit INNER JOIN or LEFT JOIN syntax according to needs. When the optimizer does a bad job, a STRAIGHT_JOIN can often sort it out, with suitable rearrangement of the query.
With any join involving large tables, it's worth using EXPLAIN.