Query using LEFT() function has bad performance - mysql

I am using INNER JOIN and WHERE with LEFT function to match records by its first 8 chars.
INSERT INTO result SELECT id FROM tableA a
INNER JOIN tableB b ON a.zip=b.zip
WHERE LEFT(a.street,8)=LEFT(b.street,8)
Both a.street and b.street are indexed (partial index 8).
The query didn't finish in 24+ hours. I am wondering is there a problem with indexes or is there a more efficient way to perform this task

Mysql won't use indexes for columns that have a function applied.
Other databases do allow function based indexes.
You could create a column with just the first 8 chars of a.street and b.street and index those and things will be quicker.

This is your query:
INSERT INTO result
SELECT id
FROM tableA a INNER JOIN
tableB b ON a.zip=b.zip
WHERE LEFT(a.street,8)=LEFT(b.street,8);
MySQL is not smart enough to use a prefix index with this comparison. It will use a prefix index for like and direct string comparisons. If I assume that id is combine from tableA, then the following may perform better:
INSERT INTO result(id)
SELECT id
FROM tableA a
WHERE exists (select 1
from tableB b
where a.zip = b.zip and
b.street like concat(left(a.street, 8), '%')
);
The index that you want is tableB(zip, street(8)) or tableB(zip, street). This may use both components of the index. In any case, it might get better performance even if it cannot use both sides of the index.

Related

mysql - How to forcefully change the order of evaluation of an inner join?

So, I have these two tables: tableA and tableB. Upon doing a simple inner join of these tables,
SELECT *
FROM tableA
JOIN tableB
ON tableA.columnA = tableB.id
Now, tableA contains 29000+ rows, whereas tableB contains just 11000+ rows. tableB.id is a primary key, hence clustered. And there exists a non-clustered index on columnA.
According to my thinking, the query optimizer should treat tableB as the inner table while performing the join, because it has a lesser number of rows, and treat tableA as the outer table, as a lot of rows need to be filtered from tableA based on the value of the tableB.id column.
But, the exact opposite of this actually happens. For some reason, the query optimizer is treating tableA as the inner table and tableB as the outer table.
Can someone please explain why that happens and what error am I making in my thought process? Also, is there a way to forcefully supersede the decision of query optimizer and dictate it to treat tableB as inner table? I am just curious to see how do the two different executions of the same query compare to each other. Thanks.
In InnoDB, primary key index lookups are marginally more efficient than secondary index lookups. The optimizer is probably preferring to run the join that does lookups against tableB.id because it uses the primary key index.
If you want to override the optimizer's ability to reorder tables, you can use an optimizer hint. The tables will be accessed in the order you specify them in your query.
SELECT *
FROM tableA
STRAIGHT_JOIN tableB
ON tableA.columnA = tableB.id
That syntax should work in any currently supported version of MySQL.
That will give you the opportunity to time query with either table order, and see which one in fact runs faster.
There's also new syntax in MySQL 8.0 to specify join order with greater control: https://dev.mysql.com/doc/refman/8.0/en/optimizer-hints.html#optimizer-hints-join-order

Mysql IN keyword

I am trying to write a query using the IN keyword.
Table A
attrId, attrName
Table B
key, attrId, attrVal
Based on key provided, I want to return the all attrName, attrVal combinations. The result will contain of columns from both tables. I don't want to use join using attrId as I trying to practice the usage of IN keyword.
Below is the query that I have attempted:
Select a.attrName, b2.attrVal
from table_A AS a, table_B AS b2
where a.attrId in (Select b1.attrId from Table_B b1 where key = <someKey>)
However I am not getting any result for the query. Also are queries that use IN keyword slow and should be avoided. I have approx 500 entries in table_A and 500k entries in table_B. The other alternative for me is to fetch all attrId from table_B and then fire multiple jdbc queries for each attrId retrieved to get corresponding attrName.
Can you please help out?
Thanks
Your query is performing a CROSS JOIN operation, every row returned from a is being "matched" with every row from b.
Your query is equivalent to:
SELECT a.attrName
, b2.attrVal
FROM table_A a
CROSS
JOIN table_B b2
WHERE a.attrId IN ( <some_list> )
The only way this query doesn't return any rows are 1) no rows in a satisfy the predicate in the WHERE clause, 2) b2 contains no rows, or 3) execution of the query is generating so many rows that it exceeds some available resource (e.g. temporary space) and returns an error, or 4) the client has timed out or cancelled the query before it completes.
I understand you are attempting to write a query that uses the use the IN operator, but the set returned by the query you posted really doesn't make much sense.
Q: are queries that use IN keyword slow and should be avoided.
A: The IN operator itself does not necessarily make a query slow.
For example:
SELECT t.id FROM mytable t WHERE t.id = 2 OR t.id = 3 OR t.id = 5
Could be rewritten using the IN operator as:
SELECT t.id FROM mytable t WHERE t.id IN (2,3,5)
On the other hand, a query using the IN operator with a correlated subquery can be "slow" if either 1) the subquery is slow and/or 2) there's a lot of rows that the subquery has to be evaluated for.
If you want to return rows from b that meet some condition, and then match those to rows in a, you should avoid a CROSS JOIN operation, by supplying some condition for the match (a join predicate in the ON clause)
For example:
SELECT a.attrName
, b.attrVal
FROM table_A a
JOIN table_B b
ON a.attrId = b.attrId
WHERE b.key = '<someKey>'

Is using an IN over a huge data set a good idea?

Let's say I have a query of the form:
SELECT a, b, c, d
FROM table1
WHERE a IN (
SELECT x
FROM table2
WHERE some_condition);
Now the query for the IN can return a huge number of records.
Assuming that a is the primary key, so an index is used is this the best way to write such a query?
Or it is more optimal to loop over each of the records returned by the subquery?
For me it is clear that when I do a where a = X it is clear that I just do an index (tree) traversal.
But I am not sure how an IN (especially over a huge data set) would traverse/utilize an index.
The MySQL optimizer isn't really ready (jet) to handle this correctly you should rewrite this kind of query to a iNNER JOIN and index correctly this will be the fasted method assuming t1.a and t2.x are unique
something like this.
SELECT
a
, b
, c
, d
FROM
table1 as t1
INNER JOIN
table2 as t2
ON t1.a = t2.x
WHERE
t1.some_condition ....
And make sure that t1.a and t2.x have PRIMARY or UNIQUE indexes
Having 1 query instead of loop will be definitely more efficient (and by nature consistent , to get consistent results with loop in general you will have to use serializable transaction ). One can argue in favour of EXISTS vs IN; as far as I remember mysql generates (or at least it was true for up to 5.1)...
Efficiency of utilizing index on a depends on number and order subquery result (assuming optimizer choses to grab results from subquery first and then compare it with a). In my understanding, the fastest option is to perform merge join which requires both resultsets sorted by the same key; however it may not be possible due to different sort order. Then I guess it's optimizer decision whether to sort or to use loop join. You can rely on its choice or try using hints and see if it makes a difference.

SQL Performance Penalty with Table Self Joins

When doing a select query, how does the performance of a table self-join compare with a join between two different tables? For example, if tableA and tableB are exact duplicates (both structure and data), would it be preferable to:
select ... from tableA a inner join tableA b..., or
select ... from tableA a inner join tableB b...
Perhaps its not a straightforward answer, in which case, are there any references on this topic?
I am using MySql.
Thanks!
Assuming that table B is exact copy of table A, and that all necessary indexes are created, self-join of table A should be a bit faster than join of B with A simply because data from table A and its indexes can be reused from cache in order to perform self-join (this may also implicitly give more memory for self-join, and more rows will fit into working buffers).
If table B is not the same, then it is impossible to compare.

Is there any way to merge these three sql statement into single sql statement

select b.b_id from btable b inner join atable a on b.b_id = a_id
go
delete from btable where b_id in (...)
go
insert into btable select * from atable where a_id in (...)
go
The second and the third sql statement's conditions are the first sql query result,
now I want to merge these three sql statement into single sql statement,
is there any way?
No, it's not possible to do.
PS: having all the 3 clauses in one statement barely would be called "query optimization". Optimization is when you improve a query performance, not when you just take N queries and get them in a single query.
Actually it's a common misunderstanding among newbies - that the less queries automagically means they would perform faster. It's just wrong. You should have as many queries as you need to retrieve all the necessary data - not less and not more.
The question would be why - if they all must be performed in a sequence then you could group them as a transaction if they are related - but that would depend on you table engines
This would require that all three worked if they are a transaction.
Also in a transaction you can test the result of a sql command and decided if you want/need to continue it
I believe that, merging will not be possible but what you can do is that put the initial result set into a temp table and reuse it.
Like
CREATE TEMPORARY TABLE tbl (b_id int)
select b.b_id into tbl from btable b inner join atable a on b.b_id = a_id
delete from btable where b_id in (select b_id from tbl)
insert into btable select * from atable where a_id in (select b_id from tbl)
drop table tbl
I hope that this will give you some idea to start.