MySQL super slow inner join with group by - mysql

I'm having a problem joining the 2 tables below. What I need is all of the parts in the first table where the clei OR part number is found in the second table, with a count of how many matches there are from table 1.
=================== ===================
table: svi table: svp
=================== ===================
id id
po price
customer clei
clei partNumber
partNumber description
==================== ===================
svi has about 1 million rows. svp has about 2000. Here is the join that I'm using...
SELECT svi.clei,
svi.partNumber,
count(*)
FROM svp svp
INNER JOIN
svi svi
ON (svp.clei = svi.clei)
OR (svp.partNumber = svi.partNumber)
GROUP BY svi.partNumber
The query is taking a little over 2 minutes to run, which seems ridiculously slow. clei and partNumber are indexed in both tables. What else can I do to speed up this join?

The indexes don't help very much here because there are no WHERE conditions against constants and because of the OR operator.
All the 2000 rows of the svp table are read; conditions against constants reduce the number of rows read from a table but there is no such condition here.
Then, for each of these 2000 rows, one or two lookups are performed in the indexes of the svi table to identify the matching rows. One for clei and, if it doesn't succeed, another one for partNumber. Or viceversa.
A compound index on columns clei and partNumber on table svi doesn't help here; it helps when the conditions are combined using OR.
The indexes on table svp are not used. If there is an index on svp that contains both clei and partNumber columns then MySQL can decide to read it here just because it contains less data than the entire table. But it still reads the entire index and processes all the rows. It cannot use the index to filter rows because there is no filtering on svp.
It could be worse (read the entire svi table and use the indexes on svp for lookup) but MySQL is smart enough to process the smaller table first.
Put EXPLAIN in front of your query and MySQL tells you (in less words) what I tried to explain above.
As I also said in a comment, the query is invalid SQL. For one value of svi.partNumber you probably have more than one value for svi.clei. The GROUP BY svi.partNumber clause generates a single output row from all the rows it gets from table svi that have the same value for partNumber.
But, since there are two or more different values for clei for the same partNumber, the final value it pics for the expression svi.clei from the SELECT clause is indeterminate. This means it can change if you run the same query again later or if you run it on a different server that mirrors the database (or after the database is backed up then restored from the backup).
If you just forgot to add svi.clei in the GROUP BY clause then it's an easy fix but otherwise you have to re-think your query because as it is now, it doesn't produce the results you expect.

Related

MYSQL search query optimization from two many-to-many tables

I have three tables.
tbl_post for a table of posts. (post_idx, post_created, post_title, ...)
tbl_mention for a table of mentions. (mention_idx, mention_name, mention_img, ...)
tbl_post_mention for a unique many-to-many relation between the two tables. (post_idx, mention_idx)
For example,
PostA can have MentionA and MentionB.
PostB can have MentionA and MentionC.
PostC cannot have MentionC and MentionC.
tbl_post has about million rows, tbl_mention has less than hundred rows, and tbl_post_mention has a couple of million rows. All three tables are heavily loaded with foreign keys, unique indices, etc.
I am trying to make two separate search queries.
Search for post ids with all the given mention ids[AND condition]
Search for post ids with any of the given mention ids[OR condition]
Then join with tbl_post and tbl_mention to populate with meaningful data, order the results, and return the top n. In the end, I hope to have a n list of posts with all the data required for my service to display on the front end.
Here are the respective simpler queries
SELECT post_idx
FROM
(SELECT post_idx, count(*) as c
FROM tbl_post_mention
WHERE mention_idx in (1,95)
GROUP BY post_idx) AS A
WHERE c >= 2;
The problem with this query is that it is already inefficient before the joins and ordering. This process alone takes 0.2 seconds.
SELECT DISTINCT post_idx
FROM tbl_post_mention
WHERE mention_idx in (1,95);
This is a simple index range scan, but because of the IN statement, the query becomes expensive again once you start joining it with other tables.
I tried more complex and "clever" queries and tried indexing different sets of columns with no avail. Are there special syntaxes that I could use in this case? Maybe a clever trick? Partitioning? Or am I missing some fundamental concept here... :(
Send help.
The query you want is this:
SELECT post_idx
FROM tbl_post_mention
WHERE mention_idx in (1,95)
GROUP BY post_idx
HAVING COUNT(*) >= 2
The HAVING clause does your post-GROUP BY filtering.
The index that will help you is this.
CREATE INDEX mentionsdex ON tbl_post_mention (mention_idx, post_idx);
It covers your query by allowing rapid lookup by mention_idx then grouping by post_idx.
Often so-called join tables with two columns -- like your tbl_post_mention -- work most efficiently when they have a pair of indexes with the columns in opposite orders.

MySQL: Group by query optimization

I've got a table of the following schema:
+----+--------+----------------------------+----------------------------+
| id | amount | created_timestamp | updated_timestamp |
+----+--------+----------------------------+----------------------------+
| 1 | 1.00 | 2018-01-09 12:42:38.973222 | 2018-01-09 12:42:38.973222 |
+----+--------+----------------------------+----------------------------+
Here, for id = 1, there could be multiple amount entries. I want to extract the last added entry and its corresponding amount, grouped by id.
I've written a working query with an inner join on the self table as below:
SELECT t1.id,
t1.amount,
t1.created_timestamp,
t1.updated_timestamp
FROM transactions AS t1
INNER JOIN (SELECT id,
Max(updated_timestamp) AS last_transaction_time
FROM transactions
GROUP BY id) AS latest_transactions
ON latest_transactions.id = t1.id
AND latest_transactions.last_transaction_time =
t1.updated_timestamp;
I think inner join is an overkill and this can be replaced with a more optimized/efficient query. I've written the following query with where, group by, and having but it isn't working. Can anyone help?
select id, any_value(`updated_timestamp`), any_value(amount) from transactions group by `id` having max(`updated_timestamp`);
There are two (good) options when performing a query like this in MySQL. You have already tried one option. Here is the other:
SELECT t1.id,
t1.amount,
t1.created_timestamp,
t1.updated_timestamp
FROM transactions AS t1
LEFT OUTER JOIN transactions later_transactions
ON later_transactions.id = t1.id
AND later_transactions.last_transaction_time > t1.updated_timestamp
WHERE later_transactions.id IS NULL
These methods are the ones in the documentation, and also the ones I use in my work basically every day. Which one is most efficient depends on a variety of factors, but usually, if one is slow the other will be fast.
Also, as Strawberry points out in the comments, you need a composite index on (id,updated_timestamp). Have separate indexes for id and updated_timestamp is not equivalent.
Why a composite index?
Be aware that an index is just a copy of the data that is in the table. In many respects, it works the same as a table does. So, creating an index is creating a copy of the table's data that the RDBMS can use to query the table's information in a more efficient manner.
An index on just updated_timestamp will create a copy of the data that contains updated_timestamp as the first column, and that data will be sorted. It will also include a hidden row ID value (that will work as a primary key) in each of those index rows, so that it can use that to look up the full rows in the actual table.
How does that help in this query (either version)? If we wanted just the latest (or earliest) updated_timestamp overall, it would help, since it can just check the first or last record in the index. But since we want the latest for each id, this index is useless.
What about just an index on id. Here we have a copy of the id column, sorted by the id column, with the row ID attached to each row in the index.
How does this help this query? It doesn't, because it doesn't even have the updated_timestamp column as part of the index, and so won't even consider using this index.
Now, consider a composite index: (id,updated_timestamp).
This creates a copy of the data with the id column first, sorted, and then the second column updated_timestamp is also included, and it is also sorted within each id.
This is the same way that a phone book (if people still use those things as something more than paperweights) is sorted by last name and then first name.
Because the rows are sorted in this way, MySQL can look, for each id, at just the last record of a given id. It knows that that record contains the highest updated_timestamp value, because of how the index is defined.
So, it only has to look up one row for each id that exists. That is fast. Further explanation into why would take up a lot more space, but you can research it yourself if you like, by just looking into B-Trees. Suffice to say, finding the first (or last) record is easy.
Try the following:
ALTER TABLE transactions
ADD INDEX `LatestTransaction` (`id`,`updated_timestamp`)
And then see whether your original query or my alternate query is faster. Likely both will be faster than having no index. As your table grows, or your select statement changes it may affect which of these queries is faster, but the index is going to provide the biggest performance boost, regardless of which version of the query you use.

Mysql index use

I have 2 tables with a common field. On one table the common field has an index
while on the other not. Running a query as the following :
SELECT *
FROM table_with_index
LEFT JOIN table_without_index ON table_with_index.comcol = table_without_index.comcol
WHERE 1
the query is way less performing than running the opposite :
SELECT *
FROM table_without_index
LEFT JOIN table_with_indexON table_without_index.comcol = table_with_index.comcol
WHERE 1
Anybody could explain me why and the logic behind the use of indexes in this case?
You can prepend your queries with EXPLAIN to find out how MySQL will use the indexes and in which order it will join the tables.
Take a look at the documentation of the EXPLAIN output format to see how to interpret the result.
Because of the LEFT JOINs, the order of the tables cannot be changed. MySQL needs to include in the final result set all the rows from the left table, whether or not they have matches in the right table.
On INNER JOINs, MySQL usually swaps the tables and puts the table having less rows first because this way it has a smaller number of rows to analyze.
Let's take this query (it's your query with shorter names for the tables):
SELECT *
FROM a
LEFT JOIN b ON a.col = b.col
WHERE 1
How MySQL runs this query:
It gets the first row from table a that matches the query conditions. If there are conditions in the WHERE or join clauses that use only fields of table a and constant values then an index that contain some or all of these fields is used to filter only the rows that matches the conditions.
After a row from table a was selected it goes to the next table from the execution plan (this is table b in our query). It has to select all the rows that match the WHERE condition(s) AND the JOIN condition(s). More specifically, the row(s) selected from table b must match the condition b.col = X where X is the value of column col for the row currently selected from table a on step 1. It finds the first matching row then goes to the next table. Since there is no "next table" in this query, it will put the pair of rows (from a and b) into the result set then discard the row from b and search for the next one, repeating this step until it finds all the rows from b that match the row currently selected from a (on step 1).
If on step 2 cannot find any row from b that match the row currently selected from a, the LEFT JOIN forces MySQL to make up a row (having the columns of b) full of NULLs and together with the current row from a it creates a row puts it into the result set.
After all the matching rows from b were processed, MySQL discards the current row from a, selects the next row from a that matches the WHERE and join conditions and starts over with the selection of matching rows from b (step 2).
This process loops until all the rows from a are processed.
Remarks:
The meaning of "first row" on step 1 depends on a lot of factors. For example, if there is an index on table a that contains all the columns (of table a) specified in the query then MySQL will not read the table data but will use the index instead. In this case, the order of the rows is given by the index. In other cases the rows are read from the table data and the order is provided by the order they are stored on the storage medium.
This simple query doesn't have any WHERE condition (WHERE 1 is always TRUE) and also there is no condition in the JOIN clause that contains only columns from a. All the rows from table a are included in the result set and that leads to a full table scan or an index scan, if possible.
On step 2, if table b has an index on column col then MySQL uses the index to find the rows from b that have value X on column col. This is a fast operation. If table b does not have an index on column col then MySQL needs to perform a full table scan of table b. That means it has to read all the rows of table b in order to find those having values X on column col. This is a very slow and resource consuming operation.
Because there is no condition on rows of table a, MySQL cannot use an index of table a to filter the rows it selects. On the other hand, when it needs to select the rows from table b (on step 2), it has a condition to match (b.col = X) and it could use an index to speed up the selection, given such an index exists on table b.
This explains the big difference of performance between your two queries. More, because of the LEFT JOIN, your two queries are not equivalent, they produce different results.
Disclaimer: Please note that the above list of steps is an overly simplified explanation of how the execution of a query works. It attempts to put it in simple words and skip the many technical aspects of what happens behind the scene.
Hints about how to make your query run faster can be found on MySQL documentation, section 8. Optimization
To check what's going on with MySQL Query optimizer please show EXPLAIN plan of these two queries. Goes like this:
EXPLAIN
SELECT * FROM table_with_index
LEFT JOIN table_without_index ON table_with_index.comcol = table_without_index.comcol
WHERE 1
and
EXPLAIN
SELECT *
FROM table_without_index
LEFT JOIN table_with_indexON table_without_index.comcol = table_with_index.comcol
WHERE 1

Huge performance difference between two similar SQL queries

I have two SQL queries that provides the same output.
My first intuition was to use this:
SELECT * FROM performance_dev.report_golden_results
where id IN (SELECT max(id) as 'id' from performance_dev.report_golden_results
group by platform_id, release_id, configuration_id)
Now, this took something like 70 secs to complete!
Searching for another solution I tried something similar:
SELECT * FROM performance_dev.report_golden_results e
join (SELECT max(id) as 'id'
from performance_dev.report_golden_results
group by platform_id, release_id, configuration_id) s
ON s.id = e.id;
Surprisingly, this took 0.05 secs to complete!!!
how come these two are so different?
thanks!
First thing which Might Cause the Time Lag is that MySQL uses 'semi-join' strategy for Subqueries.The Semi Join includes Following Steps :
If a subquery meets the preceding criteria, MySQL converts it to a
semi-join and makes a cost-based choice from these strategies:
Convert the subquery to a join, or use table pullout and run the query
as an inner join between subquery tables and outer tables. Table
pullout pulls a table out from the subquery to the outer query.
Duplicate Weedout: Run the semi-join as if it was a join and remove
duplicate records using a temporary table.
FirstMatch: When scanning the inner tables for row combinations and
there are multiple instances of a given value group, choose one rather
than returning them all. This "shortcuts" scanning and eliminates
production of unnecessary rows.
LooseScan: Scan a subquery table using an index that enables a single
value to be chosen from each subquery's value group.
Materialize the subquery into a temporary table with an index and use
the temporary table to perform a join. The index is used to remove
duplicates. The index might also be used later for lookups when
joining the temporary table with the outer tables; if not, the table
is scanned.
But giving an explicit join reduces these efforts which might be the Reason.
I hope it helped!
MySQL does not consider the first query as subject for semi-join optimization (MySQL converts semi joins to classic joins with some kind of optimization: first match, duplicate weedout ...)
Thus a full scan will be made on the first table and the subquery will be evaluated for each row generated by the outer select: hence the bad performances.
The second one is a classic join, what will happen in this case that MySQL will compute the result of derived query and then matches only values from this query with values from first query satisfying the condition, hence no full scan is needed on the first table (I assumed here that id is an indexed column).
The question right now is why MySQL does not consider the first query as subject to semi-join optimization: the answer is documented in MySQL https://dev.mysql.com/doc/refman/5.6/en/semijoins.html
In MySQL, a subquery must satisfy these criteria to be handled as a semijoin:
It must be an IN (or =ANY) subquery that appears at the top level of the WHERE or ON clause, possibly as a term in an AND expression. For example:
SELECT ...
FROM ot1, ...
WHERE (oe1, ...) IN (SELECT ie1, ... FROM it1, ... WHERE ...);
Here, ot_i and it_i represent tables in the outer and inner parts of the query, and oe_i and ie_i represent expressions that refer to columns in the outer and inner tables.
It must be a single SELECT without UNION constructs.
It must not contain a GROUP BY or HAVING clause.
It must not be implicitly grouped (it must contain no aggregate functions).
It must not have ORDER BY with LIMIT.
The STRAIGHT_JOIN modifier must not be present.
The number of outer and inner tables together must be less than the maximum number of tables permitted in a join.
Your subquery use GROUP BY hence semi-join optimization was not applied.

MySQL - LIKE search better before join or after join with round trip?

Example:
Table 1 has 100k records, and has a varchar field with a unique index on it.
Table 2 has 1 million records, and relates to table 1 through a table1_id field with a many-to-one relationship, and has three varchar fields, only one of them unique. The engine in question is InnoDB so no fulltext indexes.
For argument's sake, assume these tables will grow to a maximum of 1 million and 10 million records respectively.
When I enter a search term into my form, I want it to search both tables across all four (total) available varchar fields with a LIKE and return only the records from Table1 - so I'm grouping by table1.id here. What I'm wondering is, is it more efficient to search the million records table first since it has only one field that needs to be searched and that one field is unique and then use the fetched IDs in an table1.id IN ({IDS}) query, or would it be better to join them outright and search them right then and there without making a round trip to the database?
In other words, when doing joins, does MySQL join according to the searched term, or join first and search later? That is, if I do a join and the LIKE on both tables in one query, will it first join them and then look through them for matching records, or will it join only the records it found to be matching?
Edit: I have made two sample tables and faked some data. This example query is a join and a LIKE search across all fields. For demo purposes I used LIKE '%q%' but in reality the q may be anything. The actual search on bogus 100k/1mil records took 0.03 seconds, MySQL says. Here is the explain: http://bit.ly/PsFBxK
Here is the explain query of searching just table2 on its one unique field: http://bit.ly/S06Hug and for this one to actually happen, MySQL says it took it 0.0135 seconds.