I have 2 tables with a common field. On one table the common field has an index
while on the other not. Running a query as the following :
SELECT *
FROM table_with_index
LEFT JOIN table_without_index ON table_with_index.comcol = table_without_index.comcol
WHERE 1
the query is way less performing than running the opposite :
SELECT *
FROM table_without_index
LEFT JOIN table_with_indexON table_without_index.comcol = table_with_index.comcol
WHERE 1
Anybody could explain me why and the logic behind the use of indexes in this case?
You can prepend your queries with EXPLAIN to find out how MySQL will use the indexes and in which order it will join the tables.
Take a look at the documentation of the EXPLAIN output format to see how to interpret the result.
Because of the LEFT JOINs, the order of the tables cannot be changed. MySQL needs to include in the final result set all the rows from the left table, whether or not they have matches in the right table.
On INNER JOINs, MySQL usually swaps the tables and puts the table having less rows first because this way it has a smaller number of rows to analyze.
Let's take this query (it's your query with shorter names for the tables):
SELECT *
FROM a
LEFT JOIN b ON a.col = b.col
WHERE 1
How MySQL runs this query:
It gets the first row from table a that matches the query conditions. If there are conditions in the WHERE or join clauses that use only fields of table a and constant values then an index that contain some or all of these fields is used to filter only the rows that matches the conditions.
After a row from table a was selected it goes to the next table from the execution plan (this is table b in our query). It has to select all the rows that match the WHERE condition(s) AND the JOIN condition(s). More specifically, the row(s) selected from table b must match the condition b.col = X where X is the value of column col for the row currently selected from table a on step 1. It finds the first matching row then goes to the next table. Since there is no "next table" in this query, it will put the pair of rows (from a and b) into the result set then discard the row from b and search for the next one, repeating this step until it finds all the rows from b that match the row currently selected from a (on step 1).
If on step 2 cannot find any row from b that match the row currently selected from a, the LEFT JOIN forces MySQL to make up a row (having the columns of b) full of NULLs and together with the current row from a it creates a row puts it into the result set.
After all the matching rows from b were processed, MySQL discards the current row from a, selects the next row from a that matches the WHERE and join conditions and starts over with the selection of matching rows from b (step 2).
This process loops until all the rows from a are processed.
Remarks:
The meaning of "first row" on step 1 depends on a lot of factors. For example, if there is an index on table a that contains all the columns (of table a) specified in the query then MySQL will not read the table data but will use the index instead. In this case, the order of the rows is given by the index. In other cases the rows are read from the table data and the order is provided by the order they are stored on the storage medium.
This simple query doesn't have any WHERE condition (WHERE 1 is always TRUE) and also there is no condition in the JOIN clause that contains only columns from a. All the rows from table a are included in the result set and that leads to a full table scan or an index scan, if possible.
On step 2, if table b has an index on column col then MySQL uses the index to find the rows from b that have value X on column col. This is a fast operation. If table b does not have an index on column col then MySQL needs to perform a full table scan of table b. That means it has to read all the rows of table b in order to find those having values X on column col. This is a very slow and resource consuming operation.
Because there is no condition on rows of table a, MySQL cannot use an index of table a to filter the rows it selects. On the other hand, when it needs to select the rows from table b (on step 2), it has a condition to match (b.col = X) and it could use an index to speed up the selection, given such an index exists on table b.
This explains the big difference of performance between your two queries. More, because of the LEFT JOIN, your two queries are not equivalent, they produce different results.
Disclaimer: Please note that the above list of steps is an overly simplified explanation of how the execution of a query works. It attempts to put it in simple words and skip the many technical aspects of what happens behind the scene.
Hints about how to make your query run faster can be found on MySQL documentation, section 8. Optimization
To check what's going on with MySQL Query optimizer please show EXPLAIN plan of these two queries. Goes like this:
EXPLAIN
SELECT * FROM table_with_index
LEFT JOIN table_without_index ON table_with_index.comcol = table_without_index.comcol
WHERE 1
and
EXPLAIN
SELECT *
FROM table_without_index
LEFT JOIN table_with_indexON table_without_index.comcol = table_with_index.comcol
WHERE 1
Related
When I execute a SQL like this;
SELECT *
FROM table_foo
JOIN table_bar
ON table_foo.foo_id = table_bar.bar_id
do I need an index just on table_foo.foo_id ?
Or does MySQL uses both indices on table_foo.foo_id and table_bar.bar_id ?
The result of EXPLAIN is like this.
There are multiple possible execution plans for this query:
SELECT f.*, b.*
FROM table_foo f JOIN
table_bar b
ON f.foo_id = b.bar_id;
Here are some examples:
The one you want to avoid (presumably) is a nested loop join that loops through one table -- row by row -- and then for each row loops through the second one.
Scan foo and look up each value in bar, using an index on table_bar(bar_id). From the row id in the bar index, get the associated columns for each matching row.
Scan bar and look up each value in foo, using an index on table_foo(foo_id). From the row id in the foo index, get the associated columns for each matching row.
Scan both indexes using a merge join and look up the associated rows in each of the tables.
This leave out other options such as hash join which would not normally use indexes.
So, either or both indexes might be used, depending on which algorithms the optimizer implements. That is, one index is often going to be good enough to get the performance you want. But, you give the optimizer more options if you have an index on both tables.
I am using MySQL through R. I am working with two tables within the same database and I noticed something strange that I can't explain. To be more specific, when I try to make a connection between the tables using a foreign key the result is not what it should be.
One table is called Genotype_microsatellites, the second table is called Records_morpho. They are connected through the foreign key sample_id.
If I only select records with certain characteristics from the Genotype_microsatellites table using the following command...
Gen_msat <- dbGetQuery(mydb, 'SELECT *
FROM Genotype_microsatellites
WHERE CIDK113a >= 0')
...the query returns 546 observations for 52 variables, exactly what I would expect. Now, I want to do a query that adds a little more info to my results, specifically by including data from the Records_morpho table. I, therefore, use the following code:
Gen_msat <- dbGetQuery(mydb, 'SELECT Genotype_microsatellites.*,
Records_morpho.net_mass_g,
Records_morpho.svl_mm
FROM Genotype_microsatellites
INNER JOIN Records_morpho ON Genotype_microsatellites.sample_id = Records_morpho.sample_id
WHERE CIDK113a >= 0')
The problem is that now the output has 890 observation and 54 variables!! Some sample_id values (i.e., the rows or individuals in the data frame ) are showing up multiple times, which shouldn't be the case. I have tried to fix this using SLECT DISTINCT, but the problem wouldn't go away.
Any help would be much appreciated.
Sounds like it is working as intended, that is how joins work. With A JOIN B ON A.x = B.y you get every row from A combined with every row from B that has a y matching the A row's x. If there are 3 rows in B that match one row in A, you will get three result rows for those. The A row's data will be repeated for each B row match.
To go a little further, if x is not unique and y is not unique. And you have two x with the same value, and three y with that value, they will produce six result rows.
As you mentioned DISTINCT does not make this problem go away because DISTINCT operates across the result row. It will only merge result rows if the values in all selected fields are the same on those result rows. Similarly, if you have a query on a single table that has duplicate rows, DISTINCT will merge those rows despite them being separate rows, as they do not have distinct sets of values.
I'm having a problem joining the 2 tables below. What I need is all of the parts in the first table where the clei OR part number is found in the second table, with a count of how many matches there are from table 1.
=================== ===================
table: svi table: svp
=================== ===================
id id
po price
customer clei
clei partNumber
partNumber description
==================== ===================
svi has about 1 million rows. svp has about 2000. Here is the join that I'm using...
SELECT svi.clei,
svi.partNumber,
count(*)
FROM svp svp
INNER JOIN
svi svi
ON (svp.clei = svi.clei)
OR (svp.partNumber = svi.partNumber)
GROUP BY svi.partNumber
The query is taking a little over 2 minutes to run, which seems ridiculously slow. clei and partNumber are indexed in both tables. What else can I do to speed up this join?
The indexes don't help very much here because there are no WHERE conditions against constants and because of the OR operator.
All the 2000 rows of the svp table are read; conditions against constants reduce the number of rows read from a table but there is no such condition here.
Then, for each of these 2000 rows, one or two lookups are performed in the indexes of the svi table to identify the matching rows. One for clei and, if it doesn't succeed, another one for partNumber. Or viceversa.
A compound index on columns clei and partNumber on table svi doesn't help here; it helps when the conditions are combined using OR.
The indexes on table svp are not used. If there is an index on svp that contains both clei and partNumber columns then MySQL can decide to read it here just because it contains less data than the entire table. But it still reads the entire index and processes all the rows. It cannot use the index to filter rows because there is no filtering on svp.
It could be worse (read the entire svi table and use the indexes on svp for lookup) but MySQL is smart enough to process the smaller table first.
Put EXPLAIN in front of your query and MySQL tells you (in less words) what I tried to explain above.
As I also said in a comment, the query is invalid SQL. For one value of svi.partNumber you probably have more than one value for svi.clei. The GROUP BY svi.partNumber clause generates a single output row from all the rows it gets from table svi that have the same value for partNumber.
But, since there are two or more different values for clei for the same partNumber, the final value it pics for the expression svi.clei from the SELECT clause is indeterminate. This means it can change if you run the same query again later or if you run it on a different server that mirrors the database (or after the database is backed up then restored from the backup).
If you just forgot to add svi.clei in the GROUP BY clause then it's an easy fix but otherwise you have to re-think your query because as it is now, it doesn't produce the results you expect.
I have these queries:
1st query:
SELECT (..) FROM db WHERE A = const AND B > const AND C >= const ORDER BY B DESC LIMIT const
2nd query (different db):
SELECT (...) FROM db' WHERE A' = const ORDER BY X' DESC LIMIT const
Question about 1st query:
Is it sufficient to have a multiple row index (A, B, C) or do I need an additional single row index (B) (or a different one) because of the ORDER BYstatement?
Question about 2nd query: Do I need a multiple row index (A', X') or two single row indices (A'), (X') to make us of them in this query?
It is an important thing to know that MySQL will use at most one index (for searching, filtering and ordering) per table and subquery (so basically per row in explain), so you can use only one index here.
For your first query, an index (A,B) will allow MySQL to do a range scan and use the order. If you use (A,B,C), the column C cannot be used in the range condition (because B is already a range), but MySQL will save the time to read the actual tabledata to get the value for C to check the last condition. So (A,B,C) is in general the fastest choice here.
"In general", because you can of course have a data distribution where another index would be best: If you e.g. have only one or two rows that match C >= const and 10M+ rows with A = const, using an index on just C would be fastest. And if C is a very big column (e.g. varchar(700)), it could blow up the index and slow it down. But to estimate such exceptions would require deeper knowledge of your data.
For your second query, (A', X') will be the best choice. If you have the two indexes (A'), (X'), MySQL will in most cases (unless A' is unique, but then you wouldn't need an order by anyway) use the index on X' and hope it will find matching rows for A' soon. This will sometimes be unexpectedly and painfully slow if you only have some rows that match A' = const (because it has to jump back and forth in the table (that is ordered by the primary key) in the order of X' to find rows that match the condition for A').
You might get the same problem for your first query if you have the indexes (A) and (B) (but not (A,B) or (A,B,C)) there: MySQL will probably use (B) instead of (A) (but check the explain to make sure). Even if you just add one index now, this can e.g. happen when you add the index (B) to optimize a different query next week and forgot about this query, so I'd suggest to stick with (at least) (A,B)
I join table A to table B and need to know if table B has 1 matching row or more than one.
Of course, I can do it with GROUP BY and COUNT, but it's an overkill, because it has to count all the matches and I don't need this info.
Is there a simple way to get the info I need (only one matching row or more) which short circuits the evaluation and stops when it knows the answer without scanning and counting all the remaining matches?
Or should I not care about this, becasue it's not a big performance hit and I should simply go with COUNT?
It really depends on the size of the DB, and your exact requirements. Generally a count()/Group By/Having combination is a pretty efficient query, with the right indexes. You could do it in a more complicated way, for example, having a trigger on after update that keeps a count table updated.
Are you seeing the count(*)/group/having combination giving you performance issues?
If you just need to know if there is one or more than one row for a certain join sql, meaning a matching row:
-- Without any sample SQL code, here's a return sample
SELECT B.SOMEJOINAPPLICABLECOLUMN
FROM A
LEFT OUTER JOIN B
ON A.SOMEJOINAPPLICABLECOLUMN = B.SOMEJOINAPPLICABLECOLUMN
WHERE
B.SOMEJOINAPPLICABLECOLUMN IS NOT NULL
LIMIT 2;
Naturally:
2 returned rows = more than one match
1 returned row = one match
0 returned rows = no matches