Need some clarification on indexes (WHERE, JOIN)

Need some clarification on indexes (WHERE, JOIN) - mysql

We are facing some performance issues in some reports that work on millions of rows. I tried optimizing sql queries, but it only reduces the time of execution to half.
The next step is to analyse and modify or add some indexes, therefore i have some questions:
1- the sql queries contain a lot of joins: do i have to create an index for each foreignkey?
2- Imagine the request SELECT * FROM A LEFT JOIN B on a.b_id = b.id where a.attribute2 = 'someValue', and we have an index on the table A based on b_id and attribute2: does my request use this index for the where part ( i know if the two conditions were on the where clause the index will be used).
3- If an index is based on columns C1, C2 and C3, and I decided to add an index based on C2, do i need to remove the C2 from the first index?
Thanks for your time

You can use EXPLAIN query to see what MySQL will do when executing it. This helps a LOT when trying to figure out why its slow.
JOIN-ing happens one table at a time, and the order is determined by MySQL analyzing the query and trying to find the fastest order. You will see it in the EXPLAIN result.
Only one index can be used per JOIN and it has to be on the table being joined. In your example the index used will be the id (primary key) on table B. Creating an index on every FK will give MySQL more options for the query plan, which may help in some cases.
There is only a difference between WHERE and JOIN conditions when there are NULL (missing rows) for the joined table (there is no difference at all for INNER JOIN). For your example the index on b_id does nothing. If you change it to an INNER JOIN (e.g. by adding b.something = 42 in the where clause), then it might be used if MySQL determines that it should do the query in reverse (first b, then a).
No.. It is 100% OK to have a column in multiple indexes. If you have an index on (A,B,C) and you add another one on (A) that will be redundant and pointless (because it is a prefix of another index). An index on B is perfectly fine.

Related

Should i create any indexes here to optimize my query?

now i'm trying to figure out, what should i do, to improve my query result.
Now, it's 47.55.
So, should i create any indexes for columns? Tell me please
SELECT bw.workloadId, lrer.lecturerId, lrer.lastname, lrer.name, lrer.fathername, bt.title, ac.activityname, cast(bw.exactday as char(45)) as "date", bw.exacttime as "time" FROM base_workload as bw
right join unioncourse as uc on uc.idunioncourse = bw.idunioncourse
right join basecoursea as bc on bc.idbasecoursea = uc.idbasecourse
right join lecturer as lrer on lrer.lecturerId = uc.lecturerId
right join basetitle as bt on bt.idbasetitle = bc.idbasetitle
right join activity as ac on ac.activityId = bc.activityId
where lrer.lecturerId is not null AND bc.idbasecoursea is not null and bw.idunioncourse != ""
ORDER BY bw.exactday, bw.exacttime ASC;

From MySQL 8.0 documentation:
Indexes are used to find rows with specific column values quickly. Without an index, MySQL must begin with the first row and then read through the entire table to find the relevant rows. The larger the table, the more this costs. If the table has an index for the columns in question, MySQL can quickly determine the position to seek to in the middle of the data file without having to look at all the data. This is much faster than reading every row sequentially.
MySQL use indexes for these operations:
To find the rows matching a WHERE clause quickly.
To eliminate rows from consideration.
If the table has a multiple-column index, any leftmost prefix of the index can be used by the optimizer to look up rows.
To retrieve rows from other tables when performing joins.
To find the MIN() or MAX() value for a specific indexed column key_col.
To sort or group a table if the sorting or grouping is done on a leftmost prefix of a usable index (for example, ORDER BY key_part1, key_part2).
In some cases, a query can be optimized to retrieve values without consulting the data rows.
As of your requirements, you could use index on the WHERE clause for faster data retrieval.

I think you can get rid of
lrer.lecturerId is not null
AND bc.idbasecoursea is not null
By changing the first 3 RIGHT JOINs to JOINs.
What is the datatype of exactday? What is the purpose of
cast(bw.exactday as char(45)) as "date"
The CAST may be unnecessary.
Re bw.exactday, bw.exacttime: It is usually better to use a single column for DATETIME instead of two columns (DATE and TIME).
What are the PRIMARY KEYs of the tables?
Please convert to LEFT JOIN if possible; I can't wrap my head around RIGHT JOINs.
This index on bw may help: INDEX(exactday, exacttime).

How can i speed up the left join in my query using indexes?

I am new to SQL. At the moment I am experiencing some slower MySQL queries. I think I need to improve my indexes but not sure how.
drop temporary table if exists temp ;
CREATE TEMPORARY TABLE temp
(index idx_a (EXTRACT_DATE, project_id, SERVICE_NAME) )
select distinct DATE(c.EXTRACT_DATETIME) as EXTRACT_DATE,p.project_id, p.project_name, c.CLUSTER_NAME, c.SERVICE_NAME,
UPPER(CONCAT(SUBSTRING_INDEX(c.ENV_NAME, '-', 1),'-',c.CLUSTER_NAME)) as CLUSTER_ID
from p
left join c
on p.project_id = c.project_id ;

The short answer is that you need indexes at least to optimize the lookups done by the JOIN. The explain shows that both tables you are joining are doing a full table scan, then joining them the hard was, using "block nested loop" which indicates it is not using an index.
It would help to at least create an index on c.project_id.
ALTER TABLE c ADD INDEX (project_id);
This would mean there is still a table-scan to read the p table (estimated 5720 rows), but at least when it needs to find the related rows in c, it only reads the rows it needs, without doing a table-scan of 287K rows for each row of p.
The query you posted in an earlier question had another condition:
where DAYNAME(c.EXTRACT_DATETIME) = 'Friday' ;
I don't know why you haven't included this condition in the new question you posted.
If this is still a condition you need to handle, this could help optimize the query further. MySQL 5.7 (which you said in the other question you are using) supports virtual columns, defined for an expression, and you can index virtual columns.
ALTER TABLE c
ADD COLUMN isFriday AS (DAYNAME(EXTRACT_DATETIME) = 'Friday'),
ADD INDEX (isFriday);
Then if you search on the new isFriday column, or even if you search on the same expression used for the virtual column definition, it will use the index.
So what you really need is an index on c that uses both columns, one for the join, and then for the additional condition.
ALTER TABLE c
ADD COLUMN isFriday AS (DAYNAME(EXTRACT_DATETIME) = 'Friday'),
ADD INDEX (project_id, isFriday);

You aren’t filtering on anything other than the outer join column. This leads me to expect that most of the rows in both tables are going to need reading. In order to do this only once, you may be best off using a hash join rather than a nested loop and index. A hash join will allow both tables to be read completely once rather than the back and forth approach of a nested loop which will likely mean the same pages read each time a row is looked up.
In order to use hash joins, you need to be running and a version of MySQL at least above version 8. It would be recommended to use the latest available stable release.

Should I create 2 indexes for the same column to speed up a join?

I am new to database index and I've just read about what an index is, differences between clustered and non clustered and what composite index is.
So for a inner join query like this:
SELECT columnA
FROM table1
INNER JOIN table2
ON table1.columnA= table2.columnA;
In order to speed up the join, should I create 2 indexes, one for table1.columnA and the other for table2.columnA , or just creating 1 index for table1 or table2?
One is good enough? I don't get it, for example, if I select some data from table2 first and based on the result to join on columnA, then I am looping through results one by one from table2, then an index from table2.columnA is totally useless here, because I don't need to find anything in table2 now. So I am needing a index for table1.columnA.
And vice versa, I need a table2.columnA if I select some results from table1 first and want to join on columnA.
Well, I don't know how in reality "select xxxx first then join based on ..." looks like, but that scenario just came into my mind. It would be much appreciated if someone could also give a simple example.

One index is sufficient, but the question is which one?
It depends on how the MySQL optimizer decides to order the tables in the join.
For an inner join, the results are the same for table1 INNER JOIN table2 versus table2 INNER JOIN table1, so the optimizer may choose to change the order. It is not constrained to join the tables in the order you specified them in your query.
The difference from an indexing perspective is whether it will first loop over rows of table1, and do lookups to find matching rows in table2, or vice-versa: loop over rows of table2 and do lookups to find rows in table1.
MySQL does joins as "nested loops". It's as if you had written code in your favorite language like this:
foreach row in table1 {
look up rows in table2 matching table1.column_name
}
This lookup will make use of the index in table2. An index in table1 is not relevant to this example, since your query is scanning every row of table1 anyway.
How can you tell which table order is used? You can use EXPLAIN. It will show you a row for each table reference in the query, and it will present them in the join order.
Keep in mind the presence of an index in either table may influence the optimizer's choice of how to order the tables. It will try to pick the table order that results in the least expensive query.
So maybe it doesn't matter which table you add the index to, because whichever one you put the index on will become the second table in the join order, because it makes it more efficient to do the lookup that way. Use EXPLAIN to find out.

90% of the time in a properly designed relational database, one of the two columns you join together is a primary key, and so should have a clustered index built for it.
So as long as you're in that case, you don't need to do anything at all. The only reason to add additional non-clustered indices is if you're also further filtering the join with a where clause at the end of your statement, you need to make sure both the join columns and the filtered columns are in a correct index together (ie correct sort order, etc).

Is it possible to further optimize this MySQL query?

I was running a query of this kind of query:
SELECT
-- fields
FROM
table1 JOIN table2 ON (table1.c1 = table.c1 OR table1.c2 = table2.c2)
WHERE
-- conditions
But the OR made it very slow so i split it into 2 queries:
SELECT
-- fields
FROM
table1 JOIN table2 ON table1.c1 = table.c1
WHERE
-- conditions
UNION
SELECT
-- fields
FROM
table1 JOIN table2 ON table1.c2 = table.c2
WHERE
-- conditions
Which works much better but now i am going though the tables twice so i was wondering if there was any further optimizations for instance getting set of entries that satisfies the condition (table1.c1 = table.c1 OR table1.c2 = table2.c2) and then query on it. That would bring me back to the first thing i was doing but maybe there is another solution i don't have in mind. So is there anything more to do with it or is it already optimal?

Splitting the query into two separate ones is usually better in MySQL since it rarely uses "Index OR" operation (Index Merge in MySQL lingo).
There are few items I would concentrate for further optimization, all related to indexing:
1. Filter the rows faster
The predicate in the WHERE clause should be optimized to retrieve the fewer number of rows. And, they should be analized in terms of selectivity to create indexes that can produce the data with the fewest filtering as possible (less reads).
2. Join access
Retrieving related rows should be optimized as well. According to selectivity you need to decide which table is more selective and use it as a driving table, and consider the other one as the nested loop table. Now, for the latter, you should create an index that will retrieve rows in an optimal way.
3. Covering Indexes
Last but not least, if your query is still slow, there's one more thing you can do: use covering indexes. That is, expand your indexes to include all the rows from the driving and/or secondary tables in them. This way the InnoDB engine won't need to read two indexes per table, but a single one.

Test
SELECT
-- fields
FROM
table1 JOIN table2 ON table1.c1 = table2.c1
WHERE
-- conditions
UNION ALL
SELECT
-- fields
FROM
table1 JOIN table2 ON table1.c2 = table2.c2
WHERE
-- conditions
/* add one more condition which eliminates the rows selected by 1st subquery */
AND table1.c1 != table2.c1
Copied from the comments:
Nico Haase > What do you mean by "test"?
OP shows query patterns only. So I cannot predict does the technique is effective or not, and I suggest OP to test my variant on his structure and data array.
Nico Haase > what you've changed
I have added one more condition to 2nd subquery - see added comment in the code.
Nico Haase > and why?
This replaces UNION DISTINCT with UNION ALL and eliminates combined rowset sorting for duplicates remove.

Huge performance difference between two similar SQL queries

I have two SQL queries that provides the same output.
My first intuition was to use this:
SELECT * FROM performance_dev.report_golden_results
where id IN (SELECT max(id) as 'id' from performance_dev.report_golden_results
group by platform_id, release_id, configuration_id)
Now, this took something like 70 secs to complete!
Searching for another solution I tried something similar:
SELECT * FROM performance_dev.report_golden_results e
join (SELECT max(id) as 'id'
from performance_dev.report_golden_results
group by platform_id, release_id, configuration_id) s
ON s.id = e.id;
Surprisingly, this took 0.05 secs to complete!!!
how come these two are so different?
thanks!

First thing which Might Cause the Time Lag is that MySQL uses 'semi-join' strategy for Subqueries.The Semi Join includes Following Steps :
If a subquery meets the preceding criteria, MySQL converts it to a
semi-join and makes a cost-based choice from these strategies:
Convert the subquery to a join, or use table pullout and run the query
as an inner join between subquery tables and outer tables. Table
pullout pulls a table out from the subquery to the outer query.
Duplicate Weedout: Run the semi-join as if it was a join and remove
duplicate records using a temporary table.
FirstMatch: When scanning the inner tables for row combinations and
there are multiple instances of a given value group, choose one rather
than returning them all. This "shortcuts" scanning and eliminates
production of unnecessary rows.
LooseScan: Scan a subquery table using an index that enables a single
value to be chosen from each subquery's value group.
Materialize the subquery into a temporary table with an index and use
the temporary table to perform a join. The index is used to remove
duplicates. The index might also be used later for lookups when
joining the temporary table with the outer tables; if not, the table
is scanned.
But giving an explicit join reduces these efforts which might be the Reason.
I hope it helped!

MySQL does not consider the first query as subject for semi-join optimization (MySQL converts semi joins to classic joins with some kind of optimization: first match, duplicate weedout ...)
Thus a full scan will be made on the first table and the subquery will be evaluated for each row generated by the outer select: hence the bad performances.
The second one is a classic join, what will happen in this case that MySQL will compute the result of derived query and then matches only values from this query with values from first query satisfying the condition, hence no full scan is needed on the first table (I assumed here that id is an indexed column).
The question right now is why MySQL does not consider the first query as subject to semi-join optimization: the answer is documented in MySQL https://dev.mysql.com/doc/refman/5.6/en/semijoins.html
In MySQL, a subquery must satisfy these criteria to be handled as a semijoin:
It must be an IN (or =ANY) subquery that appears at the top level of the WHERE or ON clause, possibly as a term in an AND expression. For example:
SELECT ...
FROM ot1, ...
WHERE (oe1, ...) IN (SELECT ie1, ... FROM it1, ... WHERE ...);
Here, ot_i and it_i represent tables in the outer and inner parts of the query, and oe_i and ie_i represent expressions that refer to columns in the outer and inner tables.
It must be a single SELECT without UNION constructs.
It must not contain a GROUP BY or HAVING clause.
It must not be implicitly grouped (it must contain no aggregate functions).
It must not have ORDER BY with LIMIT.
The STRAIGHT_JOIN modifier must not be present.
The number of outer and inner tables together must be less than the maximum number of tables permitted in a join.
Your subquery use GROUP BY hence semi-join optimization was not applied.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008