MYSQL who are my JOINS appear to make a query faster? - mysql

I am trying to improve the performance of a query using a "materialized view" to optimize away joins. The first query below is the original, which employs joins. The second is the query written against a table i generated which includes all the joined data (the equivalent of a materialized view). They both return the same result set. Unfortunalatey, somehow, the second query is MUCH slower when handling a very long set of input ids (the IN clause). I don't understand how that could be!!!! Executing all the joins has to have a fair amount of overheat that is saved by the "materialized view", right?
SELECT
clinical_sample.INTERNAL_ID AS "internalId",
sample.STABLE_ID AS "sampleId",
patient.STABLE_ID AS "patientId",
clinical_sample.ATTR_ID AS "attrId",
cancer_study.CANCER_STUDY_IDENTIFIER AS "studyId",
clinical_sample.ATTR_VALUE AS "attrValue"
FROM clinical_sample
INNER JOIN sample ON clinical_sample.INTERNAL_ID = sample.INTERNAL_ID
INNER JOIN patient ON sample.PATIENT_ID = patient.INTERNAL_ID
INNER JOIN cancer_study ON patient.CANCER_STUDY_ID =
cancer_study.CANCER_STUDY_ID
WHERE cancer_study.CANCER_STUDY_IDENTIFIER = 'xxxxx'
AND sample.STABLE_ID IN
('P-0068343-T02-IM7' , 'P-0068353-T01-IM7' ,
'P-0068363-T01-IM7' , 'P-0068364-T01-IM7' )
AND clinical_sample.ATTR_ID IN
(
'CANCER_TYPE'
);
SELECT
internalId,
sampleId,
patientId,
attrId,
studyId,
attrValue
FROM test
WHERE
sampleId IN ('P-0068343-T02-IM7' , 'P-0068353-T01-IM7' ,
'P-0068363-T01-IM7' , 'P-0068364-T01-IM7' )
AND studyId = 'xxxxx'
AND attrId = 'CANCER_TYPE';
Update: I did notice in Workbench report that the query with joins seems to scan far fewer rows. About 829k vs ~2400k for the second, joinless query. So having joins seems to actually be a major optimization somehow. I have index in sampleId, studyId, attrId and composite of all three.
Both table "test" and "clinical_sample" have the same number of rows.

It would help to see what the PRIMARY KEY of each table is.
Some of these indexes are likely to help:
clinical_sample: INDEX(ATTR_ID, INTERNAL_ID, ATTR_VALUE)
sample: INDEX(STABLE_ID, INTERNAL_ID, PATIENT_ID)
patient: INDEX(INTERNAL_ID, STABLE_ID, CANCER_STUDY_ID)
cancer_study: INDEX(CANCER_STUDY_IDENTIFIER, CANCER_STUDY_ID)
I agree with Barmar's INDEX(studyId, attrId, sampleId) for the materialized view.
I have index in sampleId, studyId, attrId and composite of all three.
Let's see the EXPLAIN. It may show that it is using your index just on (sampleId) when it should be using the composite index.
Also put the IN column last, not first, regardless of cardinality. More precisely, put = columns first in a composite index.

Food for thought: When and why are database joins expensive?
this leads me to believe that normalized tables with indexes could actually be faster than my denormalized attempt (materialized view).

Related

Optimizate My SQL Index Multiple Table JOIN

I have a 5 tables in mysql. And when I want execute query it executed too long.
There are structure of my tables:
Reciept(count rows: 23799640)reciept table structure
reciept_goods(count rows: 39398989)reciept_goods table structure
good(count rows: 17514)good table structure
good_categories(count rows: 121)good_categories table structure
retail_category(count rows: 10)retail_category table structure
My Indexes:
Date -->reciept.date #1
reciept_goods_index --> reciept_goods.recieptId #1,
reciept_goods.shopId #2,
reciept_goods.goodId #3
category_id -->good.category_id #1
I have a next sql request:
SELECT
R.shopId,
sales,
sum(Amount) as sum_amount,
count(distinct R.id) as count_reciept,
RC.id,
RC.name
FROM
reciept R
JOIN reciept_goods RG
ON R.id = RG.RecieptId
AND R.ShopID = RG.ShopId
JOIN good G
ON RG.GoodId = G.id
JOIN good_categories GC
ON G.category_id = GC.id
JOIN retail_category RC
ON GC.retail_category_id = RC.id
WHERE
R.date >= '2018-01-01 10:00:00'
GROUP BY
R.shopId,
R.sales,
RC.id
Explain this query gives next result:
Explain query
and execution time = 236sec
if use straight_join good ON (good.id = reciept_goods.GoodId ) explain query
Explain query
and execution time = 31sec
SELECT STRAIGHT_JOIN ... rest of query
I think, that problem in the indexes of my tables, but I don't uderstand how to fix them, can someone help me?
With about 2% of your rows in reciepts having the correct date, the 2nd execution plan chosen (with straight_join) seems to be the right execution order. You should be able to optimize it by adding the following covering indexes:
reciept(date, sales)
reciept_goods(recieptId, shopId, goodId, amount)
I assume that the column order in your primary key for reciept_goods currently is (goodId, recieptId, shopId) (or (goodId, shopId, receiptId)). You could change that to recieptId, shopId, goodId (and if you look at e.g. the table name, you may wanted to do this anyway); in that case, you do not need the 2nd index (at least for this query). I would assume that this primary key made MySQL take the slower execution plan (of course assuming that it would be faster) - although sometimes it's just bad statistics, especially on a test server.
With those covering indexes, MySQL should take the faster explain plan even without straight_join, if it doesn't, just add it again (although I would like a look at both executions plans then). Also check that those two new indexes are used in the explain plan, otherwise I may have missed a column.
It looks like you are depending on walking through a couple of many:many tables? Many people design them inefficiently.
Here I have compiled a list of 7 tips on making mapping tables more efficient. The most important is use of composite indexes.

Join vs subquery to count nested objects

Let's say my model contains 2 tables: persons and addresses. One person can have O, 1 or more addresses. I'm trying to execute a query that lists all persons and includes the number of addresses they have respectively. Here is the 2 queries that I have to achieve that:
SELECT
persons.*,
count(addresses.id) AS number_of_addresses
FROM `persons`
LEFT JOIN addresses ON persons.id = addresses.person_id
GROUP BY persons.id
and
SELECT
persons.*,
(SELECT COUNT(*)
FROM addresses
WHERE addresses.person_id = persons.id) AS number_of_addresses
FROM `persons`
And I was wondering if one is better than the other in term of performance.
The way to determine performance characteristics is to actually run the queries and see which is better.
If you have no indexes, then the first is probably better. If you have an index on addresses(person_id), then the second is probably better.
The reason is a little complicated. The basic reason is that group by (in MySQL) uses a sort. And, sorts are O(n * log(n)) in complexity. So, the time to do a sort grows faster than a data (not much faster, but a bit fast). The consequence is that a bunch of aggregations for each person is faster than one aggregation by person over all the data.
That is conceptual. In fact, MySQL will use the index for the correlated subquery, so it is often faster than the overall group by, which does not make use of an index.
I think the first query is optimum and more optimization can provided by changing table structure. For example define both person_id and address_id fields (order is important) as primary key in addresses table to join faster.
The mysql table storage structure is indexed organized table(clustered index) so the primary key index is very faster than normal index specially in join operation.

MySQL indexes optimisation

I have a big query with different tables queried with joins and with WHERE CLAUSES.
Now from my understanding the best index to have is to see the WHERE CLAUSE and add it as an index
select name from Table WHERE name = 'John'
We would have an index on the "name" field .
How would we determine the best index to have if the clause looks like this:
WHERE table1.field = 'x' and table2.field = 'y' etc...
of course the query is much more complicated than that , just want to know how to proceed and if you guys have a better idea .
SELECT ...
FROM tA
JOIN tB WHERE tA.x = tB.y
WHERE tA.name = 'foo'
AND tB.name = 'bar'
begs for
tA: INDEX(name, x)
tB: INDEX(name, y)
On the other hand:
SELECT ...
FROM tA
JOIN tB WHERE tA.name = tB.name
needs INDEX(name) on both tables.
If name is the PRIMARY KEY on each table, then those indexes are redundant and should not be added.
Etc.
How would we determine the best index to have if the clause looks like this:
WHERE table1.field = 'x' and table2.field = 'y' etc...
First of all as you are using join of 2 tables then join fields should be indexed and for better performance these fields should be integer type.
Now try to check which condition is filtering more data means reducing rows and try to create index on that field or composite index on multiple fields (make sure field should be in most left in index which is filtering more data) but index size should not increase too much.
Normally (not always) one table uses single index, so as you are filtering data from multiple tables so you can create index on both tables columns if you are getting sufficient data filteration by these fields.
Further anyone can advise better after seeing your actual query.
There is no such thing as single index for multiple tables. The first thing you could do, is to create an index for table1 on field and another one for table2 on field. If this still not fast enough, depending on your database schema, you could set a foreign key.
Lastly, you can create a view which contains data from both tables and then index that view. The advantage of a view is to have the data pre-joined which might make the query even faster.

Optimizing MySQL Left join query between 3 tables to reduce execution time

I have the following query:
SELECT region.id, region.world_id, min_x, min_y, min_z, max_x, max_y, max_z, version, mint_version
FROM minecraft_worldguard.region
LEFT JOIN minecraft_worldguard.region_cuboid
ON region.id = region_cuboid.region_id
AND region.world_id = region_cuboid.world_id
LEFT JOIN minecraft_srvr.lot_version
ON id=lot
WHERE region.world_id = 10
AND region_cuboid.world_id=10;
The Mysql slow query log tells me that it takes more than 5 seconds to execute, returns 2300 rows but examines 15'404'545 rows to return it.
The three tables each have bout 6500 rows only with unique keys on the id and lot fields as well as keys on the world_id fields. I tried to minimize the amount of rows examined by filtering both cuboid and world by their ID and the double WHERE on world_id, but it did not seem to help.
Any idea how I can optimize this query?
Here is the sqlfiddle with the indexes as of current status.
MySQL can't use index in this case because joined fields has different data types:
`lot` varchar(20) COLLATE utf8_unicode_ci NOT NULL
`id` varchar(128) COLLATE utf8_bin NOT NULL
If you change types of this fields to general type (for example, region.id to utf8_unicode_ci), MySQL uses primary key (fiddle).
According to docs:
Comparison of dissimilar columns (comparing a string column to a
temporal or numeric column, for example) may prevent use of indexes if
values cannot be compared directly without conversion.
You have joined the two tables "minecraft_worldguard.region" and "minecraft_worldguard.region_cuboid", on region.world_id and region_cuboid.world_id. So WHERE clause wouldn't require two conditions.
The two columns in the WHERE clause have been equated in the JOIN condition, hence you wouldn't require checking both the conditions in the WHERE clause. Remove one of them in the WHERE clause and add an index on the column that is remaining on the WHERE condition.
In your example, leave the WHERE clause as below:
WHERE region.world_id = 10
and add an index on the region.world_id column, that would improve the performance a bit.
NOTE: observe that I am suggesting you to discard "AND region_cuboid.world_id=10;" part of the WHERE clause.
Hope that helps.
First, when writing queries that have multiple tables, it is a very good thing to get used to "alias" references to the tables so you don't have to retype the entire long name throughout. Also, it is a really good idea to identify which tables the columns are coming from to allow users to better understand what is where which can also help improve performance (such as suggesting a covering index).
That said, I have applied aliases to your original query, but AM GUESSING the table per the respective columns, but you can obviously identify quickly and adjust.
SELECT
R.id,
R.world_id,
RC.min_x,
RC.min_y,
RC.min_z,
RC.max_x,
RC.max_y,
RC.max_z,
LV.version,
LV.mint_version
FROM
minecraft_worldguard.region R
LEFT JOIN minecraft_worldguard.region_cuboid RC
ON R.id = RC.region_id
AND R.world_id = RC.world_id
LEFT JOIN minecraft_srvr.lot_version LV
ON R.id = LV.lot
WHERE
R.world_id = 10
I also removed from the where clause your "region_cuboid.world_id = 10" as that is redundant as a result of the JOIN clause based on region AND world.
For suggestion of indexes, and if I have the proper alias references to the columns, I would suggest a covering index on the region table of
( world_id, id ). The "World_id" in the first position quickly qualifies the WHERE clause, and the "id" is there for the RC and LV tables.
For the region_cuboid table, I would also have an index on ( world_id, region_id) to match the region table being joined to it.
For the lot_version table, and index on (lot) or a covering index on (lot, version, mint_version)

Optimize mysql query using indexes

I have a problem with this query:
SELECT DISTINCT s.city, pc.start, pc.end
FROM postal_codes pc LEFT JOIN suspects s ON (s.postalcode BETWEEN pc.start AND pc.end)
WHERE pc.user_id = "username"
ORDER BY pc.start
Suspect table has about 340 000 entries, there is a index on postalcode, I have several users, but this individual query takes about 0.5s, when I run this SQL with explain, I get something like this: http://my.jetscreenshot.com/7536/20111225-myhj-41kb.jpg - does these NULLs mean that the query isn't using index? The index is a BTREE so I think this should run a little faster.
Can you please help me with this? If there are any other informations needed just let me know.
Edit: I have indexes on suspects.postalcode, postal_codes.start, postal_codes.end, postal_codes.user_id.
Basically what I'm trying to achieve: I have a table where each user ID has multiple postalcode ranges assigned, so it looks like:
user_id | start | end
Than I have a table of suspects where each suspect has an address (which contains a postalcode), so in this query I'm trying to get postalcode range - start and end and also name of the city in this range.
Hope this helps.
Whenever left join is used all the records of the first table are picked up rather than the selection on the basis of index. I would suggest to using an inner join. Something like in the below query.
select distinct
s.city,
pc.start,
pc.end
from postal_codes pc, suspect s
where
s.postalcode between (select pc1.start, pc1.end from postal_code pc1 where pc1.user_id = "username" )
and pc.user_id = "username"
order by pc.start
It's using only one index, and not for the fields involved in the join. Try creating an index for the start and end fields, or using >= and <= instead of BETWEEN
Not 100% sure, but this might be relevant:
Sometimes MySQL does not use an index, even if one is available. One circumstance under which this occurs is when the optimizer estimates that using the index would require MySQL to access a very large percentage of the rows in the table. (In this case, a table scan is likely to be much faster because it requires fewer seeks.) However, if such a query uses LIMIT to retrieve only some of the rows, MySQL uses an index anyway, because it can much more quickly find the few rows to return in the result.
So try testing with LIMIT, and if it uses the index then, you found your cause.
I have to say I'm a little confused by your table naming convention, I would expect the "suspect" table to have a user_id not the postal_code, but you must have your reasons. If you were to leave this query as it is, you can add an index on postal_code (star,end) to avoid the complete table scan.
I think you can restructure your query like following,
SELECT DISTINCT s.city, pc1.start, pc1.end FROM
(SELECT pc.start and pc.end from postal_codes pc where pc.user_id = "username") as pc1, Suspect s
WHERE s.postalcode BETWEEN pc1.start, pc1.end ORDER BY pc1.start
your query is not picking up the index on s table because of left join and your between condition. Having an Index in your table doesn't necessarily mean that it will be used in all the queries.
Try FORCE INDEX.