Indexes and optimization - mysql

I'm not brilliant when it comes to going beyond the basics with MySQL, however, I'm trying to optimize a query:
SELECT DATE_FORMAT(t.completed, '%H') AS hour, t.orderId, t.completed as stamp,
t.deadline as deadline, t.completedBy as user, p.largeFormat as largeFormat
FROM tasks t
JOIN orders o ON o.id=t.orderId
JOIN products p ON p.id=o.productId
WHERE DATE(t.completed) = '2013-09-11'
AND t.type = 7
AND t.completedBy IN ('user1', 'user2')
AND t.suspended = '0'
AND o.shanleys = 0
LIMIT 0,100
+----+-------------+-------+--------+----------------------------+-----------+---------+-----------------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+----------------------------+-----------+---------+-----------------+-------+-------------+
| 1 | SIMPLE | o | ref | PRIMARY,productId,shanleys | shanleys | 2 | const | 54464 | Using where |
| 1 | SIMPLE | p | eq_ref | PRIMARY | PRIMARY | 4 | sfp.o.productId | 1 | |
| 1 | SIMPLE | t | ref | NewIndex1 | NewIndex1 | 5 | sfp.o.id | 6 | Using where |
+----+-------------+-------+--------+----------------------------+-----------+---------+-----------------+-------+-------------+
Before some of the indexes were added it was performing full table scans on both the p table and the o table.
Basically, I thought that MySQL would:
limit down the rows from the tasks table with the where clauses (should be 84 rows without the joins)
then go through the orders table to the products table to get a flag (largeFormat).
My questions are why does MySQL look up 50000+ rows when it's only got 84 different ids to look for, and is there a way that I can optimize the query?
I'm not able to add new fields or new tables.
Thank you in advance!

SQL needs to work on available indexes to best qualify the query
I would have a compound index on
( type, suspended, completedby, completed)
to match the criteria you have... Your orders and products tables appear ok with their existing indexes.
SELECT
DATE_FORMAT(t.completed, '%H') AS hour,
t.orderId,
t.completed as stamp,
t.deadline,
t.completedBy as user,
p.largeFormat as largeFormat
FROM
tasks t
JOIN orders o
ON t.orderId = o.id
AND o.shanleys = 0
JOIN products p
ON o.productId = p.id
WHERE
t.type = 7
AND t.suspended = 0
AND t.completedBy IN ('user1', 'user2')
AND t.completed >= '2013-09-11'
AND t.completed < '2013-09-12'
LIMIT
0,100
I suspect that suspended is a flag and is numeric (int) based, if so, leave the
criteria as a numeric and not string by wrapping in '0' quotes.
FOR datetime fields, if you try TO apply functions TO it, it cant utilize the index
well... so, if you only care about the one DAY(or range in other queries),
notice I have the datetime field >= '2013-09-11' which is implied of 12:00:00 AM,
AND the datetime field is LESS THAN '2013-09-12' which allows up to 11:59:59PM on the 2013-09-11
which is the entire day AND the index can take advantage of it.

Related

How to improve this MySQL Query using join?

I have got a simple query and it takes more than 14 seconds.
select
e.title, e.date, v.name, v.city, v.region, v.country
from seminar e force index for join (venueid)
left join venues v on e.venueid = v.id
where v.country = 'US'
and v.city = 'New York'
and v.region = 'NY'
and e.date > curdate()
and e.someid != 0
Note: count(e.id) stands for an abbreviation for debugging purposes. In fact we get information from both tables.
Explain gives this:
+----+-------------+-------+-------------+--------------------------------------------------------------------------------------+--------------------------+---------+-----------------+------+--------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------------+--------------------------------------------------------------------------------------+--------------------------+---------+-----------------+------+--------------------------------------------------------+
| 1 | SIMPLE | v | index_merge | PRIMARY,city,country,region | city,region | 378,378 | NULL | 2 | Using intersect(city,region); Using where |
| 1 | SIMPLE | e | ref | venueid | venueid | 5 | v.id | 11 | Using where |
+----+-------------+-------+-------------+--------------------------------------------------------------------------------------+--------------------------+---------+-----------------+------+--------------------------------------------------------+
I have indexes on e.id, e.date, e.someid, as well as v.id, v.country, v.city and v.region.
I know the db-setup is a mess but that's what I have to deal with right now.
Why does the SQL take so long as in the end there will be an approx. count 150? In events there are about 1M entries and in venues about 100K.
Both tables are MyISAM. Any ideas how to improve this?
Upon creating an index like this
create index location on venues (city, region, country)
it takes 20 seconds, the explain is this:
+----+-------------+-------+------+--------------------------------------+--------------+---------+-------------------+------+------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+--------------------------------------+--------------+---------+-------------------+------+------------------------------------+
| 1 | SIMPLE | v | ref | PRIMARY,city,country,region,location | location | 765 | const,const,const | 410 | Using index condition; Using where |
| 1 | SIMPLE | e | ref | EventVenueID | venueid | 5 | v.id | 11 | Using where |
+----+-------------+-------+------+--------------------------------------+--------------+---------+-------------------+------+------------------------------------+
You have left join venues, but you have conditions in the where clause on the joined venues row, so only joined rows will be returned. However, that's a side issue - read on for why you don't need a join at all.
Next, if the city is vancouver, there's no need to also test for country or state.
Finally, if you're trying to find "how many future events are in Vancouver", you don't need a join, as the venue id is a constant!
Try this:
select count(*) as event_count
from events
where venueid = (select id from venues where city = 'vancouver')
and startdate > curdate()
and te_id != 0
Mysql will use the index on venueid without you having to use a hint. If it doesn't, execute this:
analyze events
which will update the statistics of the data distribution in the indexed columns. Note that if a lot of your events are in Vancouver, it's more efficient to not use an index (as most of the rows will have to be accessed anyway).
This would make the first part of the query faster:
INDEX(city, region, country)
I went another way since it seems that MySQL can't handle joins effectively:
Created one big new table with all the columns I need from the join
So the seminars and events are in one table now
added indexes
Now the query is fast. Don't know why...
From 25 seconds, we are down to .08 seconds
That's how I wanted it.
If anybody still knows why, you are more than welcome to provide an answer.

Why is my MySQL query is so slow?

I'm trying to figure out why that query so slow (take about 6 second to get result)
SELECT DISTINCT
c.id
FROM
z1
INNER JOIN
c ON (z1.id = c.id)
INNER JOIN
i ON (c.member_id = i.member_id)
WHERE
c.id NOT IN (... big list of ids which should be excluded)
This is execution plan
+----+-------------+-------+--------+-------------------+---------+---------+--------------------+--------+----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+--------+-------------------+---------+---------+--------------------+--------+----------+--------------------------+
| 1 | SIMPLE | z1 | index | PRIMARY | PRIMARY | 4 | NULL | 318563 | 99.85 | Using where; Using index; Using temporary |
| 1 | SIMPLE | c | eq_ref | PRIMARY,member_id | PRIMARY | 4 | z1.id | 1 | 100.00 | |
| 1 | SIMPLE | i | eq_ref | PRIMARY | PRIMARY | 4 | c.member_id | 1 | 100.00 | Using index |
+----+-------------+-------+--------+-------------------+---------+---------+--------------------+--------+----------+--------------------------+
is it because mysql has to take out almost whole 1st table ? Can it be adjusted ?
You can try to replace c with a subquery.
SELECT DISTINCT
c.id
FROM
z1
INNER JOIN
(select c.id
from c
WHERE
c.id NOT IN (... big list of ids which should be excluded)) c ON (z1.id = c.id)
INNER JOIN
i ON (c.member_id = i.member_id)
to leave only necessary id's
It is imposible to say from the information you've provided whether there is a faster solution to obtaining the same data (we would need to know abou data distributions and what foreign keys are obligatory). However assuming that this is a hierarchical data set, then the plan is probably not optimal: the only predicate to reduce the number of rows is c.id NOT IN.....
The first question to ask yourself when optimizing any query is Do I need all the rows? How many rows is this returning?
I'm struggling to see any utlity in a query which returns a list of 'id' values (implying a set of autoincrement integers).
You can't use an index for a NOT IN (or <>) hence the most eficient solution is probably to start with a full table scan on 'c' - which should be the outcome of StanislavL's query.
Since you don't use the values from i and z, the joins could be replaced with 'exists' which may help performance.
I would consider creating a compound index for c(id, member_id). This way the query should work at index level only without scanning any rows in tables.

Normal select faster than count(*)

I want to do a count like this (as an example, not really counting dogs):
SELECT COUNT(*)
FROM dogs AS d INNER JOIN races AS r ON d.race_id = r.race_id
LEFT INNER colors AS c ON c.color_id = r.color_id
WHERE d.deceased = 'N'
I have 130,000 dogs in a MyISAM table. Races has 1,500 records and is an InnoDB table with 9 columns, colors has 83 records and is also InnoDB and has two columns (id, name).
The *_id columns are all primary keys, I have indices on the 'foreign' keys dogs.race_id and races.color_id and I have an index on dogs.deceased. None of the mentioned columns can be NULL.
# mysql --version
mysql Ver 14.12 Distrib 5.0.51a, for debian-linux-gnu (i486) using readline 5.2
Now the thing is: In my PhpMyAdmin this query takes 1.8 secs (with SQL_NO_CACHE) with a count result of 64,315. Changing COUNT(*) to COUNT(d.dog_id) or COUNT(d.deceased) also takes the query to run for 1.8 secs with the same result.
But when I remove the COUNT() and just do SELECT * or SELECT dog_id, it takes about 0.004 secs to run (and then counting the result with something like mysql_num_rows()).
How can this be? And how can I make the COUNT() work faster?
Edit: Added an EXPLAIN below
EXPLAIN SELECT COUNT(*)
FROM dogs AS d INNER JOIN races AS r ON d.race_id = r.race_id
INNER JOIN colors AS c ON c.color_id = r.color_id
WHERE d.deceased = 'N'
Gives me:
+----+-------------+-------+-------+------------------+----------+---------+----------------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+------------------+----------+---------+----------------------+------+-------------+
| 1 | SIMPLE | c | index | color_id | color_id | 4 | NULL | 83 | Using index |
| 1 | SIMPLE | r | ref | PRIMARY,color_id | color_id | 4 | database.c.color_id | 14 | Using index |
| 1 | SIMPLE | d | ref | race_id,deceased | race_id | 4 | database.r.race_id | 123 | Using where |
+----+-------------+-------+-------+------------------+----------+---------+----------------------+------+-------------+
The MySQL Optimizer does a full table scan only if it is needed because a column can be NULL which means if the column is not defined as NOT NULL there can be some NULL values in it and so MySQL have to perform table scan to find out. If your column d.dog_id nullable? try to run the count on another column which is not nullable, this should provide you a better performance than count(*).
Try to set an index on dogs.deceased and use SELECT COUNT(*) ... USE INDEX my_index_name.
Create indexes to make your counting faster:
CREATE INDEX ix_temp ON dogs (d.race_id)
INCLUDE (columns needed for the query)

SQL LIMIT to get latest records

I am writing a script which will list 25 items of all 12 categories. Database structure is like:
tbl_items
---------------------------------------------
item_id | item_name | item_value | timestamp
---------------------------------------------
tbl_categories
-----------------------------
cat_id | item_id | timestamp
-----------------------------
There are around 600,000 rows in the table tbl_items. I am using this SQL query:
SELECT e.item_id, e.item_value
FROM tbl_items AS e
JOIN tbl_categories AS cat WHERE e.item_id = cat.item_id AND cat.cat_id = 6001
LIMIT 25
Using the same query in a loop for cat_id from 6000 to 6012. But I want the latest records of every category. If I use something like:
SELECT e.item_id, e.item_value
FROM tbl_items AS e
JOIN tbl_categories AS cat WHERE e.item_id = cat.item_id AND cat.cat_id = 6001
ORDER BY e.timestamp
LIMIT 25
..the query goes computing for approximately 10 minutes which is not acceptable. Can I use LIMIT more nicely to give the latest 25 records for each category?
Can anyone help me achieve this without ORDER BY? Any ideas or help will be highly appreciated.
EDIT
tbl_items
+---------------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------------------+--------------+------+-----+---------+-------+
| item_id | int(11) | NO | PRI | 0 | |
| item_name | longtext | YES | | NULL | |
| item_value | longtext | YES | | NULL | |
| timestamp | datetime | YES | | NULL | |
+---------------------+--------------+------+-----+---------+-------+
tbl_categories
+----------------+------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------------+------------+------+-----+---------+-------+
| cat_id | int(11) | NO | PRI | 0 | |
| item_id | int(11) | NO | PRI | 0 | |
| timestamp | datetime | YES | | NULL | |
+----------------+------------+------+-----+---------+-------+
Can you add indices? If you add an index on the timestamp and other appropriate columns the ORDER BY won't take 10 minutes.
First of all:
It seems to be a N:M relation between items and categories: a item may be in several categories. I say this because categories has item_id foreign key.
If is not a N:M relationship then you should consider to change design. If it is a 1:N relationship, where a category has several items, then item must constain category_id foreign key.
Working with N:M:
I have rewrite your query to make a inner join insteat a cross join:
SELECT e.item_id, e.item_value
FROM
tbl_items AS e
JOIN
tbl_categories AS cat
on e.item_id = cat.item_id
WHERE
cat.cat_id = 6001
ORDER BY
e.timestamp
LIMIT 25
To optimize performance required indexes are:
create index idx_1 on tbl_categories( cat_id, item_id)
it is not mandatory an index on items because primary key is also indexed.
A index that contains timestamp don't help as mutch. To be sure can try with an index on item with item_id and timestamp to avoid access to table and take values from index:
create index idx_2 on tbl_items( item_id, timestamp)
To increase performace you can change your loop over categories by a single query:
select T.cat_id, T.item_id, T.item_value from
(SELECT cat.cat_id, e.item_id, e.item_value
FROM
tbl_items AS e
JOIN
tbl_categories AS cat
on e.item_id = cat.item_id
ORDER BY
e.timestamp
LIMIT 25
) T
WHERE
T.cat_id between 6001 and 6012
ORDER BY
T.cat_id, T.item_id
Please, try this querys and come back with your comments to refine it if necessary.
Leaving aside all other factors I can tell you that the main reason why the query is so slow, is because the result involves longtext columns.
BLOB and TEXT fields in MySQL are mostly meant to store complete files, textual or binary. They are stored separately from the row data for InnoDB tables. Each time a query involes sorting (explicitly or for a group by), MySQL is sure to use disk for the sorting (because it can not be sure in advance how large any file is).
And it is probably a rule of thumb: if you need to return more than a single row of a column in a query, the type of the field is almost never should be TEXT or BLOB, use VARCHAR or VARBINARY instead.
UPD
If you can not update the table, the query will hardly be fast with the current indexes and column types. But, anyway, here is a similar question and a popular solution to your problem: How to SELECT the newest four items per category?

Optimizing MySQL query with inner join

I've spent a lot of time optimizing this query but it's starting to slow down with larger tables. I imagine these are probably the worst types of questions but I'm looking for some guidance. I'm not really at liberty to disclose the database schema so hopefully this is enough information. Thanks,
SELECT tblA.id, tblB.id, tblC.id, tblD.id
FROM tblA, tblB, tblC, tblD
INNER JOIN (SELECT max(tblB.id) AS xid
FROM tblB
WHERE tblB.rdd = 11305
GROUP BY tblB.index_id
ORDER BY NULL) AS rddx
ON tblB.id = rddx.xid
WHERE
tblA.id = tblB.index_id
AND tblC.name = tblD.s_type
AND tblD.name = tblA.s_name
GROUP BY tblA.s_name
ORDER BY NULL;
There is a one-to-many relationship between:
tblA.id and tblB.index_id
tblC.name and tblD.s_type
tblD.name and tblA.s_name
+----+-------------+------------+--------+---------------+-----------+---------+------------------------------+-------+------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+--------+---------------+-----------+---------+------------------------------+-------+------------------------------+
| 1 | PRIMARY | derived2 | ALL | NULL | NULL | NULL | NULL | 32568 | Using temporary |
| 1 | PRIMARY | tblB | eq_ref | PRIMARY | PRIMARY | 8 | rddx.xid | 1 | |
| 1 | PRIMARY | tblA | eq_ref | PRIMARY | PRIMARY | 8 | tblB.index_id | 1 | Using where |
| 1 | PRIMARY | tblD | eq_ref | PRIMARY | PRIMARY | 22 | tblA.s_name | 1 | Using where |
| 1 | PRIMARY | tblC | eq_ref | PRIMARY | PRIMARY | 22 | tblD.s_type | 1 | |
| 2 | DERIVED | tblB | ref | rdd_idx | rdd_idx | 7 | | 65722 | Using where; Using temporary |
+----+-------------+------------+--------+---------------+-----------+---------+------------------------------+-------+------------------------------+
Unless I've misunderstood the information that you've provided I believe you could re-write the above query as follows
EXPLAIN SELECT tblA.id, MAX(tblB.id), tblC.id, tblD.id
FROM tblA
LEFT JOIN tblD ON tblD.name = tblA.s_name
LEFT JOIN tblC ON tblC.name = tblD.s_type
LEFT JOIN tblB ON tblA.id = tblB.index_id
WHERE tblB.rdd = 11305
ORDER BY NULL;
Obviously I can't provide an explain for this as explain depends on the data in your database. It would be interesting to see the explain on this query.
Obviously explain only gives you an estimate of what will happen. You can use SHOW SESSION STATUS to provide in details of what happened when you run an actual query. Make sure to run before you run the query that you are investigating so that you have clean data to read from. So in this case you would run
FLUSH STATUS;
EXPLAIN SELECT tblA.id, MAX(tblB.id), tblC.id, tblD.id
FROM tblA
LEFT JOIN tblD ON tblD.name = tblA.s_name
LEFT JOIN tblC ON tblC.name = tblD.s_type
LEFT JOIN tblB ON tblA.id = tblB.index_id
WHERE tblB.rdd = 11305
ORDER BY NULL;
SHOW SESSION STATUS LIKE 'ha%';
This gives you a number of indicators to show what actually happened when a query executed.
Handler_read_rnd_next - Number of requests to read next row in the data file
Handler_read_key - Number of requests to read a row based on a key
Handler_read_next - Number of requests to read the next row in key order
Using these values you can see exactly what is going on under the hood.
Unfortunately without knowing the data in the tables, engine type and the data types used in the queries it is quite hard to advise on how you could optimize.
I have updated the query using joins instead of the join within the WHERE clause. Also, by looking at it, as a developer, you can directly see the relationship between the tables.
A->B, A->D and D->C. Now, on table B where you want the highest ID based on the common "ID=Index_ID" AND the RDD = 11305 won't require a complete sub-query. However, this has moved the "MAX()" to the upper portion of the field selection clause. I would ensure you have an index on tblB on (index_id, rdd). Finally, by doing STRAIGHT_JOIN will help enforce the order to run the query based on how specifically listed.
-- EDIT FROM COMMENT --
It appears you are getting nulls from the tblB. This typically indicates a valid tblA record, but no tblB record by same ID that has an RDD = 11305. That said, it appears you are only concerned with those entries associated with 11305, so I'm adjusting the query accordingly. Please make sure you have an index on tblB based on the "RDD" column (at least in the first position in case multiple column index)
As you can see in this one, I'm pre-querying from table B only for 11305 entries and pre-grouping by the index_ID (as linked to tblA). This gives me one record per index where they will exist... From THIS result, I'm joining back to A, then directly back to B again, but based on that highest match ID found, then D and C as was before. So NOW, you can get any column from any of the tables and get proper record in question... There should be no NULL values left in this query.
Hopefully, I've clarified HOW I'm getting the pieces together for you.
SELECT STRAIGHT_JOIN
PreQuery.HighestPerIndexID
tblA.id,
tblA.AnotherAField,
tblA.Etc,
tblB.SomeOtherField,
tblB.AnotherField,
tblC.id,
tblD.id
FROM
( select PQ1.Index_ID,
max( PQ1.ID ) as HighestPerIndexID
from tblB PQ1
where PQ1.RDD = 11305
group by PQ1.Index_ID ) PreQuery
JOIN tblA
on PreQuery.Index_ID = tblA.ID
join tblB
on PreQuery.HighestPerIndexID = tblB.ID
join tblD
on tblA.s_Name = tblD.name
join tblC
on tblD.s_type = tblC.Name
ORDER BY
tblA.s_Name