I am going through the slow query log to try to determine why some of the queries behave erratically. For the sake of consistency, the queries are not cached and flushing was done to clear system cache before running the test. The query goes something like this:
SELECT P.id, P.name, P.lat, P.lng, P.price * E.rate AS 'ask' FROM Property P
INNER JOIN Exchange E ON E.currency = P.currency
WHERE P.floor_area >= k?
AND P.closing_date >= CURDATE() // this and key_buffer_size=0 prevents caching
AND P.type ='c'
AND P.lat BETWEEN v? AND v?
AND P.lng BETWEEN v? AND v?
AND P.price * E.rate BETWEEN k? AND k?
ORDER BY P.floor_area DESC LIMIT 100;
The k? are user defined constant values; v? are variables that change as user drag or zoom on a map. 100 results are pulled out from the table and sorted according to floor area in descending order.
A PRIMARY key on id and an INDEX on floor_area is set up only. No other index is created so that MySQL would consistently use floor_area as the only key. The query times and rows examined are recorded as follows:
query number 1 2 3 4 5 6 7 8 9 10
user action on map start > + + < ^ + > v +
time in seconds 138 0.21 0.43 32.3 0.12 0.12 36.3 4.33 0.33 2.00
rows examined ('000) 43 43 43 60 43 43 111 139 133 176
The query execution plan is as follows:
+----+-------------+-------+--------+---------------+---------+---------+--------------------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+---------+---------+--------------------+---------+-------------+
| 1 | SIMPLE | P | range | id_flA | id_flA | 3 | NULL | 4223660 | Using where |
| 1 | SIMPLE | E | eq_ref | PRIMARY | PRIMARY | 3 | BuySell.P.currency | 1 | Using where |
+----+-------------+-------+--------+---------------+---------+---------+--------------------+---------+-------------+
The test is being performed a few times and the results are quite consistent with the above. What could be the reason(s) for the spike in query times in query number 4 and number 7 and how do I bring it down?
UPDATE:
Results of removing ORDER BY as suggested by Digital Precision:
query number 1 2 3 4 5 6 7 8 9 10
user action on map start > + + < ^ + > v +
time in seconds 255 3.10 3.16 3.08 3.18 3.21 3.32 3.18 3.17 3.80
rows examined ('000) 131 131 131 131 136 136 136 136 136 157
The query execution plan is the same as above though it seems more like a table scan. Note that I am using MyISAM engine, version 5.5.14.
AS requested, below is schema:
| Property | CREATE TABLE `Property` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`type` char(1) NOT NULL DEFAULT '',
`lat` decimal(6,4) NOT NULL DEFAULT '0.0000',
`lng` decimal(7,4) NOT NULL DEFAULT '0.0000',
`floor_area` mediumint(8) unsigned NOT NULL DEFAULT '0',
`currency` char(3) NOT NULL DEFAULT '',
`price` int(10) unsigned NOT NULL DEFAULT '0',
`closing_date` date NOT NULL DEFAULT '0000-00-00',
`name` char(25) NOT NULL DEFAULT '',
PRIMARY KEY (`id`),
KEY `id_flA` (`floor_area`)
) ENGINE=MyISAM AUTO_INCREMENT=5000000 DEFAULT CHARSET=latin1
| Exchange | CREATE TABLE `Exchange` (
`currency` char(3) NOT NULL,
`rate` decimal(11,10) NOT NULL DEFAULT '0.0000000000',
PRIMARY KEY (`currency`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
2ND UPDATE:
I thought it would be appropriate to post the non-default parameters in the my.cnf configuration file since two of the answerers are mentioning about the parameters:
max_heap_table_size = 1300M
key_buffer_size = 0
read_buffer_size = 1300M
read_rnd_buffer_size = 1024M
sort_buffer_size = 1300M
I have 2GB of RAM on my test server.
I guess I figure out the reason of spikes. Here is how it goes :
First I created the tables and load some randomly generated data on it:
Here is my query:
SELECT SQL_NO_CACHE P.id, P.name, P.lat, P.lng, P.price * E.rate AS 'ask'
FROM Property P
INNER JOIN Exchange E ON E.currency = P.currency
WHERE P.floor_area >= 2000
AND P.closing_date >= CURDATE()
AND P.type ='c'
AND P.lat BETWEEN 12.00 AND 22.00
AND P.lng BETWEEN 10.00 AND 20.00
AND P.price BETWEEN 100 / E.rate AND 10000 / E.rate
ORDER BY P.floor_area DESC LIMIT 100;
And here is the describe :
+----+-------------+-------+-------+---------------+--------+---------+------+---------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+--------+---------+------+---------+----------------------------------------------+
| 1 | SIMPLE | P | range | id_flA | id_flA | 3 | NULL | 4559537 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | E | ALL | PRIMARY | NULL | NULL | NULL | 6 | Using where; Using join buffer |
+----+-------------+-------+-------+---------------+--------+---------+------+---------+----------------------------------------------+
it took between 3.5 ~ 3.9 sec every time I query the data (didn't make any difference which parameters I use). It didn't make sense so I researched Using join buffer
Then I wanted to try this query without "join buffer" so I inserted 1 more random data to Exchange table.
INSERT INTO Exchange(currency, rate) VALUES('JJ', 1);
Now I use the same sql and the it took 0.3 ~ 0.5 seconds for response. And here is the describe :
+----+-------------+-------+--------+---------------+---------+---------+-----------------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+---------+---------+-----------------+---------+-------------+
| 1 | SIMPLE | P | range | id_flA | id_flA | 3 | NULL | 4559537 | Using where |
| 1 | SIMPLE | E | eq_ref | PRIMARY | PRIMARY | 3 | test.P.currency | 1 | Using where |
+----+-------------+-------+--------+---------------+---------+---------+-----------------+---------+-------------+
So the problem (as far as I see), the optimizer trying to use "join buffer". The optimum solution of this problem would be to force optimizer not to use "join buffer". (which I couldn't find how to) or change the "join_buffer_size" value. I solve it by adding "dummy" values to Exchange table (so the optimizer wouldn't use join buffer) but it's not a exact solution, its just a stupid trick to fool mysql.
Edit : I researched in mysql forums/bugs about this "join buffer" behavior; then asked about it in official forums. I am going to fill a bug report about this irrational behavior of optimizer.
Couple of things:
Why are you calculating the product of P.price and E.rate in the SELECT and aliasing as 'ask', then doing the calculation again in the where clause? Should be able to do AND ask BETWEEN k? and k? -- Edit: This won't work due to the way MySQL works. Apparently MySQL evaluates the WHERE clause before any aliases (sourced).
What kind of index do you have on Exchange.currency and Property.currency? If exchange is a lookup table, maybe you would be better off adding a pivot (linking) table with Property.Id and Exchange.Id
The order by floor_area forces MySQL to create a temp table in order to do the sorting correctly, any chance you can do the sorting at the app layer?
Adding an index on type column will help as well.
-- Edit
Not sure what you mean by the comment // this and key_buffer_size=0 prevents caching on the CURDATE where conditional, you can force no sql caching using the 'SQL_NO_CACHE' flag on your select statement.
What I would recommend now that you have removed the ORDER BY, is to update your query statement as follows (Added P alias to columns to reduce any confusion):
WHERE P.type ='condominium'
AND P.floor_area >= k?
AND P.closing_date >= CURDATE() // No longer necessary with SQL_NO_CACHE
AND P.lat BETWEEN v? AND v?
AND P.lng BETWEEN v? AND v?
AND P.price * E.rate BETWEEN k? AND k?
Then add an index to the 'type' column and a composite index on the 'type' and 'floor_area' columns. As you stated, the type column is a low-cardinality column, but the table is large and should help. And even though floor_area appears to be a high-cardinality column, the composite index will help speed up your query times.
You may also want to research if there is a penalty using BETWEEN rather than range operators ( >, <, <= etc.)
Try an index on type and floor_area (and possibly closing_date too).
Modify your constants by the exchange rate instead of the price column:
P.price between ( k? / E.rate ) and ( k? / E.rate )
then try an index on price.
I've become a little obsessed with this question; the spike is hard to explain.
Here's what I did:
I re-created your schema, and populated the property table with 4.5 million records, with random values for the numerical and date columns. This almost certainly doesn't match your data - I'm guessing the lat/longs tend to cluster in population areas, the prices around multiples of 10K, and the floor space will be skewed towards lower-end values.
I ran your query with a range of values for lat, long, floorspace and price. With just the index on floor area, I saw the the query plan would ignore the index for some values of floor area. This was presumably because the query analyzer decided the number of records excluded by using the index was too small. However, in re-running the query for a variety of different scenarios, I noticed that the query plan would ignore the index every now and again - can't explain that.
It's always worth running ANALYZE TABLE when dealing with this kind of weirdness.
I did get slightly different "explain" results: specifically, the property table select gave 'Using where; Using temporary; Using filesort'. This suggests the index is only used for the where clause, and not to order the results.
This confirms that the most likely explanation of the performance peaks is not related so much to the query engine, but to the way the temporary table is handled, and the requirement to do a filesort. In trying to reproduce this issue, I did notice that response time went up dramatically as the number of records returned from the "where" clause increased - though I didn't see the spikes you've noticed.
I've tried a variety of different indices; using all the keys in the where clause does speed up the time to retrieve the records matching the where clause, but does nothing for the subsequent order by.
This, once again, suggests it's the performance of the temporary table that's the cause of the spikes. read_rnd_buffer_size would be the obvious thing to look at.
Related
I'll be the first to admit that I'm not great at SQL (and I probably shouldn't be treating it like a rolling log file), but I was wondering if I could get some pointers for improving some slow queries...
I have a large mysql table with 2M rows where I do two full table lookups based on a subset of the most recent data. When I load the page that contains these queries, I often find they take several seconds to complete, but the queries inside are quite quick.
PMA's (supposedly terrible) advisor pretty much throws the entire kitchen sink at me, temporary tables, too many sorts, joins without indexes (I don't even have any joins?), reading from fixed position, reading next position, temporary tables written to disk... that last one especially makes me wonder if it's a configuration issue, but I played with all the knobs, and even paid for a managed service which didn't seem to help.
CREATE TABLE `archive` (
`id` bigint UNSIGNED NOT NULL,
`ip` varchar(15) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
`service` enum('ssh','telnet','ftp','pop3','imap','rdp','vnc','sql','http','smb','smtp','dns','sip','ldap') CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
`hostid` bigint UNSIGNED NOT NULL,
`date` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
ALTER TABLE `archive`
ADD PRIMARY KEY (`id`),
ADD KEY `service` (`service`),
ADD KEY `date` (`date`),
ADD KEY `ip` (`ip`),
ADD KEY `date-ip` (`date`,`ip`),
ADD KEY `date-service` (`date`,`service`),
ADD KEY `ip-date` (`ip`,`date`),
ADD KEY `ip-service` (`ip`,`service`),
ADD KEY `service-date` (`service`,`date`),
ADD KEY `service-ip` (`service`,`ip`);
Adding indexes definitely helped (even though they're 4x the size of the actual data), but I'm kindof at a loss where I can optimize further. Initially I thought about caching the subquery results in php and using it twice for the main queries, but I don't think I have access to the result once I close the subquery. I looked into doing joins, but they look like they're meant for 2 or more separate tables, but the subquery is from the same table, so I'm not sure if that would even work either. The queries are supposed to find the most active ip/services based on whether I have data from an ip in the past 24 hours...
SELECT service, COUNT(service) AS total FROM `archive`
WHERE ip IN
(SELECT DISTINCT ip FROM `archive` WHERE date > DATE_SUB(CURRENT_TIMESTAMP, INTERVAL 24 HOUR))
GROUP BY service HAVING total > 1
ORDER BY total DESC, service ASC LIMIT 10
+----+--------------+-----------------+------------+-------+----------------------------------------------------------------------------+------------+---------+------------------------+-------+----------+---------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------+-----------------+------------+-------+----------------------------------------------------------------------------+------------+---------+------------------------+-------+----------+---------------------------------+
| 1 | SIMPLE | <subquery2> | NULL | ALL | NULL | NULL | NULL | NULL | NULL | 100.00 | Using temporary; Using filesort |
| 1 | SIMPLE | archive | NULL | ref | service,ip,date-service,ip-date,ip-service,service-date,service-ip | ip-service | 47 | <subquery2>.ip | 5 | 100.00 | Using index |
| 2 | MATERIALIZED | archive | NULL | range | date,ip,date-ip,date-service,ip-date,ip-service | date-ip | 5 | NULL | 44246 | 100.00 | Using where; Using index |
+----+--------------+-----------------+------------+-------+----------------------------------------------------------------------------+------------+---------+------------------------+-------+----------+---------------------------------+
SELECT ip, COUNT(ip) AS total FROM `archive`
WHERE ip IN
(SELECT DISTINCT ip FROM `archive` WHERE date > DATE_SUB(CURRENT_TIMESTAMP, INTERVAL 24 HOUR))
GROUP BY ip HAVING total > 1
ORDER BY total DESC, INET_ATON(ip) ASC LIMIT 10
+----+--------------+-----------------+------------+-------+---------------------------------------------------------------+---------+---------+------------------------+-------+----------+---------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------+-----------------+------------+-------+---------------------------------------------------------------+---------+---------+------------------------+-------+----------+---------------------------------+
| 1 | SIMPLE | <subquery2> | NULL | ALL | NULL | NULL | NULL | NULL | NULL | 100.00 | Using temporary; Using filesort |
| 1 | SIMPLE | archive | NULL | ref | ip,date-ip,ip-date,ip-service,service-ip | ip-date | 47 | <subquery2>.ip | 5 | 100.00 | Using index |
| 2 | MATERIALIZED | archive | NULL | range | date,ip,date-ip,date-service,ip-date,ip-service | date-ip | 5 | NULL | 44168 | 100.00 | Using where; Using index |
+----+--------------+-----------------+------------+-------+---------------------------------------------------------------+---------+---------+------------------------+-------+----------+---------------------------------+
common subquery: 0.0351s
whole query 1: 1.4270s
whole query 2: 1.5601s
total page load: 3.050s (7 queries total)
Am I just doomed to terrible performance with this table?
Hopefully there's enough information here to get an idea of what's going, but if anyone can help I would certainly appreciate it. I don't mind throwing more hardware at the issue, but when an 8c/16t server with 16gb can't handle 150mb of data I'm not sure what will. Thanks in advance for reading my long winded question.
You have the right indexes (as well as many other indexes) and your query both meets your specs and runs close to optimally. It's unlikely that you can make this much faster: it needs to look all the way back to the beginning of your table.
If you can change your spec so you only have to look back a limited amount of time like a year you'll get a good speedup.
Some possible minor tweaks.
use the latin1_bin collation for your ip column. It uses 8-bit characters and collates them without any case sensitivity. That's plenty for IPv4 dotted-quad addresses (and IPv6 addresses). You'll get rid of a bit of overhead in matching and grouping. Or, even better,
If you know you will have nothing but IPv4 addresses, rework your ip column to store their binary representations ( that is, the INET_ATON() - generated value of each IPv4). You can fit those in the UNSIGNED INT 32-bit integer data type, making the lookup, grouping, and ordering even faster.
It's possible to rework the way you gather these data. For example, you could arrange to gather at most one row per service per day. That will reduce the timeseries resolution of your data, but it will also make your queries much faster. Define your table something like this:
CREATE TABLE archive2 (
ip VARCHAR(15) COLLATE latin1_bin NOT NULL,
service ENUM ('ssh','telnet','ftp',
'pop3','imap','rdp',
'vnc','sql','http','smb',
'smtp','dns','sip','ldap') COLLATE NOT NULL,
`date` DATE NOT NULL,
`count` INT NOT NULL,
hostid bigint UNSIGNED NOT NULL,
PRIMARY KEY (`date`, ip, service)
) ENGINE=InnoDB;
Then, when you insert a row, use this query:
INSERT INTO archive2 (`date`, ip, service, `count`, hostid)
VALUES (CURDATE(), ?ip, ?service, 1, ?hostid)
ON DUPLICATE KEY UPDATE
SET count = count + 1;
This will automatically increment your count column if the row for the ip, service, and date already exists.
Then your second query will look like:
SELECT ip, SUM(`count`) AS total
FROM archive
WHERE ip IN (
SELECT ip FROM archive
WHERE `date` > CURDATE() - INTERVAL 1 DAY
GROUP BY ip
HAVING total > 1
)
ORDER BY total DESC, INET_ATON(ip) ASC LIMIT 10;
The index of the primary key will satisfy this query.
First query
(I'm not convinced that it can be made much faster.)
(currently)
SELECT service, COUNT(service) AS total
FROM `archive`
WHERE ip IN (
SELECT DISTINCT ip
FROM `archive`
WHERE date > DATE_SUB(CURRENT_TIMESTAMP, INTERVAL 24 HOUR)
)
GROUP BY service
HAVING total > 1
ORDER BY total DESC, service ASC
LIMIT 10
Notes:
COUNT(service) --> COUNT(*)
DISTINCT is not needed in IN (SELECT DISTINCT ...)
IN ( SELECT ... ) is often slow; rewrite using EXISTS ( SELECT 1 ... ) or JOIN (see below)
INDEX(date, IP) -- for subquery
INDEX(service, IP) -- for your outer query
INDEX(IP, service) -- for my outer query
Toss redundant indexes; they can get in the way. (See below)
It will have to gather all the possible results before getting to the ORDER BY and LIMIT. (That is, LIMIT has very little impact on performance for this query.)
CHARACTER SET utf8 COLLATE utf8_unicode_ci is gross overkill for IP addresses; switch to CHARACTER SET ascii COLLATE ascii_bin.
If you are running MySQL 8.0 (Or MariaDB 10.2), a WITH to calculate the subquery once, together with a UNION to compute the two outer queries, may provide some extra speed.
MariaDB has a "subquery cache" that might have the effect of skipping the second subquery evaluation.
By using DATETIME instead of TIMESTAMP, you will two minor hiccups per year when daylight savings kicks in/out.
I doubt if hostid needs to be a BIGINT (8-bytes).
To switch to a JOIN, think of fetching the candidate rows first:
SELECT service, COUNT(*) AS total
FROM ( SELECT DISTINCT IP
FROM archive
WHERE `date` > NOW() - INTERVAL 24 HOUR
) AS x
JOIN archive USING(IP)
GROUP BY service
HAVING total > 1
ORDER BY total DESC, service ASC
LIMIT 10
For any further discussion any slow (but working) query, please provide both flavors of EXPLAIN:
EXPLAIN SELECT ...
EXPLAIN FORMAT=JSON SELECT ...
Drop these indexes:
ADD KEY `service` (`service`),
ADD KEY `date` (`date`),
ADD KEY `ip` (`ip`),
Recommend only
ADD PRIMARY KEY (`id`),
-- as discussed:
ADD KEY `date-ip` (`date`,`ip`),
ADD KEY `ip-service` (`ip`,`service`),
ADD KEY `service-ip` (`service`,`ip`),
-- maybe other queries need these:
ADD KEY `date-service` (`date`,`service`),
ADD KEY `ip-date` (`ip`,`date`),
ADD KEY `service-date` (`service`,`date`),
The general rule here is that you don't need INDEX(a) when you also have INDEX(a,b). In particular, they may be preventing the use of better indexes; see the EXPLAINs.
Second query
Rewrite it
SELECT ip, COUNT(DISTINCT ip) AS total
FROM `archive`
WHERE date > DATE_SUB(CURRENT_TIMESTAMP, INTERVAL 24 HOUR)
GROUP BY ip
HAVING total > 1
ORDER BY total DESC, INET_ATON(ip) ASC
LIMIT 10
It will use only INDEX(date, ip).
I've a query that takes about 18 seconds to finish:
THE QUERY:
SELECT YEAR(c.date), MONTH(c.date), p.district_id, COUNT(p.owner_id)
FROM commission c
INNER JOIN partner p ON c.customer_id = p.id
WHERE (c.date BETWEEN '2018-01-01' AND '2018-12-31')
AND (c.company_id = 90)
AND (c.source = 'ACTUAL')
AND (p.id IN (3062, 3063, 3064, 3065, 3066, 3067, 3068, 3069, 3070, 3071,
3072, 3073, 3074, 3075, 3076, 3077, 3078, 3079, 3081, 3082, 3083, 3084,
3085, 3086, 3087, 3088, 3089, 3090, 3091, 3092, 3093, 3094, 3095, 3096,
3097, 3098, 3099, 3448, 3449, 3450, 3451, 3452, 3453, 3454, 3455, 3456,
3457, 3458, 3459, 3460, 3461, 3471, 3490, 3491, 6307, 6368, 6421))
GROUP BY YEAR(c.date), MONTH(c.date), p.district_id
The commission table has around 2,8 millions of records, of which 860 000+ belong to the current year 2018. The partner table has at this moment 8600+ records.
RESULT
| `YEAR(c.date)` | `MONTH(c.date)` | district_id | `COUNT(c.id)` |
|----------------|-----------------|-------------|---------------|
| 2018 | 1 | 1 | 19154 |
| 2018 | 1 | 5 | 9184 |
| 2018 | 1 | 6 | 2706 |
| 2018 | 1 | 12 | 36296 |
| 2018 | 1 | 15 | 13085 |
| 2018 | 2 | 1 | 21231 |
| 2018 | 2 | 5 | 10242 |
| ... | ... | ... | ... |
55 rows retrieved starting from 1 in 18 s 374 ms
(execution: 18 s 368 ms, fetching: 6 ms)
EXPLAIN:
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | extra |
|----|-------------|-------|------------|-------|------------------------------------------------------------------------------------------------------|----------------------|---------|-----------------|------|----------|----------------------------------------------|
| 1 | SIMPLE | p | null | range | PRIMARY | PRIMARY | 4 | | 57 | 100 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | c | null | ref | UNIQ_6F7146F0979B1AD62FC0CB0F5F8A7F73,IDX_6F7146F09395C3F3,IDX_6F7146F0979B1AD6,IDX_6F7146F0AA9E377A | IDX_6F7146F09395C3F3 | 5 | p.id | 6716 | 8.33 | Using where |
DDL:
create table if not exists commission (
id int auto_increment
primary key,
date date not null,
source enum('ACTUAL', 'EXPECTED') not null,
customer_id int null,
transaction_id varchar(255) not null,
company_id int null,
constraint UNIQ_6F7146F0979B1AD62FC0CB0F5F8A7F73 unique (company_id, transaction_id, source),
constraint FK_6F7146F09395C3F3 foreign key (customer_id) references partner (id),
constraint FK_6F7146F0979B1AD6 foreign key (company_id) references companies (id)
) collate=utf8_unicode_ci;
create index IDX_6F7146F09395C3F3 on commission (customer_id);
create index IDX_6F7146F0979B1AD6 on commission (company_id);
create index IDX_6F7146F0AA9E377A on commission (date);
I noted that by removing the partner IN condition MySQL takes only 3s. I tried to replace it doing something crazy like this:
AND (',3062,3063,3064,3065,3066,3067,3068,3069,3070,3071,3072,3073,3074,3075,3076,3077,3078,3079,3081,3082,3083,3084,3085,3086,3087,3088,3089,3090,3091,3092,3093,3094,3095,3096,3097,3098,3099,3448,3449,3450,3451,3452,3453,3454,3455,3456,3457,3458,3459,3460,3461,3471,3490,3491,6307,6368,6421,'
LIKE CONCAT('%,', p.id, ',%'))
and the result was about 5s... great! but it's a hack.
WHY this query is taking a very long execution time when I uses IN statement? workaround, tips, links, etc. Thanks!
MySQL can use one index at a time. For this query you need a compound index covering the aspects of the search. Constant aspects of the WHERE clause should be used before range aspects like:
ALTER TABLE commission
DROP INDEX IDX_6F7146F0979B1AD6,
ADD INDEX IDX_6F7146F0979B1AD6 (company_id, source, date)
Here's what the Optimizer sees in your query.
Checking whether to use an index for the GROUP BY:
Functions (YEAR()) in the GROUP BY, so no.
Multiple tables (c and p) mentioned, so no.
For a JOIN, Optimizer will (almost always) start with one, then reach into the other. So, let's look at the two options:
If starting with p:
Assuming you have PRIMARY KEY(id), there is not much to think about. It will simply use that index.
For each row selected from p, it will then look into c, and any variation of this INDEX would be optimal.
c: INDEX(company_id, source, customer_id, -- in any order (all are tested "=")
date) -- last, since it is tested as a range
If starting with c:
c: INDEX(company_id, source, -- in any order (all are tested "=")
date) -- last, since it is tested as a range
-- slightly better:
c: INDEX(company_id, source, -- in any order (all are tested "=")
date, -- last, since it is tested as a range
customer_id) -- really last -- added only to make it "covering".
The Optimizer will look at "statistics" to crudely decide which table to start with. So, add all the indexes I suggested.
A "covering" index is one that contains all the columns needed anywhere in the query. It is sometimes wise to extend a 'good' index with more columns to make it "covering".
But there is a monkey wrench in here. c.customer_id = p.id means that customer_id IN (...) effectively exists. But now there are two "range-like" constraints -- one is an IN, the other is a 'range'. In some newer versions, the Optimizer will happily jump around due to the IN and still be able to do "range" scans. So, I recommend this ordering:
Test(s) of column = constant
Test(s) with IN
One 'range' test (BETWEEN, >=, LIKE with trailing wildcard, etc)
Perhaps add more columns to make it "covering" -- but don't do this step if you end up with more than, say, 5 columns in the index.
Hence, for c, the following is optimal for the WHERE, and happens to be "covering".
INDEX(company_id, source, -- first, but in any order (all "=")
customer_id, -- "IN"
date) -- last, since it is tested as a range
p: (same as above)
Since there was an IN or "range", there is no use seeing if the index can also handle the GROUP BY.
A note on COUNT(x) -- it checks that x is NOT NULL. It is usually just as correct to say COUNT(*), which counts the number of rows without any extra checking.
This is a non-starter since it hides the indexed column (id) in a function:
AND (',3062,3063,3064,3065,3066,...6368,6421,'
LIKE CONCAT('%,', p.id, ',%'))
With your LIKE-hack you are tricking optimizer so it uses different plan (most probably using IDX_6F7146F0AA9E377A index on the first place).
You should be able to see this in explain.
I think the real issue in your case is the second line of explain: server executing multiple functions (MONTH, YEAR) for 6716 rows and then trying to group all these rows. During this time all these 6716 rows should be stored (in memory or on disk that is based on your server configuration).
SELECT COUNT(*) FROM commission WHERE (date BETWEEN '2018-01-01' AND '2018-12-31') AND company_id = 90 AND source = 'ACTUAL';
=> How many rows are we talking about?
If the number in above query is much lower then 6716 I'd try to add covering index on columns customer_id, company_id, source and date. Not sure about the best order as it depends on data you have (check cardinality for these columns). I'd started with index (date, company_id, source, customer_id). Also, I'd add unique index (id, district_id, owner_id) on partner.
It is also possible to add additional generated stored columns _year and _month (if your server is a bit old you can add normal columns and fill them in with trigger) to rid off the multiple function executions.
I have a sql query
SELECT level, data, string, soln, uid
FROM user_data
WHERE level = 10 AND (timetaken >= 151 AND timetaken <= 217) AND uid != 1
LIMIT 8852, 1;
which fetches from a table with 1.5 million entries.
I have indexed using
alter table user_data add index a_idx (level, timetaken, uid);
So the issue i am facing is it takes more than 30sec to query in some cases and in somecases less than 0.01sec.
Is there any issue with the indexing here.
Edit:
Added the explain query details
+----+-------------+------------------+-------+---------------+------------+---------+------+-------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------------+-------+---------------+------------+---------+------+-------+--------------------------+
| 1 | SIMPLE | user_data | range | a_idx | a_idx | 30 | NULL | 24091 | Using where; Using index |
+----+-------------+------------------+-------+---------------+------------+---------+------+-------+--------------------------+
The data field in the table is a text field. Its length is greater than 255 characters in most cases. Does this cause a Issue?
First of all you should try getting the execution plan of this query with EXPLAIN:
EXPLAIN SELECT level, data, string, soln, uid
FROM user_data
WHERE level = 10 AND (timetaken >= 151 AND timetaken <= 217) AND uid != 1
LIMIT 8852, 1;
This is a great slide to follow through on this topic:
http://www.slideshare.net/phpcodemonkey/mysql-explain-explained
Try adding different indexes:
one on uid and level
a separate one on timetaken
The problem is in the high offset. In order to select the 8853rd result, MySQL has to scan all 8852 rows before this.
Btw, using limit without order by may lead to unexpected results.
In order to speed up the queries with high offset, you should move to a since..until pagination strategy
I have generated query
select
mailsource2_.file as col_0_0_,
messagedet0_.messageId as col_1_0_,
messageent1_.mboxOffset as col_2_0_,
messageent1_.mboxOffsetEnd as col_3_0_,
messagedet0_.id as col_4_0_
from MessageDetails messagedet0_, MessageEntry messageent1_, MailSourceFile mailsource2_
where messagedet0_.id=messageent1_.messageDetails_id
and messageent1_.mailSourceFile_id=mailsource2_.id
order by mailsource2_.file, messageent1_.mboxOffset;
Explain says that there is no full scans and indexes are used:
+----+-------------+--------------+--------+------------------------------------------------------+---------+---------+--------------------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys |key | key_len | ref | rows | Extra |
+----+-------------+--------------+--------+------------------------------------------------------+---------+---------+--------------------------------------+------+----------------------------------------------+
| 1 | SIMPLE | mailsource2_ | index | PRIMARY |file | 384 | NULL | 1445 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | messageent1_ | ref | msf_idx,md_idx,FKBBB258CB60B94D38,FKBBB258CBF7C835B8 |msf_idx | 9 | skryb.mailsource2_.id | 2721 | Using where |
| 1 | SIMPLE | messagedet0_ | eq_ref | PRIMARY |PRIMARY | 8 | skryb.messageent1_.messageDetails_id | 1 | |
+----+-------------+--------------+--------+------------------------------------------------------+---------+---------+--------------------------------------+------+----------------------------------------------+
CREATE TABLE `mailsourcefile` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`file` varchar(127) COLLATE utf8_bin DEFAULT NULL,
`size` bigint(20) DEFAULT NULL,
`archive_id` bigint(20) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `file` (`file`),
KEY `File_idx` (`file`),
KEY `Archive_idx` (`archive_id`),
KEY `FK7C3F816ECDB9F63C` (`archive_id`),
CONSTRAINT `FK7C3F816ECDB9F63C` FOREIGN KEY (`archive_id`) REFERENCES `archive` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1370 DEFAULT CHARSET=utf8 COLLATE=utf8_bin
Also I have indexes for file and mboxOffset. SHOW FULL PROCESSLIST says that mysql is sorting result and it takes more then few minutes. Resultset size is 5M records. How can I optimize this?
Don't think there is much optimization to do in the query itself. Joins would make it more readable, but iirc mysql nowadays is perfectly able to detect these kind of constructs and plan the joins itself.
What would help propably is to increase both the tmp_table_size and max_heap_table_size to allow the resultset of this query to remain in memory, rather than having to write it to disk.
The maximum size for in-memory temporary tables is the minimum of the tmp_table_size and max_heap_table_size values
http://dev.mysql.com/doc/refman/5.5/en/internal-temporary-tables.html
The "using temporary" in the explain denotes that it is using a temporary table (see the link above again) - which will probably be written to disk due to the large amount of data (again, see the link above for more on this).
the file column alone is anywhere between 1 and 384 bytes, so lets take the half for our estimation and ignore the rest of the columns, that leads to 192 bytes per row in the result-set.
1445 * 2721 = 3,931,845 rows
* 192 = 754,914,240 bytes
/ 1024 ~= 737,221 kb
/ 1024 ~= 710 mb
This is certainly more than the max_heap_table_size (16,777,216 bytes) and most likely more than the tmp_table_size.
Not having to write such a result to disk will most certainly increase speed.
Good luck!
Optimization is always tricky. In order to make a dent in your execution time, I think you probably need to do some sort of pre-cooking.
If the file names are similar, (e.g. /path/to/file/1, /path/to/file/2), sorting them will mean a lot of byte comparisons, probably compounded by the unicode encoding. I would calculate a hash of the filename on insertion (e.g. MD5()) and then sort using that.
If the files are already well distributed (e.g. postfix spool names), you probably need to come up with some scheme on insertion whereby either:
simply reading records from some joined table will automatically generate them in correct order; this may not save a lot of time, but it will give you some data quickly so you can start processing, or
find a way to provide a "window" on the data so that not all of it needs to be processed at once.
As #raheel shan said above, you may want to try some JOINs:
select
mailsource2_.file as col_0_0_,
messagedet0_.messageId as col_1_0_,
messageent1_.mboxOffset as col_2_0_,
messageent1_.mboxOffsetEnd as col_3_0_,
messagedet0_.id as col_4_0_
from
MessageDetails messagedet0_
inner join
MessageEntry messageent1_
on
messagedet0_.id = messageent1_.messageDetails_id
inner join
MailSourceFile mailsource2_
on
messageent1_.mailSourceFile_id = mailsource2_.id
order by
mailsource2_.file,
messageent1_.mboxOffset
My apologies if the keys are off, but I think I've conveyed the point.
write the query with joins like
select
mailsource2_.file as col_0_0_,
messagedet0_.messageId as col_1_0_,
messageent1_.mboxOffset as col_2_0_,
messageent1_.mboxOffsetEnd as col_3_0_,
messagedet0_.id as col_4_0_
from
MessageDetails m0
inner join
MessageEntry m1
on
m0.id = m1.messageDetails_id
inner join
MailSourceFile m2
on
m1.mailSourceFile_id = m2.id
order by
m2_.file,
m1_.mboxOffset;
on seeing ur explain i found 3 things which in my opinion are not good
1 file sort in extra column
2 index in type column
3 key length which is 384
if you reduce the key length you may get quick retrieval for that consider the character set you use and the partial indexes
here you can do force index for order by and use index for join ( create appropriate multi column indexes and assign them) remember it is alway good to order with column present in the same table
index type represents it is scanning entire index column which is not good
I'm hitting some quite major performances issues due to the use of "ORDER BY"-statements in my SQL-code.
Everything is fine as long as I'm not using ORDER BY-statements in the SQL. However, once I introduce ORDER BY:s in the SQL code everything slows down dramatically due to the lack of correct indexing. One would assume that fixing this would be trivial, but judging from forum discussions, etc this seems to be a rather common issue that I've yet to see a definitive and concise answer to this question.
Question: Given the following table ...
CREATE TABLE values_table (
id int(11) NOT NULL auto_increment,
...
value1 int(10) unsigned NOT NULL default '0',
value2 int(11) NOT NULL default '0',
PRIMARY KEY (id),
KEY value1 (value1),
KEY value2 (value2),
) ENGINE=MyISAM AUTO_INCREMENT=2364641 DEFAULT CHARSET=utf8;
... how do I create indexes that will be used when querying the table for a value1-range while sorting on the value of value2?
Currently, the fetching is OK when NOT using the ORDER BY clause.
See the following EXPLAIN QUERY output:
OK, when NOT using ORDER BY:
EXPLAIN select ... from values_table this_ where this_.value1 between 12345678 and 12349999 limit 10;
+----+-------------+-------+-------+---------------+----------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+----------+---------+------+------+-------------+
| 1 | SIMPLE | this_ | range | value1 | value1 | 4 | NULL | 3303 | Using where |
+----+-------------+-------+-------+---------------+----------+---------+------+------+-------------+
However, when using ORDER BY I get "Using filesort":
EXPLAIN select ... from values_table this_ where this_.value1 between 12345678 and 12349999 order by this_.value2 asc limit 10;
+----+-------------+-------+-------+---------------+----------+---------+------+------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+----------+---------+------+------+-----------------------------+
| 1 | SIMPLE | this_ | range | value1 | value1 | 4 | NULL | 3303 | Using where; Using filesort |
+----+-------------+-------+-------+---------------+----------+---------+------+------+-----------------------------+
Some additional information about the table content:
SELECT MIN(value1), MAX(value1) FROM values_table;
+---------------+---------------+
| MIN(value1) | MAX(value2) |
+---------------+---------------+
| 0 | 4294967295 |
+---------------+---------------+
...
SELECT MIN(value2), MAX(value2) FROM values_table;
+---------------+---------------+
| MIN(value2) | MAX(value2) |
+---------------+---------------+
| 1 | 953359 |
+---------------+---------------+
Please let me know if any further information is needed to answer the question.
Thanks a lot in advance!
Update #1: Adding a new composite index (ALTER TABLE values_table ADD INDEX (value1, value2);) does not solve the problem. You'll still get "Using filesort" after adding such an index.
Update #2: A constraint that I did not mention in my question is that I'd rather change the structure of the table (say adding indexes, etc.) than changing the SQL queries used. The SQL queries are auto-generated using Hibernate, so consider those more or less fixed.
You cannot use an index in this case, as you use a RANGE filtering condition.
If you'd use something like:
SELECT *
FROM values_table this_
WHERE this_.value1 = #value
ORDER BY
value2
LIMIT 10
, then creating a composite index on (VALUE1, VALUE2) would be used both for filtering and for ordering.
But you use a ranged condition, that's why you'll need to perform ordering anyway.
Your composite index will look like this:
value1 value2
----- ------
1 10
1 20
1 30
1 40
1 50
1 60
2 10
2 20
2 30
3 10
3 20
3 30
3 40
, and if you select 1 and 2 in value1, you still don't get a whole sorted set of value2.
If your index on value2 is not very selective (i. e. there are not many DISTINCT value2 in the table), you could try:
CREATE INDEX ix_table_value2_value1 ON mytable (value2, value1)
/* Note the order, it's important */
SELECT *
FROM (
SELECT DISTINCT value2
FROM mytable
ORDER BY
value2
) q,
mytable m
WHERE m.value2 >= q.value2
AND m.value2 <= q.value2
AND m.value1 BETWEEN 13123123 AND 123123123
This is called a SKIP SCAN access method. MySQL does not support it directly, but it can be emulated like this.
The RANGE access will be used in this case, but probably you won't get any performance benefit unless DISTINCT value2 comprise less than about 1% of rows.
Note usage of:
m.value2 >= q.value2
AND m.value2 <= q.value2
instead of
m.value2 = q.value2
This makes MySQL perform RANGE checking on each loop.
It appears to me that you have two totally independent keys, one for value1 and one for value2.
So when you use the value1 key to retrieve, the records aren't necessarily returned in order of value2, so they have to be sorted. This is still better than a full table scan since you're only sorting the records that satisfy your "where value1" clause.
I think (if this is possible in MySQL), a composite key on (value1,value2) would solve this.
Try:
CREATE TABLE values_table (
id int(11) NOT NULL auto_increment,
...
value1 int(10) unsigned NOT NULL default '0',
value2 int(11) NOT NULL default '0',
PRIMARY KEY (id),
KEY value1 (value1),
KEY value1and2 (value1,value2),
) ENGINE=MyISAM AUTO_INCREMENT=2364641 DEFAULT CHARSET=utf8;
(or the equivalent ALTER TABLE), assuming that's the correct syntax in MySQL for a composite key.
In all databases I know (and I have to admit MySQL isn't one of them), that would cause the DB engine to select the value1and2 key for retrieving the rows and they would already be sorted in value2-within-value1 order, so wouldn't need a file sort.
You can still keep the value2 key if you need it.