I have two MySQL queries which I use to help me arrange news on my website. All the fields here indicated are numeric (int or tinyint) and all have an index on them. Can somebody help me build the multi column indexes which will speed up the two queries please?
SELECT MAX(content_time) AS content_time
FROM cm_data
WHERE content_time < UNIX_TIMESTAMP()
AND page = '1'
AND (content_type = '1' OR content_type = '5')
SELECT content_id
FROM cm_data
WHERE (content_time < UNIX_TIMESTAMP()
OR (hp_time >= UNIX_TIMESTAMP() AND content_time < UNIX_TIMESTAMP()
)
)
AND page = '1'
AND (content_type = '1' OR content_type = '5')
ORDER BY hp_time DESC
, content_time DESC
LIMIT 20
And here is the DB schema:
CREATE TABLE IF NOT EXISTS `cm_data` (
`content_id` mediumint(8) NOT NULL AUTO_INCREMENT,
`content_time` int(11) DEFAULT NULL,
`hp_time` int(11) NOT NULL DEFAULT '0',
`content_type` tinyint(2) DEFAULT NULL,
`page` tinyint(2) NOT NULL DEFAULT '1',
PRIMARY KEY (`content_id`),
KEY `content_time` (`content_time`),
KEY `content_type` (`content_type`),
KEY `page` (`page`),
);
Indexes aren't always the answer (especially where many INSERT statements are to be expected). That being said, there are a couple of things you can do to optimize these queries:
1) Make sure your constraints are properly formed. In the second query, you have essentially Case1 OR ( Case2 AND Case1 ), which can be simplified to just Case1 (boolean algebra). Making sure these are as reduced as possible before the query optimizer looks at it can not only reduce the effort required by the optimizer, but also catch cases that it can't resolve.
2) Check the order of your constraints. Since it doesn't state anywhere in the documentation that the query optimizer will check the constraints in any specific order for non-unique keys, a little knowledge of optimization can be used to potentially speed up queries. Generally speaking, evaluation of non-integer constraints will be costly (STRINGs, etc). In your case, UNIX_TIMESTAMP() returns an UNSIGNED INT, whereas the field is specified as an INTEGER. This will result in a cast operation for each row, which will be costly. So, if you can reduce the number of times the comparisons must be done (by filtering out rows with simpler constraints first), fewer operations must be performed, leading to a shorter execution time.
2a) MySQL doesn't support indexes on functions, so you could either change the datatype of the field to UNSIGNED INT (preferred if available), or create an additional column which gets updated with a trigger which contains the UNSIGNED INT equivalent of the field, and index it instead. The latter isn't ideal, but could offer some performance increase at the expense of increased table size.
Related
I have an sql query as follows
select *
from incidents
where remote_ip = '192.168.1.1' and is_infringement = 1
order by reported_at desc
limit 1;
This query at the moment takes 313.24 secs to run.
If I remove the order by so the query is
select *
from incidents
where remote_ip = '192.168.1.1' and is_infringement = 1
then it only takes 0.117 secs to run.
The reported_at column is indexed.
So 2 questions, firstly why is it takings so long with this order_by statement and secondly how can i speed it up?
EDIT: In response to the questions below here is the output when using explain:
'1', 'SIMPLE', 'incidents', 'index', 'uniqueReportIndex,idx_incidents_remote_ip', 'incidentsReportedAt', '4', NULL, '1044', '100.00', 'Using where'
And the table create statement:
CREATE TABLE `incidents` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`incident_ip_id` int(10) unsigned DEFAULT NULL,
`remote_id` bigint(20) DEFAULT NULL,
`remote_ip` char(32) NOT NULL,
`is_infringement` tinyint(1) NOT NULL DEFAULT '0',
`messageBody` text,
`reported_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00' COMMENT 'Formerly : created_datetime',
`created_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
`updated_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
PRIMARY KEY (`id`),
UNIQUE KEY `uniqueReportIndex` (`remote_ip`,`host_id_1`,`licence_feature`,`app_end`),
UNIQUE KEY `uniqueRemoteIncidentId` (`remote_id`),
KEY `incident_ip_id` (`incident_ip_id`),
KEY `id` (`id`),
KEY `incidentsReportedAt` (`reported_at`),
KEY `idx_incidents_remote_ip` (`remote_ip`)
)
Note: i have omitted some of the non relevant fields so there are more indexes than fields but you can safely assume the fields for all the indexes are in the table
The output of EXPLAIN reveals that, because of the ORDER BY clause, MySQL decides to use the incidentsReportedAt index. It reads each row from the table data in the order provided by the index and checks the WHERE conditions on it. This requires reading a lot of information from the table data, information that is scattered through the entire table. Not a good workflow.
Update
The OP created an index on columns reported_at and report_ip (as suggested in the original answer, see below) and the execution time went down from 313 to 133 seconds. An improvement, but not enough. I think the cause of this still large execution time is the access to table data for each row to verify the is_infringement = 1 part of the WHERE clause but even adding it to the index won't help very much.
The OP says in a comment:
Ok after further research and changing the index to be the other way round (remote_ip, reported_at) the query is now super fast (0.083 sec).
This index is better, indeed, because the remote_ip = '192.168.1.1' condition filters out a lot of rows. The same effect can be achieved using the existing uniqueReportIndex index. It is possible that the original index on reported_at fooled MySQL into thinking it is better to use it to check the rows in the order required by ORDER BY instead of filtering first and sorting at the end.
I think MySQL uses the new index on (remote_ip, reported_at) for filtering (WHERE remote_ip = '192.168.1.1') and for sorting (ORDER BY reported_at DESC). The WHERE condition provides a small list of candidate rows that are easily identified and also sorted using this index.
The original answer follows.
The advice it provides is not correct but it helped the OP find the correct solution.
Create an index on columns reported_at and report_ip in this order
then see what EXPLAIN says and how the query performs. It should work faster.
You can even create the new index on columns reported_at, report_ip and is_infringement (the order of columns in the index is very important).
The index on three columns helps MySQL identify the rows without the need to read the table data (because all the columns from WHERE and ORDER BY clauses are in the index). It needs to read the table data only for the rows it returns because of SELECT *.
After you create the new index (either on two or three columns), remove the old index incidentsReportedAt. It is not needed any more; it uses disk and memory space and takes time to be updated but it is not used. The new index (that has the reported_at column on the first position) will be used instead.
The index on two columns requires more reads of the table data for the is_infringement = 1 condition. The query probably runs a little slower that with the three-columns index. On the other hand, there is a little gain on table updates and disk and memory space usage.
The decision to index on two or three columns depends on how often the query posted in the question runs and what it serves (visitors, admins, cron jobs etc).
I have a table for storing stats. Currently this is populated with about 10 million rows at the end of the day then copied to daily stats table and deleted. For this reason I can't have an auto-incrementing primary key.
This is the table structure:
CREATE TABLE `stats` (
`shop_id` int(11) NOT NULL,
`title` varchar(255) CHARACTER SET latin1 NOT NULL,
`created` datetime NOT NULL,
`mobile` tinyint(1) NOT NULL DEFAULT '0',
`click` tinyint(1) NOT NULL DEFAULT '0',
`conversion` tinyint(1) NOT NULL DEFAULT '0',
`ip` varchar(20) CHARACTER SET latin1 NOT NULL,
KEY `shop_id` (`shop_id`,`created`,`ip`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
I have a key on shop_id, created, ip but I'm not sure what columns I should use to create the optimal index to increase lookup speeds any further?
The query below takes about 12 seconds with no key and about 1.5 seconds using the index above:
SELECT DATE(CONVERT_TZ(`created`, 'UTC', 'Australia/Brisbane')) AS `date`, COUNT(*) AS `views`
FROM `stats`
WHERE `created` <= '2017-07-18 09:59:59'
AND `shop_id` = '17515021'
AND `click` != 1
AND `conversion` != 1
GROUP BY DATE(CONVERT_TZ(`created`, 'UTC', 'Australia/Brisbane'))
ORDER BY DATE(CONVERT_TZ(`created`, 'UTC', 'Australia/Brisbane'));
If there is no column (or combination of columns) that is guaranteed unique, then do have an AUTO_INCREMENT id. Don't worry about truncating/deleting. (However, if the id does not reset, you probably need to use BIGINT, not INT UNSIGNED to avoid overflow.)
Don't use id as the primary key, instead, PRIMARY KEY(shop_id, created, id), INDEX(id).
That unconventional PK will help with performance in 2 ways, while being unique (due to the addition of id). The INDEX(id) is to keep AUTO_INCREMENT happy. (Whether you DELETE hourly or daily is a separate issue.)
Build a Summary table based on each hour (or minute). It will contain the count for such -- 400K/hour or 7K/minute. Augment it each hour (or minute) so that you don't have to do all the work at the end of the day.
The summary table can also filter on click and/or conversion. Or it could keep both, if you need them.
If click/conversion have only two states (0 & 1), don't say != 1, say = 0; the optimizer is much better at = than at !=.
If they 2-state and you changed to =, then this becomes viable and much better: INDEX(shop_id, click, conversion, created) -- created must be last.
Don't bother with TZ when summarizing into the Summary table; apply the conversion later.
Better yet, don't use DATETIME, use TIMESTAMP so that you won't need to convert (assuming you have TZ set correctly).
After all that, if you still have issues, start over on the Question; there may be further tweaks.
In your where clause, Use the column first which will return the small set of results and so on and create the index in the same order.
You have
WHERE created <= '2017-07-18 09:59:59'
AND shop_id = '17515021'
AND click != 1
AND conversion != 1
If created will return the small number of set as compare to other 3 columns then you are good otherwise you that column at first position in your where clause then select the second column as per the same explanation and create the index as per you where clause.
If you think order is fine then create an index
KEY created_shopid_click_conversion (created,shop_id, click, conversion);.
Ok, I have the following MySQL table structure:
CREATE TABLE `creditlog` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`memberId` int(10) unsigned NOT NULL,
`quantity` decimal(10,2) unsigned DEFAULT NULL,
`timeAdded` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`reference` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `memberId` (`memberId`),
KEY `timeAdded` (`timeAdded`));
And I'm querying it like this:
SELECT SUM(quantity) FROM creditlog where timeAdded>'2016-09-01' AND timeAdded<'2016-10-01' AND memberId IN (3,6,8,9,11)
Now, I also use the use index (timeAdded) because due to the number of entries it is more convenient. Explaining the above query shows:
type -> range,
key -> timeAdded,
rows -> 921294
extra -> using where
Meanwhile if I use the memberId INDEX it shows:
type -> range,
key -> memberId,
rows -> 1707849
extra -> using where
Now, my question is it's possible to combine these 2 indexes somehow to be used together and reduce the surface of the query since I ll also need to add more conditions (on other columns).
MySQL almost never uses two indexes in a single query; it is just not cost effective. However, composite indexes are often very efficient. You need this order: INDEX(memberId, timeAdded).
Build the index this way...
First include column(s) that are in the WHERE clause tested with =. (None, in your case.)
Any column(s) with IN.
One 'range', such as <, BETWEEN, etc.
Move onto all the fields of the GROUP BY or ORDER BY. (Not relevant here.)
There are a lot of exceptions and caveats. Some are given in my cookbook .
(Contrary to popular opinion, cardinality is almost never relevant in designing an index.)
Here is a way to compare two indexes (even with a table that is too small to get reliable timings):
FLUSH STATUS;
SELECT SQL_NO_CACHE ...;
SHOW SESSION STATUS LIKE 'Handler%';
(repeat for other query/index)
Smaller numbers almost always indicate better.
"timeAdded>'2016-09-01' AND timeAdded<'2016-10-01'" -- That excludes midnight on the first day. I recommend this pattern:
timeAdded >= '2016-09-01'
AND timeAdded < '2016-09-01' + INTERVAL 1 MONTH
That also avoids computing dates.
That smells like a common query? Have you considered building and maintaining Summary tables ? The equivalent query would probably run 10 times as fast.
I have following table with millions rows:
CREATE TABLE `points` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`DateNumber` int(10) unsigned DEFAULT NULL,
`Count` int(10) unsigned DEFAULT NULL,
`FPTKeyId` int(10) unsigned DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `id_UNIQUE` (`id`),
KEY `index3` (`FPTKeyId`,`DateNumber`) USING HASH
) ENGINE=InnoDB AUTO_INCREMENT=16755134 DEFAULT CHARSET=utf8$$
As you can see i have created indexes. I donnt know am i do it right may be not.
The problem is queries execute super slow.
Let's take a simple query
SELECT fptkeyid, count FROM points group by fptkeyid
I cannt get result because query aborting by timeout(10 min). What i am doing wrong?
Beware MySQL's stupid behaviour: GROUP BYing implicitly executes ORDER BY.
To prevent this, explicitely add ORDER BY NULL, which prevents unnecessary ordering.
http://dev.mysql.com/doc/refman/5.0/en/select.html says:
If you use GROUP BY, output rows are sorted according to the GROUP BY
columns as if you had an ORDER BY for the same columns. To avoid the
overhead of sorting that GROUP BY produces, add ORDER BY NULL:
SELECT a, COUNT(b) FROM test_table GROUP BY a ORDER BY NULL;
+
http://dev.mysql.com/doc/refman/5.6/en/group-by-optimization.html says:
The most important preconditions for using indexes for GROUP BY are
that all GROUP BY columns reference attributes from the same index,
and that the index stores its keys in order (for example, this is a
BTREE index and not a HASH index).
Your query does not make sense:
SELECT fptkeyid, count FROM points group by fptkeyid
You group by fptkeyid so count is not useful here. There should be an aggregate function. Not a count field. Next that that count is also a MySQL function which makes it not very useful / advisable to use the same name for a field.
Don't you need something like:
SELECT fptkeyid, SUM(`count`) FROM points group by fptkeyid
If not please explain what result you expect from the query.
Created a database with test data, half a million records, to see if I can find something equal to your issue. This is what the explain tells me:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE points index NULL index3 10 NULL 433756
And on the SUM query:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE points index NULL index3 10 NULL 491781
Both queries are done on a laptop (macbook air) within a second, nothing takes long. Inserting though took some time, few minutes to get half a million records. But retrieving and calculating does not.
We need more to answer your question totally complete. Maybe the configuration of the database is wrong, for example almost no memory allocated?
I would personally start with your AUTO_INCREMENT value. You have set it to increase by 16,755,134 for each new record. Your field value is set to INT UNSIGNED which means that the range of values is 0 to 4,294,967,295 (or almost 4.3 billion). This means that you would have only 256 values before the field goes beyond the data type limits thereby compromising the purpose of the PRIMARY KEY INDEX.
You could changed the data type to BIGINT UNSIGNED and you would have a value range of 0 to 18,446,744,073,709,551,615 (or slightly more then 18.4 quintillion) which would allow you to have up to 1,100,960,700,983 (or slightly more then 1.1 trillion) unique values with this AUTO_INCREMENT value.
I would first ask if you really need to have your AUTO_INCREMENT value set to such a large number and if not then I would suggest changing that to 1 (or at least some lower number) as storing the field values as INT vs BIGINT will save considerable disk space within larger tables such as this. Either way, you should get a more stable PRIMARY KEY INDEX which should help improve queries.
I think the problem is your server bandwidth. Having a million rows would probably need at least high megabyte bandwidths.
I have a table with over 250 million records. Our Reporting server queries regularly to that table using similar kind of query.
SELECT
COUNT(*),
DATE(updated_at) AS date,
COUNT(DISTINCT INT_FIELD)
FROM
TABLE_WITH_250_Million
WHERE
Field1 = 'value in CHAR'
AND field2 = 'VALUE in CHAR'
AND updated_at > '2012-04-27'
AND updated_at < '2012-04-28 00:00:00'
GROUP BY
Field2,
DATE(updated_at)
ORDER BY
date DESC
I have tried to create a BTREE index on the table including Field1,Field2,Field3 DESC in the same order but its not giving me the right result.
Can anyone help me how do I optimize it. My problem is I can't change the query as I don't have code where this reporting server is executing query from.
Any help would be really appreciated.
Thanks
Here's my table:
CREATE TABLE backup_jobs (
id int(11) unsigned NOT NULL AUTO_INCREMENT,
backup_profile_id int(11) DEFAULT NULL,
state varchar(32) DEFAULT NULL,
limit int(11) DEFAULT NULL,
file_count int(11) DEFAULT NULL,
byte_count bigint(20) DEFAULT NULL,
created_at datetime DEFAULT NULL,
updated_at datetime DEFAULT NULL,
status_type varchar(32) DEFAULT NULL,
status_param_1 varchar(255) DEFAULT NULL,
status_param_2 varchar(255) DEFAULT NULL,
status_param_3 varchar(255) DEFAULT NULL,
started_at datetime DEFAULT NULL,
PRIMARY KEY (id),
KEY index_backup_jobs_on_state (state),
KEY index_backup_jobs_on_backup_profile_id (backup_profile_id),
KEY index_backup_jobs_created_at (created_at),
KEY idx_backup_jobs_state_updated_at (state,updated_at) USING BTREE,
KEY idx_backup_jobs_state_status_param_1_updated_at (state,status_param_1,updated_at) USING BTREE
) ENGINE=MyISAM AUTO_INCREMENT=508748682 DEFAULT CHARSET=utf8;
Add the int_field into the index:
CREATE INDEX idx_backup_jobs_state_status_param_1_updated_at_backup_profile_id ON backup_jobs (state, status_param_1, updated_at, backup_profile_id)
to make it cover all fields.
This way, table lookups go (you will see Using index in the plan) which will make your query some 10x faster (your mileage may vary).
Also note that (at least for the single-date range provided) GROUP BY DATE(updated_at) and ORDER BY date DESC are redundant and will only make the query to use temporary and filesort without any real purpose. Not that you can do much about it, though, if you cannot change the query.
I'm sure that all 250M rows didn't occur in the date range of interest.
The problem is that the between nature of the date check forces a table scan, because you can't know where the date falls.
I'd recommend that you partition the 250M row table into weeks, months, quarters, or years and only scan the partitions need for a given date range. You'll only have to scan the partitions within the range. That'll help matters.
If you go down the partition road, you'll need to talk to a MySQL DBA, preferrably someone who's familiar with partioning. It's not for the faint of heart.
http://dev.mysql.com/doc/refman/5.1/en/partitioning.html
Per your query, you'll have to take the lead here -- smallest granularity. We have no idea what the frequency is of activity, what the Field1, Field2 status entries are, how far back your data goes, how many entries would be normal on a given SINGLE DATE. All that said, I would build my indexes based on smallest granularity first that closely matches your querying criteria.
Ex: if your "Field1" has a dozen possible "CHAR" values, and you are applying an "IN" clause, and Field1 is first in your index, it will hit each char for each date and field2 value. 250 million records could force a lot of index paging activity especially based on history. Likewise with your Field2. However, due to your "Group By" clause on Field2 and date updated, I would have ONE of those respectively in the first/second position of the index. Based on historical data, I would even tend to shoot at the following index to have dates as the primary basis, and within that, the secondary criteria.
index ( Updated_At, Field2, Field1, INT_FIELD )
This way, your entire query can be done on just the index alone and does not need to query against the raw data of the actual record. All the fields are right there in the index to pull from. You have a finite date range, so your updated_at is right-away qualified, and in order prep of the group by. From that, your "CHAR" values from Field2 will right-along nicely finish your group by. Field1 to qualify your 3rd criteria of "IN" char list, and finally your INT_FIELD for count( distinct ).
Don't know how long the index will take to build on 250 million, but that is where I would start.