I have a table like below,
Field Type Null Key Default Extra
id bigint(11) NO PRI NULL auto_increment
deviceId bigint(11) NO MUL NULL
value double NO NULL
time timestamp YES MUL 0000-00-00 00:00:00
It has more than 2 million rows. When I run select * from tableName; It takes more than 15 mins.
When I run select value,time from sensor_value where time > '2017-05-21 04:47:48' and deviceId>=812; It takes more than 45 sec to load.
Note : 512 has more than 92514 rows.
Even I have added index for column like below,
ALTER TABLE `sensor_value`
ADD INDEX `IDX_FIELDS1_2` (`time`, `deviceId`) ;
How do I make select query fast?(load in 1sec) Am I doing indexing wrong?
Only 4 columns? Sounds like you have very little RAM, or innodb_buffer_pool_size is set too low. Hence, you were seriously I/O-bound and/or swapping.
WHERE time > '2017-05-21 04:47:48'
AND deviceId >= 812
is two ranges. There is no thorough way to optimize that. Either of these would help. If you have both, the Optimizer might pick the better one:
INDEX(time)
INDEX(deviceId)
When using a 'secondary' index in InnoDB, the query first looks in the index BTree; when there is a match there, it has to look up in the 'data' BTree (using the PRIMARY KEY for lookup).
Some of the anomalous times you saw when trying INDEX(time, deviceId) were because the filtering kept from having to reach over into the data as often.
Do you use id for anything other than uniqueness? Is the pair deviceId & time unique? If the answers are 'no' and 'yes', then get rid of id and change to PRIMARY KEY(deviceId, time). Or you could swap those two columns. What other queries do you have?
Getting rid of id shrinks the table some, thereby cutting down on I/O.
When using combined index usually you must use equality operator on first column and then you can use range criteria on second column. So I recommend you change the order of columns in your index like this:
ALTER TABLE `sensor_value`
ADD INDEX `IDX_FIELDS1_2` (`deviceId`, `time`) ;
then change to use equal sign for deviceId(use deviceId=812 not deviceId>=812):
select value,time from sensor_value where time > '2017-05-21 04:47:48' and deviceId=812;
I hope it could help.
2 million records is not much for Mysql and it is normal to get result in less than 1 sec for 1 billion records if you do the right things.
Related
Hi I currently have a query which is taking 11(sec) to run. I have a report which is displayed on a website which runs 4 different queries which are similar and all take 11(sec) each to run. I don't really want the customer having to wait a minute for all of these queries to run and display the data.
I am using 4 different AJAX requests to call an APIs to get the data I need and these all start at once but the queries are running one after another. If there was a way to get these queries to all run at once (parallel) so the total load time is only 11(sec) that would also fix my issue, I don't believe that is possible though.
Here is the query I am running:
SELECT device_uuid,
day_epoch,
is_repeat
FROM tracking_daily_stats_zone_unique_device_uuids_per_hour
WHERE day_epoch >= 1552435200
AND day_epoch < 1553040000
AND venue_id = 46
AND zone_id IN (102,105,108,110,111,113,116,117,118,121,287)
I can't think of anyway to speed this query up at all, below are pictures of the table indexes and the explain statement on this query.
I think the above query is using relevant indexes in the where conditions.
If there is anything you can think of to speed this query up please let me know, I have been working on it for 3 days and can't seem to figure out the problem. It would be great to get the query times down to 5(sec) maximum. If I am wrong about the AJAX issue please let me know as this would also fix my issue.
" EDIT "
I have came across something quite strange which might be causing the issue. When I change the day_epoch range to something smaller (5th - 9th) which returns 130,000 rows the query time is 0.7(sec) but then I add one more day onto that range (5th - 10th) and it returns over 150,000 rows the query time is 13(sec). I have ran loads of different ranges and have came to the conclusion if the amount of rows returned is over 150,000 that has a huge effect on the query times.
Table Definition -
CREATE TABLE `tracking_daily_stats_zone_unique_device_uuids_per_hour` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`day_epoch` int(10) NOT NULL,
`day_of_week` tinyint(1) NOT NULL COMMENT 'day of week, monday = 1',
`hour` int(2) NOT NULL,
`venue_id` int(5) NOT NULL,
`zone_id` int(5) NOT NULL,
`device_uuid` binary(16) NOT NULL COMMENT 'binary representation of the device_uuid, unique for a single day',
`device_vendor_id` int(5) unsigned NOT NULL DEFAULT '0' COMMENT 'id of the device vendor',
`first_seen` int(10) unsigned NOT NULL DEFAULT '0',
`last_seen` int(10) unsigned NOT NULL DEFAULT '0',
`is_repeat` tinyint(1) NOT NULL COMMENT 'is the device a repeat for this day?',
`prev_last_seen` int(10) NOT NULL DEFAULT '0' COMMENT 'previous last seen ts',
PRIMARY KEY (`id`,`venue_id`) USING BTREE,
KEY `venue_id` (`venue_id`),
KEY `zone_id` (`zone_id`),
KEY `day_of_week` (`day_of_week`),
KEY `day_epoch` (`day_epoch`),
KEY `hour` (`hour`),
KEY `device_uuid` (`device_uuid`),
KEY `is_repeat` (`is_repeat`),
KEY `device_vendor_id` (`device_vendor_id`)
) ENGINE=InnoDB AUTO_INCREMENT=450967720 DEFAULT CHARSET=utf8
/*!50100 PARTITION BY HASH (venue_id)
PARTITIONS 100 */
The straight forward solution is to add this query specific index to the table:
ALTER TABLE tracking_daily_stats_zone_unique_device_uuids_per_hour
ADD INDEX complex_idx (`venue_id`, `day_epoch`, `zone_id`)
WARNING This query change can take a while on DB.
And then force it when you call:
SELECT device_uuid,
day_epoch,
is_repeat
FROM tracking_daily_stats_zone_unique_device_uuids_per_hour
USE INDEX (complex_idx)
WHERE day_epoch >= 1552435200
AND day_epoch < 1553040000
AND venue_id = 46
AND zone_id IN (102,105,108,110,111,113,116,117,118,121,287)
It is definitely not universal but should work for this particular query.
UPDATE When you have partitioned table you can get profit by forcing particular PARTITION. In our case since that is venue_id just force it:
SELECT device_uuid,
day_epoch,
is_repeat
FROM tracking_daily_stats_zone_unique_device_uuids_per_hour
PARTITION (`p46`)
WHERE day_epoch >= 1552435200
AND day_epoch < 1553040000
AND zone_id IN (102,105,108,110,111,113,116,117,118,121,287)
Where p46 is concatenated string of p and venue_id = 46
And another trick if you go this way. You can remove AND venue_id = 46 from WHERE clause. Because there is no other data in that partition.
What happens if you change the order of conditions? Put venue_id = ? first. The order matters.
Now it first checks all rows for:
- day_epoch >= 1552435200
- then, the remaining set for day_epoch < 1553040000
- then, the remaining set for venue_id = 46
- then, the remaining set for zone_id IN (102,105,108,110,111,113,116,117,118,121,287)
When working with heavy queries, you should always try to make the first "selector" the most effective. You can do that by using a proper index for 1 (or combination) index and to make sure that first selector narrows down the most (at least for integers, in case of strings you need another tactic).
Sometimes, a query simply is slow. When you have a lot of data (and/or not enough resources) you just cant really do anything about that. Thats where you need another solution: Make a summary table. I doubt you show 150.000 rows x4 to your visitor. You can sum it, e.g., hourly or every few minutes and select from that way smaller table.
Offtopic: Putting an index on everything only slows you down when inserting/updating/deleting. Index the least amount of columns, just the once you actually filter on (e.g. use in a WHERE or GROUP BY).
450M rows is rather large. So, I will discuss a variety of issues that can help.
Shrink data A big table leads to more I/O, which is the main performance killer. ('Small' tables tend to stay cached, and not have an I/O burden.)
Any kind of INT, even INT(2) takes 4 bytes. An "hour" can easily fit in a 1-byte TINYINT. That saves over a 1GB in the data, plus a similar amount in INDEX(hour).
If hour and day_of_week can be derived, don't bother having them as separate columns. This will save more space.
Some reason to use a 4-byte day_epoch instead of a 3-byte DATE? Or perhaps you do need a 5-byte DATETIME or TIMESTAMP.
Optimal INDEX (take #1)
If it is always a single venue_id, then either this is a good first cut at the optimal index:
INDEX(venue_id, zone_id, day_epoch)
First is the constant, then the IN, then a range. The Optimizer does well with this in many cases. (It is unclear whether the number of items in an IN clause can lead to inefficiencies.)
Better Primary Key (better index)
With AUTO_INCREMENT, there is probably no good reason to include columns after the auto_inc column in the PK. That is, PRIMARY KEY(id, venue_id) is no better than PRIMARY KEY(id).
InnoDB orders the data's BTree according to the PRIMARY KEY. So, if you are fetching several rows and can arrange for them to be adjacent to each other based on the PK, you get extra performance. (cf "Clustered".) So:
PRIMARY KEY(venue_id, zone_id, day_epoch, -- this order, as discussed above;
id) -- to make sure that the entire PK is unique.
INDEX(id) -- to keep AUTO_INCREMENT happy
And, I agree with DROPping any indexes that are not in use, including the one I recommended above. It is rarely useful to index flags (is_repeat).
UUID
Indexing a UUID can be deadly for performance once the table is really big. This is because of the randomness of UUIDs/GUIDs, leading to ever-increasing I/O burden to insert new entries in the index.
Multi-dimensional
Assuming day_epoch is sometimes multiple days, you seem to have 2 or 3 "dimensions":
A date range
A list of zones
A venue.
INDEXes are 1-dimensional. Therein lies the problem. However, PARTITIONing can sometimes help. I discuss this briefly as "case 2" in http://mysql.rjweb.org/doc.php/partitionmaint .
There is no good way to get 3 dimensions, so let's focus on 2.
You should partition on something that is a "range", such as day_epoch or zone_id.
After that, you should decide what to put in the PRIMARY KEY so that you can further take advantage of "clustering".
Plan A: This assumes you are searching for only one venue_id at a time:
PARTITION BY RANGE(day_epoch) -- see note below
PRIMARY KEY(venue_id, zone_id, id)
Plan B: This assumes you sometimes srefineearch for venue_id IN (.., .., ...), hence it does not make a good first column for the PK:
Well, I don't have good advice here; so let's go with Plan A.
The RANGE expression must be numeric. Your day_epoch works fine as is. Changing to a DATE, would necessitate BY RANGE(TO_DAYS(...)), which works fine.
You should limit the number of partitions to 50. (The 81 mentioned above is not bad.) The problem is that "lots" of partitions introduces different inefficiencies; "too few" partitions leads to "why bother".
Note that almost always the optimal PK is different for a partitioned table than the equivalent non-partitioned table.
Note that I disagree with partitioning on venue_id since it is so easy to put that column at the start of the PK instead.
Analysis
Assuming you search for a single venue_id and use my suggested partitioning & PK, here's how the SELECT performs:
Filter on the date range. This is likely to limit the activity to a single partition.
Drill into the data's BTree for that one partition to find the one venue_id.
Hopscotch through the data from there, landing on the desired zone_ids.
For each, further filter based the date.
I have a table meta with the following structure (this is just an example denormalized data)
`id` int(3) not null auto_increment primary key,
`category_id` int(3),
`subdomain` varchar(191),
`created_at` timestamp,
`updated_at` timestamp
The subdomain field could store unique values and repeating values like 'general' can be repeated many times
Situation 1
Also i have an index subdomain. This index applied on query
Select `id` from `table` where `subdomain` = 'general'
But when i try to get some non-indexed field, mysql scans all table and index is not used
Select `created_at` from `table` where `subdomain` = 'general'
As i know, Inno-db non-clustered index stores a reference to a row and there is no need to perform linear search over all rows to retrieve some field.
Also i know optimizer can choose an unexpected plan for human, but what the reasons can be in this case?
No matter how much data in the table, result always the same.
This can happen, when the filtering backed by the index is not very selective/your value to filter for has a high cardinality. This means a high percentage of your total rows match the where-condition supported by the index (e.g. 90% of your rows match subdomain = 'general'). If you use the index under that condition you end up processing more data compared to a full table scan.
Example: you have 100 rows and 90 of them match subdomain = 'general'.
A full table scan needs to access all 100 rows to check the conditaion and 90 values are read for the result.
An index backed select needs to access 90 items in the index fo fulfill the condition and follow the pointer from the index to the actual row to select the not indexed value from that row. Ending up in 90 lookups on the index + 90 reads from the rows = 180 operations. This is slower than the full table scan where you just access some rows more than needed. The operations might not have the same cost, but you end up doing more work in the end.
We have a multitenant application that has a table with 129 fields that can all be used in WHERE and ORDER BY clauses. I spent 5 days now trying to find out the best indexing strategy for us, I gained lot of knowledge but I still have some questions.
1) When creating an index should I always make it a composite index with tenant_id in the first place ?(all queries have tenant_id = ? in there WHERE clause)
2) Since all the columns can be used in both the WHERE clause and the order by clause, should I create an index on them all ? (right know when I order by a column that has no index it takes 6s to execute with a tenant that has about 1,500,000 rows )
3) make the PK (tenant_id, ID), but wouldn't this affect the joins to that table ?
Any advice on how to handle this would be much appreciated.
======
The database engine is InnoDB
=======
structure :
ID bigint(20) auto_increment primary
tenant_id int(11)
created_by int(11)
created_on Timestamp
updated_by int(11)
updated_on Timestamp
owner_id int(11)
first_name VARCHAR(60)
last_name VARCHAR(60)
.
.
.
(some 120 other columns that are all searchable)
A few brief answers to the questions. As far as I can see you are confused with using indexes
Consider creating Indexes on columns if the Ratio -
Consideration 1 -
(Number of UNIQUE Entries of the Columns)/(Number of Total Entries in the Column) ~= 1
That is Count of DISTINCT rows in a particular column is high.
Creating an extra index will always create overhead for the MySQL server, so you MUST NOT create every column an index. There is also a limit on number of indexes your single table can have = 64 per table
Now if your tenant_id is present in all the search queries, you should consider it as an index or in a composite key,
provided that -
Consideration 2 - number of UPDATEs are less that number of SELECTs on the tenant_id
Consideration 3 - The indexes should be as small as possible in terms of data types. You MUST NOT create a varchar 64 an index
http://www.mysqlperformanceblog.com/2012/08/16/mysql-indexing-best-practices-webinar-questions-followup/
Point to Note 1 - Even if you do declare any column an index, MySQL optimizer may still not consider it as best plan of query execution. So always use EXPLAIN to know whats going on. http://www.mysqlperformanceblog.com/2009/09/12/3-ways-mysql-uses-indexes/
Point to Note 2 -
You may want to cache your search queries, so remember not to use unpredicted statements in your SELECT queries, such as NOW()
Lastly - making the PK (tenant_id, ID) should not affect the joins on your table.
And an awesome link to answer all your questions in general - http://www.percona.com/files/presentations/WEBINAR-MySQL-Indexing-Best-Practices.pdf
Query
SELECT id FROM `user_tmp`
WHERE `code` = '9s5xs1sy'
AND `go` NOT REGEXP 'http://www.xxxx.example.com/aflam/|http://xx.example.com|http://www.xxxxx..example.com/aflam/|http://www.xxxxxx.example.com/v/|http://www.xxxxxx.example.com/vb/'
AND check='done'
AND `dataip` <1319992460
ORDER BY id DESC
LIMIT 50
MySQL returns:
Showing rows 0 - 29 ( 50 total, Query took 21.3102 sec) [id: 2622270 - 2602288]
Query took 21.3102 sec
if i remove
AND dataip <1319992460
MySQL returns
Showing rows 0 - 29 ( 50 total, Query took 0.0859 sec) [id: 3637556 - 3627005]
Query took 0.0859 sec
and if no data, MySQL returns
MySQL returned an empty result set (i.e. zero rows). ( Query took 21.7332 sec )
Query took 21.7332 sec
Explain plan:
SQL query: Explain SELECT * FROM `user_tmp` WHERE `code` = '93mhco3s5y' AND `too` NOT REGEXP 'http://www.10neen.com/aflam/|http://3ltool.com|http://www.10neen.com/aflam/|http://www.10neen.com/v/|http://www.m1-w3d.com/vb/' and checkopen='2010' and `dataip` <1319992460 ORDER BY id DESC LIMIT 50;
Rows: 1
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE user_tmp index NULL PRIMARY 4 NULL 50 Using where
Example of the database used
CREATE TABLE IF NOT EXISTS user_tmp ( id int(9) NOT NULL
AUTO_INCREMENT, ip text NOT NULL, dataip bigint(20) NOT NULL,
ref text NOT NULL, click int(20) NOT NULL, code text NOT
NULL, too text NOT NULL, name text NOT NULL, checkopen
text NOT NULL, contry text NOT NULL, vOperation text NOT NULL,
vBrowser text NOT NULL, iconOperation text NOT NULL,
iconBrowser text NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=4653425 ;
--
-- Dumping data for table user_tmp
INSERT INTO `user_tmp` (`id`, `ip`, `dataip`, `ref`, `click`, `code`, `too`, `name`, `checkopen`, `contry`, `vOperation`, `vBrowser`, `iconOperation`, `iconBrowser`) VALUES
(1, '54.125.78.84', 1319506641, 'http://xxxx.example.com/vb/showthread.php%D8%AA%D8%AD%D9%85%D9%8A%D9%84-%D8%A7%D8%BA%D9%86%D9%8A%D8%A9-%D8%A7%D9%84%D8%A8%D9%88%D9%85-giovanni-marradi-lovers-rendezvous-3cd-1999-a-155712.html', 0, '4mxxxxx5', 'http://www.xxx.example.com/aflam/', 'xxxxe', '2010', 'US', 'Linux', 'Chrome 12.0.742 ', 'linux.png', 'chrome.png');
I want the correct way to do the query and optimize database
You don't have any indexes besides the primary key. You need to make index on fields that you use in your WHERE statement. If you need to index only 1 field or a combination of several fields depends on the other SELECTs you will be running against that table.
Keep in mind that REGEXP cannot use indexes at all, LIKE can use index only when it does not begin with wildcard (so LIKE 'a%' can use index, but LIKE '%a' cannot), bigger than / smaller than (<>) usually don't use indexes also.
So you are left with the code and check fields. I suppose many rows will have the same value for check, so I would begin the index with code field. Multi-field indexes can be used only in the order in which they are defined...
Imagine index created for fields code, check. This index can be used in your query (where the WHERE clause contains both fields), also in the query with only code field, but not in query with only check field.
Is it important to ORDER BY id? If not, leave it out, it will prevent the sort pass and your query will finish faster.
I will assume you are using mysql <= 5.1
The answers above fall into two basic categories:
1. You are using the wrong column type
2. You need indexes
I will deal with each as both are relevant for performance which is ultimately what I take your questions to be about:
Column Types
The difference between bigint/int or int/char for the dataip question is basically not relevant to your issue. The fundamental issue has more to do with index strategy. However when considering performance holistically, the fact that you are using MyISAM as your engine for this table leads me to ask if you really need "text" column types. If you have short (less than 255 say) character columns, then making them fixed length columns will most likely increase performance. Keep in mind that if any one column is of variable length (varchar, text, etc) then this is not worth changing any of them.
Vertical Partitioning
The fact to keep in mind here is that even though you are only requesting the id column from the standpoint of disk IO and memory you are getting the entire row back. Since so many of the rows are text, this could mean a massive amount of data. Any of these rows that are not used for lookups of users or are not often accessed could be moved into another table where the foreign key has a unique key placed on it keeping the relationship 1:1.
Index Strategy
Most likely the problem is simply indexing as is noted above. The reason that your current situation is caused by adding the "AND dataip <1319992460" condition is that it forces a full table scan.
As stated above placing all the columns in the where clause in a single, composite index will help. The order of the columns in the index will no matter so long as all of them appear in the where clause.
However, the order could matter a great deal for other queries. A quick example would be an index made of (colA, colB). A query with "where colA = 'foo'" will use this index. But a query with "where colB = 'bar'" will not because colB is not the left most column in the index definition. So, if you have other queries that use these columns in some combination it is worth minimizing the number of indexes created on the table. This is b/c every index increases the cost of a write and uses disk space. Writes are expensive b/c of necessary disk activity. Don't make them more expensive.
You need to add index like this:
ALTER TABLE `user_tmp` ADD INDEX(`dataip`);
And if your column 'dataip' contains only unique values you can add unique key like this:
ALTER TABLE `user_tmp` ADD UNIQUE(`dataip`);
Keep in mind, that adding index can take long time on a big table, so don't do it on production server with out testing.
You need to create index on fields in the same order that that are using in where clause. Otherwise index is not be used. Index fields of your where clause.
does dataip really need to be a bigint? According to mysql The signed range is -9223372036854775808 to 9223372036854775807 ( it is a 64bit number ).
You need to choose the right column type for the job, and add the right type of index too. Else these queries will take forever.
I have a simple MyISAM table resembling the following (trimmed for readability -- in reality, there are more columns, all of which are constant width and some of which are nullable):
CREATE TABLE IF NOT EXISTS `history` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`time` int(11) NOT NULL,
`event` int(11) NOT NULL,
`source` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `event` (`event`),
KEY `time` (`time`),
);
Presently the table contains only about 6,000,000 rows (of which currently about 160,000 match the query below), but this is expected to increase. Given a particular event ID and grouped by source, I want to know how many events with that ID were logged during a particular interval of time. The answer to the query might be something along the lines of "Today, event X happened 120 times for source A, 105 times for source B, and 900 times for source C."
The query I concocted does perform this task, but it performs monstrously badly, taking well over a minute to execute when the timespan is set to "all time" and in excess of 30 seconds for as little as a week back:
SELECT COUNT(*) AS count FROM history
WHERE event=2000 AND time >= 0 AND time < 1310563644
GROUP BY source
ORDER BY count DESC
This is not for real-time use, so even if the query takes a second or two that would be fine, but several minutes is not. Explaining the query gives the following, which troubles me for obvious reasons:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE history ref event,time event 4 const 160399 Using where; Using temporary; Using filesort
I've experimented with various multi-column indexes (such as (event, time)), but with no improvement. This seems like such a common use case that I can't imagine there not being a reasonable solution, but my Googling all boil down to versions of the query I already have, with no particular suggestions on how to avoid the temporary (and even then, why performance is so abysmal).
Any suggestions?
You say you have tried multi-column indexes. Have you also tried single-column indexes, one per column?
UPDATE: Also, the COUNT(*) operation over a GROUP BY clause is probably a lot faster, if the grouped column also has an index on it... Of course, this depends on the number of NULL values that are actually in that column, which are not indexed.
For event, MySQL can execute a UNIQUE SCAN, which is quite fast, whereas for time, a RANGE SCAN will be applied, which is not so fast... If you separate indexes, I'd expect better performance than with multi-column ones.
Also, maybe you could gain something by partitioning your table by some expected values / value ranges:
http://dev.mysql.com/doc/refman/5.5/en/partitioning-overview.html
I offer you to try this multi-column index:
ALTER TABLE `history` ADD INDEX `history_index` (`event` ASC, `time` ASC, `source` ASC);
Then if it doesn't help, try to force index on this query:
SELECT COUNT(*) AS count FROM history USE INDEX (history_index)
WHERE event=2000 AND time >= 0 AND time < 1310563644
GROUP BY source
ORDER BY count DESC
If the source are known or you want to find the count for specific source, then you can try like this.
select count(source= 'A' or NULL) as A,count(source= 'B' or NULL) as B from history;
and for ordering you can do it in your application code. Also try with indexing event and source together.
This will be definitely faster than the older one.