Scale Large MYSQL Table - mysql

I have a table which is growing very quickly, Currently it has 47000000+ rows.
Even very simple queries such as this is taking 46 seconds at times.
SELECT id, userId, visitorId, date FROM user_views LIMIT 20000000, 1;
Table structure is :
Field Type Null Key Default Extra
id int(11)unsigned NO PRI NULL auto_increment
userId int(11)unsigned NO MUL NULL
visitorId int(11) NO MUL NULL
date datetime NO MUL NULL
Already the application is running with 1 master and 6 slaves. Cant afford more instances.
Have btree index on id
Is there any way to make it faster?
Thanks

First of all you should consider using different storage approaches. Depending on your use cases a relational database might not be the best choice. E.g. if 99% of all oprations are writing to the table but not updating existing records (what your column names suggest), a nosql database might perform way better.
Secondly skipping 20000000 rows without any specific order criteria (based on an index of course) leaves it open to the DBMS to apply an arbitrary order, that might be suboptimal.
I don't know MySQL-internal optimization mechanisms, but LIMIT is only applied after the whole resultset has been built, which means you have the whole table loaded in your memory. So please try to reduce the size of the result set using WHERE statements before LIMITing it.

Related

Improving MySQL Query Speeds - 150,000+ Rows Returned Slows Query

Hi I currently have a query which is taking 11(sec) to run. I have a report which is displayed on a website which runs 4 different queries which are similar and all take 11(sec) each to run. I don't really want the customer having to wait a minute for all of these queries to run and display the data.
I am using 4 different AJAX requests to call an APIs to get the data I need and these all start at once but the queries are running one after another. If there was a way to get these queries to all run at once (parallel) so the total load time is only 11(sec) that would also fix my issue, I don't believe that is possible though.
Here is the query I am running:
SELECT device_uuid,
day_epoch,
is_repeat
FROM tracking_daily_stats_zone_unique_device_uuids_per_hour
WHERE day_epoch >= 1552435200
AND day_epoch < 1553040000
AND venue_id = 46
AND zone_id IN (102,105,108,110,111,113,116,117,118,121,287)
I can't think of anyway to speed this query up at all, below are pictures of the table indexes and the explain statement on this query.
I think the above query is using relevant indexes in the where conditions.
If there is anything you can think of to speed this query up please let me know, I have been working on it for 3 days and can't seem to figure out the problem. It would be great to get the query times down to 5(sec) maximum. If I am wrong about the AJAX issue please let me know as this would also fix my issue.
" EDIT "
I have came across something quite strange which might be causing the issue. When I change the day_epoch range to something smaller (5th - 9th) which returns 130,000 rows the query time is 0.7(sec) but then I add one more day onto that range (5th - 10th) and it returns over 150,000 rows the query time is 13(sec). I have ran loads of different ranges and have came to the conclusion if the amount of rows returned is over 150,000 that has a huge effect on the query times.
Table Definition -
CREATE TABLE `tracking_daily_stats_zone_unique_device_uuids_per_hour` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`day_epoch` int(10) NOT NULL,
`day_of_week` tinyint(1) NOT NULL COMMENT 'day of week, monday = 1',
`hour` int(2) NOT NULL,
`venue_id` int(5) NOT NULL,
`zone_id` int(5) NOT NULL,
`device_uuid` binary(16) NOT NULL COMMENT 'binary representation of the device_uuid, unique for a single day',
`device_vendor_id` int(5) unsigned NOT NULL DEFAULT '0' COMMENT 'id of the device vendor',
`first_seen` int(10) unsigned NOT NULL DEFAULT '0',
`last_seen` int(10) unsigned NOT NULL DEFAULT '0',
`is_repeat` tinyint(1) NOT NULL COMMENT 'is the device a repeat for this day?',
`prev_last_seen` int(10) NOT NULL DEFAULT '0' COMMENT 'previous last seen ts',
PRIMARY KEY (`id`,`venue_id`) USING BTREE,
KEY `venue_id` (`venue_id`),
KEY `zone_id` (`zone_id`),
KEY `day_of_week` (`day_of_week`),
KEY `day_epoch` (`day_epoch`),
KEY `hour` (`hour`),
KEY `device_uuid` (`device_uuid`),
KEY `is_repeat` (`is_repeat`),
KEY `device_vendor_id` (`device_vendor_id`)
) ENGINE=InnoDB AUTO_INCREMENT=450967720 DEFAULT CHARSET=utf8
/*!50100 PARTITION BY HASH (venue_id)
PARTITIONS 100 */
The straight forward solution is to add this query specific index to the table:
ALTER TABLE tracking_daily_stats_zone_unique_device_uuids_per_hour
ADD INDEX complex_idx (`venue_id`, `day_epoch`, `zone_id`)
WARNING This query change can take a while on DB.
And then force it when you call:
SELECT device_uuid,
day_epoch,
is_repeat
FROM tracking_daily_stats_zone_unique_device_uuids_per_hour
USE INDEX (complex_idx)
WHERE day_epoch >= 1552435200
AND day_epoch < 1553040000
AND venue_id = 46
AND zone_id IN (102,105,108,110,111,113,116,117,118,121,287)
It is definitely not universal but should work for this particular query.
UPDATE When you have partitioned table you can get profit by forcing particular PARTITION. In our case since that is venue_id just force it:
SELECT device_uuid,
day_epoch,
is_repeat
FROM tracking_daily_stats_zone_unique_device_uuids_per_hour
PARTITION (`p46`)
WHERE day_epoch >= 1552435200
AND day_epoch < 1553040000
AND zone_id IN (102,105,108,110,111,113,116,117,118,121,287)
Where p46 is concatenated string of p and venue_id = 46
And another trick if you go this way. You can remove AND venue_id = 46 from WHERE clause. Because there is no other data in that partition.
What happens if you change the order of conditions? Put venue_id = ? first. The order matters.
Now it first checks all rows for:
- day_epoch >= 1552435200
- then, the remaining set for day_epoch < 1553040000
- then, the remaining set for venue_id = 46
- then, the remaining set for zone_id IN (102,105,108,110,111,113,116,117,118,121,287)
When working with heavy queries, you should always try to make the first "selector" the most effective. You can do that by using a proper index for 1 (or combination) index and to make sure that first selector narrows down the most (at least for integers, in case of strings you need another tactic).
Sometimes, a query simply is slow. When you have a lot of data (and/or not enough resources) you just cant really do anything about that. Thats where you need another solution: Make a summary table. I doubt you show 150.000 rows x4 to your visitor. You can sum it, e.g., hourly or every few minutes and select from that way smaller table.
Offtopic: Putting an index on everything only slows you down when inserting/updating/deleting. Index the least amount of columns, just the once you actually filter on (e.g. use in a WHERE or GROUP BY).
450M rows is rather large. So, I will discuss a variety of issues that can help.
Shrink data A big table leads to more I/O, which is the main performance killer. ('Small' tables tend to stay cached, and not have an I/O burden.)
Any kind of INT, even INT(2) takes 4 bytes. An "hour" can easily fit in a 1-byte TINYINT. That saves over a 1GB in the data, plus a similar amount in INDEX(hour).
If hour and day_of_week can be derived, don't bother having them as separate columns. This will save more space.
Some reason to use a 4-byte day_epoch instead of a 3-byte DATE? Or perhaps you do need a 5-byte DATETIME or TIMESTAMP.
Optimal INDEX (take #1)
If it is always a single venue_id, then either this is a good first cut at the optimal index:
INDEX(venue_id, zone_id, day_epoch)
First is the constant, then the IN, then a range. The Optimizer does well with this in many cases. (It is unclear whether the number of items in an IN clause can lead to inefficiencies.)
Better Primary Key (better index)
With AUTO_INCREMENT, there is probably no good reason to include columns after the auto_inc column in the PK. That is, PRIMARY KEY(id, venue_id) is no better than PRIMARY KEY(id).
InnoDB orders the data's BTree according to the PRIMARY KEY. So, if you are fetching several rows and can arrange for them to be adjacent to each other based on the PK, you get extra performance. (cf "Clustered".) So:
PRIMARY KEY(venue_id, zone_id, day_epoch, -- this order, as discussed above;
id) -- to make sure that the entire PK is unique.
INDEX(id) -- to keep AUTO_INCREMENT happy
And, I agree with DROPping any indexes that are not in use, including the one I recommended above. It is rarely useful to index flags (is_repeat).
UUID
Indexing a UUID can be deadly for performance once the table is really big. This is because of the randomness of UUIDs/GUIDs, leading to ever-increasing I/O burden to insert new entries in the index.
Multi-dimensional
Assuming day_epoch is sometimes multiple days, you seem to have 2 or 3 "dimensions":
A date range
A list of zones
A venue.
INDEXes are 1-dimensional. Therein lies the problem. However, PARTITIONing can sometimes help. I discuss this briefly as "case 2" in http://mysql.rjweb.org/doc.php/partitionmaint .
There is no good way to get 3 dimensions, so let's focus on 2.
You should partition on something that is a "range", such as day_epoch or zone_id.
After that, you should decide what to put in the PRIMARY KEY so that you can further take advantage of "clustering".
Plan A: This assumes you are searching for only one venue_id at a time:
PARTITION BY RANGE(day_epoch) -- see note below
PRIMARY KEY(venue_id, zone_id, id)
Plan B: This assumes you sometimes srefineearch for venue_id IN (.., .., ...), hence it does not make a good first column for the PK:
Well, I don't have good advice here; so let's go with Plan A.
The RANGE expression must be numeric. Your day_epoch works fine as is. Changing to a DATE, would necessitate BY RANGE(TO_DAYS(...)), which works fine.
You should limit the number of partitions to 50. (The 81 mentioned above is not bad.) The problem is that "lots" of partitions introduces different inefficiencies; "too few" partitions leads to "why bother".
Note that almost always the optimal PK is different for a partitioned table than the equivalent non-partitioned table.
Note that I disagree with partitioning on venue_id since it is so easy to put that column at the start of the PK instead.
Analysis
Assuming you search for a single venue_id and use my suggested partitioning & PK, here's how the SELECT performs:
Filter on the date range. This is likely to limit the activity to a single partition.
Drill into the data's BTree for that one partition to find the one venue_id.
Hopscotch through the data from there, landing on the desired zone_ids.
For each, further filter based the date.

Mysql - speed up select query from 2 million rows

I have a table like below,
Field Type Null Key Default Extra
id bigint(11) NO PRI NULL auto_increment
deviceId bigint(11) NO MUL NULL
value double NO NULL
time timestamp YES MUL 0000-00-00 00:00:00
It has more than 2 million rows. When I run select * from tableName; It takes more than 15 mins.
When I run select value,time from sensor_value where time > '2017-05-21 04:47:48' and deviceId>=812; It takes more than 45 sec to load.
Note : 512 has more than 92514 rows.
Even I have added index for column like below,
ALTER TABLE `sensor_value`
ADD INDEX `IDX_FIELDS1_2` (`time`, `deviceId`) ;
How do I make select query fast?(load in 1sec) Am I doing indexing wrong?
Only 4 columns? Sounds like you have very little RAM, or innodb_buffer_pool_size is set too low. Hence, you were seriously I/O-bound and/or swapping.
WHERE time > '2017-05-21 04:47:48'
AND deviceId >= 812
is two ranges. There is no thorough way to optimize that. Either of these would help. If you have both, the Optimizer might pick the better one:
INDEX(time)
INDEX(deviceId)
When using a 'secondary' index in InnoDB, the query first looks in the index BTree; when there is a match there, it has to look up in the 'data' BTree (using the PRIMARY KEY for lookup).
Some of the anomalous times you saw when trying INDEX(time, deviceId) were because the filtering kept from having to reach over into the data as often.
Do you use id for anything other than uniqueness? Is the pair deviceId & time unique? If the answers are 'no' and 'yes', then get rid of id and change to PRIMARY KEY(deviceId, time). Or you could swap those two columns. What other queries do you have?
Getting rid of id shrinks the table some, thereby cutting down on I/O.
When using combined index usually you must use equality operator on first column and then you can use range criteria on second column. So I recommend you change the order of columns in your index like this:
ALTER TABLE `sensor_value`
ADD INDEX `IDX_FIELDS1_2` (`deviceId`, `time`) ;
then change to use equal sign for deviceId(use deviceId=812 not deviceId>=812):
select value,time from sensor_value where time > '2017-05-21 04:47:48' and deviceId=812;
I hope it could help.
2 million records is not much for Mysql and it is normal to get result in less than 1 sec for 1 billion records if you do the right things.

Is it better to force index usage for an ORDER BY?

I'm currently trying to optimize a query generated by Doctrine 2 on this table:
CREATE TABLE `publication` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`global_order` int(11) NOT NULL,
`title` varchar(63) COLLATE utf8_unicode_ci NOT NULL,
`slug` varchar(63) COLLATE utf8_unicode_ci NOT NULL,
`type` varchar(7) COLLATE utf8_unicode_ci NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `UNIQ_AF3C6779B12CE9DB` (`global_order`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
The query is
SELECT *
FROM publication
WHERE type IN ('article', 'event', 'work')
ORDER BY global_order DESC
type is a discriminator column added by Doctrine. Although the WHERE clause is useless as type is always one of the IN values, I cannot remove it.
EXPLAIN shows me
+------+---------------+------+------+-----------------------------+
| type | possible_keys | key | rows | Extra |
+------+---------------+------+------+-----------------------------+
| ALL | NULL | NULL | 562 | Using where; Using filesort |
+------+---------------+------+------+-----------------------------+
(rows is different each time I execute the query)
After some reading I found I can force an index usage like this:
ALTER TABLE `publication` DROP INDEX `UNIQ_AF3C6779B12CE9DB` ,
ADD UNIQUE `UNIQ_AF3C6779B12CE9DB` ( `global_order` , `type` )
and
SELECT *
FROM publication
FORCE INDEX(UNIQ_AF3C6779B12CE9DB)
WHERE global_order > 0
AND type IN ('article', 'event', 'work')
ORDER BY global_order DESC
The WHERE clause is always useless, but this time EXPLAIN shows me
+-------+-----------------------+-----------------------+------+-------------+
| type | possible_keys | key | rows | Extra |
+-------+-----------------------+-----------------------+------+-------------+
| range | UNIQ_AF3C6779B12CE9DB | UNIQ_AF3C6779B12CE9DB | 499 | Using where |
+-------+-----------------------+-----------------------+------+-------------+
It seems to me it's better, but it seems it's not common to have to force an index too so I wonder if it's really efficient for such a simple query.
Does anyone know what is the better way to perform this query?
Thanks!
If your query really is:
SELECT *
FROM publication
WHERE type IN ('article', 'event', 'work')
ORDER BY global_order DESC
... and all entries (or nearly all) will match the IN clause, you're actually better off with no index at all. If you toss in a limit clause, then the index you'll want is actually on global_order, without the type field. The reason for this is, it actually costs something to read an index.
If you're going for the entire table, sequentially reading the table and sorting its rows in memory will be your cheapest plan. If you only need a few rows and most will match the where clause, going for the smallest index will do the trick.
To understand why, picture the disk IO involved.
Suppose you want the whole table without an index. To do this, you read data_page1, data_page2, data_page3, etc., visiting the various disk pages involved in order, until you reach the end of the table. You then then sort and return.
If you want the top 5 rows without an index, you'd sequentially read the entire table as before, while heap-sorting the top 5 rows. Admittedly, that's a lot of reading and sorting for a handful of rows.
Suppose, now, that you want the whole table with an index. To do this, you read index_page1, index_page2, etc., sequentially. This then leads you to visit, say, data_page3, then data_page1, then data_page3 again, then data_page2, etc., in a completely random order (that by which the sorted rows appear in the data). The IO involved makes it cheaper to just read the whole mess sequentially and sort the grab bag in memory.
If you merely want the top 5 rows of an indexed table, in contrast, using the index becomes the correct strategy. In the worst case scenario you load 5 data pages in memory and move on.
A good SQL query planner, btw, will make its decision on whether to use an index or not based on how fragmented your data is. If fetching rows in order means zooming back and forth across the table, a good planner may decide that it's not worth using the index. In contrast, if the table is clustered using that same index, the rows are guaranteed to be in order, increasing the likelihood that it'll get used.
But then, if you join the same query with another table and that other table has an extremely selective where clause that can use a small index, the planner might decide it's actually better to, e.g. fetch all IDs of rows that are tagged as foo, hash join them with publications, and heap sort them in memory.
MySQL tries to determine the best way to run a given query, and decides whether or not to use indexes based on what it thinks is the best.
It isn't always correct. Sometimes manually forcing a query to use an index is faster, sometimes its not.
If you run some testing with sample data in your specific situation, you should be able to see which method performs faster, and stick with that one.
Make sure you take into account query caching to get an accurate performance benchmark.
Forcing the use of an index is rarely the best answer. In general it is better to create and/or optimize the indices (indexes) so that MySQL chooses to use them. (It is even better to optimize the queries, but I understand you cannot do that here.)
When you are using something like Doctrine where you cannot optimize the queries and the indices don't help, your best bet is to focus on query caching. :-)

Mysql composite indexing with tenant_id

We have a multitenant application that has a table with 129 fields that can all be used in WHERE and ORDER BY clauses. I spent 5 days now trying to find out the best indexing strategy for us, I gained lot of knowledge but I still have some questions.
1) When creating an index should I always make it a composite index with tenant_id in the first place ?(all queries have tenant_id = ? in there WHERE clause)
2) Since all the columns can be used in both the WHERE clause and the order by clause, should I create an index on them all ? (right know when I order by a column that has no index it takes 6s to execute with a tenant that has about 1,500,000 rows )
3) make the PK (tenant_id, ID), but wouldn't this affect the joins to that table ?
Any advice on how to handle this would be much appreciated.
======
The database engine is InnoDB
=======
structure :
ID bigint(20) auto_increment primary
tenant_id int(11)
created_by int(11)
created_on Timestamp
updated_by int(11)
updated_on Timestamp
owner_id int(11)
first_name VARCHAR(60)
last_name VARCHAR(60)
.
.
.
(some 120 other columns that are all searchable)
A few brief answers to the questions. As far as I can see you are confused with using indexes
Consider creating Indexes on columns if the Ratio -
Consideration 1 -
(Number of UNIQUE Entries of the Columns)/(Number of Total Entries in the Column) ~= 1
That is Count of DISTINCT rows in a particular column is high.
Creating an extra index will always create overhead for the MySQL server, so you MUST NOT create every column an index. There is also a limit on number of indexes your single table can have = 64 per table
Now if your tenant_id is present in all the search queries, you should consider it as an index or in a composite key,
provided that -
Consideration 2 - number of UPDATEs are less that number of SELECTs on the tenant_id
Consideration 3 - The indexes should be as small as possible in terms of data types. You MUST NOT create a varchar 64 an index
http://www.mysqlperformanceblog.com/2012/08/16/mysql-indexing-best-practices-webinar-questions-followup/
Point to Note 1 - Even if you do declare any column an index, MySQL optimizer may still not consider it as best plan of query execution. So always use EXPLAIN to know whats going on. http://www.mysqlperformanceblog.com/2009/09/12/3-ways-mysql-uses-indexes/
Point to Note 2 -
You may want to cache your search queries, so remember not to use unpredicted statements in your SELECT queries, such as NOW()
Lastly - making the PK (tenant_id, ID) should not affect the joins on your table.
And an awesome link to answer all your questions in general - http://www.percona.com/files/presentations/WEBINAR-MySQL-Indexing-Best-Practices.pdf

MySQL query slow querying table on primary key

So I have a table that's being used basically like a NoSQL setup. The structure is:
id bigint primary key
data mediumblob
modified timestamp
It has around 350k rows. The queries that run on it are all structured as follows:
select data from table where id=XXX;
The table engine is InnoDB. I'm noticing that sometimes queries run against this table are rather slow. Sometimes they take 3 seconds to run. The table is 3 GB on disk and I gave the innodb_buffer_pool_size 4G.
Is there anything I'm missing here? Are there any settings I can tweak to improve performance?
Edit: As requested explain output:
+----+-------------+----------+-------+---------------+---------+---------+-------+------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+-------+---------------+---------+---------+-------+------+-------+
| 1 | SIMPLE | cache | const | PRIMARY | PRIMARY | 8 | const | 1 | |
+----+-------------+----------+-------+---------------+---------+---------+-------+------+-------+
create table:
CREATE TABLE `cache` (
`id` bigint(20) unsigned NOT NULL DEFAULT '0',
`data` mediumblob,
`modified` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
There are two issues that I see here initially. First is that you have a query with a blob data type. This will cause speed issues when it comes to data retrieval. Second, you are using InnoDB, which is optimized for writing. This means that while it is probably the best choice overall, in extreme read situations it might be less performant than MyISAM. Neither of these issues are necessarily deal-killers but they do each add a performance hit. Beyond this, however, I'm not sure I can give you a good answer as to what you can do to better optimize without first having you do profiling. That is what I would recommend you do first. Profile your query to figure out what the execution plan is and then identify why the execution plan is so slow.
Here is a good "Top 10" list of MySQL optimizations. At least a couple apply in your situation directly:
http://20bits.com/articles/10-tips-for-optimizing-mysql-queries-that-dont-suck/
Here is another good optimization article that goes into server settings as well (for InnoDB specifically):
http://www.mysqlperformanceblog.com/2007/11/01/innodb-performance-optimization-basics/
Based on the CREATE TABLE statement you provided, I did think of another thing that you should address (again, not a query-killer but it is another performance hit). Unless there is a business case for using a bigint for your ID field, choose an int instead. An int will allow 2.1 billion rows so you shouldn't run out of numbers. Making this switch will save you disk space and it will improve query performance. Here is an article about it:
http://ronaldbradford.com/blog/bigint-v-int-is-there-a-big-deal-2008-07-18/
Try using the minimum size of id as possible. If it's a numeric key that you know will never be larger than a few million, you could use a MEDIUMINT UNSIGNED and save yourself a byte for each record over an INT, which might speed up searches a little. Still, 3 GB is an awful lot for just 350,000 rows.
It sounds like you might also get some bang for your buck by using the partitioning feature to split your table up into logical units. You might want to Google "mysql vertical partitioning" in particular; if there are large columns that you don't access frequently, it would be much more efficient to move them out into a separate table and only query it when you need it.
Could you post your CREATE TABLE statement as well as the output of EXPLAIN select data from table where id=XXX? How is the io wait on the system?
My best guess is that you're IO bound and because the rows aren't all the same size, it's having to search through the data. You have enough memory that it should be able to keep the data cached. This link describes some low level profiling in MySQL that might be helpful.
http://dev.mysql.com/tech-resources/articles/using-new-query-profiler.html
Things I would look for:
when are the slow queries appearing?
is it after a fresh start of the DB? then this might be just a temporary problem - queries hitting in a cold cache
is it during DB dump/load? - then change your backup policies - use replication for example, or add more disk IO (adding more disks in RAID, change disks to SSD, repartition your system on multiple disks, etc)
is it during peak read/write times? replication might also help here - write into master and load balance the reads between master and slaves
Also - is that mediumblob really necessary there?