We have a big table with the following table structure:
CREATE TABLE `location_data` (
`id` int(20) NOT NULL AUTO_INCREMENT,
`dt` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`device_sn` char(30) NOT NULL,
`data` char(20) NOT NULL,
`gps_date` datetime NOT NULL,
`lat` double(30,10) DEFAULT NULL,
`lng` double(30,10) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `dt` (`dt`),
KEY `data` (`data`),
KEY `device_sn` (`device_sn`,`data`,`dt`),
KEY `device_sn_2` (`device_sn`,`dt`)
) ENGINE=MyISAM AUTO_INCREMENT=721453698 DEFAULT CHARSET=latin1
Many times we have performed query such as follow:
SELECT * FROM location_data WHERE device_sn = 'XXX' AND data = 'location' ORDER BY dt DESC LIMIT 1;
OR
SELECT * FROM location_data WHERE device_sn = 'XXX' AND data = 'location' AND dt >= '2014-01-01 00:00:00 ' AND dt <= '2014-01-01 23:00:00' ORDER BY dt DESC;
We have been optimizing this in a few ways:
By adding index and using FORCE INDEX on device_sn.
Separating the table into multiple tables based on the date (e.g. location_data_20140101) and pre-checking if there is a data based on certain date and we will pull that particular table alone. This table is created by cron once a day and the data in location_data for that particular date will be deleted.
The table location_data is HIGH WRITE and LOW READ.
However, few times, the query is running really slow. I wonder if there are other methods / ways / restructure the data that allows us to read a data in sequential date manner based on a given device_sn.
Any tips are more than welcomed.
EXPLAIN STATEMENT 1ST QUERY:
+----+-------------+--------------+------+----------------------------+-----------+---------+-------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+------+----------------------------+-----------+---------+-------------+------+-------------+
| 1 | SIMPLE | location_dat | ref | data,device_sn,device_sn_2 | device_sn | 50 | const,const | 1 | Using where |
+----+-------------+--------------+------+----------------------------+-----------+---------+-------------+------+-------------+
EXPLAIN STATEMENT 2nd QUERY:
+----+-------------+--------------+-------+-------------------------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+-------+-------------------------------+------+---------+------+------+-------------+
| 1 | SIMPLE | test_udp_new | range | dt,data,device_sn,device_sn_2 | dt | 4 | NULL | 1 | Using where |
+----+-------------+--------------+-------+-------------------------------+------+---------+------+------+-------------+
The index device_sn (device_sn,data,dt) is good. MySQL should use it without need to do any FORCE INDEX. You can verify it by running "explain select ..."
However, your table is MyISAM, which is only supports table level locks. If the table is heavily write it may be slow. I would suggest converting it to InnoDB.
Ok, I'll provide info that I know and this might not answer your question but could provide some insight.
There exits certain differences between InnoDB and MyISAM. Forget about full text indexing or spatial indexes, the huge difference is in how they operate.
InnoDB has several great features compared to MyISAM.
First off, it can store the data set it works with in RAM. This is why database servers come with a lot of RAM - so that I/O operations could be done quick. For example, an index scan is faster if you have indexes in RAM rather than on HDD because finding data on HDD is several magnitudes slower than doing it in RAM. Same applies for full table scans.
The variable that controls this when using InnoDB is called innodb_buffer_pool_size. By default it's 8 MB if I am not mistaken. I personally set this value high, sometimes even up to 90% of available RAM. Usually, when this value is optimized - a lot of people experience incredible speed gains.
The other thing is that InnoDB is a transactional engine. That means it will tell you that a write to disk succeeded or failed and that will be 100% correct. MyISAM won't do that because it doesn't force OS to force HDD to commit data permanently. That's why sometimes records are lost when using MyISAM, it thinks data is written because OS said it was when in reality OS tried to optimize the write and HDD might lose buffer data, thus not writing it down. OS tries to optimize the write operation and uses HDD's buffers to store larger chunks of data and then it flushes it in a single I/O. What happens then is that you don't have control over how data is being written.
With InnoDB you can start a transaction, execute say 100 INSERT queries and then commit. That will effectively force the hard drive to flush all 100 queries at once, using 1 I/O. If each INSERT is 4 KB long, 100 of them is 400 KB. That means you'll utilize 400kb of your disk's bandwith with 1 I/O operation and that remainder of I/O will be available for other uses. This is how inserts are being optimized.
Next are indexes with low cardinality - cardinality is a number of unique values in an indexed column. For primary key this value is 1. it's also the highest value. Indexes with low cardinality are columns where you have a few distinct values, such as yes or no or similar. If an index is too low in cardinality, MySQL will prefer a full table scan - it's MUCH quicker. Also, forcing an index that MySQL doesn't want to use could (and probably will) slow things down - this is because when using an indexed search, MySQL processes records one by one. When it does a table scan, it can read multiple records at once and avoid processing them. If those records were written sequentially on a mechanical disk, further optimizations are possible.
TL;DR:
use InnoDB on a server where you can allocate sufficient RAM
set the value of innodb_buffer_pool_size large enough so you can allocate more resources for faster querying
use an SSD if possible
try to wrap multiple INSERTs into transactions so you can better utilize your hard drive's bandwith and I/O
avoid indexing columns that have low unique value count compared to row count - they just waste space (though there are exceptions to this)
Related
We are currently evaluating mysql for one of our use case related to analytics.
The table schema is some what like this
CREATE TABLE IF NOT EXISTS `analytics`(
`date` DATE,
`dimension1` BIGINT UNSIGNED,
`dimension2` BIGINT UNSIGNED,
`metrics1` BIGINT UNSIGNED,
`metrics2` BIGINT UNSIGNED,
INDEX `baseindex` (`dimension1`,`dt`)
);
Since most query would be around dimension 1 and date we felt that a combined index would be our best case to optimize query lookup
With this table schema in mind an explain query returns the following results
explain
select dimension2,dimension1
from analytics
where dimension1=1123 and dt between '2016-01-01' and '2016-01-30';
The following query returns the
+----+-------------+-----------+------+---------------+-----------+---------+-------------+------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+------+---------------+-----------+---------+-------------+------+-----------------------+
| 1 | SIMPLE | analytics | ref | baseindex | baseindex | 13 | const,const | 1 | Using index condition |
+----+-------------+-----------+------+---------------+-----------+---------+-------------+------+-----------------------+
This look good so far as we are getting indication that the indexes are being fired up.
However we though if we can optimize this a bit further, since most of our lookups will be for the current month or month based lookup we felt a date partitioning will further improve the performance.
The table was later modified to add partitions by month
ALTER TABLE analytics
PARTITION BY RANGE( TO_DAYS(`dt`))(
PARTITION JAN2016 VALUES LESS THAN (TO_DAYS('2016-02-01')),
PARTITION FEB2016 VALUES LESS THAN (TO_DAYS('2016-03-01')),
PARTITION MAR2016 VALUES LESS THAN (TO_DAYS('2016-04-01')),
PARTITION APR2016 VALUES LESS THAN (TO_DAYS('2016-05-01')),
PARTITION MAY2016 VALUES LESS THAN (TO_DAYS('2016-06-01')),
PARTITION JUN2016 VALUES LESS THAN (TO_DAYS('2016-07-01')),
PARTITION JUL2016 VALUES LESS THAN (TO_DAYS('2016-08-01')),
PARTITION AUG2016 VALUES LESS THAN (TO_DAYS('2016-09-01')),
PARTITION SEPT2016 VALUES LESS THAN (TO_DAYS('2016-10-01')),
PARTITION OCT2016 VALUES LESS THAN (TO_DAYS('2016-11-01')),
PARTITION NOV2016 VALUES LESS THAN (TO_DAYS('2016-12-01')),
PARTITION DEC2016 VALUES LESS THAN (TO_DAYS('2017-01-01'))
);
With the partition in place, the same query now returns the following results
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+-------+---------------+-----------+---------+------+------+-------------+
| 1 | SIMPLE | analytics | range | baseindex | baseindex | 13 | NULL | 1 | Using where |
+----+-------------+-----------+-------+---------------+-----------+---------+------+------+-------------+
Now the "Extra" column show that its switching to where instead of using index condition.
We have not noticed any performance boost or degradation, so curios to know how does adding a partition changes the value inside the extra column
This is too long for a comment.
MySQL partitions both the data and the indexes. So, the result of your query is that the query is accessing a smaller index which refers to fewer data pages.
Why don't you see a performance boost? Well, looking up rows in a smaller index is negligible savings (although there might be some savings for the first query from a cold start because the index has to be loaded into memory).
I am guessing that the data you are looking for is relatively small -- say, the records come from a handful of data pages. Well, fetching a handful of data pages from a partition is pretty much the same thing as fetching a handful of data pages from the full table.
Does this mean that partitioning is useless? Not at all. For one thing, the partitioned data and index is much smaller than the overall table. So, you have a savings in memory on the server side -- and this can be a big win on a busy server.
In general, though, partitions really shine when you have queries that don't fully use indexes. The smaller data sizes in each partition often make such queries more efficient.
Use NOT NULL (wherever appropriate).
Don't use BIGINT (8 bytes) unless you really need huge numbers. Dimension ids can usually fit in SMALLINT UNSIGNED (0..64K, 2 bytes) or MEDIUMINT UNSIGNED. (0..16M, 3 bytes).
Yes, INDEX(dim1, dt) is optimal for that one SELECT.
No, PARTITIONing will not help for that SELECT.
PARTITION BY RANGE(TO_DAYS(..)) is excellent if you intend to delete old data. But there is rarely any other benefit.
Use InnoDB.
Explicitly specify the PRIMARY KEY. It will be important in the discussion below.
When working with huge databases, it is a good idea to "count the disk hits". So, let's analyze your query.
INDEX(dim1, dt) with WHERE dim1 = a AND dt BETWEEN x and y will
If partitioned, prune down to the partition(s) representing x..y.
Drill down in the Index's BTree to [a,x]. With partitioning the BTree might be 1 level shallower, but that savings is lost to the pruning of step 1.
Scan forward until [a,y]. If only one partition is involved, this scan hits exactly the same number of blocks whether partitioned or not. If multiple partitions are needed, then there is some extra overhead.
For each row, use the PRIMARY KEY to reach over into the data to get dim2. Again, virtually the same amount of effort. Without the Engine and PRIMARY KEY, I cannot finish discussion this #4.
If (dim1, dim2, dt) is unique, make it the PK. In this case, INDEX(dim1, dt) is actually dim1, dt, dim2 since the PK is included in every secondary index. That says that #4 really involves a 'covering' index. That is, the no extra work to reach for dim2 (zero disk hits).
If, on the other hand, you did SELECT metric..., then #4 does have the effort mentioned.
I have database of just 5 million rows, But inner joins and IN taking to much time (55seconds,60seconds). so i am checking if there is a problem with my MyISAM setting.
Query: SHOW STATUS LIKE 'key%'
+------------------------+-------------+
| Variable_name | Value |
+------------------------+-------------+
| Key_blocks_not_flushed | 0 |
| Key_blocks_unused | 275029 |
| Key_blocks_used | 3316428 |
| Key_read_requests | 11459264178 |
| Key_reads | 3385967 |
| Key_write_requests | 91281692 |
| Key_writes | 27930218 |
+------------------------+-------------+
give me your suggestions to increase performance of MyISAM
I have worked with more then 45GB database, I was also faced performance issue,
Here are the some stpes which I have taken for improve perfomance.
(1) Remove any unnecessary indexes on the table, paying particular attention to UNIQUE indexes as these disable change buffering. Don't use a UNIQUE index if you have no reason for that constraint; prefer a regular INDEX.
(2) Inserting in order will result in fewer page splits (which will perform worse on tables not in memory), and the bulk loading is not specifically related to the table size, but it will help reduce redo log pressure.
(3) If bulk loading a fresh table, delay creating any indexes besides the PRIMARY KEY. If you create them once all data is loaded, then InnoDB is able to apply a pre-sort and bulk load process which is both faster and results in typically more compact indexes. This optimization became true in MySQL 5.5.
(4) Make sure to use InnoDB instead of MyISAM. MyISAM can be faster at inserts to the end of a table. Innodb is row level locking and MYISAM is table level locking
(5) Try to avoid complex SELECT queries on MyISAM tables that are updated frequently, and use query like which return less result on first condition
(6) For MyISAM tables that change frequently, try to avoid all variable-length columns (VARCHAR, BLOB, and TEXT). The table uses dynamic row format if it includes even a single variable-length column
I have a performance issue, while handling billion records using select query,I have a table as
CREATE TABLE `temp_content_closure2` (
`parent_label` varchar(2000) DEFAULT NULL,
`parent_code_id` bigint(20) NOT NULL,
`parent_depth` bigint(20) NOT NULL DEFAULT '0',
`content_id` bigint(20) unsigned NOT NULL DEFAULT '0',
KEY `code_content` (`parent_code_id`,`content_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
/*!50100 PARTITION BY KEY (parent_depth)
PARTITIONS 20 */ |
I used partition which will increase the performance by subdividing the table,but it is not usefull in my case,my sample select in this table
+----------------+----------------+--------------+------------+
| parent_label | parent_code_id | parent_depth | content_id |
+----------------+----------------+--------------+------------+
| Taxonomy | 20000 | 0 | 447 |
| Taxonomy | 20000 | 0 | 2286 |
| Taxonomy | 20000 | 0 | 3422 |
| Taxonomy | 20000 | 0 | 5916 |
+----------------+----------------+--------------+------------+
Here the content_id will be unique in respect to the parent_dept,so i used parent_depth as a key for partitioning.In every depth i have 2577833 rows to handle ,so here partitioning is not useful,i got a idea from websites to use archive storage engine but it will use full table scan and not use index in select ,basically 99% i use select query in this table and this table is going to increases its count every day.currently i am in mysql database which has 5.0.1 version.i got an idea about nosql database to use ,but is any way to handle in mysql ,if you are ssuggesting nosql means which can i use cassandra or accumulo ?.
Add an index like this:
ALTER TABLE table ADD INDEX content_id ('content_id')
You can also add multiple indexes if you have more specific SELECT criteria which will also speed things up.
Multiple and single indexes
Overall though, if you have a table like this thats growing so fast then you should probably be looking at restructuring your sql design.
Check out "Big Data" solutions as well.
With that size and volume of data, you'd need to either setup a sharded-MySQL setup in a cluster of machines (Facebook and Twitter stored massive amounts of data on a sharded MySQL setup, so it is possible), or alternatively use a Big Table-based solution that natively distribute the data amongst nodes in various clusters- Cassandra and HBase are the most popular alternatives here. You must realize, a billion records on a single machine will hit almost every limit of the system- IO first, followed by memory, followed by CPU. It's simply not feasible.
If you do go the Big Table way, Cassandra will be the quickest to setup and test. However, if you anticipate map-reduce type analytic needs, then HBase is more tightly integrated with the Hadoop ecosystem, and should work out well. Performance-wise, they are both neck to neck, so take your pick.
I have generated query
select
mailsource2_.file as col_0_0_,
messagedet0_.messageId as col_1_0_,
messageent1_.mboxOffset as col_2_0_,
messageent1_.mboxOffsetEnd as col_3_0_,
messagedet0_.id as col_4_0_
from MessageDetails messagedet0_, MessageEntry messageent1_, MailSourceFile mailsource2_
where messagedet0_.id=messageent1_.messageDetails_id
and messageent1_.mailSourceFile_id=mailsource2_.id
order by mailsource2_.file, messageent1_.mboxOffset;
Explain says that there is no full scans and indexes are used:
+----+-------------+--------------+--------+------------------------------------------------------+---------+---------+--------------------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys |key | key_len | ref | rows | Extra |
+----+-------------+--------------+--------+------------------------------------------------------+---------+---------+--------------------------------------+------+----------------------------------------------+
| 1 | SIMPLE | mailsource2_ | index | PRIMARY |file | 384 | NULL | 1445 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | messageent1_ | ref | msf_idx,md_idx,FKBBB258CB60B94D38,FKBBB258CBF7C835B8 |msf_idx | 9 | skryb.mailsource2_.id | 2721 | Using where |
| 1 | SIMPLE | messagedet0_ | eq_ref | PRIMARY |PRIMARY | 8 | skryb.messageent1_.messageDetails_id | 1 | |
+----+-------------+--------------+--------+------------------------------------------------------+---------+---------+--------------------------------------+------+----------------------------------------------+
CREATE TABLE `mailsourcefile` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`file` varchar(127) COLLATE utf8_bin DEFAULT NULL,
`size` bigint(20) DEFAULT NULL,
`archive_id` bigint(20) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `file` (`file`),
KEY `File_idx` (`file`),
KEY `Archive_idx` (`archive_id`),
KEY `FK7C3F816ECDB9F63C` (`archive_id`),
CONSTRAINT `FK7C3F816ECDB9F63C` FOREIGN KEY (`archive_id`) REFERENCES `archive` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1370 DEFAULT CHARSET=utf8 COLLATE=utf8_bin
Also I have indexes for file and mboxOffset. SHOW FULL PROCESSLIST says that mysql is sorting result and it takes more then few minutes. Resultset size is 5M records. How can I optimize this?
Don't think there is much optimization to do in the query itself. Joins would make it more readable, but iirc mysql nowadays is perfectly able to detect these kind of constructs and plan the joins itself.
What would help propably is to increase both the tmp_table_size and max_heap_table_size to allow the resultset of this query to remain in memory, rather than having to write it to disk.
The maximum size for in-memory temporary tables is the minimum of the tmp_table_size and max_heap_table_size values
http://dev.mysql.com/doc/refman/5.5/en/internal-temporary-tables.html
The "using temporary" in the explain denotes that it is using a temporary table (see the link above again) - which will probably be written to disk due to the large amount of data (again, see the link above for more on this).
the file column alone is anywhere between 1 and 384 bytes, so lets take the half for our estimation and ignore the rest of the columns, that leads to 192 bytes per row in the result-set.
1445 * 2721 = 3,931,845 rows
* 192 = 754,914,240 bytes
/ 1024 ~= 737,221 kb
/ 1024 ~= 710 mb
This is certainly more than the max_heap_table_size (16,777,216 bytes) and most likely more than the tmp_table_size.
Not having to write such a result to disk will most certainly increase speed.
Good luck!
Optimization is always tricky. In order to make a dent in your execution time, I think you probably need to do some sort of pre-cooking.
If the file names are similar, (e.g. /path/to/file/1, /path/to/file/2), sorting them will mean a lot of byte comparisons, probably compounded by the unicode encoding. I would calculate a hash of the filename on insertion (e.g. MD5()) and then sort using that.
If the files are already well distributed (e.g. postfix spool names), you probably need to come up with some scheme on insertion whereby either:
simply reading records from some joined table will automatically generate them in correct order; this may not save a lot of time, but it will give you some data quickly so you can start processing, or
find a way to provide a "window" on the data so that not all of it needs to be processed at once.
As #raheel shan said above, you may want to try some JOINs:
select
mailsource2_.file as col_0_0_,
messagedet0_.messageId as col_1_0_,
messageent1_.mboxOffset as col_2_0_,
messageent1_.mboxOffsetEnd as col_3_0_,
messagedet0_.id as col_4_0_
from
MessageDetails messagedet0_
inner join
MessageEntry messageent1_
on
messagedet0_.id = messageent1_.messageDetails_id
inner join
MailSourceFile mailsource2_
on
messageent1_.mailSourceFile_id = mailsource2_.id
order by
mailsource2_.file,
messageent1_.mboxOffset
My apologies if the keys are off, but I think I've conveyed the point.
write the query with joins like
select
mailsource2_.file as col_0_0_,
messagedet0_.messageId as col_1_0_,
messageent1_.mboxOffset as col_2_0_,
messageent1_.mboxOffsetEnd as col_3_0_,
messagedet0_.id as col_4_0_
from
MessageDetails m0
inner join
MessageEntry m1
on
m0.id = m1.messageDetails_id
inner join
MailSourceFile m2
on
m1.mailSourceFile_id = m2.id
order by
m2_.file,
m1_.mboxOffset;
on seeing ur explain i found 3 things which in my opinion are not good
1 file sort in extra column
2 index in type column
3 key length which is 384
if you reduce the key length you may get quick retrieval for that consider the character set you use and the partial indexes
here you can do force index for order by and use index for join ( create appropriate multi column indexes and assign them) remember it is alway good to order with column present in the same table
index type represents it is scanning entire index column which is not good
I previously asked a question on how to analyse large datasets (how can I analyse 13GB of data). One promising response was to add the data into a MySQL database using natural keys and thereby make use of INNODB's clustered indexing.
I've added the data to the database with a schema that looks like this:
TorrentsPerPeer
+----------+------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------+------------------+------+-----+---------+-------+
| ip | int(10) unsigned | NO | PRI | NULL | |
| infohash | varchar(40) | NO | PRI | NULL | |
+----------+------------------+------+-----+---------+-------+
The two fields together form the primary key.
This table represents known instances of peers downloading torrents. I'd like to be able to provide information on how many torrents can be found at peers. I'm going to draw a histogram of the frequencies of which I see numbers of torrents (e.g. 20 peers have 2 torrents, 40 peers have 3, ...).
I've written the following query:
SELECT `count`, COUNT(`ip`)
FROM (SELECT `ip`, COUNT(`infohash`) AS `count`
FROM TorrentsPerPeer
GROUP BY `ip`) AS `counts`
GROUP BY `count`;
Here's the EXPLAIN for the sub-select:
+----+-------------+----------------+-------+---------------+---------+------------+--------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_length | ref | rows | Extra |
+----+-------------+----------------+-------+---------------+---------+------------+--------+----------+-------------+
| 1 | SIMPLE | TorrentPerPeer | index | [Null] | PRIMARY | 126 | [Null] | 79262772 | Using index |
+----+-------------+----------------+-------+---------------+---------+------------+--------+----------+-------------+
I can't seem to do an EXPLAIN for the full query because it takes way too long. This bug suggests it's because it's running the sub query first.
This query is currently running (and has been for an hour). top is reporting that mysqld is only using ~5% of the available CPU whilst its RSIZE is steadily increasing. My assumption here is that the server is building temporary tables in RAM that it's using to complete the query.
My question is then; how can I improve the performance of this query? Should I change the query somehow? I've been altering the server settings in the my.cnf file to increase the INNODB buffer pool size, should I change any other values?
If it matters the table is 79'262'772 rows deep and takes up ~8GB of disk space. I'm not expecting this to be an easy query, maybe 'patience' is the only reasonable answer.
EDIT Just to add that the query has finished and it took 105mins. That's not unbearable, I'm just hoping for some improvements.
My hunch is that with an unsigned int and a varchar 40 (especially the varchar!) you have now a HUGE primary key and it is making your index file too big to fit in whatever RAM you have for Innodb_buffer_pool. This would make InnoDB have to rely on disk to swap index pages as it searches and that is a LOT of disk seeks and not a lot of CPU work.
One thing I did for a similar issue is use something in between a truly natural key and a surrogate key. We would take the 2 fields that are actually unique (one of which was also a varchar) and in the application layer would make a fixed width MD5 hash and use THAT as the key. Yes, it means more work for the app but it makes for a much smaller index file since you are no longer using an arbitrary length field.
OR, you could just use a server with tons of RAM and see if that makes the index fit in memory but I always like to make 'throw hardware at it' a last resort :)