I have database of just 5 million rows, But inner joins and IN taking to much time (55seconds,60seconds). so i am checking if there is a problem with my MyISAM setting.
Query: SHOW STATUS LIKE 'key%'
+------------------------+-------------+
| Variable_name | Value |
+------------------------+-------------+
| Key_blocks_not_flushed | 0 |
| Key_blocks_unused | 275029 |
| Key_blocks_used | 3316428 |
| Key_read_requests | 11459264178 |
| Key_reads | 3385967 |
| Key_write_requests | 91281692 |
| Key_writes | 27930218 |
+------------------------+-------------+
give me your suggestions to increase performance of MyISAM
I have worked with more then 45GB database, I was also faced performance issue,
Here are the some stpes which I have taken for improve perfomance.
(1) Remove any unnecessary indexes on the table, paying particular attention to UNIQUE indexes as these disable change buffering. Don't use a UNIQUE index if you have no reason for that constraint; prefer a regular INDEX.
(2) Inserting in order will result in fewer page splits (which will perform worse on tables not in memory), and the bulk loading is not specifically related to the table size, but it will help reduce redo log pressure.
(3) If bulk loading a fresh table, delay creating any indexes besides the PRIMARY KEY. If you create them once all data is loaded, then InnoDB is able to apply a pre-sort and bulk load process which is both faster and results in typically more compact indexes. This optimization became true in MySQL 5.5.
(4) Make sure to use InnoDB instead of MyISAM. MyISAM can be faster at inserts to the end of a table. Innodb is row level locking and MYISAM is table level locking
(5) Try to avoid complex SELECT queries on MyISAM tables that are updated frequently, and use query like which return less result on first condition
(6) For MyISAM tables that change frequently, try to avoid all variable-length columns (VARCHAR, BLOB, and TEXT). The table uses dynamic row format if it includes even a single variable-length column
Related
We have a big table with the following table structure:
CREATE TABLE `location_data` (
`id` int(20) NOT NULL AUTO_INCREMENT,
`dt` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`device_sn` char(30) NOT NULL,
`data` char(20) NOT NULL,
`gps_date` datetime NOT NULL,
`lat` double(30,10) DEFAULT NULL,
`lng` double(30,10) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `dt` (`dt`),
KEY `data` (`data`),
KEY `device_sn` (`device_sn`,`data`,`dt`),
KEY `device_sn_2` (`device_sn`,`dt`)
) ENGINE=MyISAM AUTO_INCREMENT=721453698 DEFAULT CHARSET=latin1
Many times we have performed query such as follow:
SELECT * FROM location_data WHERE device_sn = 'XXX' AND data = 'location' ORDER BY dt DESC LIMIT 1;
OR
SELECT * FROM location_data WHERE device_sn = 'XXX' AND data = 'location' AND dt >= '2014-01-01 00:00:00 ' AND dt <= '2014-01-01 23:00:00' ORDER BY dt DESC;
We have been optimizing this in a few ways:
By adding index and using FORCE INDEX on device_sn.
Separating the table into multiple tables based on the date (e.g. location_data_20140101) and pre-checking if there is a data based on certain date and we will pull that particular table alone. This table is created by cron once a day and the data in location_data for that particular date will be deleted.
The table location_data is HIGH WRITE and LOW READ.
However, few times, the query is running really slow. I wonder if there are other methods / ways / restructure the data that allows us to read a data in sequential date manner based on a given device_sn.
Any tips are more than welcomed.
EXPLAIN STATEMENT 1ST QUERY:
+----+-------------+--------------+------+----------------------------+-----------+---------+-------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+------+----------------------------+-----------+---------+-------------+------+-------------+
| 1 | SIMPLE | location_dat | ref | data,device_sn,device_sn_2 | device_sn | 50 | const,const | 1 | Using where |
+----+-------------+--------------+------+----------------------------+-----------+---------+-------------+------+-------------+
EXPLAIN STATEMENT 2nd QUERY:
+----+-------------+--------------+-------+-------------------------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+-------+-------------------------------+------+---------+------+------+-------------+
| 1 | SIMPLE | test_udp_new | range | dt,data,device_sn,device_sn_2 | dt | 4 | NULL | 1 | Using where |
+----+-------------+--------------+-------+-------------------------------+------+---------+------+------+-------------+
The index device_sn (device_sn,data,dt) is good. MySQL should use it without need to do any FORCE INDEX. You can verify it by running "explain select ..."
However, your table is MyISAM, which is only supports table level locks. If the table is heavily write it may be slow. I would suggest converting it to InnoDB.
Ok, I'll provide info that I know and this might not answer your question but could provide some insight.
There exits certain differences between InnoDB and MyISAM. Forget about full text indexing or spatial indexes, the huge difference is in how they operate.
InnoDB has several great features compared to MyISAM.
First off, it can store the data set it works with in RAM. This is why database servers come with a lot of RAM - so that I/O operations could be done quick. For example, an index scan is faster if you have indexes in RAM rather than on HDD because finding data on HDD is several magnitudes slower than doing it in RAM. Same applies for full table scans.
The variable that controls this when using InnoDB is called innodb_buffer_pool_size. By default it's 8 MB if I am not mistaken. I personally set this value high, sometimes even up to 90% of available RAM. Usually, when this value is optimized - a lot of people experience incredible speed gains.
The other thing is that InnoDB is a transactional engine. That means it will tell you that a write to disk succeeded or failed and that will be 100% correct. MyISAM won't do that because it doesn't force OS to force HDD to commit data permanently. That's why sometimes records are lost when using MyISAM, it thinks data is written because OS said it was when in reality OS tried to optimize the write and HDD might lose buffer data, thus not writing it down. OS tries to optimize the write operation and uses HDD's buffers to store larger chunks of data and then it flushes it in a single I/O. What happens then is that you don't have control over how data is being written.
With InnoDB you can start a transaction, execute say 100 INSERT queries and then commit. That will effectively force the hard drive to flush all 100 queries at once, using 1 I/O. If each INSERT is 4 KB long, 100 of them is 400 KB. That means you'll utilize 400kb of your disk's bandwith with 1 I/O operation and that remainder of I/O will be available for other uses. This is how inserts are being optimized.
Next are indexes with low cardinality - cardinality is a number of unique values in an indexed column. For primary key this value is 1. it's also the highest value. Indexes with low cardinality are columns where you have a few distinct values, such as yes or no or similar. If an index is too low in cardinality, MySQL will prefer a full table scan - it's MUCH quicker. Also, forcing an index that MySQL doesn't want to use could (and probably will) slow things down - this is because when using an indexed search, MySQL processes records one by one. When it does a table scan, it can read multiple records at once and avoid processing them. If those records were written sequentially on a mechanical disk, further optimizations are possible.
TL;DR:
use InnoDB on a server where you can allocate sufficient RAM
set the value of innodb_buffer_pool_size large enough so you can allocate more resources for faster querying
use an SSD if possible
try to wrap multiple INSERTs into transactions so you can better utilize your hard drive's bandwith and I/O
avoid indexing columns that have low unique value count compared to row count - they just waste space (though there are exceptions to this)
I previously asked a question on how to analyse large datasets (how can I analyse 13GB of data). One promising response was to add the data into a MySQL database using natural keys and thereby make use of INNODB's clustered indexing.
I've added the data to the database with a schema that looks like this:
TorrentsPerPeer
+----------+------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------+------------------+------+-----+---------+-------+
| ip | int(10) unsigned | NO | PRI | NULL | |
| infohash | varchar(40) | NO | PRI | NULL | |
+----------+------------------+------+-----+---------+-------+
The two fields together form the primary key.
This table represents known instances of peers downloading torrents. I'd like to be able to provide information on how many torrents can be found at peers. I'm going to draw a histogram of the frequencies of which I see numbers of torrents (e.g. 20 peers have 2 torrents, 40 peers have 3, ...).
I've written the following query:
SELECT `count`, COUNT(`ip`)
FROM (SELECT `ip`, COUNT(`infohash`) AS `count`
FROM TorrentsPerPeer
GROUP BY `ip`) AS `counts`
GROUP BY `count`;
Here's the EXPLAIN for the sub-select:
+----+-------------+----------------+-------+---------------+---------+------------+--------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_length | ref | rows | Extra |
+----+-------------+----------------+-------+---------------+---------+------------+--------+----------+-------------+
| 1 | SIMPLE | TorrentPerPeer | index | [Null] | PRIMARY | 126 | [Null] | 79262772 | Using index |
+----+-------------+----------------+-------+---------------+---------+------------+--------+----------+-------------+
I can't seem to do an EXPLAIN for the full query because it takes way too long. This bug suggests it's because it's running the sub query first.
This query is currently running (and has been for an hour). top is reporting that mysqld is only using ~5% of the available CPU whilst its RSIZE is steadily increasing. My assumption here is that the server is building temporary tables in RAM that it's using to complete the query.
My question is then; how can I improve the performance of this query? Should I change the query somehow? I've been altering the server settings in the my.cnf file to increase the INNODB buffer pool size, should I change any other values?
If it matters the table is 79'262'772 rows deep and takes up ~8GB of disk space. I'm not expecting this to be an easy query, maybe 'patience' is the only reasonable answer.
EDIT Just to add that the query has finished and it took 105mins. That's not unbearable, I'm just hoping for some improvements.
My hunch is that with an unsigned int and a varchar 40 (especially the varchar!) you have now a HUGE primary key and it is making your index file too big to fit in whatever RAM you have for Innodb_buffer_pool. This would make InnoDB have to rely on disk to swap index pages as it searches and that is a LOT of disk seeks and not a lot of CPU work.
One thing I did for a similar issue is use something in between a truly natural key and a surrogate key. We would take the 2 fields that are actually unique (one of which was also a varchar) and in the application layer would make a fixed width MD5 hash and use THAT as the key. Yes, it means more work for the app but it makes for a much smaller index file since you are no longer using an arbitrary length field.
OR, you could just use a server with tons of RAM and see if that makes the index fit in memory but I always like to make 'throw hardware at it' a last resort :)
I have a statistics table which grows at a large rate (around 25M rows/day) that I'd like to optimize for selects, the table fits in memory, and the server has plenty of spare memory (32G, table is 4G).
My simple roll-up query is:
EXPLAIN select FROM_UNIXTIME(FLOOR(endtime/3600)*3600) as ts,sum(numevent1) as success , sum(numevent2) as failure from stats where endtime > UNIX_TIMESTAMP()-3600*96 group by ts order by ts;
+----+-------------+--------------+------+---------------+------+---------+------+----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+------+---------------+------+---------+------+----------+----------------------------------------------+
| 1 | SIMPLE | stats | ALL | ts | NULL | NULL | NULL | 78238584 | Using where; Using temporary; Using filesort |
+----+-------------+--------------+------+---------------+------+---------+------+----------+----------------------------------------------+
Stats is an innodb table, there is a normal index on endtime.. How should I optimize this?
Note: I do plan on adding roll-up tables, but currently this is what I'm stuck with, and I'm wondering if its possible to fix it without additional application code.
I've been doing local tests. Try the following:
alter table stats add index (endtime, numevent1, numevent2);
And remove the order by as it should be implicit in the group by (I guess the parser just ignores the order by in this case, but just in case :)
Since you are using InnoDB you can also try the following:
a) Change innodb_buffer_pool_size to 24GB (requires server restart) - which will ensure that your whole table can be loaded into memory therefore will speed up sorting even as the table grows larger
b) Add innodb_file_per_table which causes InnoDB to place each new table space in its own table. Needs you to drop existing table and recreate it
c) Use the smallest available column size which can fit the data. Without seeing the actual column definitions and some sample, I cannot provide any specific ideas. Can you provide a sample schema and maybe 5 rows of data
I'm troubleshooting a query performance problem. Here's an expected query plan from explain:
mysql> explain select * from table1 where tdcol between '2010-04-13 00:00' and '2010-04-14 03:16';
+----+-------------+--------------------+-------+---------------+--------------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------------+-------+---------------+--------------+---------+------+---------+-------------+
| 1 | SIMPLE | table1 | range | tdcol | tdcol | 8 | NULL | 5437848 | Using where |
+----+-------------+--------------------+-------+---------------+--------------+---------+------+---------+-------------+
1 row in set (0.00 sec)
That makes sense, since the index named tdcol (KEY tdcol (tdcol)) is used, and about 5M rows should be selected from this query.
However, if I query for just one more minute of data, we get this query plan:
mysql> explain select * from table1 where tdcol between '2010-04-13 00:00' and '2010-04-14 03:17';
+----+-------------+--------------------+------+---------------+------+---------+------+-----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------------+------+---------------+------+---------+------+-----------+-------------+
| 1 | SIMPLE | table1 | ALL | tdcol | NULL | NULL | NULL | 381601300 | Using where |
+----+-------------+--------------------+------+---------------+------+---------+------+-----------+-------------+
1 row in set (0.00 sec)
The optimizer believes that the scan will be better, but it's over 70x more rows to examine, so I have a hard time believing that the table scan is better.
Also, the 'USE KEY tdcol' syntax does not change the query plan.
Thanks in advance for any help, and I'm more than happy to provide more info/answer questions.
5 million index probes could well be more expensive (lots of random disk reads, potentially more complicated synchronization) than reading all 350 million rows (sequential disk reads).
This case might be an exception, because presumably the order of the timestamps roughly matches the order of the inserts into the table. But, unless the index on tdcol is a "clustered" index (meaning that the database ensures that the order in the underlying table matches the order in tdcol), its unlikely that the optimizer knows this.
In the absence of that order correlation information, it would be right to assume that the 5 million rows you want are roughly evenly distributed among the 350 million rows, and thus that the index approach will involve reading most or nearly all of the pages in the underlying row anyway (in which case the scan will be much less expensive than the index approach, fewer reads outright and sequential instead of random reads).
MySQL's query generator has a cutoff when figuring out how to use an index. As you've correctly identified, MySQL has decided a table scan will be faster than using the index, and won't be dissuaded from it's decision. The irony is that when the key-range matches more than about a third of the table, it is probably right. So why in this case?
I don't have an answer, but I have a suspicion MySQL doesn't have enough memory to explore the index. I would be looking at the server memory settings, particularly the Innodb memory pool and some of the other key storage pools.
What's the distribution of your data like? Try running a min(), avg(), max() on it to see where it is. It's possible that that 1 minute makes the difference in how much information is contained in that range.
It also can just be the background setting of InnoDB There are a few factors like page size, and memory like staticsan said. You may want to explicitly define a B+Tree index.
"so I have a hard time believing that the table scan is better."
True. YOU have a hard time believing it. But the optimizer seems not to.
I won't pronounce on your being "right" versus your optimizer being "right". But optimizers do as they do, and, all in all, their "intellectual" capacity must still be considered as being fairly limited.
That said, do your database statistics show a MAX value (for this column) that happens to be equal to the "one second more" value ?
If so, then the optimizer might have concluded that all rows satisfy the upper limit anyway, and mighthave decided to proceed differently, compared to the case when it has to conclude that, "oh, there are definitely some rows that won't satisfy the upper limit either, so I'll use the index just to be on the safe side".
I have a table watchlist containing today almost 3Mil records.
mysql> select count(*) from watchlist;
+----------+
| count(*) |
+----------+
| 2957994 |
+----------+
It is used as a log to record product-page-views on a large e-commerce site (50,000+ products). It records the productID of the viewed product, the IP address and USER_AGENT of the viewer. And a timestamp of when it happens:
mysql> show columns from watchlist;
+-----------+--------------+------+-----+-------------------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------+--------------+------+-----+-------------------+-------+
| productID | int(11) | NO | MUL | 0 | |
| ip | varchar(16) | YES | | NULL | |
| added_on | timestamp | NO | MUL | CURRENT_TIMESTAMP | |
| agent | varchar(220) | YES | MUL | NULL | |
+-----------+--------------+------+-----+-------------------+-------+
The data is then reported on several pages throughout the site on both the back-end (e.g. checking what GoogleBot is indexing), and front-end (e.g. a side-bar box for "Recently Viewed Products" and a page showing users what "People from your region also liked" etc.).
So that these "report" pages and side-bars load quickly I put indexes on relevant fields:
mysql> show indexes from watchlist;
+-----------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+-----------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| watchlist | 1 | added_on | 1 | added_on | A | NULL | NULL | NULL | | BTREE | |
| watchlist | 1 | productID | 1 | productID | A | NULL | NULL | NULL | | BTREE | |
| watchlist | 1 | agent | 1 | agent | A | NULL | NULL | NULL | YES | BTREE | |
+-----------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
Without the INDEXES, pages with the side-bar for example would spend about 30-45sec executing a query to get the 7 most-recent ProductIDs. With the indexes it takes <0.2sec.
The problem is that with the INDEXES the product pages themselves are taking longer and longer to load because as the table grows the write operations are taking upwards of 5sec. In addition there is a spike on the mysqld process amounting to 10-15% of available CPU each time a product page is viewed (roughly once every 2sec). We already had to upgrade the server hardware because on a previous server it was reaching 100% and caused mysqld to crash.
My plan is to attempt a 2-table solution. One table for INSERT operations, and another for SELECT operations. I plan to purge the INSERT table whenever it reaches 1000 records using a TRIGGER, and copy the oldest 900 records into the SELECT table. The report pages are a mixture of real-time (recently viewed) and analytics (which region), but the real-time pages tend to only need a handful of fresh records while the analytical pages don't need to know about the most recent trend (i.e. last 1000 views). So I can use the small table for the former and the large table for the latter reports.
My question: Is this an ideal solution to this problem?
Also: With TRIGGERS in MySQL is it possible to nice the trigger_statement so that it takes longer, but doesn't consume much CPU? Would running a cron job every 30min that is niced, and which performs the purging if required be a better solution?
Write operations for a single row into a data table should not take 5 seconds, regardless how big the table gets.
Is your clustered index based on the timestamp field? If not, it should be, so you're not writing into the middle of your table somewhere. Also, make sure you are using InnoDB tables - MyISAM is not optimized for writes.
I would propose writing into two tables: one long-term table, one short-term reporting table with little or no indexing, which is then dumped as needed.
Another solution would be to use memcached or an in-memory database for the live reporting data, so there's no hit on the production database.
One more thought: exactly how "live" must either of these reports be? Perhaps retrieving a new list on a timed basis versus once for every page view would be sufficient.
A quick fixed might be to use the INSERT DELAYED syntax, which allows mysql to queue the inserts and execute them when it has time. That's probably not a very scalable solution though.
I actually think that the principles of what you will be attempting is sound, although I wouldn't use a trigger. My suggested solution would be to let the data accumulate for a day and then purge the data to the secondary log table with a batch script that runs at night. This is mainly because these frequent transfers of a thousand rows would still put a rather heavy load on the server, and because I don't really trust the MySQL trigger implementation (although that isn't based on any real substance).
Instead of optimizing indexes you could use some sort of database write offload. You could delegate writing to some background process via asynchronous queue (ActiveMQ for example). Inserting a message into ActiveMQ queue is very fast. We are using ActiveMQ and have about 10-20K insert operations on test platform (and this is single threaded test application! So you could have more).
Look for 'shadow tables' when reconstructing tables this way, you don't need to write to the production table.
I had the same issue even using InnoDB tables or MyISAM as mention before, not optimized for writes, and solved it by using a second table to write temp data (that can periodically update master huge table). Master table over 18 million records, used to read only records and write result on to second small table.
The problem is the insert/update onto the big master table, takes a while, and worse if there are several updates or inserts on the queue awaiting, even with the INSERT DELAYED or UPDATE [LOW_PRIORITY] options enabled
To make it even faster, do read the small secondary table first, when searching a record, if te record is there, then work on the second table only. Use the master big table for reference and picking up new data record only *if data is not on the secondary small table, you just go and read the record from the master (Read is fast on InnoDB tables or MyISAM schemes)and then insert that record on the small second table.
Works like a charm, takes much less than 5 seconds to read from huge master 20 Million record and write on to second small table 100K to 300K records in less than a second.
This works just fine.
Regards
Something that often helps when doing bulk loads is to drop any indexes, do the bulk load, then recreate the indexes. This is generally much faster than the database having to constantly update the index for each and every row inserted.