I am working on an e-shop which sells products only via loans. I display 10 products per page in any category, each product has 3 different price tags - 3 different loan types. Everything went pretty well during testing time, query execution time was perfect, but today when transfered the changes to the production server, the site "collapsed" in about 2 minutes. The query that is used to select loan types sometimes hangs for ~10 seconds and it happens frequently and thus it cant keep up and its hella slow. The table that is used to store the data has approximately 2 milion records and each select looks like this:
SELECT *
FROM products_loans
WHERE KOD IN("X17/Q30-10", "X17/12", "X17/5-24")
AND 369.27 BETWEEN CENA_OD AND CENA_DO;
3 loan types and the price that needs to be in range between CENA_OD and CENA_DO, thus 3 rows are returned.
But since I need to display 10 products per page, I need to run it trough a modified select using OR, since I didnt find any other solution to this. I have asked about it here, but got no answer. As mentioned in the referencing post, this has to be done separately since there is no column that could be used in a join (except of course price and code, but that ended very, very badly). Here is the show create table, kod and CENA_OD/CENA_DO very indexed via INDEX.
CREATE TABLE `products_loans` (
`KOEF_ID` bigint(20) NOT NULL,
`KOD` varchar(30) NOT NULL,
`AKONTACIA` int(11) NOT NULL,
`POCET_SPLATOK` int(11) NOT NULL,
`koeficient` decimal(10,2) NOT NULL default '0.00',
`CENA_OD` decimal(10,2) default NULL,
`CENA_DO` decimal(10,2) default NULL,
`PREDAJNA_CENA` decimal(10,2) default NULL,
`AKONTACIA_SUMA` decimal(10,2) default NULL,
`TYP_VYHODY` varchar(4) default NULL,
`stage` smallint(6) NOT NULL default '1',
PRIMARY KEY (`KOEF_ID`),
KEY `CENA_OD` (`CENA_OD`),
KEY `CENA_DO` (`CENA_DO`),
KEY `KOD` (`KOD`),
KEY `stage` (`stage`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
And also selecting all loan types and later filtering them trough php doesnt work good, since each type has over 50k records and the select takes too much time as well...
Any ides about improving the speed are appreciated.
Edit:
Here is the explain
+----+-------------+----------------+-------+---------------------+------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------+-------+---------------------+------+---------+------+--------+-------------+
| 1 | SIMPLE | products_loans | range | CENA_OD,CENA_DO,KOD | KOD | 92 | NULL | 190158 | Using where |
+----+-------------+----------------+-------+---------------------+------+---------+------+--------+-------------+
I have tried the combined index and it improved the performance on the test server from 0.44 sec to 0.06 sec, I cant access the production server from home though, so I will have to try it tomorrow.
Your issue is that you are searching for intervals which contain a point (rather than the more normal query of all points in an interval). These queries do not work well with the standard B-tree index, so instead you need to use an R-Tree index. Unfortunately MySQL doesn't allow you to select an R-Tree index on a column, but you can get the desired index by changing your column type to GEOMETRY and using the geometric functions to check if the interval contains the point.
See Quassnoi's article Adjacency list vs. nested sets: MySQL where he explains this in more detail. The use case is different, but the techniques involved are the same. Here's an extract from the relevant part of the article:
There is also a certain class of tasks that require searching for all ranges containing a known value:
Searching for an IP address in the IP range ban list
Searching for a given date within a date range
and several others. These tasks can be improved by using R-Tree capabilities of MySQL.
Try to refactor your query like:
SELECT * FROM products_loans
WHERE KOD IN("X17/Q30-10", "X17/12", "X17/5-24")
AND CENA_OD >= 369.27
AND CENA_DO <= 369.27;
(mysql is not very smart when choosing indexes) and check the performance.
The next try is to add a combined key - (KOD,CENA_OD,CENA_DO)
And the next major try is to refactor your base to have products separated from prices. This should really help.
PS: you can also migrate to postgresql, it's smarter than mysql when choosing right indexes.
MySQL can only use 1 key. If you always get the entry by the 3 columns, depending on the actual data (range) in the columns one of the following could very well add a serious amount of performance:
ALTER TABLE products_loans ADD INDEX(KOD, CENA_OD, CENA_DO);
ALTER TABLE products_loans ADD INDEX(CENA_OD, CENA_DO, KOD);
Notice that the order of the columns matter! If that doesn't improve performance, give us the EXPLAIN output of the query.
Related
SELECT ticker.ticker_id,
ticker.ticker_code,
inter_day_ticker_candle_price_history.close AS previousDayClose
FROM inter_day_ticker_candle_price_history
INNER JOIN
(SELECT MAX(inter_day_ticker_candle_price_history.candle_price_history_id) AS candle_price_history_id
FROM inter_day_ticker_candle_price_history
WHERE inter_day_ticker_candle_price_history.close>0
GROUP BY inter_day_ticker_candle_price_history.ticker_id) derivedTable
ON inter_day_ticker_candle_price_history.candle_price_history_id = derivedTable.candle_price_history_id
RIGHT JOIN ticker ON ticker.ticker_id = inter_day_ticker_candle_price_history.ticker_id
WHERE ticker.is_active = 1
Kindly suggest me any other technique, I can apply here to reduce the time.
this is the table structure
Field Type Null Key Default Extra
----------------------- ------------- ------ ------ ------- ----------------
candle_price_history_id int(8) NO PRI (NULL) auto_increment
ticker_id bigint(11) NO MUL (NULL)
candle_interval int(11) YES 1
trade_date datetime YES (NULL)
trade_price decimal(16,2) YES (NULL)
trade_size decimal(30,2) YES (NULL)
open decimal(16,2) YES (NULL)
high decimal(16,2) YES (NULL)
low decimal(16,2) YES (NULL)
close decimal(16,2) YES (NULL)
volume bigint(30) YES (NULL)
creation_date datetime YES (NULL)
is_ebabled bit(1) YES b'1'
It would look more natural to select from the ticker table first, then find the latest history entry and then join that:
SELECT
t.ticker_id,
t.ticker_code,
h.close AS previousDayClose
FROM ticker t
LEFT JOIN
(
SELECT ticker_id, MAX(candle_price_history_id) AS candle_price_history_id
FROM inter_day_ticker_candle_price_history
WHERE close > 0
GROUP BY ticker_id
) m on m.ticker_id = t.ticker_id
LEFT JOIN inter_day_ticker_candle_price_history h
ON h.candle_price_history_id = m.candle_price_history_id
WHERE t.is_active = 1;
However, your query should also work.
Make sure to have appropriate indexes. I'd suggest:
create index idx_ticker on ticker(is_active,
ticker_id,
ticker_code);
and
create index idx_history on inter_day_ticker_candle_price_history(ticker_id,
close,
candle_price_history_id);
or
create index idx_history on inter_day_ticker_candle_price_history(close,
ticker_id,
candle_price_history_id);
(The order of columns may make a difference, so you may want to try both versions for the history index. Well, you can of course create both indexes at the same time with different names and see which one gets used.)
Generally creating the apropiate indexes would speed up your querys with multiples filtering conditions.
For instance, creating an index on ticker_id may be the key to make the query faster.
On the other hand, creating indexes on close and is_active could help, but only if is_active = 1 its up to like 10% or less, of the records in the table.
Also, you could change the MAX function for an ORDER BY candle_price_history_id DESC LIMIT 1 as the table is already ordered by candle_price_history_id
This seems to be a "groupwise max" problem. For optimization techniques for that pattern, see http://mysql.rjweb.org/doc.php/groupwise_max .
See my comments on shrinking the table size. I assume this could be a huge table, possibly bigger than will fit in RAM?
If this is InnoDB, innodb_buffer_pool_size needs to be about 70% of available RAM.
If (ticker_id, tradedate) is unique, then make it the PRIMARY KEY and get rid of id completely. The order is important -- this clusters all the rows for a given ticker together, thereby cutting down on I/O. (If you are currently I/O-bound, this may give you a 10-fold speedup.)
Provide EXPLAIN SELECT .... You need for the query (as written) to start with the derived query. LEFT JOIN will not allow that.
Consider getting rid of inactive rows and rows with close <= 0.
I am working on a data analytics Dashboard for a media content broadcasting company. Even if a user clicks a certain channel, logs/records are stored into MySQL DB. Following is the table that stores data regarding channel play times.
Here is the table structure:
_____________________________________
| ID INT(11) |
_____________________________________
| Channel_ID INT(11) |
_____________________________________
| playing_date (DATE) |
_____________________________________
| country_code VARCHAR(50) |
_____________________________________
| playtime_in_sec INT(11) |
_____________________________________
| count_more_then_30_min_play INT(11) |
_____________________________________
| count_15_30_min_play INT(11) |
_____________________________________
| count_0_15_min_play |
_____________________________________
| channel_report_tag VARCHAR(50) |
_____________________________________
| device_report_tag VARCHAR(50) |
_____________________________________
| genre_report_tag VARCHAR(50) |
_____________________________________
The Query that I run behind one of the dashboard graphs construction is :
SELECT
channel_report_tag,
SUM(count_more_then_30_min_play) AS '>30 minutes',
SUM(count_15_30_min_play) AS '15-30 Minutes',
SUM(count_0_15_min_play) AS '0-15 Minutes'
FROM
channel_play_times_cleaned
WHERE
playing_date BETWEEN '' AND ''
AND country_code LIKE ''
AND device_report_tag LIKE ''
AND channel_report_tag LIKE ''
GROUP BY
channel_report_tag
LIMIT 10
This query basically is taking a lot of time to return the result set (given the table data exceeds a million records per day and increasing every second ). I came across this stack-overflow Question : What generic techniques can be applied to optimize SQL queries? which basically mentions employing indices as one the techniques to optimize SQL queries. At the moment I am confused how to apply indices (i.e on what columns) in order to optimize the above mentioned query. I would be very grateful if some one could offer help in creating indices according to my specific scenario. Any other expert opinion for a beginner like me are surely welcomed.
EDIT :
As suggested by #Thomas G ,
I have tried to improve my query and make it more specific :
SELECT
channel_report_tag,
SUM(count_more_then_30_min_play) AS '>30 minutes',
SUM(count_15_30_min_play) AS '15-30 Minutes',
SUM(count_0_15_min_play) AS '0-15 Minutes'
FROM
channel_play_times_cleaned
WHERE
playing_date BETWEEN '' AND ''
AND country_code = 'US'
AND device_report_tag = 'j8'
AND channel_report_tag = 'NAT GEO'
GROUP BY
channel_report_tag
LIMIT 10
I started to write this in a comment because these are hints and not a clear answer. But that's way too long
First of all, it is common sense (but not always a rule of thumb) to index the columns appearing in a WHERE clause :
playing_date BETWEEN '' AND ''
AND country_code LIKE ''
AND device_report_tag LIKE ''
AND channel_report_tag LIKE ''
If your columns have a very high cardinality (your tag columns???), it's probably not a good idea to index them. Country_code and playing_date should be indexed.
The issue here is that there are so many LIKE in your query. This operator is perf a killer and you are using it on 3 columns. That's awfull for the database. So the question is: Is that really needed?
For instance I see no obvious reason to make a LIKE on a country code. Will you really query like this :
AND country_code LIKE 'U%'
To retrieve UK and US ??
You probably won't. Chances are high that you will know the countries for which you are searching for, so you should do this instead :
AND country_code IN ('UK','US')
Which will be a lot faster if the country column is indexed
Next, If you really want to make LIKE on your 2 tag columns, instead of doing a LIKE you can try this
AND MATCH(device_report_tag) AGAINST ('anything*' IN BOOLEAN MODE)
It is also possible to index your tag columns as FULLTEXT, especially if you search with LIKE ='anything%'. I you search with LIKE='%anything%', the index won't probably help much.
I could also state that with millions rows a day, you might have to PARTITION your tables (on the date for instance). And following your data, a composite index on the date and something else might help.
Really, there's no simple and straight answer to your complex question, especially with what you shown (not a lot).
Separate indexes are not as useful as composite indexes. Unfortunately, you have many possible combinations, and you are (apparently) allowing wildcards, which may destroy the utility of indexes.
Suggest you use client code to build the WHERE clause rather than populating it with ''
In composite indexes, put one range last. date BETWEEN ... AND ... is a "range".
LIKE 'abc' -- same as = 'abc', so why not change to that.
LIKE 'abc%' -- is a "range"
LIKE '%abc' -- can't use an index.
IN ('CA', 'TX') -- sometimes optimizes like '=', sometimes like 'range'.
So... Watch what queries the users ask for, then build composite indexes to satisfy them. Some rules:
At most one range, and put it last.
Put '=' column(s) first.
INDEX(a,b) is handled by INDEX(a,b,c), so include only the latter.
Don't have more than, say, a dozen indexes.
Index Cookbook
We have a big table with the following table structure:
CREATE TABLE `location_data` (
`id` int(20) NOT NULL AUTO_INCREMENT,
`dt` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`device_sn` char(30) NOT NULL,
`data` char(20) NOT NULL,
`gps_date` datetime NOT NULL,
`lat` double(30,10) DEFAULT NULL,
`lng` double(30,10) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `dt` (`dt`),
KEY `data` (`data`),
KEY `device_sn` (`device_sn`,`data`,`dt`),
KEY `device_sn_2` (`device_sn`,`dt`)
) ENGINE=MyISAM AUTO_INCREMENT=721453698 DEFAULT CHARSET=latin1
Many times we have performed query such as follow:
SELECT * FROM location_data WHERE device_sn = 'XXX' AND data = 'location' ORDER BY dt DESC LIMIT 1;
OR
SELECT * FROM location_data WHERE device_sn = 'XXX' AND data = 'location' AND dt >= '2014-01-01 00:00:00 ' AND dt <= '2014-01-01 23:00:00' ORDER BY dt DESC;
We have been optimizing this in a few ways:
By adding index and using FORCE INDEX on device_sn.
Separating the table into multiple tables based on the date (e.g. location_data_20140101) and pre-checking if there is a data based on certain date and we will pull that particular table alone. This table is created by cron once a day and the data in location_data for that particular date will be deleted.
The table location_data is HIGH WRITE and LOW READ.
However, few times, the query is running really slow. I wonder if there are other methods / ways / restructure the data that allows us to read a data in sequential date manner based on a given device_sn.
Any tips are more than welcomed.
EXPLAIN STATEMENT 1ST QUERY:
+----+-------------+--------------+------+----------------------------+-----------+---------+-------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+------+----------------------------+-----------+---------+-------------+------+-------------+
| 1 | SIMPLE | location_dat | ref | data,device_sn,device_sn_2 | device_sn | 50 | const,const | 1 | Using where |
+----+-------------+--------------+------+----------------------------+-----------+---------+-------------+------+-------------+
EXPLAIN STATEMENT 2nd QUERY:
+----+-------------+--------------+-------+-------------------------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+-------+-------------------------------+------+---------+------+------+-------------+
| 1 | SIMPLE | test_udp_new | range | dt,data,device_sn,device_sn_2 | dt | 4 | NULL | 1 | Using where |
+----+-------------+--------------+-------+-------------------------------+------+---------+------+------+-------------+
The index device_sn (device_sn,data,dt) is good. MySQL should use it without need to do any FORCE INDEX. You can verify it by running "explain select ..."
However, your table is MyISAM, which is only supports table level locks. If the table is heavily write it may be slow. I would suggest converting it to InnoDB.
Ok, I'll provide info that I know and this might not answer your question but could provide some insight.
There exits certain differences between InnoDB and MyISAM. Forget about full text indexing or spatial indexes, the huge difference is in how they operate.
InnoDB has several great features compared to MyISAM.
First off, it can store the data set it works with in RAM. This is why database servers come with a lot of RAM - so that I/O operations could be done quick. For example, an index scan is faster if you have indexes in RAM rather than on HDD because finding data on HDD is several magnitudes slower than doing it in RAM. Same applies for full table scans.
The variable that controls this when using InnoDB is called innodb_buffer_pool_size. By default it's 8 MB if I am not mistaken. I personally set this value high, sometimes even up to 90% of available RAM. Usually, when this value is optimized - a lot of people experience incredible speed gains.
The other thing is that InnoDB is a transactional engine. That means it will tell you that a write to disk succeeded or failed and that will be 100% correct. MyISAM won't do that because it doesn't force OS to force HDD to commit data permanently. That's why sometimes records are lost when using MyISAM, it thinks data is written because OS said it was when in reality OS tried to optimize the write and HDD might lose buffer data, thus not writing it down. OS tries to optimize the write operation and uses HDD's buffers to store larger chunks of data and then it flushes it in a single I/O. What happens then is that you don't have control over how data is being written.
With InnoDB you can start a transaction, execute say 100 INSERT queries and then commit. That will effectively force the hard drive to flush all 100 queries at once, using 1 I/O. If each INSERT is 4 KB long, 100 of them is 400 KB. That means you'll utilize 400kb of your disk's bandwith with 1 I/O operation and that remainder of I/O will be available for other uses. This is how inserts are being optimized.
Next are indexes with low cardinality - cardinality is a number of unique values in an indexed column. For primary key this value is 1. it's also the highest value. Indexes with low cardinality are columns where you have a few distinct values, such as yes or no or similar. If an index is too low in cardinality, MySQL will prefer a full table scan - it's MUCH quicker. Also, forcing an index that MySQL doesn't want to use could (and probably will) slow things down - this is because when using an indexed search, MySQL processes records one by one. When it does a table scan, it can read multiple records at once and avoid processing them. If those records were written sequentially on a mechanical disk, further optimizations are possible.
TL;DR:
use InnoDB on a server where you can allocate sufficient RAM
set the value of innodb_buffer_pool_size large enough so you can allocate more resources for faster querying
use an SSD if possible
try to wrap multiple INSERTs into transactions so you can better utilize your hard drive's bandwith and I/O
avoid indexing columns that have low unique value count compared to row count - they just waste space (though there are exceptions to this)
I have a performance issue, while handling billion records using select query,I have a table as
CREATE TABLE `temp_content_closure2` (
`parent_label` varchar(2000) DEFAULT NULL,
`parent_code_id` bigint(20) NOT NULL,
`parent_depth` bigint(20) NOT NULL DEFAULT '0',
`content_id` bigint(20) unsigned NOT NULL DEFAULT '0',
KEY `code_content` (`parent_code_id`,`content_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
/*!50100 PARTITION BY KEY (parent_depth)
PARTITIONS 20 */ |
I used partition which will increase the performance by subdividing the table,but it is not usefull in my case,my sample select in this table
+----------------+----------------+--------------+------------+
| parent_label | parent_code_id | parent_depth | content_id |
+----------------+----------------+--------------+------------+
| Taxonomy | 20000 | 0 | 447 |
| Taxonomy | 20000 | 0 | 2286 |
| Taxonomy | 20000 | 0 | 3422 |
| Taxonomy | 20000 | 0 | 5916 |
+----------------+----------------+--------------+------------+
Here the content_id will be unique in respect to the parent_dept,so i used parent_depth as a key for partitioning.In every depth i have 2577833 rows to handle ,so here partitioning is not useful,i got a idea from websites to use archive storage engine but it will use full table scan and not use index in select ,basically 99% i use select query in this table and this table is going to increases its count every day.currently i am in mysql database which has 5.0.1 version.i got an idea about nosql database to use ,but is any way to handle in mysql ,if you are ssuggesting nosql means which can i use cassandra or accumulo ?.
Add an index like this:
ALTER TABLE table ADD INDEX content_id ('content_id')
You can also add multiple indexes if you have more specific SELECT criteria which will also speed things up.
Multiple and single indexes
Overall though, if you have a table like this thats growing so fast then you should probably be looking at restructuring your sql design.
Check out "Big Data" solutions as well.
With that size and volume of data, you'd need to either setup a sharded-MySQL setup in a cluster of machines (Facebook and Twitter stored massive amounts of data on a sharded MySQL setup, so it is possible), or alternatively use a Big Table-based solution that natively distribute the data amongst nodes in various clusters- Cassandra and HBase are the most popular alternatives here. You must realize, a billion records on a single machine will hit almost every limit of the system- IO first, followed by memory, followed by CPU. It's simply not feasible.
If you do go the Big Table way, Cassandra will be the quickest to setup and test. However, if you anticipate map-reduce type analytic needs, then HBase is more tightly integrated with the Hadoop ecosystem, and should work out well. Performance-wise, they are both neck to neck, so take your pick.
Database is MySQL with MyISAM engine.
Table definition:
CREATE TABLE IF NOT EXISTS matches (
id int(11) NOT NULL AUTO_INCREMENT,
game int(11) NOT NULL,
user int(11) NOT NULL,
opponent int(11) NOT NULL,
tournament int(11) NOT NULL,
score int(11) NOT NULL,
finish tinyint(4) NOT NULL,
PRIMARY KEY ( id ),
KEY game ( game ),
KEY user ( user ),
KEY i_gfu ( game , finish , user )
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=3149047 ;
I have set an index on (game, finish, user) but this GROUP BY query still needs 0.4 - 0.6 seconds to run:
SELECT user AS player
, COUNT( id ) AS times
FROM matches
WHERE finish = 1
AND game = 19
GROUP BY user
ORDER BY times DESC
The EXPLAIN output:
| id | select_type | table | type | possible_keys | key | key_len |
| 1 | SIMPLE | matches | ref | game,i_gfu | i_gfu | 5 |
| ref | rows | Extra |
| const,const | 155855 | Using where; Using temporary; Using filesort |
Is there any way I can make it faster? The table has about 800K records.
EDIT: I changed COUNT(id) into COUNT(*) and the time dropped to 0.08 - 0.12 seconds. I think I've tried that before making the index and forgot to change it again after.
In the explain output the Using index explains the speeding up:
| rows | Extra |
| 168029 | Using where; Using index; Using temporary; Using filesort |
(Side question: is this dropping of a factor of 5 normal?)
There are about 2000 users, so the final sorting, even if it uses filesort, it doesn't hurt performance. I tried without ORDER BY and it still takes almost same time.
Get rid of 'game' key - it's redundant with 'i_gfu'. As 'id' is unique count(id) just returns number of rows in each group, so you can get rid of that and replace it with count(*). Try it that way and paste output of EXPLAIN:
SELECT user AS player, COUNT(*) AS times
FROM matches
WHERE finish = 1
AND game = 19
GROUP BY user
ORDER BY times DESC
Eh, tough. Try reordering your index: put the user column first (so make the index (user, finish, game)) as that increases the chance the GROUP BY can use the index. However, in general GROUP BY can only use indexes if you limit the aggregate functions used to MIN and MAX (see http://dev.mysql.com/doc/refman/5.0/en/group-by-optimization.html and http://dev.mysql.com/doc/refman/5.5/en/loose-index-scan.html). Your order by isn't really helping either.
One of the shortcomings of this query is that you order by an aggregate. That means that you can't return any rows until the full result set has been generated; no index can exist (for mysql myisam, anyway) to fix that.
You can denormalize your data fairly easily to overcome this, though; You could, for instance, add an insert/update trigger to stick a count value in a summary table, with an index, so that you can start returning rows immediately.
The EXPLAIN verifies the (game, finish, user) index was used in the query. That seems like the best possible index to me. Could it be a hardware issue? What is your system RAM and CPU?
I take it that the bulk of the time is spent on extracting and more importantly sorting (twice, including the one skipped by reading the index) 150k rows out of 800k. I doubt you can optimize it much more than it already is.
As others have noted, you may have reached the limit of your ability to tune the query itself. You should next see what the setting of max_heap_table_size and tmp_table_size variables in your server. The default is 16MB, which may be too small for your table.