I have a mysql (5.0.22) myisam table with roughly 300k records in it and I want to do a lat/lon distance search within a five mile radius.
I have an index that covers the lat/lon fields and is fast (milisecond response) when I just select for lat/lon. But when I select for additional fields in the table is slows down horribly to 5-8 seconds.
I'm using myisam to take advantage of fulltext search. The other indexes perform well (e.g. select * from Listing where slug = 'xxxxx').
How can I optimize my query, table or index to speed things up?
My schema is:
CREATE TABLE `Listing` (
`id` int(10) unsigned NOT NULL auto_increment,
`name` varchar(125) collate utf8_unicode_ci default NULL,
`phone` varchar(18) collate utf8_unicode_ci default NULL,
`fax` varchar(18) collate utf8_unicode_ci default NULL,
`email` varchar(55) collate utf8_unicode_ci default NULL,
`photourl` varchar(55) collate utf8_unicode_ci default NULL,
`thumburl` varchar(5) collate utf8_unicode_ci default NULL,
`website` varchar(85) collate utf8_unicode_ci default NULL,
`categoryid` int(10) unsigned default NULL,
`addressid` int(10) unsigned default NULL,
`deleted` tinyint(1) default NULL,
`status` int(10) unsigned default '2',
`parentid` int(10) unsigned default NULL,
`organizationid` int(10) unsigned default NULL,
`listinginfoid` int(10) unsigned default NULL,
`createuserid` int(10) unsigned default NULL,
`createdate` datetime default NULL,
`lasteditdate` timestamp NOT NULL default CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP,
`lastedituserid` int(10) unsigned default NULL,
`slug` varchar(155) collate utf8_unicode_ci default NULL,
`aclid` int(10) unsigned default NULL,
`alt_address` varchar(80) collate utf8_unicode_ci default NULL,
`alt_website` varchar(80) collate utf8_unicode_ci default NULL,
`lat` decimal(10,7) default NULL,
`lon` decimal(10,7) default NULL,
`city` varchar(80) collate utf8_unicode_ci default NULL,
`state` varchar(10) collate utf8_unicode_ci default NULL,
PRIMARY KEY (`id`),
KEY `idx_fetch` USING BTREE (`slug`,`deleted`),
KEY `idx_loc` (`state`,`city`),
KEY `idx_org` (`organizationid`,`status`,`deleted`),
KEY `idx_geo_latlon` USING BTREE (`status`,`lat`,`lon`),
FULLTEXT KEY `idx_name` (`name`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci ROW_FORMAT=DYNAMIC;
My query is:
SELECT Listing.name, Listing.categoryid, Listing.lat, Listing.lon
, 3956 * 2 * ASIN(SQRT( POWER(SIN((Listing.lat - 37.369195) * pi()/180 / 2), 2) + COS(Listing.lat * pi()/180) * COS(37.369195 * pi()/180) * POWER(SIN((Listing.lon --122.036849) * pi()/180 / 2), 2) )) rawgeosearchdistance
FROM Listing
WHERE
Listing.status = '2'
AND ( Listing.lon between -122.10913433498 and -121.96456366502 )
AND ( Listing.lat between 37.296909665016 and 37.441480334984)
HAVING rawgeosearchdistance < 5
ORDER BY rawgeosearchdistance ASC;
Explain plan without geosearch:
+----+-------------+------------+-------+-----------------+-----------------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len |ref | rows | Extra |
+----+-------------+------------+-------+-----------------+-----------------+---------+------+------+-------------+
| 1 | SIMPLE | Listing | range | idx_geo_latlon | idx_geo_latlon | 19 | NULL | 453 | Using where |
+----+-------------+------------+-------+-----------------+-----------------+---------+------+------+-------------+
Explain plan with geosearch:
+----+-------------+------------+-------+-----------------+-----------------+---------+------+------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+-------+-----------------+-----------------+---------+------+------+-----------------------------+
| 1 | SIMPLE | Listing | range | idx_geo_latlon | idx_geo_latlon | 19 | NULL | 453 | Using where; Using filesort |
+----+-------------+------------+-------+-----------------+-----------------+---------+------+------+-----------------------------+
Here's the explain plan with the covering index. Having the columns in the correct order made a big difference:
+----+-------------+--------+-------+---------------+---------------+---------+------+--------+------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+-------+---------------+---------------+---------+------+--------+------------------------------------------+
| 1 | SIMPLE | Listing | range | idx_geo_cover | idx_geo_cover | 12 | NULL | 453 | Using where; Using index; Using filesort |
+----+-------------+--------+-------+---------------+---------------+---------+------+--------+------------------------------------------+
Thank you!
I think you really should consider the use of PostgreSQL (combined with Postgis).
I have given up on MySQL for geospatial data (for now) because of the following reasons:
MySQL only supports spatial datatypes / spatial indexes on MyISAM tables with the inherent disadvantages of MyISAM (concerning transactions, referential integrity...)
MySQL implements some of the OpenGIS
specifications only on a MBR-basis
(minimum bounding rectangle) which is
pretty useless for most serious
geospatial querying-processing (see
this link in the MySQL manual). Chances are you will need some of this functionality sooner of later.
PostgreSQL/Postgis with proper (GIST) spatial indexes and proper queries can be extremely fast.
Example: determining overlapping polygons between a 'small' selection of polygons and a table with over 5 million (!) very complex polygons, calculate the amount of overlap between these results + sort. Average runtime: between 30 and 100 milliseconds (This particular machine has a lot of RAM off course. Don't forget to tune your PostgreSQL install... (read the docs)).
You are probably using a 'covering index' in your lat/lon only query. A covering index occurs when the index used by the query contains the data that you are selecting for. MySQL only needs to visit the index and never the data rows. See this for more info. That would explain why the lat/lon query is so fast.
I suspect that the calculations and the sheer number of rows returned, slows down the longer query. (plus any temp table that has to be created for the having clause).
When I implemented geo radius search I just loaded all of the us Zipcodes into memory with their lat long and then used my starting point with radius to get a list of zipcodes in the radius and then used that for my db query. Of course I was using solr to do my searching because the search space was in the 20 million row range but the same principles should apply. Apologies for the shallowness of this response as I'm on my phone.
Depending on the number of your listings could you create a view that contains
Listing1Id, Listing2ID, Distance
Basically just have all of the distances "pre-calculated"
Then you could do something like:
Select listing2ID from v_Distance d
where distance < 5 and listing1ID =
XXX
You really should avoid doing that much math in your select statement. That's probably the source of a lot of your slowdowns. Remember, SQL is a query language; it's really not optimized for trigonometric functions.
SQL will be faster and your overall results will be faster if you do a very naive distance search (which will return more results) and then winnow your results.
If you want to be using distance in your query, at the very least, use a squared distance calculation; sqrt calculations are notoriously slow. Squared distance is much easier to use. A squared distance calculation is simply using the square of the distance instead of the distance; it is much simpler. For cartesian coordinate systems, since the sum of the squares of the short sides of a right triangle equals the square of the hypotenuse, it's easier to calculate the square distance (just sum the two squares) than it is to calculate the distance; all you have to do is make sure that you're squaring the distance you want to compare to (so instead of finding the precise distance and comparing that to your desired distance (let's say 5), you find the square distance, and compare that to the square of the desired distance (25, if your desired distance was 5).
Related
I'm trying to understand if the table is being loaded to InnoDB buffer. For that I'm querying INFORMATION_SCHEMA.INNODB_BUFFER_PAGE table.
From what I see, the table is fully loaded. However, amount of data loaded (MB) into buffer is very different from the numbers displayed in INFORMATION_SCHEMA.TABLES.
For example:
SELECT TABLE_NAME, TABLE_ROWS
, CAST(DATA_LENGTH/POWER(1024,2) AS DECIMAL(5, 0)) AS DATA_LENGTH_MB
, CAST(DATA_FREE/POWER(1024,2) AS DECIMAL(5, 0)) AS DATA_FREE_MB
FROM INFORMATION_SCHEMA.TABLES
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = '<db_name>'
AND TABLE_NAME = '<table_name>';
| TABLE_NAME | TABLE_ROWS | DATA_LENGTH_MB | DATA_FREE_MB |
|-----------------------------------------------------------|
| <table_name> | 39735968 | 10516 | 548 |
So there is around 39.7 million records and 10.5 GB in data pages according to INFORMATION_SCHEMA.TABLES
However, when I'm running this:
SELECT p.TABLE_NAME, p.INDEX_NAME
, ROUND(SUM(DATA_SIZE)/POWER(1024,2)) AS DATA_SIZE_MB
, SUM(NUMBER_RECORDS) AS NUMBER_RECORDS
FROM INFORMATION_SCHEMA.INNODB_BUFFER_PAGE AS p
WHERE p.TABLE_NAME LIKE '`<db_name>`.`<table_name>`' AND p.INDEX_NAME = 'PRIMARY'
AND p.PAGE_TYPE = 'INDEX' AND p.PAGE_STATE = 'FILE_PAGE'
ORDER BY p.TABLE_NAME, p.INDEX_NAME
I'm getting:
| TABLE_NAME | INDEX_NAME | DATA_SIZE_MB | NUMBER_RECORDS |
-----------------------------------------------------------------------
| <db_name>.<table_name> | PRIMARY | 3505 | 45224835 |
And finally,
SELECT COUNT(1) FROM <db_name>.<table_name>;
44947428
NUMBER_RECORDS is slightly greater that TABLE_ROWS in INFORMATION_SCHEMA.TABLES. so I assume that table is fully loaded into memory, and TABLE_ROWS is either approximate or not up to date.
But why DATA_SIZE in INFORMATION_SCHEMA.INNODB_BUFFER_PAGE is so much different (3.5 GB vs. 10.5 GB) ?
What am I missing? Is the data size in TABLES completely incorrect?
Database is running on Amazon RDS (Aurora MySQL 5.7) if that matters.
Thanks.
P.S. CREATE TABLE statement (columns names obfuscated, sorry : )
CREATE TABLE `table_name` (
`recid` BINARY(32) NOT NULL,
`col1` INT(11) UNSIGNED NOT NULL,
`col2` TINYINT(1) UNSIGNED NOT NULL,
`col3` VARCHAR(250) NULL DEFAULT NULL COLLATE 'utf8_general_ci',
`col4` TINYINT(1) UNSIGNED NOT NULL,
`col5` VARCHAR(250) NULL DEFAULT NULL COLLATE 'utf8_general_ci',
`col6` TINYINT(1) UNSIGNED NOT NULL,
`col7` TINYINT(1) UNSIGNED NOT NULL,
`col8` VARCHAR(100) NULL DEFAULT NULL COLLATE 'utf8_general_ci',
`col9` TINYINT(1) UNSIGNED NOT NULL,
`col10` TINYINT(1) UNSIGNED NOT NULL,
`col11` VARCHAR(100) NULL DEFAULT NULL COLLATE 'utf8_general_ci',
`col12` TINYINT(1) UNSIGNED NOT NULL DEFAULT '0',
`col13` TINYINT(1) UNSIGNED NOT NULL DEFAULT '1',
`col14` INT(11) UNSIGNED NULL DEFAULT NULL,
`col15` BINARY(32) NULL DEFAULT NULL,
`col16` CHAR(2) NULL DEFAULT NULL COLLATE 'utf8_general_ci',
`col17` TINYINT(1) NULL DEFAULT NULL,
`col18` VARCHAR(50) NULL DEFAULT NULL COLLATE 'utf8_general_ci',
`col19` TINYINT(1) NULL DEFAULT NULL,
`col20` TINYINT(1) NULL DEFAULT NULL,
PRIMARY KEY (`recid`) USING BTREE,
UNIQUE INDEX `col3` (`col3`) USING BTREE,
INDEX `col5` (`col5`) USING BTREE,
INDEX `col8` (`col8`) USING BTREE
)
COLLATE='utf8_general_ci'
ENGINE=InnoDB
;
The Information Schema INNODB_BUFFER_PAGE table contains information about pages in the buffer pool.
Note the last 4 words.
That suggests that SUM from INNODB_BUFFER_PAGE may be smaller than what you get from INFORMATION_SCHEMA.TABLES.
(I don't know all the details, but here are some general statements.)
The buffer_pool may contain:
Some or all of the leaf nodes for a table.
Some or all of the non-leaf nodes for a table.
Ditto for leaf and non-leaf nodes for each non-PRIMARY index for a table.
Ditto for more tables.
TEXT and BLOB (and large VARCHAR) may be stored off-record. This greatly increases the disk space occupied. But I don't think this happens in your case. However, see below.
25% (tunable) of the buffer_pool is reserved for the "change buffer"; this is sort of a write cache for changes to secondary indexes.
Other stuff
A few percent of the buffer_pool is held in reserve or lost for other reasons.
Blocks are kicked out of the buffer_pool in [roughly] a least-recently-used order.
I don't know, but I would expect that it might not be possible to keep a table in the buffer_pool if that table is half the size of the buffer_pool.
Another thing to note... The Data_free metric for each table is just one of quite a few categories of "overhead" in a table.
Pre-allocated blocks (perhaps reflected in Data_free)
Unfilled blocks (perhaps no data or index block is 100% full)
Transactions lead to extra copies of rows -- these come and go, either in the undo/redo space or in the data blocks.
Block splits
etc.
A Rule of Thumb is that the disk space occupied by the data (Data_length) is 2x-3x the predicted size. ("Predicted" = adding up individual data sizes, such as 4 bytes for each INT.)
Wild idea
What is the ROW_FORMAT?
Your 3.5GB computation may be the on-record space, and all the VARCHARs are stored off_record. The math almost works out.
Let's pursue 2 avenues of thought with
SELECT count(*),
AVG(LENGTH(col3)) AS avg3,
AVG(LENGTH(col5)) AS avg5,
... -- the rest of the VARCHARs
FROM table_name;
(I specifically want LENGTH(), not CHAR_LENGTH().)
Sorry for a long delay. I have finally managed to confirm there in fact was a data clean-up performed on the table in question. Around 60% or the records were deleted.
That should explain the difference between size and n_leaf_pages values in mysql.innodb_index_stats table. Not sure if that's normal behavior or not.
SO to answer my ow question. To estimate how much table would take in InnoDB pool I should probably be looking into mysql.innodb_index_stats.size instead of INFORMATION_SCHEMA.TABLE.
SELECT TABLE_NAME, ROUND((stat_value*##innodb_page_size)/POWER(1024,2)) AS DATA_SIZE_MB
FROM mysql.innodb_index_stats
WHERE database_name = 'db_name' AND index_name = 'PRIMARY' AND table_name = 'table_name'
AND stat_name = 'n_leaf_pages';
Thanks #Rick James for helping me with this one
In MySQL slow query log I have the following query:
SELECT * FROM `news_items`
WHERE `ctime` > 1465013901 AND `feed_id` IN (1, 2, 9) AND
`moderated` = '1' AND `visibility` = '1'
ORDER BY `views` DESC
LIMIT 5;
Here is the result of EXPLAIN:
+----+-------------+------------+-------+---------------------------------------------------------------------------------------+-------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+-------+---------------------------------------------------------------------------------------+-------+---------+------+------+-------------+
| 1 | SIMPLE | news_items | index | feed_id,ctime,ctime_2,feed_id_2,moderated,visibility,feed_id_3,cday_complex,feed_id_4 | views | 4 | NULL | 5 | Using where |
+----+-------------+------------+-------+---------------------------------------------------------------------------------------+-------+---------+------+------+-------------+
1 row in set (0.00 sec)
When I run this query manually, it takes like 0.00 sec but for some reason it appears in MySQL's slow log taking 1-5 seconds sometimes. I believe it happens when server is under high load.
Here is the table structure:
CREATE TABLE IF NOT EXISTS `news_items` (
`item_id` int(10) NOT NULL,
`category_id` int(10) NOT NULL,
`source_id` int(10) NOT NULL,
`feed_id` int(10) NOT NULL,
`title` varchar(255) CHARACTER SET utf8 NOT NULL,
`announce` varchar(255) CHARACTER SET utf8 NOT NULL,
`content` text CHARACTER SET utf8 NOT NULL,
`hyperlink` varchar(255) CHARACTER SET utf8 NOT NULL,
`ctime` varchar(11) CHARACTER SET utf8 NOT NULL,
`cday` tinyint(2) NOT NULL,
`img` varchar(100) CHARACTER SET utf8 NOT NULL,
`video` text CHARACTER SET utf8 NOT NULL,
`gallery` text CHARACTER SET utf8 NOT NULL,
`comments` int(11) NOT NULL DEFAULT '0',
`views` int(11) NOT NULL DEFAULT '0',
`visibility` enum('1','0') CHARACTER SET utf8 NOT NULL DEFAULT '0',
`pin` tinyint(1) NOT NULL,
`pin_dttm` varchar(10) CHARACTER SET utf8 NOT NULL,
`moderated` tinyint(1) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
The index named as "views" consists of 1 field only -- views.
I also have many other indexes consisting of (for example):
feed_id + views + visibility + moderated
moderated + visibility + feed_id + ctime
moderated + visibility + feed_id + views + ctime
I used fields in mentioned order because that was the only reason MySQL started to use them. However, I never got "Using where; using index" in EXPLAIN.
Any ideas on how to make EXPLAIN to show me "using index"?
If you have change the storage engine to InnoDB and create the correct composite index you can try this. The first query only gets the item_id for the first 5 rows. Limit is done after the complete SELECT is done. So its better to do this without any big data and then get the hole row only from the 5 woes
SELECT idata.* FROM (
SELECT item_id FROM `news_items`
WHERE `ctime` > 1465013901 AND `feed_id` IN (1, 2, 9) AND
`moderated` = '1' AND `visibility` = '1'
ORDER BY `views` DESC
LIMIT 5 ) as i_ids
LEFT JOIN news_items AS idata ON idata.item_id = i_ids.item_id
ORDER BY `views` DESC;
If your table "also have many other indexes", why do they not show in the SHOW CREATE TABLE?
There are two ways that
WHERE `ctime` > 1465013901
AND `feed_id` IN (1, 2, 9)
AND `moderated` = '1'
AND `visibility` = '1'
ORDER BY `views` DESC
could use indexing:
INDEX(views) and hope that the desired 5 rows (see LIMIT) show up early.
INDEX(moderated, visibility, feed_id, ctime)
This last 'composite' index starts with the two columns (in either order) that are compared = constant, then moves on to IN and finally a "range" (ctime > const). Older versions won't get past IN; newer versions will leapfrog through the IN values and make use of the range on ctime. More discussion.
It is useless to include the ORDER BY columns in a composite index before all of the WHERE columns. However, it will not be useful to include views in your case because the "range" on ctime.
The tip about 'lazy evaluation' that Bernd suggests will also help.
I agree that InnoDB would probably be better. Conversion tips.
To answer your question: "using index" means that MySQL will use only index to satisfy your query. To do this we will need to create a "covered" index (index which "covers" the query) = index which covers both "where" and "order by/group by" and all fields from "select" However, you are doing "select *" so that will not be practical.
MySQL chooses index on views as you have limit 5 in the query. It does that as 1) index is small 2) it can avoid filesort in this case.
I believe the problem is not with the index but rather than with the engine=MyISAM. MyISAM uses table level lock, so if you change the news_items it will be locked. I would suggest converting table to InnoDB.
Another possibility may be that if the table is large, index on (views) may not be the best option.
If you use Percona Server you can enable slow log verbosity option and see the query plan for the slow query as described here: https://www.percona.com/doc/percona-server/5.5/diagnostics/slow_extended_55.html
I'm learning about MySQL performance with a pet project consisting of ~2million rows + ~600k rows (two MyISAM tables). A range query using BETWEEN on two INT(10) indexed columns, LIMITed to 1 returned result takes about 160ms (including an INNER JOIN). I figure my configuration isn't optimised and am looking for some advice on how to either diagnose, or perhaps "common configuration".
I created a gist containing both tables, the query and the contents of my.cnf.
I created the b-tree index after inserting all data which was imported from a CSV file from MaxMinds open database. I tried two separate, and now a combined index with no difference in performance.
I'm running this locally on a Macbook Pro clocking at 2,6GHz (i5) and 8GB 1600MHz RAM. MySQL is installed using the downloadable binary from mysql's download page (unable to supply a third link because my rep is to low). It's a default installation with no major additions to the my.cnf config-file, included in the gist (located under /usr/local/mysql-5.6.xxx/ directory on my system).
My concern is that I'm reaching ~160ms which indicates to me that I'm missing something. I've considered compressing the table but I have a feeling that I'm missing other configurations. Also the myisampack wasn't in my PATH (I think) so I'm considering other optimisations before I explore this further.
Any advice is appreciated!
$ mysql --version
/usr/local/mysql-5.6.23-osx10.8-x86_64/bin/mysql Ver 14.14 Distrib 5.6.23, for osx10.8 (x86_64) using EditLine wrapper
Tables
CREATE TABLE `blocks` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`begin_range` int(10) unsigned NOT NULL,
`end_range` int(10) unsigned NOT NULL,
`_location_id` int(11) unsigned DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `begin_range` (`begin_range`,`end_range`)
) ENGINE=MyISAM AUTO_INCREMENT=2008839 DEFAULT CHARSET=ascii;
CREATE TABLE `locations` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`country` varchar(2) NOT NULL DEFAULT '',
`region` varchar(255) DEFAULT NULL,
`city` varchar(255) DEFAULT NULL,
`postalcode` varchar(255) DEFAULT NULL,
`latitude` float NOT NULL,
`longitude` float NOT NULL,
`metro_code` int(11) DEFAULT NULL,
`area_code` int(11) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=641607 DEFAULT CHARSET=utf8;
Query
SELECT locations.latitude, locations.longitude
FROM blocks
INNER JOIN locations ON blocks._location_id = locations.id
WHERE INET_ATON('139.130.4.5') BETWEEN begin_range AND end_range
LIMIT 0, 1;
Edit;
Updated gist with EXPLAIN on the SELECT, also posted here for convenience.
EXPLAIN SELECT locations.latitude, locations.longitude FROM blocks INNER JOIN locations ON blocks._location_id = locations.id WHERE INET_ATON('94.137.106.123') BETWEEN begin_range AND end_range LIMIT 0, 1;
+----+-------------+-----------+--------+---------------+-------------+---------+---------------------------+---------+------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+--------+---------------+-------------+---------+---------------------------+---------+------------------------------------+
| 1 | SIMPLE | blocks | range | begin_range | begin_range | 4 | NULL | 1095345 | Using index condition; Using where |
| 1 | SIMPLE | locations | eq_ref | PRIMARY | PRIMARY | 4 | geoip.blocks._location_id | 1 | NULL |
+----+-------------+-----------+--------+---------------+-------------+---------+---------------------------+---------+------------------------------------+
2 rows in set (0.00 sec)
Edit 2; Included data into the question for convenience.
The problem, and the normal approach (which your code exemplifies) leads to hitting 1095345 rows. I have an approach that can do that query in one disk hit, even the cache is cold.
Excerpts from http://mysql.rjweb.org/doc.php/ipranges :
The Situation
Your data includes a large set of non-overlapping 'ranges'. These could be IP addresses, datetimes (show times for a single station), zipcodes, etc.
You have pairs of start and end values; one 'item' belongs to each such 'range'. So, instinctively, you create a table with start and end of the range, plus info about the item. Your queries involve a WHERE clause that compares for being between the start and end values.
The Problem
Once you get a large set of items, performance degrades. You play with the indexes, but find nothing that works well. The indexes fail to lead to optimal functioning because the database does not understand that the ranges are non-overlapping.
The Solution
I will present a solution that enforces the fact that items cannot have overlapping ranges. The solution builds a table to take advantage of that, then uses Stored Routines to get around the clumsiness imposed by it.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 8 years ago.
Improve this question
I have noticed, that I am having problem with writing SQL queries, because of the problem with Mysqld, in peak hours. It causes my website to load 3-5 times slower than usually. So I'have tried siege -d5 -c150 http://mydomain.com/ and looked into top and my mysqld takes over 700% of CPU! I've also noticed in mysql status: Copying to tmp table and queries adding to some queue or something like this.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
25877 mysql 20 0 1076m 227m 8268 S 749.0 2.8 224:02.21 mysqld
This is my query
SELECT COUNT(downloaded.id) AS downloaded_count
, downloaded.file_name
,uploaded.*
FROM `downloaded` JOIN uploaded
ON downloaded.file_name = uploaded.file_name
WHERE downloaded.completed = '1'
AND uploaded.active = '1'
AND uploaded.nsfw = '0'
AND downloaded.datetime > DATE_SUB(NOW(), INTERVAL 7 DAY)
GROUP BY downloaded.file_name
ORDER BY downloaded_count DESC LIMIT 30;
Showing rows 0 - 29 ( 30 total, Query took 0.1639 sec) //is this that much? shouldn't it be 0.01s instead?
UPDATED: (removed ORDER BY)
Showing rows 0 - 29 ( 30 total, Query took 0.0064 sec)
Why ORDER BY makes it 20x slower?
EXPLAIN
+----+-------------+------------+------+---------------+-----------+---------+--------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+------+---------------+-----------+---------+--------------------------+------+----------------------------------------------+
| 1 | SIMPLE | uploaded | ALL | file_name_up | NULL | NULL | NULL | 3139 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | downloaded | ref | file_name | file_name | 767 | piqik.uploaded.file_name | 8 | Using where |
+----+-------------+------------+------+---------------+-----------+---------+--------------------------+------+----------------------------------------------+
table: uploaded (Total 720.5 KiB)
CREATE TABLE IF NOT EXISTS `uploaded` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`sid` int(1) NOT NULL,
`file_name` varchar(255) NOT NULL,
`file_size` varchar(255) NOT NULL,
`file_ext` varchar(255) NOT NULL,
`file_name_keyword` varchar(255) NOT NULL,
`access_key` varchar(40) NOT NULL,
`upload_datetime` datetime NOT NULL,
`last_download` datetime NOT NULL,
`file_password` varchar(255) NOT NULL DEFAULT '',
`nsfw` int(1) NOT NULL,
`votes` int(11) NOT NULL,
`downloads` int(11) NOT NULL,
`video_thumbnail` int(1) NOT NULL DEFAULT '0',
`video_duration` varchar(255) NOT NULL DEFAULT '',
`video_resolution` varchar(11) NOT NULL,
`video_additional` varchar(255) NOT NULL DEFAULT '',
`active` int(1) NOT NULL DEFAULT '1',
PRIMARY KEY (`id`),
FULLTEXT KEY `file_name_keyword` (`file_name_keyword`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=3328 ;
table: downloaded (Total 5,152.0 KiB)
CREATE TABLE IF NOT EXISTS `downloaded` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`file_name` varchar(255) NOT NULL,
`completed` int(1) NOT NULL,
`client_ip_addr` varchar(40) NOT NULL,
`client_access_key` varchar(40) NOT NULL,
`datetime` datetime NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=31475 ;
(not sure why I've chosen InnoDB here)
Please note, that I am (still) not using indexes (which as I read is very important!) because of lack of knowledge and I am not sure how to add them correctly.
So the question is, how to improve this query to prevent webserver from slow loading of website? I have only "few" records and can not believe I am having so major problems, people here deal with millions of records and their projects work. How do webhosting companies prevent this problem? (I am hosting only my webpages with over 150 concurrent clients)
Additional info:
Mysql: 5.5.33
Nginx 1.2.1, php5-fpm
Debian 7.1 Wheezy
2x L5420 # 2.50GHz
8GB RAM
A few observations:
You may not have actively chosen InnoDB as a storage engine: it will be the default engine for your version of MySQL. It's probably the right choice for your context, though, as it offers row-level locking instead of table-level locking (amongst other things) which you likely want.
Don't quote your integers in your comparisons (eg uploaded.active = '1'). You'll end up with slower string comparison, instead of integer comparison.
The comparison downloaded.datetime > DATE_SUB(NOW(), INTERVAL 7 DAY) with a derived value is going to be slower than comparison with a normal column value.
Regarding the last point, you could replace this with a user defined variable declared before the query:
SET #one_week_ago := DATE_SUB(NOW(), INTERVAL 7 DAY);
and then within the query compare to that pre-computed value:
...
downloaded.datetime > #one_week_ago
...
More importantly, though, you'll definitely want to have an index on any key that you're joining on.
In this case, you can add them by:
CREATE INDEX idx_file_name ON uploaded(file_name);
CREATE INDEX idx_file_name ON downloaded(file_name);
If you don't have indices, you're going to end up with multiple full table scans, which is slow.
There is a cost to adding an index: it takes up space, and it also means writes to the table are slower because the index has to be updated to include them. If this is a query that is running as part of the operation of your website, though, you definitely need the indices.
We want to map the entries of the calibration_data to the calibration data by following query. But the duration of this query is quite too long in my opinion (>24h).
Is there any optimization possible?
We added for testing more Indexes as needed right now but it didn't had any impact on the duration.
[Edit]
The hardware shouldn't be the biggest bottleneck
128 GB RAM
1TB SSD RAID 5
32 cores
EXPLAIN result
+----+-------------+-------+------------+------+---------------+------+---------+------+---------+----------+------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+------+---------+------+---------+----------+------------------------------------------------+
| 1 | SIMPLE | cal | NULL | ALL | NULL | NULL | NULL | NULL | 2009 | 100.00 | Using temporary; Using filesort |
| 1 | SIMPLE | m | NULL | ALL | visit | NULL | NULL | NULL | 3082466 | 100.00 | Range checked for each record (index map: 0x1) |
+----+-------------+-------+------------+------+---------------+------+---------+------+---------+----------+------------------------------------------------+
Query which takes too long:
Insert into knn_data (SELECT cal.X AS X,
cal.Y AS Y,
cal.BeginTime AS BeginTime,
cal.EndTime AS EndTime,
avg(m.dbm_ant) AS avg_dbm_ant,
m.ant_id AS ant_id,
avg(m.location) avg_location,
count(*) AS count,
m.visit
FROM calibration cal
LEFT join calibration_data m
ON m.visit BETWEEN cal.BeginTime AND cal.EndTime
GROUP BY cal.X,
cal.Y,
cal.BeginTime,
cal. BeaconId,
m.ant_id,
m.macHash,
m.visit;
Table knn_data:
CREATE TABLE `knn_data` (
`X` int(11) NOT NULL,
`Y` int(11) NOT NULL,
`BeginTime` datetime NOT NULL,
`EndTIme` datetime NOT NULL,
`avg_dbm_ant` float DEFAULT NULL,
`ant_id` int(11) NOT NULL,
`avg_location` float DEFAULT NULL,
`count` int(11) DEFAULT NULL,
`visit` datetime NOT NULL,
PRIMARY KEY (`ant_id`,`visit`,`X`,`Y`,`BeginTime`,`EndTIme`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Table calibration
BeaconId, X, Y, BeginTime, EndTime
41791, 1698, 3944, 2016-11-12 22:44:00, 2016-11-12 22:49:00
CREATE TABLE `calibration` (
`BeaconId` int(11) DEFAULT NULL,
`X` int(11) DEFAULT NULL,
`Y` int(11) DEFAULT NULL,
`BeginTime` datetime DEFAULT NULL,
`EndTime` datetime DEFAULT NULL,
KEY `x,y` (`X`,`Y`),
KEY `x` (`X`),
KEY `y` (`Y`),
KEY `BID` (`BeaconId`),
KEY `beginTime` (`BeginTime`),
KEY `x,y,beg,bid` (`X`,`Y`,`BeginTime`,`BeaconId`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Table calibration_data
macHash, visit, dbm_ant, ant_id, mac, isRand, posX, posY, sources, ip, dayOfMonth, location, am, ar
'f5:dc:7d:73:2d:e9', '2016-11-12 22:44:00', '-87', '381', 'f5:dc:7d:73:2d:e9', NULL, NULL, NULL, NULL, NULL, '12', '18.077636300207715', 'inradius_41791', NULL
CREATE TABLE `calibration_data` (
`macHash` varchar(100) COLLATE utf8_bin NOT NULL,
`visit` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`dbm_ant` int(3) NOT NULL,
`ant_id` int(11) NOT NULL,
`mac` char(17) COLLATE utf8_bin DEFAULT NULL,
`isRand` tinyint(4) DEFAULT NULL,
`posX` double DEFAULT NULL,
`posY` double DEFAULT NULL,
`sources` int(2) DEFAULT NULL,
`ip` int(10) unsigned DEFAULT NULL,
`dayOfMonth` int(11) DEFAULT NULL,
`location` varchar(80) COLLATE utf8_bin DEFAULT NULL,
`am` varchar(300) COLLATE utf8_bin DEFAULT NULL,
`ar` varchar(300) COLLATE utf8_bin DEFAULT NULL,
KEY `visit` (`visit`),
KEY `macHash` (`macHash`),
KEY `ant, time` (`dbm_ant`,`visit`),
KEY `beacon` (`am`),
KEY `ant_id` (`ant_id`),
KEY `ant,mH,visit` (`ant_id`,`macHash`,`visit`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
Onetime task? Then it does not matter? After getting this data loaded, will you incrementally update the "summary table" each day?
Shrink datatypes -- bulky data takes longer to process. Example: a 4-byte INT DayOfMonth could be a 1-byte TINYINT UNSIGNED.
You are moving a TIMESTAMP into a DATETIME. This may or may not work as you expect.
INT UNSIGNED is OK for IPv4, but you can't fit IPv6 in it.
COUNT(*) probably does not need a 4-byte INT; see the smaller variants.
Use UNSIGNED where appropriate.
A mac-address takes 19 bytes the way you have it; it could easily be converted to/from a 6-byte BINARY(6). See REPLACE(), UNHEX(), HEX(), etc.
What is the setting of innodb_buffer_pool_size? It could be about 100G for the big RAM you have.
Do the time ranges overlap? If not, take advantage of that. Also, don't include unnecessary columns in the PRIMARY KEY, such as EndTime.
Have the GROUP BY columns in the same order as the PRIMARY KEY of knn_data; this will avoid a lot of block splits during the INSERT.
The big problem is that there is no useful index in calibration_data, so the JOIN has to do a full table scan again and again! An extimated 2K scans of 3M rows! Let me focus on that problem...
There is no good way to do WHERE x BETWEEN start AND end because MySQL does not know whether the datetime ranges overlap. There is no real cure for that in this context, so let me approach it differently...
Are start and end 'regular'? Like every hour? Of so, we can do some sort of computation instead of the BETWEEN. Let me know if this is the case; I will continue my thoughts.
That's a nasty and classical one on "range" queries: the optimiser doesnt use your indexes and end up in a full table scan. In your explain plan ou can see this on column type=ALL.
Ideally you should have type=range and something in the key column
Some ideas:
I doubt that changing you jointure from
ON m.visit BETWEEN cal.BeginTime AND cal.EndTime
to
ON m.visit >= cal.BeginTime AND m.visit <= cal.EndTime
will work, but still give it a try.
Do trigger an ANALYSE TABLE on both tables. This is will update the stats on your tables and might help the optimiser to take the right decision (ie using the indexes)
Change the query to this might also help to force the optimiser use indexes :
Insert into knn_data (SELECT cal.X AS X,
cal.Y AS Y,
cal.BeginTime AS BeginTime,
cal.EndTime AS EndTime,
avg(m.dbm_ant) AS avg_dbm_ant,
m.ant_id AS ant_id,
avg(m.location) avg_location,
count(*) AS count,
m.visit
FROM calibration cal
LEFT join calibration_data m
ON m.visit >= cal.BeginTime
WHERE m.visit <= cal.EndTime
GROUP BY cal.X,
cal.Y,
cal.BeginTime,
cal. BeaconId,
m.ant_id,
m.macHash,
m.visit;
That's all I am thinking off...