In a database I have a table with order items. The table holds roughly 36 million of records.
Running a query like this takes about 3 minutes:
SELECT COUNT(DISTINCT DATE(created_on), product_id) FROM order_items;
Running a query like this takes about 13 seconds:
SELECT COUNT(1) FROM order_items;
Something tells me that 36 million of records is not that much, and that both queries are running rather slowly.
What would be the checklist to start looking into the performance issue here?
We are using MySQL (in fact, a Clustrix version of it, MySQL 5.0.45-clustrix-6.0.1).
Edit. Adding more info:
/* SHOW CREATE TABLE order_items; */
CREATE TABLE `order_items` (
`id` int(10) unsigned not null AUTO_INCREMENT,
`state` enum('pending','sold_out','approved','declined','cancelled','processing','completed','expired') CHARACTER SET utf8 not null default 'pending',
`order_id` int(10) unsigned not null,
`product_id` int(10) unsigned not null,
`quantity` smallint(5) unsigned not null,
`price` decimal(10,2) unsigned not null,
`total` decimal(10,2) unsigned not null,
`created_on` datetime not null,
`updated_on` datetime not null,
`employee_id` int(11),
`customer_id` int(11) unsigned not null,
PRIMARY KEY (`id`) /*$ DISTRIBUTE=1 */,
KEY `updated_on` (`updated_on`) /*$ DISTRIBUTE=1 */,
KEY `state` (`state`,`quantity`) /*$ DISTRIBUTE=3 */,
KEY `product_id` (`product_id`,`state`) /*$ DISTRIBUTE=2 */,
KEY `product` (`product_id`) /*$ DISTRIBUTE=1 */,
KEY `order_items_quantity` (`quantity`) /*$ DISTRIBUTE=2 */,
KEY `order_id` (`order_id`,`state`,`created_on`) /*$ DISTRIBUTE=3 */,
KEY `order` (`order_id`) /*$ DISTRIBUTE=1 */,
KEY `index_order_items_on_employee_id` (`employee_id`) /*$ DISTRIBUTE=2 */,
KEY `customer_id` (`customer_id`) /*$ DISTRIBUTE=2 */,
KEY `created_at` (`created_on`) /*$ DISTRIBUTE=1 */,
) AUTO_INCREMENT=36943352 CHARACTER SET utf8 ENGINE=InnoDB /*$ REPLICAS=2 SLICES=12 */
And:
/* SHOW VARIABLES LIKE '%buffer%'; */
+----------------------------------------+-------+
| Variable_name | Value |
+----------------------------------------+-------+
| backup_compression_buffer_size_bytes | 8192 |
| backup_read_buffer_size_bytes | 8192 |
| backup_write_buffer_size_bytes | 8192 |
| mysql_master_trx_buffer_kb | 256 |
| mysql_slave_session_buffer_size_events | 100 |
| net_buffer_length | 16384 |
| replication_master_buffer_kb | 65536 |
+----------------------------------------+-------+
Edit 2. Here's EXPLAIN statements for both queries:
mysql> EXPLAIN SELECT COUNT(1) FROM order_items;
+----------------------------------------------------------+-------------+-------------+
| Operation | Est. Cost | Est. Rows |
+----------------------------------------------------------+-------------+-------------+
| row_count "expr1" | 29740566.81 | 1.00 |
| stream_combine | 26444732.70 | 32958341.10 |
| compute expr0 := param(0) | 1929074.80 | 2746528.43 |
| filter isnotnull(param(0)) | 1915342.16 | 2746528.43 |
| index_scan 1 := order_items.order_items_quantity | 1854308.19 | 3051698.25 |
+----------------------------------------------------------+-------------+-------------+
5 rows in set (0.13 sec)
And:
mysql> EXPLAIN SELECT COUNT(DISTINCT DATE(created_on), product_id) FROM order_items;
+----------------------------------------------------------------------------------+-------------+------------+
| Operation | Est. Cost | Est. Rows |
+----------------------------------------------------------------------------------+-------------+------------+
| hash_aggregate_combine expr1 := count(DISTINCT (0 . "expr0"),(1 . "product_id")) | 10115923.36 | 4577547.38 |
| hash_aggregate_partial GROUPBY((0 . "expr0"), (1 . "product_id")) | 3707357.04 | 4577547.38 |
| compute expr0 := cast(1.created_on, date) | 2166388.20 | 3051698.25 |
| index_scan 1 := order_items.__idx_order_items__PRIMARY | 2151129.71 | 3051698.25 |
+----------------------------------------------------------------------------------+-------------+------------+
4 rows in set (0.24 sec)
The first query must walk the entire database, checking every row in the table. An index on created_on and product_id would probably speed it up significantly. If you don't know about indexes, http://use-the-index-luke.com is a great place to start.
The second query seems to me that it should be instant, because it only has to check table metadata and doesn't need to check any rows.
You should publish the query plan but I suspect that to process the query MySQL must walk through the product_id and the created_on indexes. For the created_on field it must also aggregate the values (the field is datetime but you want to group by date).If you need speed I would add and additional field created_on_date with only the date and I would create an index on the product_id and the created_on_date. It should make your query much faster.
Of course the count(1) query is faster because it doesn't read the table at all and it can use the indexes metadata.
Some things to note:
If you add INDEX(product_id, created_on), the first query should run faster because it would be a "covering index". (The fields can be in the opposite order.)
Running those two queries in the order given could have caused info to be cached, thereby making the second query run faster.
SELECT COUNT(*) FROM tbl will use the smallest index. (In InnoDB.)
If you have enough RAM, and if innodb_buffer_pool_size is bigger than the table, then one or other of the operations may been performed entirely in RAM. RAM is a lot faster than disk.
Please provide SHOW CREATE TABLE order_items; I am having to guess too much.
Please provide SHOW VARIABLES LIKE '%buffer%';. How much RAM do you have?
Edit
Since it is Clustrix, there could be radically different things going on. Here's a guess:
SELECT COUNT(1) ... can probably be distributed to the nodes; each node would get a subtotal; then the subtotals could (very rapidly) be added.
SELECT COUNT(DISTINCT ...)... really has to look at all the rows, one way or another. That is, the effort cannot be distributed. Perhaps what happens is that all the rows are shoveled to one node for processing. I would guess it is a couple GB of stuff.
Is there some way in Clustrix to get EXPLAIN? I would be interested to see what it says about each of the SELECTs. (And whether it backs up my guess.)
I would expect GROUP BY and DISTINCT to be inefficient in a 'sharded' system (such as Clustrix).
COUNT(1)
In Plan, stream_combine was used. It has read only the index (order_items_quantity (quantity))
COUNT(DISTINCT DATE(created_on), product_id)
In general, COUNT(DISTINCT...) may be inefficient in RDB, NewSQL Scale-Out RDB even more, it's because of difficulty in reducing inter-nodes traffic (lots of data should be forwarded to GTM node in many cases). So Clustrix needs 'dist_stream_aggregate' and the right index (the columns and the column orders)
In plan, hash_aggregate_partial was shown. It has scanned FULL TABLE (__idx_order_items__PRIMARY) and took lots of time (much bigger size)
For parallelism, it may not be enough number for the all cpus available. (i.e. SLICES=12). I wonder how many nodes and cpus per nodes (?)
Because of DATE(created_on), the index created_at (created_on) would not work. The optimizer (Plan) thought FULL TABLE SCAN is more efficient than both looking up INDEX(created_at) and then accessing TABLE (__idx_order_items__PRIMARY).
For this case, I recommend to test as below.
Add column create_on_date_type
create index new_index on order_items(create_on_date_type, productid)
regarding distribute=? & slices=?, test should be done for your dataset.(the number of slices might impacts on how much cpu parallelism works)
You have to make sure the plan has dist_stream_aggregate.
dist_stream_aggregate can work efficiently only with the 'new_index' columns for your query.
I believe you would be able to get better performance.
Related
I have a MySQL table structured like this:
CREATE TABLE `messages` (
`id` int NOT NULL AUTO_INCREMENT,
`author` varchar(250) COLLATE utf8mb4_unicode_ci NOT NULL,
`message` varchar(2000) COLLATE utf8mb4_unicode_ci NOT NULL,
`serverid` varchar(200) COLLATE utf8mb4_unicode_ci NOT NULL,
`date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`guildname` varchar(1000) COLLATE utf8mb4_unicode_ci NOT NULL,
PRIMARY KEY (`id`,`date`)
) ENGINE=InnoDB AUTO_INCREMENT=27769461 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
I need to query this table for various statistics using date ranges for Grafana graphs, however all of those queries are extremely slow, despite the table being indexed using a composite key of id and date.
"id" is auto-incrementing and date is also always increasing.
The queries generated by Grafana look like this:
SELECT
UNIX_TIMESTAMP(date) DIV 120 * 120 AS "time",
count(DISTINCT(serverid)) AS "servercount"
FROM messages
WHERE
date BETWEEN FROM_UNIXTIME(1615930154) AND FROM_UNIXTIME(1616016554)
GROUP BY 1
ORDER BY UNIX_TIMESTAMP(date) DIV 120 * 120
This query takes over 30 seconds to complete with 27 million records in the table.
Explaining the query results in this output:
+----+-------------+----------+------------+------+---------------+------+---------+------+----------+----------+-----------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------+------------+------+---------------+------+---------+------+----------+----------+-----------------------------+
| 1 | SIMPLE | messages | NULL | ALL | PRIMARY | NULL | NULL | NULL | 26952821 | 11.11 | Using where; Using filesort |
+----+-------------+----------+------------+------+---------------+------+---------+------+----------+----------+-----------------------------+
This indicates that MySQL is indeed using the composite primary key I created for indexing the data, but still has to scan almost the entire table, which I do not understand. How can I optimize this table for date range queries?
Plan A:
PRIMARY KEY(date, id), -- to cluster by date
INDEX(id) -- needed to keep AUTO_INCREMENT happy
Assiming the table is quite big, having date at the beginning of the PK puts the rows in the given date range all next to each other. This minimizes (somewhat) the I/O.
Plan B:
PRIMARY KEY(id),
INDEX(date, serverid)
Now the secondary index is exactly what is needed for the one query you have provided. It is optimized for searching by date, and it is smaller than the whole table, hence even faster (I/O-wise) than Plan A.
But, if you have a lot of different queries like this, adding a lot more indexes gets impractical.
Plan C: There may be a still better way:
PRIMARY KEY(id),
INDEX(server_id, date)
In theory, it can hop through that secondary index checking each server_id. But I am not sure that such an optimization exists.
Plan D: Do you need id for anything other than providing a unique PRIMARY KEY? If not, there may be other options.
The index on (id, date) doesn't help because the first key is id not date.
You can either
(a) drop the current index and index (date, id) instead -- when date is in the first place this can be used to filter for date regardless of the following columns -- or
(b) just create an additional index only on (date) to support the query.
I have a rather large database where I would like to search/filter on a MEDIUMTEXT (tags), DATETIME (created_time) and a BIT (include) column.
Let's say the database looks like this:
+------+-----------------------+--------------------------+---------+
| id | created_time | tags | include |
|(INT) | (DATETIME) | (MEDIUMTEXT) | (BIT) |
+------+-----------------------+--------------------------+---------+
| 1 | '2017-02-20 08:58:06' | 'client 1' | 1 |
| 2 | '2017-03-01 18:12:00' | 'client 1 and client 2' | 0 |
| 3 | '2017-03-02 02:52:35' | 'client 3 plus client 1' | 0 |
| 4 | '2017-03-03 12:41:58' | 'client 1' | 1 |
| 5 | '2017-03-05 18:03:12' | 'client 2, client 3' | 1 |
| 6 | '2017-03-06 20:25:45' | 'client 1 and client 3' | 0 |
| 7 | '2017-03-08 22:51:22' | 'client 1' | 1 |
+------+-----------------------+--------------------------+---------+
I have indexed the DATETIME and BIT columns and I have used a FULLTEXT index on the MEDIUMTEXT column.
If I run this statement:
select statement 1
------------------
SELECT COUNT(*)
FROM database
WHERE (MATCH(tags) AGAINST('"client 1"' IN BOOLEAN MODE))
AND created_time >= '2017-03-01 12:00:00'
AND include = 0;
It takes 14 sec. to run and returns 6700 rows.
However, if I run:
select statement 2
------------------
SELECT COUNT(*)
FROM database
WHERE (MATCH(tags) AGAINST('"client 1"' IN BOOLEAN MODE));
It takes 0,4 sec. to run and returns 145000 rows and if I run:
select statement 3
------------------
SELECT COUNT(*)
FROM database
WHERE created_time >= '2017-03-01 12:00:00'
AND include = 0;
It takes 0,5 sec. to run and returns 25000 rows.
Now my question is, how do I make ‘select statement 1’ run faster? Do I need to first run ‘select statement 2’ and then run the ‘select statement 3’ on the results? If so, how? Anyone have experience with UNION and can I use it here? Or is there a way I can create a multiple-column index on INDEX and FULLTEXT?
Added info on the actual table (and not the example above) with special thanks to #rick-james
Query 1:
SELECT SQL_NO_CACHE count(*)
FROM Twitter_tweet
WHERE created_time >= '2017-01-01 23:00:00'
AND MATCH(tags) AGAINST('\"dkpol\"' IN BOOLEAN MODE);
Query 2:
SELECT SQL_NO_CACHE count(*)
FROM Twitter_tweet
WHERE MATCH(tags) AGAINST('\"dkpol\"' IN BOOLEAN MODE);
Query 3:
SELECT SQL_NO_CACHE count(*)
FROM Twitter_tweet
WHERE created_time >= '2017-01-01 23:00:00';
EXPLAIN for the 3 queries:
+----+-------------+---------------+----------+----------------------------------------------------+--------------------+---------+-------+--------+----------+-----------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------------+----------+----------------------------------------------------+--------------------+---------+-------+--------+----------+-----------------------------------+
| 1 | SIMPLE | Twitter_tweet | fulltext | created_time_INDEX,SELECT_tags_INDEX,tags_FULLTEXT | tags_FULLTEXT | 0 | const | 1 | 50.00 | Using where; Ft_hints: no_ranking |
+----+-------------+---------------+----------+----------------------------------------------------+--------------------+---------+-------+--------+----------+-----------------------------------+
| 2 | SIMPLE | | | | | | | | | Select tables optimized away |
+----+-------------+---------------+----------+----------------------------------------------------+--------------------+---------+-------+--------+----------+-----------------------------------+
| 3 | SIMPLE | Twitter_tweet | range | created_time_INDEX,SELECT_tags_INDEX | created_time_INDEX | 6 | | 572286 | 100.00 | Using where; Using index |
+----+-------------+---------------+----------+----------------------------------------------------+--------------------+---------+-------+--------+----------+-----------------------------------+
SHOW CREATE TABLE:
CREATE TABLE `Twitter_tweet` (
`post_id` bigint(20) unsigned NOT NULL,
`from_user_id` bigint(20) unsigned NOT NULL,
`from_user_username` tinytext,
`from_user_fullname` tinytext,
`message` mediumtext,
`created_time` datetime DEFAULT NULL,
`quoted_post_id` bigint(20) unsigned DEFAULT NULL,
`quoted_user_id` bigint(20) unsigned DEFAULT NULL,
`quoted_user_username` tinytext,
`quoted_user_fullname` tinytext,
`to_post_id` bigint(20) unsigned DEFAULT NULL,
`to_user_id` bigint(20) unsigned DEFAULT NULL,
`to_user_username` tinytext,
`truncated` bit(1) DEFAULT NULL,
`is_retweet` bit(1) DEFAULT NULL,
`retweeting_post_id` bigint(20) unsigned DEFAULT NULL,
`retweeting_user_id` bigint(20) unsigned DEFAULT NULL,
`retweeting_user_username` tinytext,
`retweeting_user_fullname` tinytext,
`tags` text,
`mentions_user_id` text,
`mentions_user_username` text,
`mentions_user_fullname` text,
`post_urls` text,
`count_favourite` int(11) DEFAULT NULL,
`count_retweet` int(11) DEFAULT NULL,
`lang` tinytext,
`location_longitude` float(13,10) DEFAULT NULL,
`location_latitude` float(13,10) DEFAULT NULL,
`place_id` tinytext,
`place_fullname` tinytext,
`source` tinytext,
`fetchtime` datetime DEFAULT NULL,
PRIMARY KEY (`post_id`),
UNIQUE KEY `post_id_UNIQUE` (`post_id`),
KEY `from_user_id_INDEX` (`from_user_id`),
KEY `quoted_user_id_INDEX` (`quoted_user_id`),
KEY `to_user_id_INDEX` (`to_user_id`),
KEY `retweeting_user_id_INDEX` (`retweeting_user_id`),
KEY `created_time_INDEX` (`created_time`),
KEY `retweeting_post_id_INDEX` (`retweeting_post_id`),
KEY `post_all_id_INDEX` (`post_id`,`retweeting_post_id`,`to_post_id`,`quoted_post_id`),
KEY `quoted_post_id_INDEX` (`quoted_post_id`),
KEY `to_post_id_INDEX` (`to_post_id`),
KEY `is_retweet_INDEX` (`is_retweet`),
KEY `SELECT_tags_INDEX` (`created_time`,`is_retweet`,`post_id`),
FULLTEXT KEY `tags_FULLTEXT` (`tags`),
FULLTEXT KEY `mentions_user_id_FULLTEXT` (`mentions_user_id`),
FULLTEXT KEY `message_FULLTEXT` (`message`),
FULLTEXT KEY `content_select` (`tags`,`message`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
When timing, do two things:
Turn off the Query cache (or us SELECT SQL_NO_CACHE...)
Run the query twice.
When a query is run, these happen:
Check the QC to see if exactly the same query was recently run; if so, return the result from that run. This usually takes ~1ms. (This is not what happened in the examples you gave.)
Perform the query. Now there are multiple sub-cases:
If the "buffer pool" is 'cold', this is likely to involve lots of I/O. I/O is slow. This may explain your 14 second run.
If the desired data is cached in RAM, then it will run faster. This probably explains why the other two runs were a lot faster.
If, after compensating from these, you still have issues, please provide SHOW CREATE TABLE and EXPLAIN SELECT ... for the cases. (There could other factors involved.)
Schema critique
One way to improve performance (some) is to shrink the data.
lang tinytext, -- there is a 5 char standard
BIGINT takes 8 bytes. A 4-byte INT is enough for half the people in the world. (But first verify that your AUTO_INCREMENTs are not burning a lot of ids.)
For subtle reasons, VARCHAR(255) is better than TINYTEXT, even though they seem equivalent. Whenever practical, use something less than 255.
FLOAT(13,10) has some issues; I recommend DECIMAL(8,6)/(9,6) as sufficient for distinguishing two tweeters sitting next to each other (not that GPS is that precise).
A PRIMARY KEY is a UNIQUE key; get rid of the redundant UNIQUE.
With INDEX(a, b), you don't also need INDEX(a). (at least 2 cases of such)
Bulk
What will you do with 6700 or 25000 rows in the resultset? I ask because the effort of returning lots of rows is part of the performance problem. If your next step is to further whittle down the output, then it may be better to do the whittling in SQL.
Analysis
Looking at the second set of Queries:
FT + date range. This first did the FT search, then further filtered by date.
FT, count results, quit. Note that all of that was done in the EXPLAIN, hence "Select tables optimized away" -- and the EXPLAIN time is the same as the SELECT time.
Scan one index for an estimated 572K rows -- done entirely in the index. This cannot be improved. However, it can be made severely worse -- such as by adding a seemingly innocuous AND include = 0. In this case it would not be able to use just the index, but instead would have to bounce between the index and the data -- a lot more costly. A cure for this case: INDEX(include, created_time), which would run faster.
COUNT(*) is potentially cheap -- no need to return lots of data, often can be completed within an index, etc.
SELECT col1, col2 is faster than SELECT * -- especially because of TEXT columns.
I'm trying to create a new table by joining four existing ones. My database is static, so making one large preprocessed table will simplify programming, and save lots of time in future queries. My query works fine when limited with a WHERE, but seems to either hang, or go too slowly to notice any progress.
Here's the working query. The result only takes a few seconds.
SELECT group.group_id, MIN(application.date), person.person_name, pers_appln.sequence
FROM group
JOIN application ON group.appln_id=application.appln_id
JOIN pers_appln ON pers_appln.appln_id=application.appln_id
JOIN person ON person.person_id=pers_appln.person_id
WHERE group_id="24601"
GROUP BY group.group_id, pers_appln.sequence
;
If I simply remove the WHERE line, it will run for days with nothing to show. Adding a CREATE TABLE newtable AS at the beginning does the same thing. It never moves beyond 0% progress.
The group, application, and person tables all use the MyISAM engine, while pers_appln uses InnoDB. The columns are all indexed. The table sizes range from about 40 million to 150 million rows. I know it's rather large, but I wouldn't think it would pose this much of a problem. The computer currently has 4GB of ram.
Any ideas how to make this work?
Here's the SHOW CREATE TABLE info. There are no views or virtual tables:
CREATE TABLE `group` (
`APPLN_ID` int(10) unsigned NOT NULL,
`GROUP_ID` int(10) unsigned NOT NULL,
KEY `idx_appln` (`APPLN_ID`),
KEY `idx_group` (`GROUP_ID`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
CREATE TABLE `application` (
`APPLN_ID` int(10) unsigned NOT NULL,
`APPLN_AUTH` char(2) NOT NULL DEFAULT '',
`APPLN_NR` varchar(20) NOT NULL DEFAULT '',
`APPLN_KIND` char(2) DEFAULT '',
`DATE` date DEFAULT NULL,
`IPR_TYPE` char(2) DEFAULT '',
PRIMARY KEY (`APPLN_ID`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
CREATE TABLE `person` (
`PERSON_ID` int(10) unsigned NOT NULL,
`PERSON_CTRY_CODE` char(2) NOT NULL,
`PERSON_NAME` varchar(300) DEFAULT NULL,
`PERSON_ADDRESS` varchar(500) DEFAULT NULL,
KEY `idx_person` (`PERSON_ID`),
) ENGINE=MyISAM DEFAULT CHARSET=utf8 MAX_ROWS=30000000 AVG_ROW_LENGTH=100
CREATE TABLE `pers_appln` (
`PERSON_ID` int(10) unsigned NOT NULL,
`APPLN_ID` int(10) unsigned NOT NULL,
`SEQUENCE` smallint(4) unsigned DEFAULT NULL,
`PLACE` smallint(4) unsigned DEFAULT NULL,
KEY `idx_pers_appln` (`APPLN_ID`),
KEY `idx_person` (`PERSON_ID`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
/*!50100 PARTITION BY HASH (appln_id)
PARTITIONS 20 */
Here's the EXPLAIN of my query:
+----+-------------+-------------+--------+----------------------------+-----------------+---------+--------------------------+----------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+--------+----------------------------+-----------------+---------+--------------------------+----------+---------------------------------+
| 1 | SIMPLE | person | ALL | idx_person | NULL | NULL | NULL | 47827690 | Using temporary; Using filesort |
| 1 | SIMPLE | pers_appln | ref | idx_application,idx_person | idx_person | 4 | mydb.person.PERSON_ID | 1 | |
| 1 | SIMPLE | application | eq_ref | PRIMARY | PRIMARY | 4 | mydb.pers_appln.APPLN_ID | 1 | |
| 1 | SIMPLE | group | ref | idx_application | idx_application | 4 | mydb.pers_appln.APPLN_ID | 1 | |
+----+-------------+-------------+--------+----------------------------+-----------------+---------+--------------------------+----------+---------------------------------+
Verify that key_buffer_size is about 200M and innodb_buffer_pool_size is about 1200M. Perhaps they could be bigger, but make sure you are not swapping.
group should have PRIMARY KEY(appln_id, group_id) and INDEX(group_id, appln_id) instead of the two KEYs it has.
pers_appln should have INDEX(person_id, appln_id) and INDEX(appln_id, person_id) instead of the two keys it has. If possible, one of those should be PRIMARY KEY, but watch out for the PARTITIONing.
A minor improvement would be to change those CHAR(2) fields to be CHARACTER SET ascii -- assuming you don't really need utf8. That would shrink the field from 6 bytes to 2 bytes per row.
The PARTITIONing is probably not helping at all. (No, I can't say that removing the PARTITIONing will speed it up much.)
If these suggestions do not help enough, please provide the output from EXPLAIN SELECT ...
Edit
Converting to InnoDB and specifying PRIMARY KEYs for all tables will help. This is because InnoDB "clusters" the PRIMARY KEY with the data. What you have now is a lot of bouncing between a MyISAM index and its data -- literally hundreds of millions of times. Assuming not everything can be cached in your small 4GB, that means a lot of disk I/O. I would not be surprised if the non-WHERE version would take a week to run. Even with InnoDB, there will be I/O, but some of it will be avoided because:
1. reaching into a table with the PK gets the data without another disk hit.
2. the extra indexes I proposed will avoid hitting the data, again avoiding an extra disk hit.
(Millions of references * "an extra disk hit" = days of time.)
If you switch all of your tables to InnoDB, you should lower key_buffer_size to 20M and raise innodb_buffer_pool_size to 1500M. (These are approximate; do not raise them so high that there is any swapping.)
Please show us the CREATE TABLEs with InnoDB -- I want to make sure each table has a PRIMARY KEY and which column(s) that is. The PRIMARY KEY makes a big difference in this particular situation.
For person, the MyISAM version has just a KEY(person_id). If you did not change the keys in the conversions, InnoDB will invent a PRIMARY KEY. When the JOIN to that table occurs, InnoDB will (1) drill down the BTree for key to find that invented PK value, then (2) drill down the PK+data BTree to find the row. If, instead, person_id could be the PK, that JOIN would run twice as fast. Possibly even faster--depending on how big the table is and how much it needs to jump around in the index / data. That is, the two BTree lookups is adding to the pressure on the cache (buffer_pool).
How big is each table? What was the final value for innodb_buffer_pool_size? Once you have changed everything from MyISAM to InnoDB, set key_buffer_size to 40M or less, and set innodb_buffer_pool_size to about 70% of available RAM. If the Data + Index sizes for all the tables are less than the buffer_pool, then (once cache is primed) the query won't have to do any I/O. This is easily a 10x speedup.
pers_appln is a many-to-many relationship? Then, probably
PRIMARY KEY(appln_id, person_id),
INDEX(person_id, appln_id) -- if you need to go the other direction, too.
I found the solution: switching to an SSD. My table creation time went from an estimated 45 days to 16 hours. Previously, the database spent all its time with hard drive I/O, barely even using 5% of the CPU or RAM.
Thanks everyone.
Been spending several days profiling a wide variety of queries used by a distributed application of ours in a MySQL database. Our app potentially stores millions of records on client database servers and the queries can vary enough so that the design of the indexes isn't always clear or easy. A tiny bit of extra overhead on query write it acceptable if the lookup speed is fast enough.
I've managed to narrow down a few composite indexes that work very well for nearly all of our most common queries. There may be some columns in the below indexes I can weed out, but I need to run tests to be sure.
However, my problem: A certain query actually runs faster when it uses an index that contains fewer columns present in the conditions.
The table structure with current composite indexes:
CREATE TABLE IF NOT EXISTS `prism_data` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`epoch` int(10) unsigned NOT NULL,
`action_id` int(10) unsigned NOT NULL,
`player_id` int(10) unsigned NOT NULL,
`world_id` int(10) unsigned NOT NULL,
`x` int(11) NOT NULL,
`y` int(11) NOT NULL,
`z` int(11) NOT NULL,
`block_id` mediumint(5) DEFAULT NULL,
`block_subid` mediumint(5) DEFAULT NULL,
`old_block_id` mediumint(5) DEFAULT NULL,
`old_block_subid` mediumint(5) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `epoch` (`epoch`),
KEY `block` (`block_id`,`action_id`,`player_id`),
KEY `location` (`world_id`,`x`,`z`,`y`,`epoch`,`action_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
I have eight common queries that I've been testing and they all show incredible performance improvement on a database with 50 million records. One query however, doesn't.
The following query returns 11088 rows in (9.77 sec) and uses the location index
SELECT SQL_NO_CACHE id,
epoch,
action,
player,
world_id,
x,
y,
z
FROM prism_data
INNER JOIN prism_players p ON p.player_id = prism_data.player_id
INNER JOIN prism_actions a ON a.action_id = prism_data.action_id
WHERE world_id =
(SELECT w.world_id
FROM prism_worlds w
WHERE w.world = 'world')
AND (a.action = 'world-edit')
AND (prism_data.x BETWEEN -7220 AND -7020)
AND (prism_data.y BETWEEN -22 AND 178)
AND (prism_data.z BETWEEN -9002 AND -8802)
AND prism_data.epoch >= 1392220467;
+----+-------------+------------+--------+----------------+----------+---------+--------------------------------+--------+------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+--------+----------------+----------+---------+--------------------------------+--------+------------------------------------+
| 1 | PRIMARY | a | ref | PRIMARY,action | action | 77 | const | 1 | Using where; Using index |
| 1 | PRIMARY | prism_data | ref | epoch,location | location | 4 | const | 660432 | Using index condition; Using where |
| 1 | PRIMARY | p | eq_ref | PRIMARY | PRIMARY | 4 | minecraft.prism_data.player_id | 1 | NULL |
| 2 | SUBQUERY | w | ref | world | world | 767 | const | 1 | Using where; Using index |
+----+-------------+------------+--------+----------------+----------+---------+--------------------------------+--------+------------------------------------+
If I remove the world condition, it would no longer match the location index and instead uses the epoch index. Amazingly, it returns 11088 rows in (0.31 sec)
9.77 sec versus 0.31 sec is too much of a difference to ignore. I don't understand why I'm not seeing such a performance kill on my other queries using the location index too but more importantly I don't know what I can do to fix this.
Presumably, the "epoch" index is more selective than the "location" index.
Note that MySQL might be running the subquery once for every row. That could have considerable overhead, even with an index. Doing 30 million index lookups might take a little time.
Try doing the query this way:
SELECT SQL_NO_CACHE id,
epoch,
action,
player,
world_id,
x,
y,
z
FROM prism_data
INNER JOIN prism_players p ON p.player_id = prism_data.player_id
INNER JOIN prism_actions a ON a.action_id = prism_data.action_id
CROSS JOIN (SELECT w.world_id FROM prism_worlds w WHERE w.world = 'world') w
WHERE world_id = w.world_id
AND (a.action = 'world-edit')
AND (prism_data.x BETWEEN -7220 AND -7020)
AND (prism_data.y BETWEEN -22 AND 178)
AND (prism_data.z BETWEEN -9002 AND -8802)
AND prism_data.epoch >= 1392220467;
If this doesn't show an improvement, then the issue is selectivity of the indexes. MySQL is simply making the wrong decision on which is the best index to use. If this does show an improvement, then it is because the subquery is being executed only one time in the from clause.
EDIT:
Your location index is:
KEY `location` (`world_id`,`x`,`z`,`y`,`epoch`,`action_id`)
Can you change this to:
KEY `location` (`world_id`, action_id `x`, `z`, `y`, `epoch`)
This allows the where filtering to use the action_id as well as x. (Only the first inequality uses direct index lookups.)
or better yet, one of of these:
KEY `location` (`world_id`, action_id, epoch, `x`, `z`, `y`)
KEY `location` (`world_id`, epoch, action_id, `x`, `z`, `y`)
KEY `location` (epoch, `world_id`, action_id, `x`, `z`, `y`)
The idea is to move epoch before x so it will be used for the where clause conditions.
I have this table, that contains around 80,000,000 rows.
CREATE TABLE `mytable` (
`date` date NOT NULL,
`parameters` mediumint(8) unsigned NOT NULL,
`num` tinyint(3) unsigned NOT NULL,
`val1` int(11) NOT NULL,
`val2` int(10) NOT NULL,
`active` tinyint(3) unsigned NOT NULL,
`ref` int(10) unsigned NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`ref`) USING BTREE,
KEY `parameters` (`parameters`)
) ENGINE=MyISAM AUTO_INCREMENT=79092001 DEFAULT CHARSET=latin1
it's articulated around 2 main columns: "parameters" and "date".
there are around 67,000 possible values for "parameters"
for each "parameters" there are around 1200 rows, each with a different date.
so for each date, there are 67,000 rows.
1200 * 67,000 = 80,400,000.
table size appears as 1.5GB, index size 1.4GB.
now, I want to query the table to retrieve all rows of one "parameters"
(actually I want to do it for each parameter, but this is a good start)
SELECT val1 FROM mytable WHERE parameters=1;
the first run gives me results in 8 seconds
subsequent runs for different but close values of parameters (2, 3, 4...) are instantaneous
a run for a "far away" value (parameters=1000) gives me results in 8 seconds again.
I did tests running the same query without the index, and got results in 20 seconds, so I guess the index is kicking in as shown by EXPLAIN, but not giving a drastic jump in performances:
+----+-------------+----------+------+---------------+------------+---------+-------+------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+------+---------------+------------+---------+-------+------+-------+
| 1 | SIMPLE | mytable | ref | parameters | parameters | 3 | const | 1097 | |
+----+-------------+----------+------+---------------+------------+---------+-------+------+-------+
but I'm still baffled by the time for such and easy request (no join, directly on the index).
the server is 2 years-old 2 cpu quad core 2.6GHz running Ubuntu, with 4G of RAM.
I've raised the key_buffer parameter to 1G, and have restarted mysql, but noticed no change whatsoever.
should I consider this normal ? or is there something I'm doing wrong ? I get the feeling with the right config the request should be almost immediate.
Try using a covering index, i.e. create an index that includes both of the columns you need. It won't need the second disk I/O to fetch the values from the main table since the data's right there in the index.