Slow INSERT query on 200m table - mysql

We have the following table with about 200 million records:
CREATE TABLE IF NOT EXISTS `history` (
`airline` char(2) NOT NULL,
`org` char(3) NOT NULL,
`dst` char(3) NOT NULL,
`departat` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
`arriveat` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
`validon` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`price` int(11) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8
/*!50500 PARTITION BY RANGE COLUMNS(org)
(PARTITION p0 VALUES LESS THAN ('AHI') ENGINE = MyISAM,
PARTITION p1 VALUES LESS THAN ('ARO') ENGINE = MyISAM,
...
PARTITION p39 VALUES LESS THAN ('WMA') ENGINE = MyISAM,
PARTITION p40 VALUES LESS THAN (MAXVALUE) ENGINE = MyISAM) */;
--
-- Indexes for table `history`
--
ALTER TABLE `history`
ADD KEY `tail` (`org`,`dst`,`departat`);
We're doing bulk inserts of some VALUES frequently, usually up to 1000 records in simple INSERT queries, without any decoration such as ON DUPLICATE KEY (the index is not unique anyway).
Sometimes when I go to server status in phpMyAdmin, a see a bunch of INSERT statements waiting for each other, sometimes for up to 300-400 seconds. Nothing else seems to be going on the server at the particular time. We got 32 GB and otherwise excellent performance.
How to troubleshoot this issue? Thanks for help.

Probably first step is to do couple of test runs with profiling on.
Usually you'd do something like:
SET LOCAL PROFILING=ON;
-- run your INSERT, like:
INSERT INTO yourtable (id) VALUES (1),(2),(3);
SHOW PROFILES;
+----------+------------+------------------------------------------------+
| Query_ID | Duration | Query |
+----------+------------+------------------------------------------------+
| 1012 | 6.25220000 | INSERT INTO yourtable (id) VALUES (1),(2),(3); |
+----------+------------+------------------------------------------------+
This tells you very basic information, like duration of the query (6.25 sec in this case). To get the actual details you need to pull up the profile for said query:
SHOW PROFILE FOR QUERY 1025;
+------------------------------+----------+
| Status | Duration |
+------------------------------+----------+
| starting | 0.004356 |
| checking permissions | 0.000015 |
| Opening tables | 6.202999 |
| System lock | 0.000017 |
| init | 0.000342 |
| update | 0.023951 |
| Waiting for query cache lock | 0.000008 |
| update | 0.000007 |
| end | 0.000011 |
| query end | 0.019984 |
| closing tables | 0.000019 |
| freeing items | 0.000304 |
| logging slow query | 0.000006 |
| cleaning up | 0.000181 |
+------------------------------+----------+
You may notice that 'Opening tables' took very long. In this example query execution was delayed by locking the table (LOCK TABLES) by another process to delay the execution. Further information about the states is available in the manual.

set default 0 for timestamp fields and try
eg:
departat timestamp NOT NULL DEFAULT 0,
arriveat timestamp NOT NULL DEFAULT 0,
Timestamp will store a value like integer (mean timestamp of the passing time), it will not keep record like datetime.
In your case you have set the default as datetime format in timestap field type

There are several things you can do to optimize bulk inserts.
One of the things is setting off these variables if you are sure your data doesn't contain duplicates (don't forget to set them to 1 after the upload is complete):
SET AUTOCOMMIT = 0; SET FOREIGN_KEY_CHECKS = 0; SET UNIQUE_CHECKS = 0;
Also you need to check if no other users are accessing the table. You can also try using Innodb since it's said is better than MyISAM handling bulk inserts with data already on it.
Also you can check for fragmentation on your tables, sometimes the overhead the OS gives when assigning free space on fragmented drives is the cause of the delay.

Related

MySQL 8 ignoring integer lengths

I have a MySQL 8.0.19 running in a Docker container and using the InnoDB engine. I have noticed that table integer field lengths are getting ignored.
The issue occurs with integer datatypes regardless if running a CREATE or ALTER query
CREATE TABLE `test` (
`id` int DEFAULT NULL,
`text_field` varchar(20) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`decimal_field` decimal(6,2) DEFAULT NULL,
`int_field` int DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
The lengths are showing as 0 in my MySQL client (Navicat), but the same occurs if checking in the console with SHOW FULL COLUMNS FROMtest;
mysql> SHOW FULL COLUMNS FROM `test`;
+---------------+--------------+--------------------+------+-----+---------+-------+---------------------------------+---------+
| Field | Type | Collation | Null | Key | Default | Extra | Privileges | Comment |
+---------------+--------------+--------------------+------+-----+---------+-------+---------------------------------+---------+
| id | int | NULL | YES | | NULL | | select,insert,update,references | |
| text_field | varchar(20) | utf8mb4_unicode_ci | YES | | NULL | | select,insert,update,references | |
| decimal_field | decimal(6,2) | NULL | YES | | NULL | | select,insert,update,references | |
| int_field | int | NULL | YES | | NULL | | select,insert,update,references | |
+---------------+--------------+--------------------+------+-----+---------+-------+---------------------------------+---------+
The Type column should be showing int(11) for the two integer fields, but it isn't.
Is this related to something in my MySQL settings, and if so, which variable would have to be changed?
This is a change documented in the MySQL 8.0.19 release notes:
Display width specification for integer data types was deprecated in
MySQL 8.0.17, and now statements that include data type definitions in
their output no longer show the display width for integer types, with
these exceptions:
The type is TINYINT(1). MySQL Connectors make the assumption that
TINYINT(1) columns originated as BOOLEAN columns; this exception
enables them to continue to make that assumption.
The type includes the ZEROFILL attribute.
This change applies to tables, views, and stored routines, and affects
the output from SHOW CREATE and DESCRIBE statements, and from
INFORMATION_SCHEMA tables.
For DESCRIBE statements and INFORMATION_SCHEMA queries, output is
unaffected for objects created in previous MySQL 8.0 versions because
information already stored in the data dictionary remains unchanged.
This exception does not apply for upgrades from MySQL 5.7 to 8.0, for
which all data dictionary information is re-created such that data
type definitions do not include display width. (Bug #30556657, Bug #97680)
The "length" of an integer column doesn't mean anything. A column of int(11) is the same as int(2) or int(40). They are all a fixed-size, 32-bit integer data type. They support the same minimum and maximum value.
The "length" of integer columns has been a confusing feature of MySQL for years. It's only a hint that affects the display width, not the storage or the range of values. Practically, it only matters when you use the ZEROFILL option.
mysql> create table t ( i1 int(6) zerofill, i2 int(12) zerofill );
Query OK, 0 rows affected (0.02 sec)
mysql> insert into t set i1 = 123, i2 = 123;
Query OK, 1 row affected (0.00 sec)
mysql> select * from t;
+--------+--------------+
| i1 | i2 |
+--------+--------------+
| 000123 | 000000000123 |
+--------+--------------+
1 row in set (0.00 sec)
So it's a good thing that the misleading integer "length" is now deprecated and removed. It has caused confusion for many years.
I can confirm that having upgraded AWS RDS to MySQL 8.0.19 that you can now sync using Navicat correctly.
However, PLEASE BE AWARE!!
When updating the id column, if auto_increment is set, Navicat removes auto_increment to change the length and then re-applies it at the end. This causes the auto_increment column to reassign the ids is sequencial order!
ALTER TABLE `database`.`table` MODIFY COLUMN `id` mediumint(0) NOT NULL FIRST;
...
...
ALTER TABLE `database`.`table` MODIFY COLUMN `id` mediumint(0) NOT NULL AUTO_INCREMENT;
If you are using table relationships and do not have the foreign keys setup properly, this will break your database!
Also, if you have id numbers of zero or below in your auto_increment column, this will cause the following error:
Result: 1062 - ALTER TABLE causes auto_increment resequencing,
resulting in duplicate entry '1' for key 'table.PRIMARY'
To avoid the above, you will need to manually change each tables id length to 0 and then save the changes before attempting to use the Navicat sync feature. When saving the changes using Navicat this will automatically change any other int column lengths to 0.
Please ensure you throughly test your changes on a copy of the database before trying to apply to any production databases.

Update large table from smaller, mission critical, table without locking small table

In MySQL, I have two innodb tables, a small mission critical table, that needs to be readily available at all times for reads/writes. Call this mission_critical. I have a larger table (>10s of millions of rows), called big_table. I need to update big_table, for instance:
update mission_critical c, big_table b
set
b.something = c.something_else
where b.refID=c.id
This query could take more than an hour, but this creates a write-lock on the mission_critical table. Is there a way I can tell mysql, "I don't want a lock on mission_critical" so that that table can be written to?
I understand that this is not ideal from a transactional point of view. The only workaround I can think of right now is to make a copy of the small mission_critical table and do the update from that (which I don't care gets locked), but I'd rather not do that if there's a way to make MySQL natively deal with this more gracefully.
It is not the table that is locking but all of the records in mission_critical that are locked, since they are basically all scanned by the update. I am not assuming this; the symptom is that when a user logs in to an online system, it tries to update a datetime column in mission_critical to update the last time they logged in. These queries die due to a Lock wait timeout exceeded error while the query above is running. If I kill the query above, all pending queries run immediately.
mission_critical.id and big_table.refID are both indexed.
The pertinent portions of the creation statements for each table is:
mission_critical:
CREATE TABLE `mission_critical` (
`intID` int(11) NOT NULL AUTO_INCREMENT,
`id` smallint(6) DEFAULT NULL,
`something_else` varchar(50) NOT NULL,
`lastLoginDate` datetime DEFAULT NULL,
PRIMARY KEY (`intID`),
UNIQUE KEY `id` (`id`),
UNIQUE KEY `something_else` (`something_else`),
) ENGINE=InnoDB AUTO_INCREMENT=1432 DEFAULT CHARSET=latin1
big_table:
CREATE TABLE `big_table` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`postDate` date DEFAULT NULL,
`postTime` int(11) DEFAULT NULL,
`refID` smallint(6) DEFAULT NULL,
`something` varchar(50) NOT NULL,
`change` decimal(20,2) NOT NULL
PRIMARY KEY (`id`),
KEY `refID` (`refID`),
KEY `postDate` (`postDate`),
) ENGINE=InnoDB AUTO_INCREMENT=138139125 DEFAULT CHARSET=latin1
The explanation of the query is:
+----+-------------+------------------+------------+------+---------------+-------+---------+------------------------------------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------------+------------+------+---------------+-------+---------+------------------------------------+------+----------+-------------+
| 1 | SIMPLE | mission_critical | | ALL | id | | | | 774 | 100 | Using where |
| 1 | UPDATE | big_table | | ref | refID | refID | 3 | db.mission_critical.something_else | 7475 | 100 | |
+----+-------------+------------------+------------+------+---------------+-------+---------+------------------------------------+------+----------+-------------+
I first suggested a workaround with a subquery, to create a copy in an internal temporary table. But in my test the small table was still locked for writes. So I guess your best bet is to make a copy manually.
The reason for the lock is described in this bug report: https://bugs.mysql.com/bug.php?id=72005
This is what Sinisa Milivojevic wrote in an answer:
update table t1,t2 ....
any UPDATE with a join is considered a multiple-table update. In that
case, a referenced table has to be read-locked, because rows must not
be changed in the referenced table during UPDATE until it has
finished. There can not be concurrent changes of the rows, nor DELETE
of the rows, nor, much less, exempli gratia any DDL on the referenced
table. The goal is simple, which is to have all tables with consistent
contents when UPDATE finishes, particularly since multiple-table
UPDATE can be executed with several passes.
In short, this behavior is for a good reason.
Consider writing INSERT and UPDATE triggers, which will update the big_table on the fly. That would delay writes on the mission_critical table. But it might be fast enough for you, and wouldn't need the mass-update-query any more.
Also check if it wouldn't be better to use char(50) instead of varchar(50). I'm not sure, but it's possible that it will improve the update performance because the row size wouldn't need to change. I could improve the update performance about 50% in a test.
UPDATE will lock the rows that it needs to change. It may also lock the "gaps" after those rows.
You may use MySQL transactions in loop
Update only 100 rows at once
BEGIN;
SELECT ... FOR UPDATE; -- arrange to have this select include the 100 rows
UPDATE ...; -- update the 100 rows
COMMIT;
May be worth trying a correlated subquery to see if the optimiser comes up with a different plan, but performance may be worse.
update big_table b
set b.something = (select c.something_else from mission_critical c where b.refID = c.id)

Database performance checklist

In a database I have a table with order items. The table holds roughly 36 million of records.
Running a query like this takes about 3 minutes:
SELECT COUNT(DISTINCT DATE(created_on), product_id) FROM order_items;
Running a query like this takes about 13 seconds:
SELECT COUNT(1) FROM order_items;
Something tells me that 36 million of records is not that much, and that both queries are running rather slowly.
What would be the checklist to start looking into the performance issue here?
We are using MySQL (in fact, a Clustrix version of it, MySQL 5.0.45-clustrix-6.0.1).
Edit. Adding more info:
/* SHOW CREATE TABLE order_items; */
CREATE TABLE `order_items` (
`id` int(10) unsigned not null AUTO_INCREMENT,
`state` enum('pending','sold_out','approved','declined','cancelled','processing','completed','expired') CHARACTER SET utf8 not null default 'pending',
`order_id` int(10) unsigned not null,
`product_id` int(10) unsigned not null,
`quantity` smallint(5) unsigned not null,
`price` decimal(10,2) unsigned not null,
`total` decimal(10,2) unsigned not null,
`created_on` datetime not null,
`updated_on` datetime not null,
`employee_id` int(11),
`customer_id` int(11) unsigned not null,
PRIMARY KEY (`id`) /*$ DISTRIBUTE=1 */,
KEY `updated_on` (`updated_on`) /*$ DISTRIBUTE=1 */,
KEY `state` (`state`,`quantity`) /*$ DISTRIBUTE=3 */,
KEY `product_id` (`product_id`,`state`) /*$ DISTRIBUTE=2 */,
KEY `product` (`product_id`) /*$ DISTRIBUTE=1 */,
KEY `order_items_quantity` (`quantity`) /*$ DISTRIBUTE=2 */,
KEY `order_id` (`order_id`,`state`,`created_on`) /*$ DISTRIBUTE=3 */,
KEY `order` (`order_id`) /*$ DISTRIBUTE=1 */,
KEY `index_order_items_on_employee_id` (`employee_id`) /*$ DISTRIBUTE=2 */,
KEY `customer_id` (`customer_id`) /*$ DISTRIBUTE=2 */,
KEY `created_at` (`created_on`) /*$ DISTRIBUTE=1 */,
) AUTO_INCREMENT=36943352 CHARACTER SET utf8 ENGINE=InnoDB /*$ REPLICAS=2 SLICES=12 */
And:
/* SHOW VARIABLES LIKE '%buffer%'; */
+----------------------------------------+-------+
| Variable_name | Value |
+----------------------------------------+-------+
| backup_compression_buffer_size_bytes | 8192 |
| backup_read_buffer_size_bytes | 8192 |
| backup_write_buffer_size_bytes | 8192 |
| mysql_master_trx_buffer_kb | 256 |
| mysql_slave_session_buffer_size_events | 100 |
| net_buffer_length | 16384 |
| replication_master_buffer_kb | 65536 |
+----------------------------------------+-------+
Edit 2. Here's EXPLAIN statements for both queries:
mysql> EXPLAIN SELECT COUNT(1) FROM order_items;
+----------------------------------------------------------+-------------+-------------+
| Operation | Est. Cost | Est. Rows |
+----------------------------------------------------------+-------------+-------------+
| row_count "expr1" | 29740566.81 | 1.00 |
| stream_combine | 26444732.70 | 32958341.10 |
| compute expr0 := param(0) | 1929074.80 | 2746528.43 |
| filter isnotnull(param(0)) | 1915342.16 | 2746528.43 |
| index_scan 1 := order_items.order_items_quantity | 1854308.19 | 3051698.25 |
+----------------------------------------------------------+-------------+-------------+
5 rows in set (0.13 sec)
And:
mysql> EXPLAIN SELECT COUNT(DISTINCT DATE(created_on), product_id) FROM order_items;
+----------------------------------------------------------------------------------+-------------+------------+
| Operation | Est. Cost | Est. Rows |
+----------------------------------------------------------------------------------+-------------+------------+
| hash_aggregate_combine expr1 := count(DISTINCT (0 . "expr0"),(1 . "product_id")) | 10115923.36 | 4577547.38 |
| hash_aggregate_partial GROUPBY((0 . "expr0"), (1 . "product_id")) | 3707357.04 | 4577547.38 |
| compute expr0 := cast(1.created_on, date) | 2166388.20 | 3051698.25 |
| index_scan 1 := order_items.__idx_order_items__PRIMARY | 2151129.71 | 3051698.25 |
+----------------------------------------------------------------------------------+-------------+------------+
4 rows in set (0.24 sec)
The first query must walk the entire database, checking every row in the table. An index on created_on and product_id would probably speed it up significantly. If you don't know about indexes, http://use-the-index-luke.com is a great place to start.
The second query seems to me that it should be instant, because it only has to check table metadata and doesn't need to check any rows.
You should publish the query plan but I suspect that to process the query MySQL must walk through the product_id and the created_on indexes. For the created_on field it must also aggregate the values (the field is datetime but you want to group by date).If you need speed I would add and additional field created_on_date with only the date and I would create an index on the product_id and the created_on_date. It should make your query much faster.
Of course the count(1) query is faster because it doesn't read the table at all and it can use the indexes metadata.
Some things to note:
If you add INDEX(product_id, created_on), the first query should run faster because it would be a "covering index". (The fields can be in the opposite order.)
Running those two queries in the order given could have caused info to be cached, thereby making the second query run faster.
SELECT COUNT(*) FROM tbl will use the smallest index. (In InnoDB.)
If you have enough RAM, and if innodb_buffer_pool_size is bigger than the table, then one or other of the operations may been performed entirely in RAM. RAM is a lot faster than disk.
Please provide SHOW CREATE TABLE order_items; I am having to guess too much.
Please provide SHOW VARIABLES LIKE '%buffer%';. How much RAM do you have?
Edit
Since it is Clustrix, there could be radically different things going on. Here's a guess:
SELECT COUNT(1) ... can probably be distributed to the nodes; each node would get a subtotal; then the subtotals could (very rapidly) be added.
SELECT COUNT(DISTINCT ...)... really has to look at all the rows, one way or another. That is, the effort cannot be distributed. Perhaps what happens is that all the rows are shoveled to one node for processing. I would guess it is a couple GB of stuff.
Is there some way in Clustrix to get EXPLAIN? I would be interested to see what it says about each of the SELECTs. (And whether it backs up my guess.)
I would expect GROUP BY and DISTINCT to be inefficient in a 'sharded' system (such as Clustrix).
COUNT(1)
In Plan, stream_combine was used. It has read only the index (order_items_quantity (quantity))
COUNT(DISTINCT DATE(created_on), product_id)
In general, COUNT(DISTINCT...) may be inefficient in RDB, NewSQL Scale-Out RDB even more, it's because of difficulty in reducing inter-nodes traffic (lots of data should be forwarded to GTM node in many cases). So Clustrix needs 'dist_stream_aggregate' and the right index (the columns and the column orders)
In plan, hash_aggregate_partial was shown. It has scanned FULL TABLE (__idx_order_items__PRIMARY) and took lots of time (much bigger size)
For parallelism, it may not be enough number for the all cpus available. (i.e. SLICES=12). I wonder how many nodes and cpus per nodes (?)
Because of DATE(created_on), the index created_at (created_on) would not work. The optimizer (Plan) thought FULL TABLE SCAN is more efficient than both looking up INDEX(created_at) and then accessing TABLE (__idx_order_items__PRIMARY).
For this case, I recommend to test as below.
Add column create_on_date_type
create index new_index on order_items(create_on_date_type, productid)
regarding distribute=? & slices=?, test should be done for your dataset.(the number of slices might impacts on how much cpu parallelism works)
You have to make sure the plan has dist_stream_aggregate.
dist_stream_aggregate can work efficiently only with the 'new_index' columns for your query.
I believe you would be able to get better performance.

insertion 74G data into mysql table cost more then 2 days, how to improve insert performance

insertion 74G data into mysql table cost more then 2 days, how to improve insert performance.
table t1 as follow:
+-------+----------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+----------------+------+-----+---------+-------+
| id | varchar(50) | NO | PRI | | |
| value | varchar(10000) | YES | | NULL | |
| dt | int(11) | YES | | NULL | |
+-------+----------------+------+-----+---------+-------+
sql like this: insert into t1 values(XX,XX,XX),(XX,XX,XX),(XX,XX,XX),....(XX,XX,XX)
What might be slowing down things significantly is the VARCHAR(50) PRIMARY KEY in two ways:
The PRIMARY KEY being active during insertion slows insertion down. Normally this doesn't matter and is the desired behavior because of the other things the PRIMARY KEY does, but for bulk INSERT operation of 74G the PRIMARY KEY might just be in your way, performance wise.
The PRIMARY KEY being a VARCHAR(50) makes a slow PRIMARY KEY.
Personally I would try to avoid using VARCHAR for PRIMARY KEY.
How PRIMARY KEY slows down INSERT
The PRIMARY KEY is a unique index which is used frequently. An index speeds up things for reading. There are many read access operations which need to compare, for example JOIN queries and WHERE X = Y queries. Without an index, these queries need to resort to linear search, which is O(n) with n being the number of rows in the table in question. That is slow. With an index, these queries resort to smarter algorithms which usually have a best case access time of O(C), means constant access time, in case a hash could be used and there are no collisions, and O(log2(n)) if due to hash collisions a sorted list or tree needs to be walked to find the exact match.
But the index comes at a prize. An index needs to be maintained (complex). Plus in case of a unique index, duplicates need to be avoided (trivial).
You should imagine the index like a sorted list. Let's compare a normal table and an index.
In a normal table, new entries would simply go to the end of the table. They are appended. SQL calls this INSERT, but physically it's an append operation. That's trivial, as nothing needs to be compared, copied or moved around. For the table itself, it hardly makes any difference if the row that you insert is row number 1 or row number 20 billion.
In an index, new entries must be inserted at the right place. Finding the right place is trivial, that's a read access operation between O(C) and O(log2(n)). Once the right place is found, the insert operation needs to perform an insertion. That is, moving all elements after the insert position by one position towards the end. The complexity of the INSERT thus is O(n).
A pre-sorted PRIMARY KEY, that is the INSERT operations performed in the sequence of the PRIMARY KEY, is not guaranteed to speed up the INSERT operation. It would only speed up the INSERT operation if the PRIMARY KEY is a plain array, it does not speed if the PRIMARY KEY is hashed, because without knowing the hashing function used it is seemingly random.
How the Datatype of the PRIMARY KEY influences speed
For the PRIMARY KEY I would always use something which is a 32 bit value, if 4 billion rows are sufficient, or a 64 bit value otherwise. The simple reason is that on a 64 bit machine, comparing a 32 or 64 bit value is trivial. That basically boils down to a single CPU instruction, cmp on many CPUs. If you use VARCHAR, the comparison of is far more complex. It needs to compare a String byte by byte. Depending on the DBMS, the Locale settings and the Collation used, it might even be more complex than that.
The special case of a fast PRIMARY KEY
A PRIMARY KEY of the form
CREATE TABLE Person (
id INT PRIMARY KEY AUTO_INCREMENT,
name VARCHAR(50)
);
would be fast because due to the AUTO_INCREMENT it's basically guaranteed that new keys are appended at the end of the index, and MySQL generates that new unique value itself.
What you can do in your case
In case your 74G of data is pure, i.e. no duplicate keys, you can disable the PRIMARY KEY for the INSERT operation and re-enable it after the INSERT operation. That should speed up things significantly. There wouldn't be anything that slows down the insertion operation. And the creation of an INDEX afterwords has roughly the complexity of a sort operation.
Via refactoring table structure, the insertion is speeded up drmatically. New table as follows:
+-------+----------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------+----------------+------+-----+---------+----------------+
| id | varchar(50) | NO | UNI | | |
| value | varchar(10000) | YES | | NULL | |
| dt | int(11) | YES | | NULL | |
| uid | int(11) | NO | PRI | NULL | auto_increment |
+-------+----------------+------+-----+---------+----------------+
If former table structure is used, moving data is so frequently. New table structure can avoid such actions. I post an experimental result comparing the insertion speed (data size = 1G).
+----------------+------------+
| former table | 30 minutes |
+----------------+------------+
| new table | 5 minutes |
+----------------+------------+
we can further speed up insertion operation using sql :
"innsert into t1 (id,value,dt) values (XX,XX,XX),(XX,XX,XX),....(XX,XX,XX) on on duplicate key update value=values(value),dt=values(dt)" instead of "replace into t1 values(XX,XX,XX),(XX,XX,XX),(XX,XX,XX),....(XX,XX,XX)", because such operation can avoid removing data at first

MySQL I/O bound InnoDB query optimization problem without setting innodb_buffer_pool_size to 5GB

I got myself into a MySQL design scalability issue. Any help would be greatly appreciated.
The requirements:
Storing users' SOCIAL_GRAPH and USER_INFO about each user in their social graph. Many concurrent reads and writes per second occur. Dirty reads acceptable.
Current design:
We have 2 (relevant) tables. Both InnoDB for row locking, instead of table locking.
USER_SOCIAL_GRAPH table that maps a logged in (user_id) to another (related_user_id). PRIMARY key composite user_id and related_user_id.
USER_INFO table with information about each related user. PRIMARY key is (related_user_id).
Note 1: No relationships defined.
Note 2: Each table is now about 1GB in size, with 8 million and 2 million records, respectively.
Simplified table SQL creates:
CREATE TABLE `user_social_graph` (
`user_id` int(10) unsigned NOT NULL,
`related_user_id` int(11) NOT NULL,
PRIMARY KEY (`user_id`,`related_user_id`),
KEY `user_idx` (`user_id`)
) ENGINE=InnoDB;
CREATE TABLE `user_info` (
`related_user_id` int(10) unsigned NOT NULL,
`screen_name` varchar(20) CHARACTER SET latin1 DEFAULT NULL,
[... and many other non-indexed fields irrelevant]
`last_updated` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`related_user_id`),
KEY `last_updated_idx` (`last_updated`)
) ENGINE=InnoDB;
MY.CFG values set:
innodb_buffer_pool_size = 256M
key_buffer_size = 320M
Note 3: Memory available 1GB, these 2 tables are 2GBs, other innoDB tables 3GB.
Problem:
The following example SQL statement, which needs to access all records found, takes 15 seconds to execute (!!) and num_results = 220,000:
SELECT SQL_NO_CACHE COUNT(u.related_user_id)
FROM user_info u LEFT JOIN user_socialgraph u2 ON u.related_user_id = u2.related_user_id
WHERE u2.user_id = '1'
AND u.related_user_id = u2.related_user_id
AND (NOT (u.related_user_id IS NULL));
For a user_id with a count of 30,000, it takes about 3 seconds (!).
EXPLAIN EXTENDED for the 220,000 count user. It uses indices:
+----+-------------+-------+--------+------------------------+----------+---------+--------------------+--------+----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+--------+------------------------+----------+---------+--------------------+--------+----------+--------------------------+
| 1 | SIMPLE | u2 | ref | user_user_idx,user_idx | user_idx | 4 | const | 157320 | 100.00 | Using where |
| 1 | SIMPLE | u | eq_ref | PRIMARY | PRIMARY | 4 | u2.related_user_id | 1 | 100.00 | Using where; Using index |
+----+-------------+-------+--------+------------------------+----------+---------+--------------------+--------+----------+--------------------------+
How do we speed these up without setting innodb_buffer_pool_size to 5GB?
Thank you!
The user_social_graph table is not indexed correctly !!!
You have ths:
CREATE TABLE user_social_graph
(user_id int(10) unsigned NOT NULL,
related_user_id int(11) NOT NULL,
PRIMARY KEY (user_id,related_user_id),
KEY user_idx (user_id))
ENGINE=InnoDB;
The second index is redundant since the first column is user_id. You are attempting to join the related_user_id column over to the user_info table. That column needed to be indexed.
Change user_social_graphs as follows:
CREATE TABLE user_social_graph
(user_id int(10) unsigned NOT NULL,
related_user_id int(11) NOT NULL,
PRIMARY KEY (user_id,related_user_id),
UNIQUE KEY related_user_idx (related_user_id,user_id))
ENGINE=InnoDB;
This should change the EXPLAIN PLAN. Keep in mind that the index order matters depending the the way you query the columns.
Give it a Try !!!
What is the MySQL version? Its manual contains important information for speeding up statements and code in general;
Change your paradigm to a data warehouse capable to manage till terabyte table. Migrate your legacy MySQL data base with free tool or application to the new paradigm. This is an example: http://www.infobright.org/Downloads/What-is-ICE/ many others (free and commercial).
PostgreSQL is not commercial and there a lot of tools to migrate MySQL to it!