I have this simple query:
INSERT IGNORE INTO beststat (bestid,period,rawView) VALUES ( 4510724 , 201205 , 1 )
On the table:
CREATE TABLE `beststat` (
`bestid` int(11) unsigned NOT NULL,
`period` mediumint(8) unsigned NOT NULL,
`view` mediumint(8) unsigned NOT NULL DEFAULT '0',
`rawView` mediumint(8) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`bestid`,`period`),
) ENGINE=InnoDB AUTO_INCREMENT=2020577 DEFAULT CHARSET=utf8
And it takes 1 sec to completes.
Side Note: actually it doesn't take always 1sec. Sometime it's done even in 0.05 sec. But often it takes 1 sec
This table (beststat) currently has ~500'000 records and its size is: 40MB. I have 4GB RAM and innodb buffer pool size = 104,857,600, with: Mysql: 5.1.49-3
This is the only InnoDB table in my database (others are MyISAM)
ANALYZE TABLE beststat shows: OK
Maybe there is something wrong with InnoDB settings?
I ran some simulations about 3 years ago as part of some evaluation project for a customer. They had a requirement to be able to search a table where data is constantly being added, and they wanted to be up to date up to a minute.
InnoDB has shown much better results in the beginning, but has quickly deteriorated (much before 1mil records), until I have removed all indexes (including primary). At that point InnoDB has become superior to MyISAM when executing inserts/updates. (I have much worse HW then you, executing tests only on my laptop.)
Conclusion: Insert will always suffer if you have indexes, and especially unique.
I would suggest following optimization:
Remove all indexes from your beststat table and use it as a simple dump.
If you really need these unique indexes, consider some programmable solution (like remembering the max bestid at all time, and insisting that the new record is above that number - and immediately increasing this number. (But do you really need so many unique fields - and they all sound to me just like indexes.)
Have a background thread move new records from InnoDB to another table (which can be MyISAM) where they would be indexed.
Consider dropping indexes temporarily and then after bulk update re-indexing the table, possibly switching two tables so that querying is never interrupted.
These are theoretical solutions, I admit, but is the best I can say given your question.
Oh, and if your table is planned to grow to many millions, consider a NoSQL solution.
So you have two unique indexes on the table. You primary key is a autonumber. Since this is not really part of the data as you add it to the data it is what you call a artificial primary key. Now you have a unique index on bestid and period. If bestid and period are supposed to be unique that would be a good candidate for the primary key.
Innodb stores the table either as a tree or a heap. If you don't define a primary key on a innodb table it is a heap if you define a primary key it is defined as a tree on disk. So in your case the tree is stored on disk based on the autonumber key. So when you create the second index it actually creates a second tree on disk with the bestid and period values in the index. The index does not contain the other columns in the table only bestid, period and you primary key value.
Ok so now you insert the data first thing myself does is to ensure the unique index is always unique. Thus it read the index to see if you are trying to insert a duplicate value. This is where the slow down comes into play. It first has to ensure uniqueness then if it passes the test write data. Then it also has to insert the bestid, period and primary key value into the unique index. So total operation would be 1 read index for value 1 insert row into table 1 insert bestid and period into index. A total of three operations. If you removed the autonumber and used only the unique index as the primary key it would read table if unique insert into table. In this case you would have the following number of operations 1 read table to check values 1 insert into tables. This is two operations vs three. So you do 33% less work by removing the redundant autonumber.
I hope this is clear as I am typing from my Android and autocorrect keeps on changing innodb to inborn. Wish I was at a computer.
Related
I have multiple big tables for business data with smallest one having 38million rows(24G data, 26G index size). I have indexes setup to speed up the lookups and buffer pool set to 80% of total RAM(116G). Even after these settings, over time we have started observing performance issues. I have constraints with the disk size(1T) and sharding is not an option currently. The data growth has increased to 0.5M rows per day. This is leading to frequent optimisation and master switch exercises. Table schemas and indexes have already been optimised. Hence, I have started looking at partitioning the table to improve performance. My primary partitioning use case is to delete data on monthly basis by dropping partitions so that optimisations are not required and read/write latencies are improved. Following is the structure for one of the big tables(column names have been changed for legal reasons - assume that the columns where indexes are defined have lookup use cases):
CREATE TABLE `table_name` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`data_1` int(11) NOT NULL,
`data_2` varchar(40) COLLATE utf8_unicode_ci NOT NULL,
`data_3` varchar(50) COLLATE utf8_unicode_ci DEFAULT NULL,
`data_4` varchar(20) COLLATE utf8_unicode_ci DEFAULT NULL,
`created_at` datetime DEFAULT NULL,
`updated_at` datetime DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `index_data1` (`data_1`),
KEY `index_data2` (`data_2`)
) ENGINE=InnoDB AUTO_INCREMENT=100572 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
I am planning to partition on the created_at column. However, the problem is that the partitioning column has to be part of all the unique keys. I can add the created_at column to the primary key but that would lead to increase in index size which in turn has its own side effects. Is there some workaround or any better solution?
Apart from solving this problem, there are few more questions whose answers couldn't be found in any documentation or articles present.
1. Why does mysql warrant partitioning column to be part of unique key?
2. The queries from the ORM don't have created_at clause present that means pruning is not possible with reads which we were okay with provided inserts are always pruned. However, doesn't look like this is the case. Why does mysql open all the partitions for inserts?
Mysql Version - 5.6.33-79.0-log Percona Server (GPL), Release 79.0, Revision 2084bdb
PRIMARY KEY(id, created_at) will take only an tiny bit more space than PRIMARY KEY(id). I estimate it at much less than 1% for your data. I can't tell about the index space -- can you show us the non-primary index(es)?
Explanation: The leaf nodes of the data (which is a BTree organized by the PK), will not change in size. The non-leaf nodes will have created_at added to each 'row'. As a rule of thumb in InnoDB, non-leaf nodes take up about 1% of the space for the BTree.
For the INDEX BTrees, the leaf nodes need an extra 4 bytes/row for created_at unless created_at is already in the index.
Let's say you currently have INDEX(foo) where foo is INT and id is also INT. That's a total of 8 bytes (plus overhead). Adding created_at (a 4-byte TIMESTAMP) expands each leaf 'row' to 12+overhead. So, that index may double in size.
A guess: Your 24G+26G might grow to 25G+33G.
It sounds like you have several indexes. You do understand that INDEX(a) is not useful if you also have INDEX(a,b)? And that INDEX(x,y) is a lot better than INDEX(x), INDEX(y) in some situations? Let's discuss your indexes.
The main benefit for PARTITIONing is your use case -- DROP PARTITION is a lot faster than DELETE. My blog on such.
Don't be lulled by partitioning. You are hoping for "read/write latencies are improved"; such is not likely to happen. If you would like further explanation please provide a SELECT where you think it might happen.
How many "months" will you partition on? I recommend not more than 50. PARTITIONing has some inefficiencies when there are lots of partitions.
Because of the need for the partition key to be in UNIQUE keys, the uniqueness constraint is almost totally useless. Having it on the end of an AUTO_INCREMENT id is not an issue.
Consider whether something other than id can be the PK.
Question 1: When INSERTing a row, all UNIQUE keys are immediately checked for "dup key". Without the partition key being part of the unique key, this would mean probing every partition. This is too costly to contemplate; so it was not done. (In the future, a 'global-to-the-table' UNIQUE key may be implemented. Version 8.0 has some hooks for such.)
Question 2a: Yes, if the SELECT's WHERE does not adequately specify the partition key, all partitions will be opened and looked at. This is another reason to minimize the number of partitions. Hmmm... If you do a SELECT on the 31st of the month and do the same SELECT the next day, you could get fewer rows (even without any deletes, just the DROP PARTITION); this seems "wrong".
Question 2b: "Why does mysql open all the partitions for inserts?" -- What makes you think it does? There is an odd case where the "first" partition is 'unnecessarily' opened -- the partition key is DATETIME.
I am building a website (LAMP stack) with an Amazon RDS MySQL instance as the back end (type db.m3.medium).
I am happy with database integrity, and it works perfectly with regards to SELECT/JOIN/ETC queries (everything is normalized, indexed, and foreign keyed, all tables have id primary keys and relevant secondary keys / unique keys).
I have a table 'df_products' with approx half a million products in it. The products need to be updated nightly. The process involves a PHP script reading over a large products data-file and inserting data into several tables (products table, product_colours table, brands table, etc), calling either INSERT or UPDATE depending on whether or not a row already exists. This is done as one giant transaction.
What I am seeing is the UPDATE commands are sufficiently fast (50/sec, not exactly lightning but it should do), however the INSERT commands are super slow (1/sec) and appear to be consuming 100% of the CPU. On a dual core instance we see 50% CPU use (i.e. one full core).
I assume that this is because indexes (1x PRIMARY + 5x INDEX + 1x UNIQUE + 1x FULLTEXT) are being rebuilt after every INSERT. However I though that putting the entire process into one transaction should stop indexes being rebuilt until the transaction is committed.
I have tried setting the following params via PHP but there is negligible performance improvement:
$this->db->query('SET unique_checks=0');
$this->db->query('SET foreign_key_checks=0;');
The process will take weeks to complete at this rate so we must improve performance. Google appears to suggest using LOAD DATA. However:
I would have to generate five files in order to populate five tables
The process would have to use UPDATE commands as opposed to INSERT since the tables already exist
I would still need to loop over the products and scan the database for what values already do and don't exist
The database is entirely InnoDB and I don't plan to move to MyISAM (I want transactions, foreign keys, etc). This means that I cannot disable indexes. Even if I did it would probably be a big performance drain as we need to check if a row already exists before we insert it, and without an index this will be super slow.
I have provided the products table defition below for information. Can you please provide advice to what process we should be using to achieve faster INSERT/UPDATE on multiple large related tables? Or what optimisations we can make to our existing process?
Thank you,
CREATE TABLE `df_products` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`id_brand` int(11) NOT NULL,
`title` varchar(255) NOT NULL,
`id_gender` int(11) NOT NULL,
`id_colourSet` int(11) DEFAULT NULL,
`id_category` int(11) DEFAULT NULL,
`desc` varchar(500) DEFAULT NULL,
`seoAlias` varchar(255) CHARACTER SET ascii NOT NULL,
`runTimestamp` timestamp NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `seoAlias_UNIQUE` (`seoAlias`),
KEY `idx_brand` (`id_brand`),
KEY `idx_category` (`id_category`),
KEY `idx_seoAlias` (`seoAlias`),
KEY `idx_colourSetId` (`id_colourSet`),
KEY `idx_timestamp` (`runTimestamp`),
KEY `idx_gender` (`id_gender`),
FULLTEXT KEY `fulltext_title` (`title`),
CONSTRAINT `fk_id_colourSet` FOREIGN KEY (`id_colourSet`) REFERENCES `df_productcolours` (`id_colourSet`) ON DELETE NO ACTION ON UPDATE NO ACTION,
CONSTRAINT `fk_id_gender` FOREIGN KEY (`id_gender`) REFERENCES `df_lu_genders` (`id`) ON DELETE NO ACTION ON UPDATE NO ACTION
) ENGINE=InnoDB AUTO_INCREMENT=285743 DEFAULT CHARSET=utf8
How many "genders" are there? If the usual 2, don't normalize it, don't index it, don't us a 4-byte INT to store it, use a CHAR(1) CHARACTER SET ascii (only 1 byte) or an ENUM (1 byte).
Each unnecessary index is a performance drain on the load, regardless of how it is done.
For INSERT vs UPDATE, look into using INSERT ... ON DUPLICATE KEY UPDATE.
Load the nightly data into a separate table (this could be MyISAM with no indexes). Then run one query to update existing rows and one to insert new rows. (Each needs a JOIN.) See http://mysql.rjweb.org/doc.php/staging_table, especially the 2 SQLs used for "normalizing". They can be adapted to your situation.
Any kind of multi-row query runs noticeably faster than 1-row at a time. (A 100-row INSERT runs 10 times as fast as 100 1-row inserts.)
innodb_flush_log_at_trx_commit = 2 will let the individual write statements run much faster. (Batching them as I suggest won't speed up much.)
I have a set of integer data. The first being the number 0 and the last being 47055833459. There are two billion of these numbers from the first to the last and they will never change or be added to. The only insert into the mysql table will be loading this data into it. From then on, it will only be read from.
I predict the size of the database table to be roughly 20Gb. I plan on having two columns:
id, data
Id will be a primary key, auto incremented unsigned INT and data will be an unsigned BIGINT
What will be the best way of optimising this data for read only with those two columns? I have looked at the other questions which are similar but they all take into account write speeds and ever increasing tables. The host I am using does not support MySQL partitioning so unfortunately this is not an option at the moment. If it turns out that partitioning is the only way, then I will reconsider a new host.
The table will only ever be accessed by the id column so there does not need to be an index on the data column.
So to summarise, what is the best way of handling a table with 2 billion rows with two columns, without partitioning, optimised for reads, in MySQL?
Assuming you are using InnnDB, you should simply:
CREATE TABLE T (
ID INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
DATA BIGINT UNSIGNED
);
This will effectively create one big B-Tree and nothing else, and retrieving a row by ID can be done in a single index seek1. Take a look at "Understanding InnoDB clustered indexes" for more info.
1 Without table heap access, in fact there is no heap at all.
Define your table like so.
CREATE TABLE `lkup` (
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
`data` BIGINT UNSIGNED NOT NULL,
PRIMARY KEY (`id`, `data`)
)
The compound primary key will consume disk space, but will make lookups very fast; your queries can be satisfied just by reading the index (which is known as a covering index).
And, do OPTIMIZE TABLE lkup when you're finished loading your static data into it. That may take a while, but it will pay off big at runtime.
In MySQL, if you have a MyISAM table that looks something like:
CREATE TABLE `table1` (
`col1` INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,
`col2` INT(10) UNSIGNED NOT NULL,
PRIMARY KEY (`col2`, `col1`)
)
COLLATE='utf8_general_ci'
ENGINE=MyISAM;
if you insert rows then the autoincrement base will be unique for every distinct col2 value. If my explanation isn't clear enough, this answer should explain better. InnoDB, however, doesn't follow this behavior. In fact, InnoDB won't even let you put col2 as first in the primary key definition.
My question is, is it possible to model this behavior in InnoDB somehow without resorting to methods like MAX(id)+1 or the likes? The closest I could find is this, but it's for PostgreSQL.
edit: misspelling in title
It's a neat feature of MyISAM that I have used before, but you can't do it with InnoDB. InnoDB determines the highest number on startup, then keeps the number in RAM and increments it when needed.
Since InnoDB handles simultaneous inserts/updates, it has to reserve the number at the start of a transaction. On a transaction rollback, the number is still "used" but not saved. Your MAX(id) solution could get you in trouble because of this. A transaction starts, the number is reserved, you pull the highest "saved" number + 1 in a separate transaction, which is the same as that reserved for the first transaction. The transaction finishes and the reserved number is now saved, conflicting with yours.
MAX(id) returns the highest saved number, not the highest used number. You could have a MyISAM table whose sole purpose to to generate the numbers you want. It's the same number of queries as you MAX(id) solution, it's just that one is a SELECT, the other an INSERT.
A short recap of what happened. I am working with 71 million records (not much compared to billions of records processed by others). On a different thread, someone suggested that the current setup of my cluster is not suitable for my need. My table structure is:
CREATE TABLE `IPAddresses` (
`id` int(11) unsigned NOT NULL auto_increment,
`ipaddress` bigint(20) unsigned default NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM;
And I added the 71 million records and then did a:
ALTER TABLE IPAddresses ADD INDEX(ipaddress);
It's been 14 hours and the operation is still not completed. Upon Googling, I found that there is a well-known approach for solving this problem - Partitioning. I understand that I need to partition my table now based on the ipaddress but can I do this without recreating the entire table? I mean, through an ALTER statement? If yes, there was one requirement saying that the column to be partitioned on should be a primary key. I will be using the id of this ipaddress in constructing a different table so ipaddress is not my primary key. How do I partition my table given this scenario?
Ok turns out that this problem was more than just a simple create a table, index it and forget problem :) Here's what I did just in case someone else faces the same problem (I have used an example of IP Address but it works for other data types too):
Problem: Your table has millions of entries and you need to add an index really fast
Usecase: Consider storing millions of IP addresses in a lookup table. Adding the IP addresses should not be a big problem but creating an index on them takes more than 14 hours.
Solution: Partition your table using MySQL's Partitioning strategy
Case #1: When the table you want is not yet created
CREATE TABLE IPADDRESSES(
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
ipaddress BIGINT UNSIGNED,
PRIMARY KEY(id, ipaddress)
) ENGINE=MYISAM
PARTITION BY HASH(ipaddress)
PARTITIONS 20;
Case #2: When the table you want is already created.
There seems to be a way to use ALTER TABLE to do this but I have not yet figured out a proper solution for this. Instead, there is a slightly inefficient solution:
CREATE TABLE IPADDRESSES_TEMP(
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
ipaddress BIGINT UNSIGNED,
PRIMARY KEY(id)
) ENGINE=MYISAM;
Insert your IP addresses into this table. And then create the actual table with partitions:
CREATE TABLE IPADDRESSES(
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
ipaddress BIGINT UNSIGNED,
PRIMARY KEY(id, ipaddress)
) ENGINE=MYISAM
PARTITION BY HASH(ipaddress)
PARTITIONS 20;
And then finally
INSERT INTO IPADDRESSES(ipaddress) SELECT ipaddress FROM IPADDRESSES_TEMP;
DROP TABLE IPADDRESSES_TEMP;
ALTER TABLE IPADDRESSES ADD INDEX(ipaddress)
And there you go... indexing on the new table took me about 2 hours on a 3.2GHz machine with 1GB RAM :) Hope this helps.
Creating indexes with MySQL is slow, but not that slow. With 71 million records, it should take a couple minutes, not 14 hours. Possible problems are :
you have not configured sort buffer sizes and other configuration options
look here : http://dev.mysql.com/doc/refman/5.5/en/server-system-variables.html#sysvar_myisam_sort_buffer_size
If you try to generate a 1GB index with a 8MB sort buffer it's going to take lots of passes. But if the buffer is larger than your CPU cache it will get slower. So you have to test and see what works best.
someone has a lock on the table
your IO system sucks
your server is swapping
etc
as usual check iostat, vmstat, logs, etc. Issue a LOCK TABLE on your table to check if someone has a lock on it.
FYI on my 64-bit desktop creating an index on 10M random BIGINTs takes 17s...
I had the problem where I wanted to speed up my query by adding an index. The table only had about 300.000 records but it also took way too long. When I checked the mysql server processes, it turned out that the query I was trying to optimize was still running in the background. 4 times! After I killed those queries, indexing was done in a jiffy. Perhaps the same problem applies to your situation.
You are using MyISAM which is being deprecated soon. An alternative would be InnoDB.
"InnoDB is a transaction-safe (ACID compliant) storage engine for MySQL that has commit, rollback, and crash-recovery capabilities to protect user data. InnoDB row-level locking (without escalation to coarser granularity locks) and Oracle-style consistent nonlocking reads increase multi-user concurrency and performance. InnoDB stores user data in clustered indexes to reduce I/O for common queries based on primary keys. To maintain data integrity, InnoDB also supports FOREIGN KEY referential-integrity constraints. You can freely mix InnoDB tables with tables from other MySQL storage engines, even within the same statement."\
http://dev.mysql.com/doc/refman/5.0/en/innodb.html
According to:
http://dev.mysql.com/tech-resources/articles/storage-engine/part_1.html
, you should be able to switch between different engine by utilizing a simple alter command which allows you some flexibility. It also states that each table in your DB can be configured independently.
In your table . you have already inserted 71 billion records. now if you want to create partitions on the primary key column of your table, you can use alter table option. An example is given for your reference.
CREATE TABLE t1 (
id INT,
year_col INT
);
ALTER TABLE t1
PARTITION BY HASH(id)
PARTITIONS 8;