We have a fairly unoptimized table with the following definition:
CREATE TABLE `Usage` (
`TxnDate` varchar(30) DEFAULT NULL,
`TxnID` decimal(13,2) NOT NULL,
`UserID2015` varchar(20) DEFAULT NULL,
`UserRMN` decimal(13,0) DEFAULT NULL,
`CustomerNo` decimal(13,0) DEFAULT NULL,
`OperatorName` varchar(50) DEFAULT NULL,
`AggregatorName` varchar(30) DEFAULT NULL,
`TransAmount` decimal(10,2) DEFAULT NULL,
`MMPLTxnID` decimal(13,0) DEFAULT NULL,
`ProductType` varchar(30) DEFAULT NULL,
`YearMonthRMN` varchar(50) DEFAULT NULL,
PRIMARY KEY (`TxnID`),
UNIQUE KEY `TxnID` (`TxnID`) USING BTREE,
KEY `TxnDate` (`TxnDate`),
KEY `OperatorName` (`OperatorName`),
KEY `AggregatorName` (`AggregatorName`),
KEY `MMPLTxnID` (`MMPLTxnID`),
KEY `ProductType` (`ProductType`),
KEY `UserRMN` (`UserRMN`),
KEY `YearMonthRMN` (`YearMonthRMN`) USING BTREE,
KEY `CustomerNo` (`CustomerNo`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=latin1
The table has abotu 170M records.
I want to drop the primary key and instead add an auto number primary key. So far the index dropping has taken 2h.
Why is it taking so long to remove an index, is there any sorting happening?
How can I estimate the time to drop the index?
When I add the autonumber, will I have to estimate time for sorting the table or will this not be necessary with a new autonumber index?
You're not just dropping an index, you're dropping the primary key.
Normally, InnoDB tables are stored as a clustered index based on the primary key, so by dropping the primary key, it has to create a new table that uses either the secondary unique key or else an auto-generated key for its clustered index.
I've done a fair amount of MySQL consulting, and the question of "how much time will this take?" is a common question.
It takes as long as it takes to build a new clustered index on your server. This is hard to predict. It depends on several factors, like how fast your server's CPUs are, how fast your storage is, and how much other load is going on concurrently, competing for CPU and I/O bandwidth.
In other words, in my experience, it's not possible to predict how long it will take.
Your table will be rebuilt with TxnID as the new clustered index, which is coincidentally the same as the primary key. But apparently MySQL Server doesn't recognize this special case as one that can use the shortcut of doing an inplace alter.
Your table also has eight other secondary indexes, five of which are varchars. It has to build those indexes during the table restructure. That's a lot of I/O to build those indexes in addition to the clustered index. That's likely what's taking so much time.
You'll go through a similar process when you add your new auto-increment primary key. You could have saved some time if you had dropped your old primary key and created the new auto-increment primary key in one ALTER TABLE statement.
(I agree with Bill's answer; here are more comments.)
I would kill the process and rethink whether there is any benefit in a AUTO_INCREMENT.
I try to look beyond the question to the "real" question. In this case it seems to be something as-yet-unspoken that calls for an AUTO_INCREMENT; please elaborate.
Your current PRIMARY KEY is 6 bytes. Your new PK will be 4 bytes if INT or 8 bytes if BIGINT. So, there will be only a trivial savings or loss in disk space utilization.
Any lookups by TxnID will be slowed down because of going through the AI. And since TxnID is UNIQUE and non-null, it seems like the optimal "natural" PK.
A PK is a Unique key, so UNIQUE(TxnID) is totally redundant; DROPping it would save space without losing anything. That is the main recommendation I would have (just looking at the schema).
When I see a table with essentially every column being NULL, I am suspicious that the designer did not make a conscious decision about the nullness of the columns.
DECIMAL(13,2) would be a lot of dollars or Euros, but as a PK, it is quite unusual. What's up?
latin1? No plans for globalization?
Lots of single-column indexes? WHERE a=1 AND b=2 begs for a composite INDEX(a,b).
Back to estimating time...
If the ALTER rebuilds the 8-9 indexes, then is should do what it can with a disk sort. This involves writing stuff to disk, using an efficient disk-based sort that involves some RAM, then reading the sorted result to recreate the index. A sort is O(log N), thereby making it non-linear. This makes it hard to predict the time taken. Some newer versions of MariaDB attempt estimate the remaining time, but I don't trust it.
A secondary index includes the column(s) being index, plus any other column(s) of the PK. Each index in that table will occupy about 5-10GB of disk space. This may help you convert to IOPs or whatever. But note that (assuming you don't have much RAM), that 5-10GB will be reread a few (several?) times during the sort that rebuilds the index.
When doing multiple ALTERs, do them in a single ALTER statement. That way, all the work (especially rebuilding of secondary indexes) need be done only once.
You have not said what version you are using. Older versions hand one choice: "COPY": Create new table; copy data over; rebuild indexes; rename. New versions can deal with secondary indexes "INPLACE". Note: changes to the PRIMARY KEY require the copy method.
For anyone interested:
This is run on Amazon Aurora with 30GB of data stored. I could not find any information on how IOPS is provisioned for this, but I expected at worst case there would be 90IOPS available consistently. To write 10GB in and out would take around 4 hours.
I upgraded the instance to db.r3.8xlarge before running the alter table.
Then ran
alter table `Usage` drop primary key, add id bigint auto_increment primary key
it took 1h 21m, which is much better than expected.
Related
First, I am new to mysqlslap
I want to test insert query using mysqlslap on my existing database. Table which I want to test has primary and composite unique.
So, how to do performance test on this table using mysqlslap concurrently?
I should not face mysql error duplicate key
Below is skeleton for my table:
CREATE TABLE data (
id bigint(20) NOT NULL,
column1 bigint(20) DEFAULT NULL,
column2 varchar(255) NOT NULL DEFAULT '0',
datacolumn1 VARCHAR(255) NOT NULL DEFAULT '',
datacolumn2 VARCHAR(2048) NOT NULL DEFAULT '',
PRIMARY KEY (id),
UNIQUE KEY profiles_UNIQUE (column1,column2),
INDEX id_idx (id),
INDEX unq_id_idx (column1, column2) USING BTREE
) ENGINE=innodb DEFAULT CHARSET=latin1;
Please help me on this
There are several problems with benchmarking INSERTs. The speed will change as you insert more and more, but not in an easily predictable way.
An Insert performs (roughly) this way:
Check for duplicate key. You have two unique keys (the PK and a UNIQUE). Each BTree will be drilled down to check for a dup. Assuming no dup...
The row will be inserted in the data (a BTree keyed by the PK)
A "row" will be inserted into each Unique's BTree. In your case, there is a BTree effectively ordered by (column1, column2) and containing (id).
Stuff is put into the "Change Buffer" for each non-unique index.
If you had an AUTO_INCREMENT or a UUID or ..., there will be more discussion.
The Change Buffer is effectively a "delayed write" to non-unique indexes. This delay has to be dealt with eventually. That is, at some time, things will slow down if a background process fails to keep up with the changes. That is, if you insert 1 million rows, you may not hit this slowdown; if you insert 10 million rows, you may hit it.
Another variable: VARCHAR(2048) (and other TEXT and BLOB columns) may or may not be stored "off record". This depends on the size of the row, the size of that column, and "row format". A big string may take an extra disk hit, thereby slowing down the benchmark, probably by a noticeable amount. That is, if you benchmark with only small strings and certain row formats, you will get a faster insert time than otherwise.
And you need to understand how the benchmark program runs -- versus how your application will run:
Insert rows one at a time in a single thread -- each being a transaction.
Insert rows one at a time in a single thread -- lots batched into a transaction.
Insert 100 rows at a time in a single thread in a single transaction.
LOAD DATA.
Multiple threads with each of the above.
Different transaction isolation settings.
Etc.
(I am not a fan of benchmarks because of how many flaws they have.) The 'best' benchmark for comparing hardware or limited schema/app changes: Capture the "general log" from a running application; capture the database at the start of that; time the re-applying of that log.
Designing a table/insert for 50K inserted rows/sec
Minimize indexes. In your case, all you need is PRIMARY KEY(col1, col2); toss the rest; toss id. Please explain what col1 and col2 are; there may be more tips here.
Get rid of the table. Seriously, consider summarize the 50K rows every second and store only the summarization. If it is practical, this will greatly speed things up. Or maybe a minute's worth.
Batch insert rows in some way. The details here depend on whether you have one or many clients doing the inserts, whether you need to massage the data as it comes, in, etc. More discussion: http://mysql.rjweb.org/doc.php/staging_table
What is in those strings? Can/should they be 'normalized'?
Let's discuss the math. Will you be loading about 10 petabytes per year? Do you have that much disk space? What will you do with the data? How long will it take to read even a small part of that data? Or will it be a "write only" database??
More math. 50K rows * 0.5KB = 25MB writing to disk per second. What device do you have? Can it handle, say, 2x that? (With your original schema, it would be more like 60MB/s because of all the indexes.)
After comments
OK, so more like 3TB before you toss the data and start over (in 2 hours)? For that, I would suggest PARTITION BY RANGE and use some time function that gives you 5 minutes in each partition. This will give you a reasonable number of partitions (about 25) and the DROP PARTITION will be dropping only about 100GB, which might not overwhelm the filesystem. More discussion: http://mysql.rjweb.org/doc.php/partitionmaint
As for the strings... You suggest 25KB, yet the declarations don't allow for that much???
I have a table in a MariaDB database for which no primary key is defined. However, it has an index. I'd like to add a primary key with the same definition as that index. The naïve way might be:
alter table `foo` add primary key (`bar`, `baz`),
drop index `qux`;
...but that will take a very long time and seems wasteful. (The table is tens of gigabytes in size and is running on a machine with less free disk space than the total size of the table.) I realize an index and a primary key aren't the same thing (at the very least, the primary key includes a uniqueness constraint which must be checked during the creation process), but is there any way to use the index to “bootstrap” the primary key?
Assuming the table is ENGINE=InnoDB??...
If there is not enough free space on disk for another copy of the table, the task cannot be performed without the help of a second server. Can you drop some tables? Or otherwise free up space?
A PRIMARY KEY is UNIQUE and is an index. If the combination of bar and baz is not unique, you should not turn it into the PK.
Using a PK for looking up a single row is faster than using a secondary index. This is because it first looks up the row in the secondary index's BTree. There it finds the PRIMARY KEY, which is then used to find the row in the data's BTree.
If the table is bigger than innodb_buffer_pool_size, your change would also (in many cases) eliminate a disk hit. (Disk hits are the slowest part of database operations.)
Yes, there is currently a PRIMARY KEY on you table. It is a 6-byte hidden 'column'. Your ALTER would throw that away, thereby making the table a little smaller (another small benefit).
Do you have innodb_file_per_table=ON (or =1)? If the table is in its own .ibd file, you will recover the disk space after the operation (assuming it can run at all). With OFF, it will increase the size of the ibdata1 file, but fail to shrink it back. Have it ON when creating tables that will eventually be 'big'.
OK, there may be hope. If you are running with OFF, and there is enough space in ibdata1, then the task may complete. (But that means, as aluded to above, that you have already bloated ibdata1.)
I am working on a project where I need to use a UUID (16bit) as unique identifier in the database (MySQL). The database has a lot of tables with relations. I have the following questions about using a UUID as PK:
Should I index the unique identifier as PK / FK or is it not necessary?
If I index it, the index size will increase, but it is really needed?
Enclose an example where i have to use uuid:
Table user with one unique identifier (oid) and foreign key (language).
CREATE TABLE user (
oid binary(16) NOT NULL,
username varchar(80) ,
f_language_oid binary(16) NOT NULL,
PRIMARY KEY (oid),
KEY f_language_oid (f_language_oid),
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
Is it helpful / necessary to define "oid" as PRIMARY_KEY and language as FOREIGN_KEY or would it be bette, if i only create the table without key definitions?
I have read in this article (here) that innodb will generate automatically an 6bit integer als primary key (hidden). In this case, it would be better to use the 6bit internal pk than the 16bit binary key?
If no index is required, should I use MyISAM or InnoDB?
Many thanks in advance.
With MySQL it's often advantageous to use a regular INT as your primary key and have a UUID as a secondary UNIQUE index. This is mostly because I believe MySQL uses the primary key as a row identifier in all secondary indexes, and having large values here can lead to vastly bigger index sizes. Do some testing at scale to see if this impacts you.
The one reason to use a UUID as a primary key would be if you're trying to spread data across multiple independent databases and want to avoid primary key conflicts. UUID is a great way to do this.
In either case, you'll probably want to express the UUID as text so it's human readable and it's possible to do manipulate data easily. It's difficult to paste in binary data into your query, for example, must to do a simple UPDATE query. It will also ensure that you can export to or import from JSON without a whole lot of conversion overhead.
As for MyISAM vs. InnoDB, it's really highly ill-advised to use the old MyISAM database in a production environment where data integrity and uptime are important. That engine can suffer catastrophic data loss if the database becomes corrupted, something as simple as an unanticipated reboot can cause this, and has trouble recovering. InnoDB is a modern, journaled, transactional database engine that's significantly more resilient and recovers from most sudden failure situations automatically, even database crashes.
One more consideration is evaluating if PostgreSQL is a suitable fit because it has a native UUID column type.
I have a table -
CREATE TABLE `DBMSProject`.`ShoppingCart` (
`Customer_ID` INT NOT NULL,
`Seller_ID` INT NOT NULL,
`Product_ID` INT NOT NULL,
`Quantity` INT NOT NULL,
PRIMARY KEY (`Customer_ID`,`Seller_ID`,`Product_ID`));
This table has a lot more insert operations and delete operations than update operations..
Which storage structure is most suited in decreasing overall operation and access time ? I'm not sure between Hash Structure and B+ Tree structure and ISAM. PS - The number or records is in an order of 10 Million
For a shopping cart, data integrity is more important than raw speed.
MyISAM doesn't support ACID transactions, and it doesn't enforce foreign keys constraints. With all those ID numbers you're showing, I wouldn't proceed with an engine that doesn't enforce foreign key constraints.
In general, I favor testing over speculation. You can build tables, load them with 10 million rows of random(ish) data, index them several different ways, and run timing tests with representative SQL statements in just a couple of hours. Use similar hardware, if possible.
And when it comes to indexes, you can drop them and add them without having to rewrite any application code. If you can't be bothered to test, just pick one. Later, if you have a performance problem that explain suggests might be related to the index, drop it and create a different one. (After you do this a couple of times, you'll probably discover that you can spare the time to test.)
Reason I'm using GUID / UUID as primary key: Data syncing across devices. I have a master database on the server, and then each device has its own database with the same structure (although, different engines. MySQL on the server, SQLite on the Android devices, etc).
I've read that if you're going to use GUID's as your primary key, it should at least not be the clustering key. However, I can't find how to do that with MySQL. All I can find is this reference that says if you have a primary key, InnoDB will use it as the clustering key.
Am I missing something?
The article you linked to is about Microsoft SQL Server, which gives you the option of which index to use as the clustering key.
With MySQL's InnoDB storage engine, you have no option. The primary key is always used as the clustering key. If there is no primary key, then it uses the first non-null unique key. Absent any such unique key, InnoDB generates its own internal 6-byte key as the clustering key.
So you could make a table that uses a GUID as a non-unique key, but in practice use it as a candidate key.
CREATE TABLE MyTable (
guid CHAR(32) NOT NULL,
/* other columns... */
KEY (guid) -- just a regular secondary index, neither primary nor unique
);
However, there's a legitimate use for the clustering key. If you frequently do lookups based on the GUID, they will be more efficient if you use the GUID as the clustering key.
The concerns about using a GUID as the clustering key are mostly about space. Inserting into the middle of a clustered index can cause a bit of fragmentation, but that's not necessarily a huge problem in MySQL.
The other issue is that in InnoDB, secondary indexes implicitly contain the primary key, so a CHAR(32) or whatever you use to store the GUID is going to be appended to each entry in other indexes. This makes them take more space than if you had used an integer as the primary key.