Which storage structure is the best? - mysql

I have a table -
CREATE TABLE `DBMSProject`.`ShoppingCart` (
`Customer_ID` INT NOT NULL,
`Seller_ID` INT NOT NULL,
`Product_ID` INT NOT NULL,
`Quantity` INT NOT NULL,
PRIMARY KEY (`Customer_ID`,`Seller_ID`,`Product_ID`));
This table has a lot more insert operations and delete operations than update operations..
Which storage structure is most suited in decreasing overall operation and access time ? I'm not sure between Hash Structure and B+ Tree structure and ISAM. PS - The number or records is in an order of 10 Million

For a shopping cart, data integrity is more important than raw speed.
MyISAM doesn't support ACID transactions, and it doesn't enforce foreign keys constraints. With all those ID numbers you're showing, I wouldn't proceed with an engine that doesn't enforce foreign key constraints.
In general, I favor testing over speculation. You can build tables, load them with 10 million rows of random(ish) data, index them several different ways, and run timing tests with representative SQL statements in just a couple of hours. Use similar hardware, if possible.
And when it comes to indexes, you can drop them and add them without having to rewrite any application code. If you can't be bothered to test, just pick one. Later, if you have a performance problem that explain suggests might be related to the index, drop it and create a different one. (After you do this a couple of times, you'll probably discover that you can spare the time to test.)

Related

Test insert query using mysqlslap

First, I am new to mysqlslap
I want to test insert query using mysqlslap on my existing database. Table which I want to test has primary and composite unique.
So, how to do performance test on this table using mysqlslap concurrently?
I should not face mysql error duplicate key
Below is skeleton for my table:
CREATE TABLE data (
id bigint(20) NOT NULL,
column1 bigint(20) DEFAULT NULL,
column2 varchar(255) NOT NULL DEFAULT '0',
datacolumn1 VARCHAR(255) NOT NULL DEFAULT '',
datacolumn2 VARCHAR(2048) NOT NULL DEFAULT '',
PRIMARY KEY (id),
UNIQUE KEY profiles_UNIQUE (column1,column2),
INDEX id_idx (id),
INDEX unq_id_idx (column1, column2) USING BTREE
) ENGINE=innodb DEFAULT CHARSET=latin1;
Please help me on this
There are several problems with benchmarking INSERTs. The speed will change as you insert more and more, but not in an easily predictable way.
An Insert performs (roughly) this way:
Check for duplicate key. You have two unique keys (the PK and a UNIQUE). Each BTree will be drilled down to check for a dup. Assuming no dup...
The row will be inserted in the data (a BTree keyed by the PK)
A "row" will be inserted into each Unique's BTree. In your case, there is a BTree effectively ordered by (column1, column2) and containing (id).
Stuff is put into the "Change Buffer" for each non-unique index.
If you had an AUTO_INCREMENT or a UUID or ..., there will be more discussion.
The Change Buffer is effectively a "delayed write" to non-unique indexes. This delay has to be dealt with eventually. That is, at some time, things will slow down if a background process fails to keep up with the changes. That is, if you insert 1 million rows, you may not hit this slowdown; if you insert 10 million rows, you may hit it.
Another variable: VARCHAR(2048) (and other TEXT and BLOB columns) may or may not be stored "off record". This depends on the size of the row, the size of that column, and "row format". A big string may take an extra disk hit, thereby slowing down the benchmark, probably by a noticeable amount. That is, if you benchmark with only small strings and certain row formats, you will get a faster insert time than otherwise.
And you need to understand how the benchmark program runs -- versus how your application will run:
Insert rows one at a time in a single thread -- each being a transaction.
Insert rows one at a time in a single thread -- lots batched into a transaction.
Insert 100 rows at a time in a single thread in a single transaction.
LOAD DATA.
Multiple threads with each of the above.
Different transaction isolation settings.
Etc.
(I am not a fan of benchmarks because of how many flaws they have.) The 'best' benchmark for comparing hardware or limited schema/app changes: Capture the "general log" from a running application; capture the database at the start of that; time the re-applying of that log.
Designing a table/insert for 50K inserted rows/sec
Minimize indexes. In your case, all you need is PRIMARY KEY(col1, col2); toss the rest; toss id. Please explain what col1 and col2 are; there may be more tips here.
Get rid of the table. Seriously, consider summarize the 50K rows every second and store only the summarization. If it is practical, this will greatly speed things up. Or maybe a minute's worth.
Batch insert rows in some way. The details here depend on whether you have one or many clients doing the inserts, whether you need to massage the data as it comes, in, etc. More discussion: http://mysql.rjweb.org/doc.php/staging_table
What is in those strings? Can/should they be 'normalized'?
Let's discuss the math. Will you be loading about 10 petabytes per year? Do you have that much disk space? What will you do with the data? How long will it take to read even a small part of that data? Or will it be a "write only" database??
More math. 50K rows * 0.5KB = 25MB writing to disk per second. What device do you have? Can it handle, say, 2x that? (With your original schema, it would be more like 60MB/s because of all the indexes.)
After comments
OK, so more like 3TB before you toss the data and start over (in 2 hours)? For that, I would suggest PARTITION BY RANGE and use some time function that gives you 5 minutes in each partition. This will give you a reasonable number of partitions (about 25) and the DROP PARTITION will be dropping only about 100GB, which might not overwhelm the filesystem. More discussion: http://mysql.rjweb.org/doc.php/partitionmaint
As for the strings... You suggest 25KB, yet the declarations don't allow for that much???

Mysql - estimate time to drop index

We have a fairly unoptimized table with the following definition:
CREATE TABLE `Usage` (
`TxnDate` varchar(30) DEFAULT NULL,
`TxnID` decimal(13,2) NOT NULL,
`UserID2015` varchar(20) DEFAULT NULL,
`UserRMN` decimal(13,0) DEFAULT NULL,
`CustomerNo` decimal(13,0) DEFAULT NULL,
`OperatorName` varchar(50) DEFAULT NULL,
`AggregatorName` varchar(30) DEFAULT NULL,
`TransAmount` decimal(10,2) DEFAULT NULL,
`MMPLTxnID` decimal(13,0) DEFAULT NULL,
`ProductType` varchar(30) DEFAULT NULL,
`YearMonthRMN` varchar(50) DEFAULT NULL,
PRIMARY KEY (`TxnID`),
UNIQUE KEY `TxnID` (`TxnID`) USING BTREE,
KEY `TxnDate` (`TxnDate`),
KEY `OperatorName` (`OperatorName`),
KEY `AggregatorName` (`AggregatorName`),
KEY `MMPLTxnID` (`MMPLTxnID`),
KEY `ProductType` (`ProductType`),
KEY `UserRMN` (`UserRMN`),
KEY `YearMonthRMN` (`YearMonthRMN`) USING BTREE,
KEY `CustomerNo` (`CustomerNo`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=latin1
The table has abotu 170M records.
I want to drop the primary key and instead add an auto number primary key. So far the index dropping has taken 2h.
Why is it taking so long to remove an index, is there any sorting happening?
How can I estimate the time to drop the index?
When I add the autonumber, will I have to estimate time for sorting the table or will this not be necessary with a new autonumber index?
You're not just dropping an index, you're dropping the primary key.
Normally, InnoDB tables are stored as a clustered index based on the primary key, so by dropping the primary key, it has to create a new table that uses either the secondary unique key or else an auto-generated key for its clustered index.
I've done a fair amount of MySQL consulting, and the question of "how much time will this take?" is a common question.
It takes as long as it takes to build a new clustered index on your server. This is hard to predict. It depends on several factors, like how fast your server's CPUs are, how fast your storage is, and how much other load is going on concurrently, competing for CPU and I/O bandwidth.
In other words, in my experience, it's not possible to predict how long it will take.
Your table will be rebuilt with TxnID as the new clustered index, which is coincidentally the same as the primary key. But apparently MySQL Server doesn't recognize this special case as one that can use the shortcut of doing an inplace alter.
Your table also has eight other secondary indexes, five of which are varchars. It has to build those indexes during the table restructure. That's a lot of I/O to build those indexes in addition to the clustered index. That's likely what's taking so much time.
You'll go through a similar process when you add your new auto-increment primary key. You could have saved some time if you had dropped your old primary key and created the new auto-increment primary key in one ALTER TABLE statement.
(I agree with Bill's answer; here are more comments.)
I would kill the process and rethink whether there is any benefit in a AUTO_INCREMENT.
I try to look beyond the question to the "real" question. In this case it seems to be something as-yet-unspoken that calls for an AUTO_INCREMENT; please elaborate.
Your current PRIMARY KEY is 6 bytes. Your new PK will be 4 bytes if INT or 8 bytes if BIGINT. So, there will be only a trivial savings or loss in disk space utilization.
Any lookups by TxnID will be slowed down because of going through the AI. And since TxnID is UNIQUE and non-null, it seems like the optimal "natural" PK.
A PK is a Unique key, so UNIQUE(TxnID) is totally redundant; DROPping it would save space without losing anything. That is the main recommendation I would have (just looking at the schema).
When I see a table with essentially every column being NULL, I am suspicious that the designer did not make a conscious decision about the nullness of the columns.
DECIMAL(13,2) would be a lot of dollars or Euros, but as a PK, it is quite unusual. What's up?
latin1? No plans for globalization?
Lots of single-column indexes? WHERE a=1 AND b=2 begs for a composite INDEX(a,b).
Back to estimating time...
If the ALTER rebuilds the 8-9 indexes, then is should do what it can with a disk sort. This involves writing stuff to disk, using an efficient disk-based sort that involves some RAM, then reading the sorted result to recreate the index. A sort is O(log N), thereby making it non-linear. This makes it hard to predict the time taken. Some newer versions of MariaDB attempt estimate the remaining time, but I don't trust it.
A secondary index includes the column(s) being index, plus any other column(s) of the PK. Each index in that table will occupy about 5-10GB of disk space. This may help you convert to IOPs or whatever. But note that (assuming you don't have much RAM), that 5-10GB will be reread a few (several?) times during the sort that rebuilds the index.
When doing multiple ALTERs, do them in a single ALTER statement. That way, all the work (especially rebuilding of secondary indexes) need be done only once.
You have not said what version you are using. Older versions hand one choice: "COPY": Create new table; copy data over; rebuild indexes; rename. New versions can deal with secondary indexes "INPLACE". Note: changes to the PRIMARY KEY require the copy method.
For anyone interested:
This is run on Amazon Aurora with 30GB of data stored. I could not find any information on how IOPS is provisioned for this, but I expected at worst case there would be 90IOPS available consistently. To write 10GB in and out would take around 4 hours.
I upgraded the instance to db.r3.8xlarge before running the alter table.
Then ran
alter table `Usage` drop primary key, add id bigint auto_increment primary key
it took 1h 21m, which is much better than expected.

Mysql: MyIsam or InnoDB when UUID will be used as PK

I am working on a project where I need to use a UUID (16bit) as unique identifier in the database (MySQL). The database has a lot of tables with relations. I have the following questions about using a UUID as PK:
Should I index the unique identifier as PK / FK or is it not necessary?
If I index it, the index size will increase, but it is really needed?
Enclose an example where i have to use uuid:
Table user with one unique identifier (oid) and foreign key (language).
CREATE TABLE user (
oid binary(16) NOT NULL,
username varchar(80) ,
f_language_oid binary(16) NOT NULL,
PRIMARY KEY (oid),
KEY f_language_oid (f_language_oid),
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
Is it helpful / necessary to define "oid" as PRIMARY_KEY and language as FOREIGN_KEY or would it be bette, if i only create the table without key definitions?
I have read in this article (here) that innodb will generate automatically an 6bit integer als primary key (hidden). In this case, it would be better to use the 6bit internal pk than the 16bit binary key?
If no index is required, should I use MyISAM or InnoDB?
Many thanks in advance.
With MySQL it's often advantageous to use a regular INT as your primary key and have a UUID as a secondary UNIQUE index. This is mostly because I believe MySQL uses the primary key as a row identifier in all secondary indexes, and having large values here can lead to vastly bigger index sizes. Do some testing at scale to see if this impacts you.
The one reason to use a UUID as a primary key would be if you're trying to spread data across multiple independent databases and want to avoid primary key conflicts. UUID is a great way to do this.
In either case, you'll probably want to express the UUID as text so it's human readable and it's possible to do manipulate data easily. It's difficult to paste in binary data into your query, for example, must to do a simple UPDATE query. It will also ensure that you can export to or import from JSON without a whole lot of conversion overhead.
As for MyISAM vs. InnoDB, it's really highly ill-advised to use the old MyISAM database in a production environment where data integrity and uptime are important. That engine can suffer catastrophic data loss if the database becomes corrupted, something as simple as an unanticipated reboot can cause this, and has trouble recovering. InnoDB is a modern, journaled, transactional database engine that's significantly more resilient and recovers from most sudden failure situations automatically, even database crashes.
One more consideration is evaluating if PostgreSQL is a suitable fit because it has a native UUID column type.

What could cause very slow performance of single UPDATEs of a InnoDB table?

I have a table in my web app for storing session data. It's performing badly, and I can't figure out why. Slow query log shows updating a row takes anything from 6 to 60 seconds.
CREATE TABLE `sessions` (
`id` char(40) COLLATE utf8_unicode_ci NOT NULL,
`payload` text COLLATE utf8_unicode_ci NOT NULL,
`last_activity` int(11) unsigned NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `session_id_unique` (`id`) USING HASH
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
The PK is a char(40) which stores a unique session hash generated by the framework this project uses (Laravel).
(I'm aware of the redundancy of the PK and unique index, but I've tried all combinations and it doesn't have any impact on performance in my testing. This is the current state of it.)
The table is small - fewer than 200 rows.
A typical query from the slow query log looks like this:
INSERT INTO sessions (id, payload, last_activity)
VALUES ('d195825ddefbc606e9087546d1254e9be97147eb',
'YTo1OntzOjY6Il90b2tlbiI7czo0MDoi...around 700 chars...oiMCI7fX0=',
1405679480)
ON DUPLICATE KEY UPDATE
payload=VALUES(payload), last_activity=VALUES(last_activity);
I've done obvious things like checking the table for corruption. I've tried adding a dedicated PK column as an auto increment int, I've tried without a PK, without the unique index, swapping the text column for a very very large varchar, you name it.
I've tried switching the table to use MyISAM, and it's still slow.
Nothing I do seems to make any difference - the table performs very slowly.
My next thought was the query. This is generated by the framework, but I've tested hacking it out into a UPDATE with an INSERT if that fails. The slowness continued on the UPDATE statement.
I've read a lot of questions about slow INSERT and UPDATE statements, but those usually related to bulk transactions. This is just one insert/update per user per request. The site is not remotely busy, and it's on its own VPS with plenty of resources.
What could be causing the slowness?
This is not an answer but SE comment length is too damn short. So.
What happens if you run an identical INSERT ... ON DUPLICATE KEY UPDATE... statement directly on the command line? Please try with and without actual usage of the application. The application may be artificially slowing down this UPDATE (for example, in INNODB a transaction might be opened, but committed after a lot of time was consumed. You tested with MyISAM too which does not support transactions. Perhaps in that case an explicit LOCK could account for the same effect. If the framework uses this trick, I'm not sure, I don't know laravel) Try to benchmark to see if there is a concurrency effect.
Another question: is this a single server? Or is it a master that replicates to one or more slaves?
Apart from this question, a few observations:
the values for id are hex strings. the column is unicode. this means 3*40 bytes are reserved while only 40 are utilized. This is a waste that will make things inefficient in general. It would be much better to use BINARY or ASCII as character encoding. Better yet, change the id column to BINARY data type and store the (unhexed) binary value
A hash for a innodb PK table will scatter the data across pages. The idea to use a auto_incrment pk, or not explicitly declare a pk at all (this will cause innodb to create an autoincrement pk of its own internally) is a good idea.
It looks like the payload is base64 encoded. Again the character encoding is specified to be unicode. Ascii or Binary (the character encoding, not the data type) is much more appropriate.
the HASH keyword in the unique index on ID is meaningless. InnoDB does not implement HASH indexes. Unfortunately MySQL is perfectly silent about this (see http://bugs.mysql.com/bug.php?id=73326)
(while this list does offer angles for improvement it seems unlikely that the extreme slowness can be fixed with this. there must be something else going on)
Frustratingly, the answer is this case was a bad disk. One of the disks in the storage array had gone bad, and so writes were taking forever to complete. Simply that.

Mysql 'Partitioning' vs Splitting data into different tables

We have a mysql table called posts_content.
The structure is as follows :
CREATE TABLE IF NOT EXISTS `posts_content` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`post_id` int(11) NOT NULL,
`forum_id` int(11) NOT NULL,
`content` longtext CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=79850 ;
The problem is that the table is getting pretty huge. Many giga-bytes of data ( we have a crawling engine ).
We keep inserting data into the table on a daily bases but seldom do we retrieve the data. Now as the table is getting pretty huge its getting difficult to handle the table.
We discussed two possibilities
Use MySQL's partitioning feature to partition the table using the forum_id ( there are about 50 forum_ids so there would be about 50 partitions. Note that even each partition if made so will eventually grow to again many giga-bytes of data maybe even eventually need its own drive
Create separate tables for each forum_id and split the data like that.
I hope I have clearly explained the problem. WHat I need to know is which of the above two would be a better solution in the long run. What are the adv. dis adv. of both the cases.
Thanking you
The difference is that in the first case you leave MySQL to do the sharding, and in the second case you are doing it on your own. MySQL won't scan any shards that do not contain the data, however if you have a query WHERE forum_id IN(...) it may need to scan several shards. As far as I remember, in that case the operation is syncronous, e.g. MySQL queries one partition at a time, and you may want to implement it asyncronously. Generally, if you do the partitioning on your own, you are more flexible, but for simple partitioning, based on the forum_id, if you query only 1 forum_id at a time, MySQL partitioning is OK.
My advice is to read the MySQL documentation on partitioning, especially the restrictions and limitations section, and then decide.
Although this is an old post, caveat with regards to partitioning if your engine is still MyISAM. MySQL 8.0 no longer supports partitioning other than Innodb or NDB storage engines only. In that case, you have to convert your MyISAM table to InnoDB or NDB but you need to remove partitioning first before converting it, else it cannot be used afterwards.
here you have a good answer for your question: https://dba.stackexchange.com/a/24705/15243
Basically, let your system grow and while you get familiarized with partitioning, and when your system really need to be "cropped in pieces", do it with partitioning.
A quick solution for 3x space shrinkage (and probably a speedup) is to compress the content and put it into a MEDIUMBLOB. Do the compression in the client, not the server; this saves on bandwidth and allows you to distribute the computation among the many client servers you have (or will have).
"Sharding" is separating the data across multiple servers. See MariaDB and Spider. This allows for size growth and possibly performance scaling. If you end up sharding, the forum_id may be the best. But that assumes no forum is too big to fit on one server.
"Partitioning" splits up the data, but only within a single server; it does not appear that there is any advantage for your use case. Partitioning by forum_id will not provide any performance.
Remove the FOREIGN KEYs; debug your application instead.