MySQL large index integers for few rows performance - mysql

A developer of mine was making an application and came up with the following schema
purchase_order int(25)
sales_number int(12)
fulfillment_number int(12)
purchase_order is the index in this table. (There are other fields but not relevant to this issue). purchase_order is a concatenation of sales_number + fulfillment.
Instead i proposed an auto_incrementing field of id.
Current format could be essentially 12-15 characters long and randomly generated (Though always unique as sales_number + fulfillment_number would always be unique).
My question here is:
if I have 3 rows each with a random btu unique ID i.e. 983903004, 238839309, 288430274 vs three rows with the ID 1,2,3 is there a performance hit?
As an aside my other argument (for those interested) to this was the schema makes little sense on the grounds of data redundancy (can easily do a SELECT CONCATENAE(sales_number,fulfillment_number)... rather than storing two columns together in a third)

The problem as I see is not with bigint vs int ( autoicrement column can be bigint as well, there is nothing wrong with it) but random value for primary key. If you use INNODB engine, primary key is at the same time a clustered key which defines physical order of data. Inserting random value can potentially cause more page splits, and, as a result a greater fragmentation, which in turn causes not only insert/update query to slow down, but also selects.
Your argument about concatenating makes sense, but executing CONCATE also has its cost(unfortunately, mysql doesn't support calculated persistent columns, so in some cases it's ok to store result of concatenation in a separate column; )

AFAIK integers are stored and compared as integers so the comparisons should take the same length of time.
Concatenating two ints (32bit) into one bigint (64bit) may have a performance hit that is hardware dependent.

having incremental id's will put records that were created around the same time near each other on the hdd. this might make some queries faster. if this is the primary key on innodb or for the index that these id's are used.
incremental records can sometimes be inserted a little bit quicker. test to see.
you'll need to make sure that the random id is unique. so you'll need an extra lookup.
i don't know if these points are material for you application.

Related

Which one is faster to get the row? The primary key that carries numbers or that carries characters?

ID (Int 11) (Primary key) (Auto increment)
TITLE
1
...
2
...
3
...
4
...
5
...
To 10 million rows
ID (Char 32) (Primary key)
TITLE
a4a0FCBbE614497581da84454f806FbA
...
40D553d006EF43f4b8ef3BcE6B08a542
...
781DB409A5Db478f90B2486caBaAdfF2
...
fD07F0a9780B4928bBBdbb1723298F92
...
828Ef8A6eF244926A15a43400084da5D
...
To 10 million rows
If I want to get a specific row from the first table, How much time will take approximately, Same thing with the second table, How much time will take approximately?
Is the primary key that carries numbers will be found faster than that carries characters?
I do not want to use auto-increment with int like the first table because of this problem
UUIDs and MD5s and other hashes suck because of the "randomness" and lack of "locality of reference", not because of being characters instead of numeric.
You could convert those to BINARY(16), thereby making them half as big.
10M INT = 40MB = 600/block
10M CHAR(32) = 320MB = 300/block
10M VARCHAR(32) = 330MB = 300/block
10M BINARY(16) = 160MB = 450/block
Add that much more for each secondary key in that table.
Add again for each other table that references that PK (eg, FOREIGN KEY).
Let's look at the B+Tree that is the structure of the PK and secondary indexes. In a 16KB block, some number of entries can be placed. I have estimated them above. (Yes, the 'overhead' is much than an INT.) For INT, the BTree for 10M rows will probably be 3 levels deep. Ditto for the others. (As the table grows, Varchar would move to 4 levels before the others.)
So, I conclude, there is little or no difference in how many BTree blocks are needed to do your "point query".
Summary of how much slower a string is than an INT:
BTree depth -- little or none
Cachability of index blocks -- some; not huge
CPU time to compare numbers vs strings -- some; not huge
Use of a fancy COLLATION -- some; not huge
Overall -- not enough difference to worry about.
What I will argue for in some cases is whether you need a fabricated PK. In 2/3 of the tables I build, I find that there is a 'natural' PK -- some column(s) that is, by the business logic, naturally UNIQUE and NOT NULL. These are the two main qualifications (in MySQL) for a PRIMARY KEY. In some situations the speedup afforded by a "natural PK" can be more than a factor of 2.
A Many-to-many mapping table is an excellent (and common) example of such.
It is impossible to tell the exact times needed to retrieve a specific record, because it depends on lots of factors.
In general, numeric values take less storage space, thus scanning the index requires less I/O operations, therefore are usually faster.
However in this specific case the second table looks like a hexadecimal representation of a large number. You can probably store it as a binary value to save storage space.
On top of the above, in general numeric values are not affected by various database and column settings, while strings are (like collation), which also can add some processing time while querying.
The real question is what is the purpose of using the binary representation. 10 million values can easily fit in INT what is the need to have a key which can store way more (32 long hexadecimal value)?
As long as you are within the range of the numeric values and there is no other requirement, just to be able to store that many different values, I would go with an integer.
The 'problem' you mention in the question is usually not a problem. There is no need to not have gaps in the identifiers in most caes. In fact in lots of systems, gaps are naturally occurring during normal operations. You most probably won't reassign the records to other IDs when one record is being deleted from the middle of the table.
Unless there is a semantic meaning of the ID (it should not), I would just go with an AUTO_INCREMENT, there is no need to reinvent the wheel.

Mysql : 'UNIQUE' constraint over a large string

What could be the possible downside of having UNIQUE constraint for a large string (varchar) (roughly 100 characters or so) in MYSQL during :
insert phase
retrieval phase (on another primary key)
Can the length of the query impact the performance of read/writes ? (Apart from disk/memory usage for book-keeping).
Thanks
Several issues. There is a limit on the size of a column in an index (191, 255, 767, 3072, etc, depending on various things).
Your column fits within the limit.
Simply make a UNIQUE or PRIMARY key for that column. There are minor performance concerns, but keep this in mind: Fetching a row is more costly than any datatype issues involving the key used to locate it.
Your column won't fit.
Now the workarounds get ugly.
Index prefixing (INDEX foo(50)) has a number of problems and inefficiencies.
UNIQUE foo(50) is flat out wrong. It is declaring that the first 50 characters are constrained to be unique, not the entire column.
Workarounds with hashing the string (cf md5, sha1, etc) have a number of problems and inefficiencies. Still, this may be the only viable way to enforce uniqueness of a long string.
(I'll elaborate if needed.)
Fetching a row (Assuming the statement is parsed and the PRIMARY KEY is available.)
Drill down the BTree containing the data (and ordered by the PK). This may involve bring a block (or more) from disk into the buffer_pool.
Parse the block to find the row. (There are probably dozens of rows in the block.)
At some point in the process lock the row for reading and/or be blocked by some other connection that is, say, updating or deleting.
Pick apart the row -- that is, split into columns.
For any text/blob columns needed, reach into the off-record storage. (Wide columns are not stored with the narrow columns of the row; they are stored in other block(s).) The costly part is locating (and reading from disk if not cached) the extra block(s) containing the big TEXT/BLOB.
Convert from the internal storage (not word-aligned, little-endian, etc) into the desired format. (A small amount of CPU code, but necessary. This means that the data files are compatible across OS and even hardware.)
If the next step is to compare two strings (for JOIN or ORDER BY), then that a simple subroutine call to a scan over however many characters there are. (OK, most utf8 collations are not 'simple'.) And, yes, comparing two INTs would be faster.
Disk space
Should INT be used instead of VARCHAR(100) for the PRIMARY KEY? It depends.
Every secondary key has a copy of the PRIMARY KEY in it. This implies that a PK that is VARCHAR(100) makes secondary indexes bulkier than if the PK were INT.
If there are no secondary keys, then the above comment implies that INT is the bulkier approach!
If there are more than 2 secondary keys, then using varchar is likely to be bulkier.
(For exactly one secondary key, it is a tossup.)
Speed
If all the columns of a SELECT are in a secondary index, the query may be performed entirely in the index's BTree. ("Covering index", as indicated in EXPLAIN by "Using index".) This is sometimes a worthwhile optimization.
If the above does not apply, and it is useful to look up row(s) via a secondary index, then there are two BTree lookups -- once in the index, then via the PK. This is sometimes a noticeable slowdown.
The point here is that artificially adding an INT id may be slower than simply using the bulky VARCHAR as the PK. Each case should be judged on its tradeoffs; I am not making a blanket statement.

Storing a 100k by 100k array in MySQL

I need to store a massive, fixed size square array in MySQL. The values of the array are just INTs but they need to be accessed and modified fairly quickly.
So here is what I am thinking:
Just use 1 column for primary keys and translate the 2d arrays indexes into single dimensional indexes.
So if the 2d array is n by n => 2dArray[i][j] = 1dArray[n*(i-1)+j]
This translates the problem into storing a massive 1D array in the database.
Then use another column for the values.
Make every entry in the array a row.
However, I'm not very familiar with the internal workings of MySQL.
100k*100k makes 10 billion data points, which is more than what 32 bits can get you so I can't use INT as a primary key. And researching stackoverflow, some people have experienced performance issues with using BIGINT as primary key.
In this case where I'm only storing INTs, would the performance of MySQL drop as the number of rows increases?
Or if I were to scatter the data over multiple tables on the same server, could that improve performance? Right now, it looks like I won't have access to multiple machines, so I can't really cluster the data.
I'm completely flexible about every idea I've listed above and open to suggestions (except not using MySQL because I've kind of committed to that!)
As for your concern that BIGINT or adding more rows decreases performance, of course that's true. You will have 10 billion rows, that's going to require a big table and a lot of RAM. It will take some attention to the queries you need to run against this dataset to decide on the best storage method.
I probably recommend using two columns for the primary key. Developers often overlook the possibility of a compound primary key.
Then you can use INT for both primary key columns if you want to.
CREATE TABLE MyTable (
array_index1 INT NOT NULL,
array_index1 INT NOT NULL,
datum WHATEVER_TYPE NOT NULL,
PRIMARY KEY (array_index1, array_index2)
);
Note that a compound index like this means that if you search on the second column without an equality condition on the first column, the search won't use the index. So you need a secondary index if you want to support that.
100,000 columns is not supported by MySQL. MySQL has limits of 4096 columns and of 65,535 bytes per row (not counting BLOB/TEXT columns).
Storing the data in multiple tables is possible, but will probably make your queries terribly awkward.
You could also look into using table PARTITIONING, but this is not as useful as it sounds.

What are the merits of using numeric row IDs in MySQL?

I'm new to SQL, and thinking about my datasets relationally instead of hierarchically is a big shift for me. I'm hoping to get some insight on the performance (both in terms of storage space and processing speed) versus design complexity of using numeric row IDs as a primary key instead of string values which are more meaningful.
Specifically, this is my situation. I have one table ("parent") with a few hundred rows, for which one column is a string identifier (10-20 characters) which would seem to be a natural choice for the table's primary key. I have a second table ("child") with hundreds of thousands (or possibly millions or more) of rows, where each row refers to a row in the parent table (so I could create a foreign key constraint on the child table). (Actually, I have several tables of both types with a complex set of references among them, but I think this gets the point across.)
So I need a column in the child table that gives an identifier to rows in the parent table. Naively, it seems like creating the column as something like VARCHAR(20) to refer to the "natural" identifier in the first table would lead to a huge performance hit, both in terms of storage space and query time, and therefore I should include a numeric (probably auto_increment) id column in the parent table and use this as the reference in the child. But, as the data that I'm loading into MySQL don't already have such numeric ids, it means increasing the complexity of my code and more opportunities for bugs. To make matters worse, since I'm doing exploratory data analysis, I may want to muck around with the values in the parent table without doing anything to the child table, so I'd have to be careful not to accidentally break the relationship by deleting rows and losing my numeric id (I'd probably solve this by storing the ids in a third table or something silly like that.)
So my question is, are there optimizations I might not be aware of that mean a column with hundreds of thousands or millions of rows that repeats just a few hundred string values over and over is less wasteful than it first appears? I don't mind a modest compromise of efficiency in favor of simplicity, as this is for data analysis rather than production, but I'm worried I'll code myself into a corner where everything I want to do takes a huge amount of time to run.
Thanks in advance.
I wouldn't be concerned about space considerations primarily. An integer key would typically occupy four bytes. The varchar will occupy between 1 and 21 bytes, depending on the length of the string. So, if most are just a few characters, a varchar(20) key will occupy more space than an integer key. But not an extraordinary amount more.
Both, by the way, can take advantage of indexes. So speed of access is not particularly different (of course, longer/variable length keys will have marginal effects on index performance).
There are better reasons to use an auto-incremented primary key.
You know which values were most recently inserted.
If duplicates appear (which shouldn't happen for a primary key of course), it is easy to determine which to remove.
If you decide to change the "name" of one of the entries, you don't have to update all the tables that refer to it.
You don't have to worry about leading spaces, trailing spaces, and other character oddities.
You do pay for the additional functionality with four more bytes in a record devoted to something that may not seem useful. However, such efficiencies are premature and probably not worth the effort.
Gordon is right (which is no surprise).
Here are the considerations for you not to worry about, in my view.
When you're dealing with dozens of megarows or less, storage space is basically free. Don't worry about the difference between INT and VARCHAR(20), and don't worry about the disk space cost of adding an extra column or two. It just doesn't matter when you can buy decent terabyte drives for about US$100.
INTs and VARCHARS can both be indexed quite efficiently. You won't see much difference in time performance.
Here's what you should worry about.
There is one significant pitfall in index performance, that you might hit with character indexes. You want the columns upon which you create indexes to be declared NOT NULL, and you never want to do a query that says
WHERE colm IS NULL /* slow! */
or
WHERE colm IS NOT NULL /* slow! */
This kind of thing defeats indexing. In a similar vein, your performance will suffer bigtime if you apply functions to columns in search. For example, don't do this, because it too defeats indexing.
WHERE SUBSTR(colm,1,3) = 'abc' /* slow! */
One more question to ask yourself. Will you uniquely identify the rows in your subsidiary tables, and if so, how? Do they have some sort of natural compound primary key? For example, you could have these columns in a "child" table.
parent varchar(20) pk fk to parent table
birthorder int pk
name varchar(20)
Then, you could have rows like...
parent birthorder name
homer 1 bart
homer 2 lisa
homer 3 maggie
But, if you tried to insert a fourth row here like this
homer 1 badbart
you'd get a primary key collision because (homer,1) is occupied. It's probably a good idea to work how you'll manage primary keys for your subsidiary tables.
Character strings containing numbers sort funny. For example, '2' comes after '101'. You need to be on the lookout for this.
The main benefit you get from numeric values that that they are easier to 'index'. Indexing is a process that MySQL uses to make it easier to find a value.
Typically, if you want to find a value in a group, you have to loop through the group looking for your value. That is slow and has a worst case of O(n). If instead your data was in a nice, searchable format -- like a binary search tree, if could be found in O(lon n), much faster.
Indexing is the process MySQL uses to prepare data to be searched, it generates search trees and other clever do-bobs that will make finding data quick. It makes many searches much faster. However, to do it, it has to compare the value you are searching for to various 'key' values to determine if your value is greater than or less than the key.
This comparison can be done on non-numeric values. However, comparing non-numeric values is much slower. If you want to be able to quickly look up data, your best bet is you have a integer 'key' that you use.
Numeric row id's have many advantages over a string based id.
Most of them are mentioned in other answers.
1. One of them is indexing. Primary keys are by default indexed in a relational database. So, having a numeric key is always more efficient.
2. Numeric fields are stored much more efficiently
2. Joins are much faster with numeric keys.
3. A row id could be a foreign key. Numeric id's are compact to store, making them efficient
4. I think using a auto-increment on primary key has its advantages too
-Thanks
_san

SQL - performance in varchar vs. int

I have a table which has a primary key with varchar data type. And another table with foreign key as varchar datatype.
I am making a join statement using this pair of varchar datatype. Though I am dealing with few number of rows (say hunderd rows), it is taking 60ms. But when the system will finally be deployed, it will have around thousands of rows.
I read Performance of string comparison vs int join in SQL, and concluded that the performance of SQL Query depend upon DB and number of rows it is dealing with.
But when I am dealing with a very large amount of data, would it matter much?
Should I create a new column with a number datatype in both the table and join the table to reduce the time taken by the SQL Query.?
You should use the correct data type for that data that you are representing -- any dubious theoretical performance gains are secondary to the overhead of having to deal with data conversions.
It's really impossible to say what that is based on the question, but most cases are rather obvious. Where they are not obvious are in situations where you have a data element that is represented by a set of digits but which you do not treat as a number -- for example, a phone number.
Clues that you are dealing with this situation are:
leading zeroes that must be preserved
no arithmetic operations are carried out on the element.
string operations are carried out: eg. "take the last four characters"
If that's the case then you probably want to store your "number" as a varchar.
Yes, you should give that a shot. But before you do, make a test version of your db that you populate with the level of data you expect to have in production, and run some tests on not just SELECT, but also INSERT, UPDATE, and DELETE as well. Then make a version with integer keys, and perform equvialent tests.
The numeric-keys WILL be faster, for the simple reason that the keys are of smaller size, but the difference may not be noticeable. Don't blindly trust books when you can test and measure the difference yourself.
(One thing to remember: if there are occasions when all you need from a relation is the value you currently have as its key, your database may run significantly faster if you can skip entire table lookups by just referencing the foreign-key on the records you have.)
Should I create a new column with a number datatype in both the table and join the table to reduce the time taken by the SQL Query.?
If you're in a position where you can change the design of the database with ease then yes, your Primary Key should be an integer. Unless there is a really good reason to have an FK as a varchar, then they should be integers as well.
If you can't change the PK or FK fields, then make sure they're indexed properly. This will eventually become a bottleneck though.
It just does not sound right to me. It will use more space result in more reads etc. Then is the varchar the clustered index key? If so the table is going to get very fragmented.