Does MySQL slow down with really high primary key values? - mysql

If the values of the primary key in my table range from 200,000,000 to 200,000,100
Will queries be much slower than if the values were 1000 to 1100?

No. Unless you're using char/varchar fields for those numbers, a number is stored in a fixed-sized raw binary format, and the "size" of a number has no bearing on search speeds - 1 will occupy as much space inside the database as 999999999999.

The answer to your specific question is no, they will not be much slower.
Maybe a few nanoseconds as the extra bits get sent down the wire and processed, but nothing significant.
However, if your question was "will having 200,000,100 rows be slower than 1,100?" then yes, it would be a bit slower.

No, but as others have noted, a larger number of records in your table will generally require more IO to retrieve records, although it can be mitigated through indexes.
The data type for your primary key will make a slight difference (200M still fits into a 4 byte INT, but an 8 byte BIGINT will be slightly larger, and a CHAR(100) would be more wasteful etc). This choice will result in slightly larger storage requirements, for this table, its indexes, and other tables with foreign keys to this table.

Well, define much slower.....
The values 1000-1100 can fit in a SMALLINT (2 bytes) and 200,000,000 will have to be put into an INT (4 bytes). So there can be twice as many records in memory for a SMALLINT as for an INT. So 1000-11000 will be faster

Nope, otherwise the algorithm would be very stupid... Your values perfectly fit into 32 bit integer.

The queries will be exactly the same.
They will be slower, however, when you have between 200,000,000 to 200,000,100 rows in your table when compared to 1000 to 1100 rows.

Not at all. It wouldn't at all. The mysql developers are very particular about high value - speed interoperability.

Related

In MySQL, is it faster to compare with integer or string of integer?

I have a very large table looks like this
int_id str_id
1 1
1 1
2 2
3 3
... ...
99999 99999
3 3
Column int_id has type INT while column str_id has type VARCHAR. They always contain the "same" value (i.e. if int_id is 1, then str_id is "1", etc...).
Now let's say I want to query rows whose id is in (1,4,5,2,....5282,55,232) (a list of around 15 ids), which of the 2 queries below is faster?
select * from table where int_id IN (1,4,5,2,....5282,55,232)
or
select * from table where str_id IN ('1','4','5','2',....'5282','55','232')
assuming I create an index for each column. My table will be absolutely enormous and speed is very important to me so I want to optimize it as much as possible. Appreciate any help.
MySQL ultimately runs on some processor, and in general an integer comparison can be done in a single CPU cycle, while string comparisons will generally take multiple cycles, perhaps one cycle per character. See Why is integer comparison faster then string comparison? for more information.
For a table scan, fetching the row and extracting the column to test is 99% of the effort. So speeding up the comparison may not make a noticeable difference.
With the column indexed the question is really about how fast an INT can be looked up versus a VARCHAR. Again, the fetching of the row (in the index's BTree) is the overwhelming determinant of performance. I may have to guess 98% when discussing the index's BTree.
Bottom Line: Use the datatype that is appropriate to your application; don't worry about performance of INT vs VARCHAR.
(Caveat: I pulled "99%" and "98%" out of thin air.)
The link to C++ code is mostly irrelevant, because it talks only about a single comparison, not the fetching of the row, parsing the row, drilling down a BTree, etc.
Oh, another thing, Since you are talking about 5-digit numbers, you can get another tiny performance boost by changing from INT (4 bytes) to MEDIUMINT UNSIGNED (3 bytes).
Oh, there is even a hardware component to the question. INT (and its siblings) could store the bytes either "big-endian" or "little-endian". MySQL, on day one, standardized on one of those for storing on disk. Now, when they fetch numbers on some hardware platforms, they have to reverse the bytes.

Does moving a varchar column to a different table and later using a join improve performance?

We have a table with 150 million rows, and one of the columns is still varchar (128 symbols), we optimized every other column to tinyints and similar to reduce size. We're trying to improve the performance further. Would moving the column to another table and using a join when selecting something have any performance issues? there are around 500 unique varchars at the moment, and it shouldn't exceed growth of around 100-200/year, so in theory it should decrease the size of the table drastically.
It depends on how long those strings are. Just because the string is defined as varchar(128) doesn't mean that it contains that many characters. A varchar is going to contain a length (either one or two bytes) and then the data. In this case, the length is 1 byte.
So, if your strings are very short, then an integer used for mapping to another table might actually be bigger.
If your strings are long -- say 100 characters -- then replacing them with a looking will be smaller. And, this might actually have a significant impact on the data size (and hence on performance).
The join itself should add little to the cost of a query, particularly if the join key is a primary key. In fact, because the data in the larger table is smaller, such queries might run faster with the join.
What you should do depends on your data.

Most efficient way to store VARCHAR in 150B-row table

We have to ingest and store 150 billion records in our MySQL InnoDB database. One field in particular is a field that is a VARCHAR as is taking up a lot of space. Its characteristics:
Can be NULL
Highly duplicated but we can't de-dupe because it increases ingestion time exponentially
Average length is about 75 characters
It has to have an index as it will have to join with another table
We don't need to store it in human readable format but we need to be able to match it to another table which would have to have the same format for this column
I've tried the following:
Compressing the table, this helps with space but dramatically increases ingestion time, so I'm not sure compression is going to work for us
Tried hashing to SHA2 which reduced the string length to 56, which gives us reasonable space saving but just not quite enough. Also I'm not sure SHA2 will generate unique values for this sort of data
Was thinking about MD5 which would further reduce string length to probably the right level but not sure again whether MD5 is string enough to generate unique values to be able to match with another table
A hash function like MD5 produces a 128-bit hash in a string of 32 hex characters, but you can use UNHEX() to cut that in half to 16 binary characters, and store the result in a column of type BINARY(16). See my answer to What data type to use for hashed password field and what length?
MD5 has 2128 distinct hashes, or 340,282,366,920,938,463,463,374,607,431,768,211,456. The chances of two different strings resulting in a collision is pretty reasonably low, even if you have 15 billion distinct inputs. See How many random elements before MD5 produces collisions? If you're still concerned, use SHA1 or SHA2.
I'm a bit puzzled by your attempts to use a hash function, though. You must not care what the original string is, since you must understand that hashing is not reversible. That is, you can't get the original string from a hash.
I like the answer from #Data Mechanics, that you should enumerate the unique string inputs in a lookup table, and use a BIGINT primary key (a INT has only 4+ billion values so it isn't large enough for 15 billion rows).
I understand what you mean that you'd have to look up the strings to get the primary key. What you'll have to do is write your own program to do this data input. Your program will do the following:
Create an in-memory hash table to map strings to integer primary keys.
Read a line of your input
If the hash table does not yet have an entry for the input, insert that string into the lookup table and fetch the generated insert id. Store this as a new entry in your hash table, with the string as the key and the insert id as the value of that entry.
Otherwise the hash table does have an entry already, and just read the primary key bigint from the hash table.
Insert the bigint into your real data table, as a foreign key, along with other data you want to load.
Loop to step 2.
Unfortunately it would take over 1 TB of memory to hold a HashMap of 15 billion entries, even if you MD5 the string before using it as the key of your HashMap.
So I would recommend putting the full collection of mappings into a database table, and keep a subset of it in memory. So you have to do an extra step around 3. above, if the in-memory HashMap doesn't have an entry for your string, first check the database. If it's in the database, load it into the HashMap. If it isn't in the database, then proceed to insert it to the database and then to the HashMap.
You might be interested in using a class like LruHashMap. It's a HashMap with a maximum size (which you choose according to how much memory you can dedicate to it). If you put a new element when it's full, it kicks out the least recently referenced element. I found an implementation of this in Apache Lucene, but there are other implementations too. Just Google for it.
Is the varchar ordinary text? Such is compressible 3:1. Compressing just the one field may get it down to 25-30 bytes. Then use something like VARBINARY(99).
INT (4 bytes) is not big enough for normalizing 15 billion distinct values, so you need something bigger. BIGINT takes 8 bytes. BINARY(5) and DECIMAL(11,0) are 5 bytes each, but are messier to deal with.
But you are concerned by the normalization speed. I would be more concerned by the ingestion speed, especially if you need to index this column!
How long does it take to build the table? You haven't said what the schema is; I'll guess that you can put 100 rows in an InnoDB block. I'll say you are using SSDs and can get 10K IOPs. 1.5B blocks / 10K blocks/sec = 150K seconds = 2 days. This assumes no index other than an ordered PRIMARY KEY. (If it is not ordered, then you will be jumping around the table, and you will need a lot more IOPs; change the estimate to 6 months.)
The index on the column will effectively be a table 150 billion 'rows' -- it will take several terabytes just for the index BTree. You can either index the field as you insert the rows, or you can build the index later.
Building index as you insert, even with the benefit of InnoDB's "change buffer", will eventually slow down to not much faster than 1 disk hit per row inserted. Are you using SSDs? (Spinning drives are rated about 10ms/hit.) Let's say you can get 10K hits (inserts) per second. That works out to 15M seconds, which is 6 months.
Building the index after loading the entire table... This effectively builds a file with 150 billion lines, sorts it, then constructs the index in order. This may take a week, not months. But... It will require enough disk space for a second copy of the table (probably more) during the index-building.
So, maybe we can do the normalization in a similar way? But wait. You said the column was so big that you can't even get the table loaded? So we have to compress or normalize that column?
How will the load be done?
Multiple LOAD DATA calls (probably best)? Single-row INSERTs (change "2 days" to "2 weeks" at least)? Multi-row INSERTs (100-1000 is good)?
autocommit? Short transactions? One huge transaction (this is deadly)? (Recommend 1K-10K rows per COMMIT.)
Single threaded (perhaps cannot go fast enough)? Multi-threaded (other issues)?
My discussion of high-speed-ingestion.
Or will the table be MyISAM? The disk footprint will be significantly smaller. Most of my other comments still apply.
Back to MD5/SHA2. Building the normalization table, assuming it is much bigger than can be cached in RAM, will be a killer, too, no matter how you do it. But, let's get some of the other details ironed out first.
See also TokuDB (available with newer versions of MariaDB) for good high-speed ingestion and indexing. TokuDB will slow down some for your table size, whereas InnoDB/MyISAM will slow to a crawl, as I already explained. TokuDB also compresses automatically; some say by 10x. I don't have any speed or space estimates, but I see TokuDB as very promising.
Plan B
It seems that the real problem is in compressing or normalizing the 'router address'. To recap: Of the 150 billion rows, there are about 15 billion distinct values, plus a small percentage of NULLs. The strings average 75 bytes. Compressing may be ineffective because of the nature of the strings. So, let's focus on normalizing.
The id needs to be at least 5 bytes (to handle 15B distinct values); the string averages 75 bytes. (I assume that is bytes, not characters.) Add on some overhead for BTree, etc, and the total ends up somewhere around 2TB.
I assume the router addresses are rather random during the load of the table, so lookups for the 'next' address to insert is a random lookup in the ever-growing index BTree. Once the index grows past what can fit in the buffer_pool (less than 768GB), I/O will be needed more and more often. By the end of the load, approximately 3 out of 4 rows inserted will have to wait for a read from that index BTree to check for the row already existing. We are looking at a load time of months, even with SSDs.
So, what can be done? Consider the following. Hash the address with MD5 and UNHEX it - 16 bytes. Leave that in the table. Meanwhile write a file with the hex value of the md5, plus the router address - 150B lines (skipping the NULLs). Sort, with deduplication, the file. (Sort on the md5.) Build the normalization table from the sorted file (15B lines).
Result: The load is reasonably fast (but complex). The router address is not 75 bytes (nor 5 bytes), but 16. The normalization table exists and works.
You state its highly duplicated ?
My first thought would be to create another table with the actual varchar value and a primary int key pointing to this value.
Then the existing table can simply change to contain as a foreign key the reference to this value (and additionally be efficiently index able).

Does changing a column length from varchar 18 to varchar 8 improve search speed drastically?

I have a Mysql database with ~500k rows. I have a column that is basically a short url used as an index for each row. I can make the index 8 char/varchar or 18. I want to know if the extra 10 chars will really slow my searches down drastically. A pro to varchar 18 is I won't have to generate a short url. Whereas, if I do varchar or char 8, I will. The index will be used for retrieving comments, updating the entry, etc. The index however will be unique regardless.
Thanks
Not drastically. Search speed in B+-tree indexes varies as the number of index reads, which varies with logx(N), where N is the number of index entries and x is the number of keys per index block, which in turn is inversely proportional to their length. You can see it is a sub-linear relationship.
The difference in index performance between 8-char and 18-char column values is not zero. But it is quite small. Index performance deteriorates more markedly when you're dealing with stuff like 255-character indexes.
You should be fine either way with a half-megarow table unless you have really extraordinary performance requirements.
When you're done with the table's initial load, do OPTIMIZE TABLE to make the indexes work optimally. If the table gets lots of DELETE or INSERT operations, try to find a time of day or week when you routinely can redo the OPTIMIZE TABLE operation. http://dev.mysql.com/doc/refman/5.6/en/optimize-table.html

How to handle a large amount of data in a specific table of a database

I am working on a project where I constantly insert rows in a table and within a few days this table is going to be very big and I came up with a question and can't find the answer:
what is going to happen when I'll have more rows than 'bigint' in that table knowing that
I have an 'id' column (which is an int)? Does my database (MySQL) can handle that properly? How does big companies handle that kind of problems and joins on big tables?
I don't know if there are short answers to that kind of problems but any lead to solve my question would be welcome!
You would run out of storage before you run out of BIGINT primary key sequence.
Unsigned BIGINT can represent a range of 0 to 18,446,744,073,709,551,615. Even if you had a table with a single column that held the primary key of BIGINT type (8 bytes), you would consume (18,446,744,073,709,551,615×8)÷1,024^4 = 134,217,728 terabytes of storage.
Also maximum size of tables in MySQL is 256 terabytes for MyISAM and 64 terabytes for InnoDB, so really you're limited to 256×1,024^4÷8 = 35 trillion rows.
Oracle supports NUMBER(38) (takes 20 bytes) as largest possible PK, 0 to 1e38. However having a 20 byte primary key is useless because maximum table size in Oracle is 4*32 = 128 terabytes (at 32K block size).
numeric data type
If this column is primary key, you are not able to insert more rows.
If not a primary key, the column is truncated to the maximum value it can presented in that data type.
You should change id column to bigint as well if you require to perform join.
You can use uuid to replace integer primary key (for big companies),
take note that uuiq is string, and your field will not longer in numeric
That is one of the big problems of every website with LOTS of users. Think about Facebook, how many requests do they get every second? How many servers do they have to store all the data? If they have many servers, how do they separate the data across the servers? If they separate the data across the servers, how would they be able to call normal SQL queries on multiple servers and then join the results? And so on. Now to avoid complicating things for you by answering all these questions (which will most probably make you give up :-)), I would suggest using Google AppEngine. It is a bit difficult at the beginning, but once you get used to it you will appreciate the time you spent learning it.
If you are only having a database and you don't have many requests, and your concern is just the storage, then you should consider moving to MSSQL or -better as far as I know- Oracle.
Hope that helps.
To put BIGINT even more into perspective, if you were inserting rows non-stop at 1 row per millisecond (1000 rows per second), you would have 31,536,000,000 row per year.
With BIGINT at 18,446,744,073,709,551,615 you would be good for about 18 million years.
You could make your bigint unsigned, giving you 18,446,744,073,709,551,615 available IDs
Big companies handle it by using DB2 or Oracle