I am running a MySQL 5.7.30-0ubuntu0.16.04.1-log Server where I have the option of saving in char(4) or in smallint(5, unsigned).
There will be a primary index on the column and the key will be used as a referrence accross tables.
What is faster? Char or Int?
Unsigned SMALLINT values use two bytes and have values in the range [0, 65535]. CHAR(4) values take four bytes. So, indexing SMALLINT values will make for a smaller index. Smaller is faster. Plus indexes on character columns usually have all sorts of character-set and case-insensitivity monkey business built in to them, which also takes time and space.
But, for a table with at most 65K rows, the effect of this choice will be so small you'll have trouble measuring it. If you build something that's hard to debug, you'll spend your precious time and ten thousand times as much computer time debugging it than it will save.
Design your tables so they match your application. If you're using a four-digit number use SMALLINT.
The next person to work on your code (even if that person is you a year from now) will thank you for a clear implementation.
And keep in mind that MySQL ignores the number in parentheses on INT declarations. SMALLINT(4), SMALLINT(5), and SMALLINT all mean precisely the same thing. MySQL uses the native processor integer datatypes: TINYINT is an 8-bit number, SMALLINT a 16-bit number, INT a 32-bit number, and BIGINT a 64-bit number. Likewise FLOAT is a 32-bit IEEE 754 floating point number and DOUBLE a 64-bit one. The number of digits SMALLINT(4) is a nod to SQL standards compatibility.
As mentioned by O. Jones, SMALLINT will be faster and more space-efficient.
This is related to the following answer: mysql-char-vs-int
Also, MySQL Documentation:
CHAR and VARCHAR types
Integer Types
Case 1: The difference between CHAR(4) and SMALLINT is insignificant. It should not influence you choice of datatypes. Instead, use the datatypes that match the data.
Case 2: If you are comparing TINYINT to VARCHAR(255), the answer is probably different. Note that there is a much bigger difference in the choices.
Case 3: If the choice comes down to whether to "normalize" a column, there are arguments either way. I much prefer using a CHAR(2) for country_code than normalizing in order to shrink to a TINYINT. The overhead of extra normalization always(?) outweighs the space savings.
Another consideration: How many secondary keys are on the table? And how many other tables will you be joining to?
Case 4: PRIMARY KEY(big_string) but no secondary keys. There is no possibly no advantage in switching to an int.
Case 5: Since secondary keys include the PK, consider:
PRIMARY KEY(big_string),
INDEX(foo),
INDEX(bar)
versus
PRIMARY KEY(id), -- surrogate AUTO_INCREMENT
INDEX(big_string),
INDEX(foo),
INDEX(bar)
The latter will take less disk space.
Another consideration: Fetching a row is far more costly than comparing an int or string. My point is that you should not worry about comparison performance; you should look at the bigger picture when optimizing.
Case 6: USA 5-digit zip code. CHAR(5) (5 bytes) is reasonable. MEDIUMINT(5) UNSIGNED ZEROFILL (3 bytes) is better because it does everything better. (And it is a very rare case of the *INT(n) being meaningful.)
And the debate goes on and on.
Related
I am thinking about the best way to index my data. Is it a good idea to use the timestamp as my primary key? I am saving it anyway and I though about saving some columns. The timestamp should be an integer not a datetime column, because of performance. Moreover I don't want to be restricted on the amount of data in a short time (between two seconds). Therefore, I thought about an additionary AUTO_INCREMENT column. Now I have a unique key (timestamp and AI) and I can get the current inserted id easily by using the command "LAST_INSERT_ID". Is it possible to reset the AI counter every second / when there is a new timestamp? Or is it possible to detect if there is a dataset with the same timestamp and increase the AI value (I still want to be able to use LAST_INSERT_ID).
Please share some thoughts.
The timestamp should be an integer not a datetime column, because of performance.
I think you are of the belief that datetime is stored as a string. It is stored as numbers quite efficiently and with a wider range and more accuracy than an integer.
Using an integer may decrease performance because the database may not be able to correctly index it for use as a timestamp. It will complicate queries because you will not be able to use the full suite of date and time functions without first converting the integer to a datetime.
Use the appropriate date/time type, index it, and let the database optimize it.
Moreover I don't want to be restricted on the amount of data in a short time (between two seconds). Therefore, I thought about an [additional] AUTO_INCREEMENT column.
This would seem to defeat the point of "saving some columns". Now your primary key is two integers. Worse, it's a compound key which requires all references to store both values increasing storage requirements and complicating joins.
All the extra work necessary to determine the next primary key could be done in an insert trigger, but now you'd added complexity and extra work to every insert.
Is it a good idea to use the timestamp as my primary key?
A primary key should be A) unique and B) immutable. A timestamp is not unique, and you might need to change it.
Your primary key is unlikely to be a performance or storage bottleneck. Unless you have a good reason, stick with a simple, auto-incrementing big integer. A big integer because 2 billion is smaller than you think.
MySQL encapsulates this in serial which is bigint unsigned not null auto_increment unique.
TIMESTAMP and DATETIME are risky as a PRIMARY KEY since the PK must be Unique.
Otherwise, it is fine to use them for the PK or an index. But here are some caveats:
When using composite indexes (multi-column), put the things tested with = first; put the datetime last.
Smaller is slightly better when picking a PK. TIMESTAMP and DATETIME take 5 bytes (when not including microseconds); INT is 4 bytes; BIGINT is 8.
The time taken for comparing one PK value to another is insignificant. That includes character PKs. For example, country_code CHAR(2) CHARACTER SET ascii is only 2 bytes -- better than 'normalizing' it and replacing it with a 4-byte cc_id INT.
So, no, don't bother using INT instead of TIMESTAMP.
In my experience, 2/3 of tables have a "natural" PK and don't need an auto_increment PK.
One of the worst places to use a auto_inc is on a many-to-many mapping table. It is likely to slow down most operations by a factor of 2.
You hinted at PRIMARY KEY(timestamp, ai):
You need to add INDEX(ai) to keep AUTO_INCREMENT happy.
It provides locality of reference for temporarily 'near' rows. But so does ai, by itself.
No, there is no practical way to reset the ai each second. (MyISAM has such, but do not use that engine.) Instead be sure to declare ai big enough to last 'forever' before overflowing.
But I can't think of a use case where there isn't a better way.
I have created one table and I want to store numbers from 1 to 60 numbers only inside the field.
What should I put in the datatype of the table field? Should I use TINYINT (4) datatype?
"Best" data type is open to interpretation. Here are three options:
numeric(2, 0)
varchar(2)
tinyint(2)
These have different sizes, but that doesn't make them "best" -- except under certain circumstances where storage space is a primary concern. I am guessing that your "numbers" are not really numbers, but are codes of some sort that vary from 1 to 60.
If these are referencing a reference table, then tinyint makes sense as the key, because keys are often numbers. However, I often use int for such keys. The extra three bytes usually have little impact on performance.
If the code is zero-padded (so '01' rather than '1'), then char(2) is the appropriate type. It might take more space, but it accurately represents the value.
If these are indeed numbers -- like addition or multiplication is defined -- then tinyint is definitely the most appropriate type.
Yes TINYINT is the best option its from -128 to 127!
Documentation Link: https://dev.mysql.com/doc/refman/8.0/en/integer-types.html
Would there be a significant performance difference between looking up a MySQL table (currently 2M rows) using PK unsigned int(4 bytes) vs varchar(15) (unique index) column?
The problem with VARCHAR being used for any KEY is that they can hold WHITE SPACE. White space consists of ANY non-screen-readable character, like spaces tabs, carriage returns etc. Using a VARCHAR as a key can make your life difficult when you start to hunt down why tables aren't returning records with extra spaces at the end of their keys.
Sure, you CAN use VARCHAR, but you do have to be very careful with the input and output. They also take up more space and are likely slower when doing a Queries.
Integer types have a small list of 10 characters that are valid, 0,1,2,3,4,5,6,7,8,9. They are a much better solution to use as keys.
You could always use an integer-based key and use VARCHAR as a UNIQUE value if you wanted to have the advantages of faster lookups.
Shorter keys means more keys per index block, means fewer index block retrievals, means faster lookups.
I have some small tables that don't need the bigint primary key, they won't get that big, but, all tables have bigint primary key as standard.
Can this affect my performance or mysql is smart on that?
I wouldn't like to change the PKs to int on those tables, but if it can slow me down, surely I will.
One of optimization rules for DBMS is "keep your data as small as possible" - so if you don't need bigint - declare it as an int (and change type when you need it)
Based on benchmarks here using an BIGINT could increase the database size by a significant factor, which would affect performance, probably not noticeable until you reached a significant size.
MySQL won't do this for you, as it (and you) never know(s) how big the tables will get. The performance benefits of changing BIGINT to INT on smaller tables is negligable, although it might be an idea to keep the BIGINT type in case your row count goes above INTs limits of 2147483647 (4294967295 unsigned). It is, however, advisable to keep your data in as compact a way as possible.
If it's a relatively small table, you might be better off going with MEDIUMINT actually. It's limits are 8388607 and 16777215 (unsigned).
I seem to see a lot of people arbitrarily assigning large sizes to primary/foreign key fields in their MySQL schemas, such as INT(11) and even BIGINT(20) as WordPress uses.
Now correct me if I'm wrong, but even an INT(4) would support (unsigned) values up to over 4 billion. Change it to INT(5) and you allow for values up to a quadrillion, which is more than you would ever need, unless possibly you're storing geodata at NASA/Google, which I'm sure most of us aren't.
Is there a reason people use such large sizes for their primary keys? Seems like a waste to me...
The size is neither bits nor bytes.
It's just the display width, that is
used when the field has ZEROFILL
specified.
and
INT[(M)] [UNSIGNED] [ZEROFILL] A
normal-size integer. The signed range
is -2147483648 to 2147483647. The
unsigned range is 0 to 4294967295.
See this explanation.
I don't see any good reason to use a number larger than 32-bit integer for indexing data in normal business-sized databases. Most of them have maybe millions of records (or that order of magnitude).