I have two tables:
table A with columns, INT t_a, INT t_b, INT t_c, INT t_d, VARCHAR t_var
table B with columns, INT t_a, INT t_b, INT t_c, INT t_d, CHAR t_cha
If I select column t_a, will there be any performance difference between table A and table B?
In theory there should be a performance penalty when dealing with VARCHARs as you can not work with fixed adress computation. But in practice this is nowadays not visible.
no difference. between char and varchar when selecting
Yes, but it is quite subtle. I presume that the character fields actually have lengths associated with them.
There is a difference in how the data is stored. The char field will store all the characters in the database, even when they are spaces at the end. The varchar() field will only store the length needed for the fields.
So, if you had a table that contained US state names, then:
create table states (
stateid int,
statename char(100)
);
Would occupy something like 100*50 + 100*4 = 5,400 bytes in the database. With a varchar(), the space usage would be much less.
In larger tables, this can increase the number of pages needed to store the data. This additional storage can slow down the query, by some amount. (It could be noticeable on a table with a large number of records and lots of such wasted space).
Related
Is it possible to use the Locate() function on TEXT column, or is there any alternative to it for TEXT fields.
the thing is we have LARGE varchars (65kb) that we use to track for subscriptions, so we add subscription_ids inside 1 long string in varchar.
this string can hold up to 5000 subscription_ids in 1 row. we use LOCATE to see if a user is subscribed.
if a subscription_id is found inside the varchar string.
the problem is that we plan to have more than 500,000 rows like this, it seems this can have a big impact on performance.
so we decided to move to TEXT instead, but now there is a problem with indexation and how to LOCATE sub-text inside a TEXT column.
Billions of subscriptions? Please show an abbreviated example of a TEXT value. Have you tried FIND_IN_SET()?
Is one TEXT field showing up to 5000 subscriptions for one user? Or is it the other way -- up to 5K users for one magazine?
In any case, it would be better to have a table with 2 columns:
CREATE TABLE user_sub (
user_id INT UNSIGNED NOT NULL,
sub_id INT UNSIGNED NOT NULL,
PRIMARY KEY(user_id, sub_id),
INDEX(sub_id, user_id)
) ENGINE=InnoDB;
The two composite indexes let you very efficiently find the 5K subscriptions for a user or the 500K users for a sub.
Shrink the less-500K id to MEDIUMINT UNSIGNED (16M limit instead of 4 billion; 3 bytes each instead of 4).
Shrink the less-5K id to SMALLINT UNSIGNED (64K limit instead of 4B; 2 bytes each instead of 4).
If you desire, you can use GROUP_CONCAT() to reconstruct the commalist. Be sure to change group_concat_max_len to a suitably large number (default is only 1024 bytes.)
I want to convert a CSV database into a MySQL one, I know I will never add any new row in the database tables. I know the max ID of each table, for example : 9898548.
What should be the proper way to compute the int size ? Does a CEIL(LOG2(last_id)) could be sufficient for this ? With my example, it would be LOG2(9898548) = 23.2387 so int(24) ? is this correct ?
When you're defining your table and you know your max values you can refer to the max table sizes. See http://dev.mysql.com/doc/refman/5.7/en/integer-types.html for a table of numeric sizes.
IDs are usually positive so you can use the unsigned numbers. In your case 9898548 is less than 16777215 (the unsigned MEDIUMINT max value) so that would be the most space efficient storage option. So your calculation is correct. You need 24 bits or 3 bytes, or a UNSIGNED MEDIUMINT.
CREATE TABLE your_table (id UNSIGNED MEDIUMINT PRIMARY KEY);
The brackets with numbers inside are to help MySQL display the number correctly, they don't do anything to the storage size. So INT(11) and INT (24), can both sure the same range of numbers. But the one defined INT (11) will only display a number with a column width of equivalent to 11 digits even if the number is smaller. See
http://dev.mysql.com/doc/refman/5.7/en/numeric-type-attributes.html
"This optional display width may be used by applications to display integer values having a width less than the width specified for the column by left-padding them with spaces"
Yes, in this case, you need an integer type with a least 24 bits (equals 3 bytes). The smallest in MySQL satisfying this is UNSIGNED MEDIUMINT, according to the documentation.
Edit: Added the UNSIGNED.
Let's consider this strange situation,
where there's a redundancy of indexes.
TableA (item_id, code_key, data01, ... data0n)
TableB (item_id, code_key, dataA1, ... dataAn)
Both item_id and code_key are unique and they could be a primary key in both tables. item_id or code_key could be removed from both tables without losing any reference/relation.
It's redundant I know, but this is not the point of the question.
Consider that, both columns are indexed.
Item_id is a INT, codeKey is a VARCHAR(100).
Someone is suggesting that's better querying:
select * from TableA INNER JOIN TableB USING(item_id)
rather than :
select * from TableA INNER JOIN TableB USING(code_key)
I don't see the point of it since both columns are indexed and the performance would be the same.... isn't it?
Is it that having a INT would be faster than having a VARCHAR in the ON clause?Even if they're both indexed?
Int comparisons are faster than varchar comparisons, for the simple
fact that ints take up much less space than varchars.
This holds true both for unindexed and indexed access. The fastest way
to go is an indexed int column.
-- #Robert Munteanu
Hope that helps. There are no much differences, but we value the speed performance. The longer is the varchar the slower it gets.
You seem to be asking about having two columns for the same information. That is almost always frowned on.
Moving on... Should you have an INT or a VARCHAR...
Fetching a row costs a lot more (even if cached) than anything to do with the individual columns. So, while VARCHAR might be more costly than INT, it is not enough more costly to warrant going out of your way to make the change just for that reason.
The same argument goes for complexity of expressions.
In a related vein, there are multiple reasons for using an ENUM instead of a VARCHAR when appropriate. (Ditto for changing a VARCHAR into a TINYINT.)
Smaller --> faster, especially if I/O-bound.
If indexed, then the index(es) are smaller, too.
Less disk space
"Normalization" is a deliberate attempt to replace a VARCHAR by some size of INT. But there are multiple reasons for that.
Only one place to change the string, not many rows in many tables. If this reason exists, then it trumps other considerations.
Space savings.
But it adds complexity (now need to JOIN). Hence speed may or may not be improved.
When picking INT, always pick the smallest flavor. INT takes 4 bytes; MEDIUMINT - 3 bytes, etc. And pick it based on the range. And use UNSIGNED usually.
I need to create a table which saves measurements consisting of a device id (int), logdate (datetime) and a value (decimal) (SQL Server 2008). The measurements are always on the quarter e.g. 00:00, 00:15, 00:30, 00:45, 01:00, 01:15... so I was thinking that an int defining the amount of quarters since a certain date would be result in better performance than a datetime.
Retrieving would usually be done using the following:
-where DeviceId = x and QuarterNumber between a and b
-where DeviceId in (x, y, ...) and QuarterNumber between a and b
-where DeviceId = x and QuarterNumber = a
What would be the best design for this table?
PK DeviceId int
PK QuarterNumber int
Value int
or
PK MeasurementId int
UQ QuarterNumber int
UQ DeviceId int
Value int
(UQ=unique index)
or something totally different?
Thanks!
You might get marginally better SELECT performance by defining the number of quarter hours since a certain date if you have many millions of rows.
Personally, I don't think the marginal performance gain will be worth the reduced readability. I also wouldn't like basing the design on a quarter-hour assumption. (In my experience, that kind of requirement often changes over time.) You could include a quarter-hour CHECK constraint now on a datetime column now, and drop it later if that requirement changes.
But there's no point in relying on opinion when you can test and measure. Build three tables, load several million rows of sample data, and study the query plans. (It's not completely impractical to load 50 million rows into each table. I've sometimes loaded 20 million rows into a test table when answering a question on SO.) Don't assume that your first try at indexing will be optimal. Consider multiple indexes, and consider a multi-column index, too.
I dont think there can be any specific guidelines for your criteria. You might need to create and test(you can insert a demo data in each). Since you want performance improvement I would suggest the use of index in your table.
Funny thing I've found abount mysql. MySQL has a 3 byte numeric type - MEDIUMINT. Its range is from -8388608 to 8388607. It seems strange to me. Size of numeric types choosen for better performance, I thought data should be aligned to a machine word or double word. And if we need some restriction rules for numeric ranges, it must be external relative to datatype. For example:
CREATE TABLE ... (
id INT RANGE(0, 500) PRIMARY KEY
)
So, does anyone know why 3 bytes? Is there any reason?
The reason is so that if you have a number that falls within a 3 byte range, you don't waste space by storing it using 4 bytes.
When you have twenty billion rows, it matters.
The alignment issue you mentioned applies mostly to data in RAM. Nothing forces MySQL to use 3 bytes to store that type as it processes it.
This might have a small advantage in using disk cache more efficiently though.
We frequently use tinyint, smallint, and mediumint as very significant space savings. Keep in mind, it makes your indexes that much smaller.
This effect is magnified when you have really small join tables, like:
id1 smallint unsigned not null,
id2 mediumint unsigned not null,
primary key (id1, id2)
And then you have hundreds of millions or billions of records.