I have a table with quite a lot entries.
I need an additional column with an integer value or null.
The thing is that only very few rows will have that field populated.
So i wonder whether its better to create a seperate table where i link the entries in an 1:1 relation.
I know one integer entry takes 4 bytes in mysql/myisam. If I have the column set to allow null values, and only 100 of 100 000 rows have the field populated, will the rest still consume 4 bytes for every null value?
Or is mysql intelligent enough to set the value where it is populated and just regard everything as null, where nothing is set?
This depends on the ROW_FORMAT value you give when you create your table.
Before version 5.0.3, the default format is set to "REDUNDANT" : any fixed-length field will use the same space, even if it's value is NULL.
Starting with version 5.0.3, the value is set to "COMPACT" : NULL values will never use any space in your database.
You can do an ALTER TABLE to be sure to use the correct format :
ALTER TABLE ... ROW_FORMAT=COMPACT
More details here :
http://dev.mysql.com/doc/refman/5.1/en/data-size.html
As far as my understanding goes, once you declare a field as int, 4 bytes will be set aside for it. So, for 100,000 rows you are looking at ~ 400 KB of space.
If space is a constraint, then separate table will be better. On the other hand, if performance is a criteria, then you'll have to take into account how many times that field is queried and whether it is checked for existence or non-existence. In either case, you'll need a join. If you want to check whether the field is set you can use inner join, which will be slower than single table query. If you want to check for non-existence, you'll need left/right outer join which will be slower than inner join.
It will use bitfields to store nulls so it may need less than one byte. But, even if it did - who cares, unless you are using 3.5" floppies to store your backend in ;-)
NULL in MySQL (Performance & Storage)
Related
I'm designing DB tables for a log system. I have two ideas on my mind about a field. Should I create three "bit(1)" property or one "enum" property?
is_error bit(1)
is_test bit(1)
is_embedded bit(1)
or
boolErrors enum(is_error_true, is_error_false, is_test, is_test_false, is_embedded_ is_embedded_false)
Obviously, holding enum seems not proper in semantics and space but what about performance. Is fetching time increases when i have 3 columns instead of 1?
If, as it seems, the flags represent states (that is, only one flags may be true at a given point in time), then I would recommend a single column, as integer datatype. Instead of using ENUM, you can use a referrential table to store all possible flags and their names, an reference it from the original table, using the integer column.
On the other hand, if several flags may be on (say, both is_error and is_test), then a single column is not sufficient. You can either create several columns (if the list of flags never changes), or use a bridge table to store each status on a separate row.
If only one of those flags can be set at a time, use ENUM.
If multiple flags can be set at the same time, use SET.
Performance is not really something to worry about. The main "cost" in working with a row in a table is fetching the row, not the details of what goes on in the columns.
Sure, "smaller is better" for several reasons -- I/O, etc. But an ENUM is 1 or 2 bytes; a SET is up to 8 bytes (for up to 64 flags). Both of those are reasonably small for any use case.
As for speed and indexability, let's see the main queries.
Right now I'm trying to learn the details of MySQL. The type BINARY needs as many storage bytes as provided via its parameter, so for example, if I define a column as BINARY(8) it consumes 8 bytes.
On the site https://dev.mysql.com/doc/refman/8.0/en/storage-requirements.html#data-types-storage-reqs-strings, there is a table mapping the types to their storage requirements. And it says that I can define a BINARY(0). But in my opinion, it does not make sense. BINARY(0) would mean that I can store 0 bytes - so nothing. Do I miss a thing? What use does it have? Or what is the reason for that?
On the other hand, I cannot define a bigger BINARY-column than one with 255 bytes. I always thought the reason for 255 is that you start counting at 0. But when you don't need a BINARY(0) you could define a BINARY(256) without problems...
I had to poke around on this one, because I didn't know myself. From this link, we can see that BINARY(0) can store two types of values:
NULL
empty string
So, you could use a BINARY(0) column much in the same way you would use a non nullable BIT(1) column, namely as a true/false or yes/no column. However, the storage requirement of BINARY(0) is just one bit, which requires no additional storage beyond the boundary for nullable columns.
Since the non NULL state of the BINARY(0) column would be empty string, which translates to zero, you could find all such records using:
SELECT *
FROM yourTable
WHERE bin_zero_column = 0;
The unmarked NULL records could find found using WHERE bin_zero_column IS NULL.
I have a table structure shown below contains Structure of Roles Table I taken:
Let it be a "roles" table contains some records related to roles of users.
Now here I have taken one column "is_archived(int)" which I am using to get to know that role still exists or deleted.
So I am considering two values for that column:
"NULL"=> if that role still exists (like TRUE),
"1" => if deleted /inactive (like FALSE)
For my table maximum records will contain "NULL" value for this column and Default value is also "NULL".
Now I am in a dilemma that is there any performance issue in this case as I am using "NULL" instead of "0".
I need to know the pros and cons of this case(Like "Search Performance", "Storage", "indexing", etc).
And in case of cons, what are the best alternatives?
My opinion is that NULL is for "out of band", not for kludging an in-band value. If there is any performance or space difference, it is insignificant.
For true/false, use TINYINT NOT NULL. It is only 1 byte. You could use ENUM('false', 'true'); it is also 1 byte.
INT, regardless of the number after it, takes 4 bytes. Don't use INT for something of such low cardinality.
Leave NULL to mean "not yet known" or any other situation where you can't yet say "true" or "false". (Since you probably always know if it is 'archived', NULL has no place here.
You could even use ENUM('male', 'female', 'decline_to_state', 'transgender', 'gay', 'lesbian', 'identifies_as_male', 'North_Carolina_resident', 'other'). (Caveat: That is only a partial list; it may be better to set up a table and JOIN to it.)
I agree with #RickJames about NULL. Don't use NULL where you mean to use a real value like true. Likewise, don't use a real value like 0 or '' to signify absence of a value.
As for performance impact, you should know that to search for the presence/absence of NULL you would use the predicate is_archive IS [NOT] NULL.
If you use EXPLAIN on the query, you'll see that that predicate counts as a "range" access type. Whereas searching for a single specific value, e.g. is_archive = 1 or is_archive = 0 is a "ref" access type.
That will have performance implications for some queries. For example if you have an index on (is_archived, created_on) and you try to do a query like:
SELECT ... FROM roles
WHERE is_archived IS NULL AND created_on = '2017-01-31'
Then the index will only be half-useful. The WHERE clause cannot search the second column in the index.
But if you use real values, then the query like:
SELECT ... FROM roles
WHERE is_archived = 0 AND created_on = '2017-01-31'
Will use both columns in the index.
Re your comment about NULL storage:
Yes, in the InnoDB storage engine, internally each row stores a bitfield with 1 bit per column, where the bits indicate whether each column is NULL or not. These bits are stored compactly, i.e. one byte contains up to 8 bits. Following the bitfield is the series of column values. A column that is NULL stores no value. So yes, technically it is true that using a NULL reduces storage.
However, I urge you to simplify your data management and use false when you mean false. Do not use NULL for one of your values. I suppose there's an exception if you manage data at a scale where saving one byte per row matters. For example, if you are managing tens of billions of rows.
But at a smaller scale than that, the potential space savings aren't worth the extra complexity you add to your project.
To put it in perspective, InnoDB pages only fill each data page 15/16 full anyway. So the overhead of the InnoDB page format is likely to be greater than the savings you could get from micro-optimizing boolean storage.
I have a MySQL database table, which have more than 100 columns. I have to add two more columns, which if entered by user, keeps text data in it, but which is hardly used.
Now my question is, what will happen if I make it as "medium text" sized column and most of the user don't enter it. Will that column still takes the given memory, or only when user enters in to it,memory will be allocated.
I dont have much knowledge in this, So any explanations are welcome. Also let me know if any other better method to go.
It's not bad practice to use large texts or blobs even if it's not going to be used frequently, however try to use the smallest data type that suits your needs.
TEXT requires N characters + 2 bytes
MEDIUMTEXT requires N characters + 3 bytes
LONGTEXT requires N characters + 4 bytes
See: https://dev.mysql.com/doc/refman/5.7/en/storage-requirements.html
Additionally, if you allow them to be NULL (and assuming you are using InnoDB engine with COMPACT row format), it will only use 1 bit per column per row). So, if you have 2 NULLs, it will still use 1 byte.
Space Required for NULLs = CEILING(N/8) bytes where N is the number of NULL columns in a row.
More on: https://dev.mysql.com/doc/refman/5.7/en/innodb-physical-record.html
On the other hand, having that many columns might not be ideal. Try restructuring your table into several tables.
I think you need to split that information in three tables. One contains general info about entry, one contains fields list and other holds relation between first and second table.
[Product]
ID | name | model | price
[Fields]
ID | field_name | field_key | is_mandatory
[Field_to_product]
field_id | product_id | value
And in Field_to_product you hold only these values, that product has.
On update delete all entries for that product from Field_to_product and rewrite it's values.
If the length of data on the column used less than 65,535 characters, you should consider using varchar, rather than text type variant.
blob, text, varchar, and varbinary data type are pointer type.
They only store 1-2 byte pointer header and for the data part will acquire space in dynamic manner. they alocate space as the data fill in the column. the N part when creating type such as varchar(N) is for validation purpose.
Blob and binary differ from text and varchar in the way database engine use index for sorting and use matching algorithm to compare data.
Where text based will be stored and compared using collation of character set. The way the database engine store the character in physical is defined by character set. Some character set like Japanese or Chinese, require double byte to store, while Latin character use single digit. And so on.
While Blob and binary data is saved as is and no reference to any character set.
Aside from data type, you should consider normalize the table.
https://en.wikipedia.org/wiki/Database_normalization
A table with 100 columns will hold performance down as the data row grow.
Make searching, inserting and updating the table take more time as it grow.
You can try sql utility syntax from sql query console to show your table status
show table status from your_table_name;
The actual size of table is not only defined by datatype, but also come from index(s), key(s). Index can define on set of column(s), so multiple index can be created on a single table.
The space requirement will also grew exponential if using a text data type column with full text index enabled on that column.
I have a column deleted in my table. On every sql statement I check whether this flag IS NULL. Does someone want to delete entries, the flag is set to the current timestamp.
In case to restore entries, this timestamp is used to restore them. This is the only use case when the value of this column will be used.
In all other cases, it's only important to know whether it IS NULL or it IS NOT NULL.
In the future the table can and will contain millions of rows.
Is it useful to create an index on this column? Because 99% of the statements & use cases don't care about the value.
Does MySQL optimize IS NULL conditions and therefore an index is not needed?
An index on 'deleted' will also index null values, and thus allow for faster lookups of non-null/null
I think this will be sufficient in this case and not cause too much overhead, since the timestamp is set on deletion, and therefor won't be changed all that much. (The opposite: using an edit-timestamp that is changed all the time and only sometimes set to null, would cause adjusting the index on every time a record is changed. That might not be optimal. That is not the case here.)
(Also, but I don't know if the indexer is smart enough to take advantage of that, the expected changes always go to the ends of the index, either at the null-end or at the 'most recent' end.)
Of course, profile (both query execution times and storage space if important) to find out if there are actual problems arising from this.
Can't you create an "archive" table and store deleted rows with their timestamp.
If the user want to restore a row, you juste have to transfer it from archive to your main table.
And you don't have to check "flag IS NOT NULL" in every query
According to this book (High Performance MySQL, 2nd Edition) it does not recommended to use "allow NULL" in columns definition in MySQL. MySQL use additional byte to store status of cell (Null or not Null) and index size will be bigger than without "allow NULL". Better solution to make row TINYINT datatype and store value 1 for active rows and 0 for deleted rows. So it is recommended to never use allow NULL in column definition.