Performance effects of using NULL-able fields in MySQL - mysql

Sometimes an absent value can be represented (with no loss of function) without resort to a NULL-able column, e.g.:
Zero integer in a column that references the AUTO_INCREMENT row ID of another table
Invalid date value (0000-00-00)
Zero timestamp value
Empty string
On the other hand, according to Ted Codd's relational model, NULL is the marker of an absent datum. I always feel better doing something "the correct way" and MySQL supports it and the associated 3-value logic, so why not?
A few years ago I was woking on a performance problem and found that I could resolve it simply by adding NOT NULL to a column definition. The column was indexed but I don't remember other details. I have avoided NULL-able columns when there is an alternative since then.
But it has always bothered me that I don't properly understand the performance effects of allowing NULL in a MySQL table. Can anyone help out?

It saves 1 bit per column. http://dev.mysql.com/doc/refman/5.0/en/data-size.html
Doesn't seem like much, but over millions of rows it starts to make a difference

Related

Theoretical situation about MySQL

I searched Google for a question I ask myself since this morning but couldn't find any information or article about it.
I was wondering, in the following situation, to improve performance (a little % still) :
Context: I have two column : ID, AddedAt (AddedAt is the Unix Timestamp of when the row is created).
Theoretically, if you insert a new row, ID will be +1 and AddedAt will be the current time.
Now, let's say it is impossible in the current situation to have two simultaneous insert, would it be better to use AddedAt as a PK and remove the ID column ? AddedAt will be only one and unique column that does PK and UNIX Timestamp. So in the final, I will have one column instead of two.
The only bad side I see is maybe the size of the key that will be created on AddedAt since unix timestamp now's day is 10 digits.
Would it be better, in this situation ? What's your opinion ?
EDIT: What about using timestamp + ms ?
Timestamps are in seconds. While you might not have simultaneous inserts, as the world tends to speed up you might get multiple inserts in a second. Build your system to function soundly--don't use timesamps as primary keys.
Also, with statement replication sometime timestamps arent consistent across dbs... Row based replication alleviates this, but still its another reason for concern when using them.
From an good convention standpoint, Primary Keys should have some clear meaning to others outside yourself if it's anything other than just us a plain old auto incrementing id field. Generally, people expect numbers or char values for keys, not things like blobs, timestamps, datetimes, etc... This is especially true if later it's used for as a foreign key in another table, using timestamp as a foreign key can be confusing to later developers. Sure, if you have a varchar GUID field you know is unique, use it as the key. Just remember when used as a foreign key your going to eat up also quite a bit of memory if you have a huge string.
Assuming you can guarantee that two events won't occur within the same 1-second interval, then sure, you could use the timestamp field as a PK.
That being said, why are you worried about key sizes? A timestamp may be 10 digits, but its internal storage requirements is only 4 bytes. By comparison, an int is also 4 bytes, so you wouldn't be losing anything - unless you're using bigints, in which case it's 8 bytes.
Also, note that timestamp fields are subject to the y2038k problem. They're essentially unix timestamps that auto-format into a human readable date for you. If your app is going to be around for more than 26 years, then you should stick with an int/bigint, which has a wraparound range of "however fast you insert rows", not a fixed date/time.
The primary key is not only a technical thing, it is the business representation of something that makes each object represented by a row unique.
A timestamp is a unique field of your object because you cannot (in your case) insert two objects at the same time, but it is NOT the primary definition of a business object (if you had a business object called "timestamp" then yes, the time when it was inserted should be the primary key)
An ID stands for "my client has a physical id that represents him": in the past, we would give numbers to clients on papers, bills...
Never forget that computer science is not the objective per se but the means to achieve your goals.
I would leave the ID column as the primary key as there may be scenarios in which the unix timestamp will give you a value you're not expecting. One could be inserting very fast in succession returns the same timestamp, and another is if the server admin decides to monkey with the servers time settings.
Doing joins will probably much more obvious as people typically expect the primary key to be some sort of unique id, not a timestamp.
Yes of course, but performance gain will be minimal only while adding new record.
Moreover you will be forced to use timestamp for foreign_keys in all related objects.
It is worth considering only if you expect many inserts per second and a lot of records (to save storage on id column and its index), but as you said timestamp will be unique, so it's max 1 record per second :-)

Performance optimization: Null allowed/not allowed vs Performance, if not a key [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
NULL in MySQL (Performance & Storage)
Do you HIGHLY recommend to uncheck NULL allowed for a column in a table, if:
that column is not a key,
that column could be SELECTed or INSERTed into the table or UPDATEd, but never used in WHERE, join and other parts of any query (WHERE, JOIN etc. could be counted when we decide what column to index for better performance).
So, if you leave the checkbox "NULL" checked (i.e. NULL is allowed) for the column described above, will it affect the performance?
I hope the answer (at least for MySQL) is:
no, it will not affect even for 1%, or
no, it will not affect at all.
Yes, I do recommend you make any column which does not need to be NULL, NOT NULL.
In my view, it is one of the "bugs" in SQL, that NULL is enabled by default.
The reason is nothing to do with performance, but for application correctness. If you have a column value which is mandatory, it makes more sense to make it NOT NULL (and indeed, a NULL value would be an error). Do use a DEFAULT value if it makes sense.
In short: No column should ever be nullable unless there is a valid reason within your database structure for it to be null.
Note also that NULLs do something "special" in unique indexes - several rows with a NULL are permitted.

mysql: 'WHERE something!=true' excludes fields with NULL

I have a 2 tables, one in which I have groups, the other where I set user restrictions of which groups are seen.
When I do LEFT JOIN and specify no condition, it shows me all records. When I do WHERE group_hide.hide!='true' it only shows these records that have false enum type set to them. With JOIN, other groups get the hide field set as "NULL".
How can I make it so that it excludes only these that are set to true, and show everything else that has either NULL or false?
In MySQL you must use IS NULL or IS NOT NULL when dealing with nullable values.
HEre you should use (group_hide.hide IS NULL OR group_hide.hide != 'true')
Don already provided good answer to the question that you asked and will solve your immediate problem.
However, let me address the point of wrong data type domain. Normally you would make hide be BOOLEAN but mysql does not really implement it completely. It converts it to TINYINT(1) which allows values from -128 to 127 (see overview of data types for mysql). Since mysql does not support CHECK constraint you are left with options to either use a trigger or foreign reference to properly enforce the domain.
Here are the problems with wrong data domain (your case), in order of importance:
The disadvantages of allowing NULL for a field that can be only 1 or 0 are that you have to employ 3 value logic (true, false, null), which btw is not perfectly implemented in SQL. This makes certain query more complex and slower then they need to be. If you can make a column NOT NULL, do.
The disadvantages of using VARCHAR for a field that can be only 1 or 0 are the speed of the query, due to the extra I/O and bigger storage needs (slows down reads, writes, makes indexes bigger if a field is part of the index and influences the size of backups; keep in mind that none of these effects might be noticeable with wrong domain of a single field for a smaller size tables, but if data types are consistently set too big or if the table has serious number of records the effects will bite). Also, you will always need to convert the VARCHAR to a 1 or 0 to use natural mysql boolean operators increasing complexity of queries.
The disadvantage of mysql using TINYINT(1) for BOOL is that certain values are allowed by RDBMS that should not be allowed, theoretically allowing for meaningless values to be stored in the system. In this case your application layer must guarantee the data integrity and it is always better if RDBMS guarantees integrity as it would protect you from certain bugs in application layer and also mistakes that might be done by database administrator.
an obvious answer would be:
WHERE (group_hide.hide is null or group_hide.hide ='false')
I'm not sure off the top of my head what the null behaviour rules are.

NULL in table fields better than empty fields?

i wonder if one should have NULL in the fields with no value, or should it be empty?
what is best?
thanks
NULL means that no data is set, while empty string could be some valid data.
Thus, using NULL helps you to differentiate these two cases.
From a programming standpoint, I try to not allow null values for a few reasons. One of which is that code often has a bad reaction to unexpected NULL values. If a query filter ran faster checking null values I might consider using them but there is no evidence of this I have experienced. But I have experienced many a function which pooped out on doing some kind of comparison not testing for NULL before hand.
There is a certain argument that you should never allow NULL in your data, if you are using it to indicate that you don't know what the value should be or that you just don't have that data yet then use an explicit value in the field to indicate those states. Similarly for 'empty' fields. That said, I think everyone does it or has done it and may do it again. NULL has odd comparative properties which is why it's always best, if you can, to avoid it and have explicit values for missing data states.
Avoid NULLs in base tables whenever three valued logic is likely to come back to bite you. That's easy to say, but lengthy to explain. Three valued logic can sometimes be successfully managed, but your intuition is likely to be based on two valued logic, and can be misleading.
If you avoid NULLS in base tables, but create views or queries with outer joins, be prepared to deal with NULLS. NULLS in fields that are never used in where clauses and never used "incorrectly" with aggregates (as in sum(FIELD)) are OK.
NULL fields are always empty, but empty doesn't always imply NULL. In particular, an empty or non existent field in a form can translate into a non NULL value in a table. Autonumber fields are an example.
Oracle made a mistake way back in the 1980s by using the same representation for the VARCHAR string of length zero (the empty string), and NULL. They've been about to fix it "real soon now" for a quarter of a century. Don't hold your breath.
Don't use NULLs to convey a meaningful message. This almost always confuses your colleagues, even when they deny it.
Nulls are necessary amd important tools in dataase design. If you don'tknow the value at the time the record is inseerted, null is entirely appropriate and the best practice. Making an unknon into a known value such as empty string is silly. This especially true when you get away from string data into dates or numeric data. 0 is not the same as null, some arbitrary date far in the past or future is not the same as null. For that matter empty strings means there is no value, null means we don't know what the value is. This is an important distinction.
It's not hard to handle nulls, any competent programmer should be able to do so.

Column Nullability/Optionality: NULL vs NOT NULL

Is there a reason for or against setting some fields as NULL or NOT NULL in a mysql table, apart from primary/foreign key fields?
That completely depends on your domain to be honest. Functionally it makes little difference to the database engine, but if you're looking to have a well defined domain it is often best to have both the database and application layer mirror the requirements you are placing on the user.
If it's moot to you whether or not the user enters their "Display Name", then by all means mark the column as nullable. On the other hand, if you are going to require a "Display Name" you should mark it non null in the database as well as enforcing the constraint in the application. By doubling the constraint, you ensure that should your front-end change, the domain is still fully qualified.
MySQL has a NOT NULL condition on a field, but this will not stop you from inserting "empty" data. There is no way to flag a field as "required".
As Pekka mentioned, you should be doing some sort of validation to prevent this at a higher level in your application.
It's not a MySQL specific thing - every database that I'm aware of allows for defining columns with a constraint that either allows a NULL value in the column, or does not allow this to happen.
Defining a column as NOT NULL means there always has to be a value present that matches the data type. NULL is a sentinel value, and its' data type transcends whatever is defined for the column.
If the column is a foreign key, the value also has to already exist in the related table before you insert the value into the current table. DEFAULT constraints are common, but not necessary, on columns defined as NOT NULL so that the columns will be populated with an appropriate value if NULL was attempted to be inserted into these columns. Getting back to foreign keys, a foreign key column can be nullable, which means the relationship is optional - the business rules allow for there to be no relationship.
When Should NULL & NOT NULL be Used?
Ideally, every column should be NOT NULL but it really depends on what the business rules require.
I don't know how you would define a required field in mySQL, care to enlighten me? I really don't know.
Anyway, even if this can be done, I can hardly think of a scenario where it would make sense. IMO, you would have to validate faulty (=incorrectly empty) data much earlier. Validation, sanitation and cutting should be done long before anything enters the database. The only time a database error should occur is when something exceptional occurs, e.g. when the database is physically not reachable.