When do nullable columns affect performance severely? - mysql

From what I understand, one should avoid nullable columns in databases whenever possible.
But, in what specific situations do nullable columns actually cause a significant performance drop?
In other words, when does null really hurt performance? (As opposed to when it's negligible, and does not matter at all).
I'm asking so I can know when and how it actually makes a difference.

Don't know where you heard it, but it's not true.
Nullable columns are there to represent data accurately: if a value is unknown, or not yet entered, NULL is a natural value to store. Null values are no more onerous to store or retrieve than values of any other type: most database servers store them in a single bit, which means that it will take less I/O and processor effort to retrieve a NULL value than assembling a varchar, BLOB, or text field from a bunch of fragments that may require walking through a linked list, or reading more disk blocks off the hard drive.
There are a couple instances marginally related to nullable columns that may affect performance:
If you create an index on a nullable column, and the actual values in the column are sparse (i.e. many rows have a NULL value, or only a very few values are present (as, with, say a controlled vocabulary value), the b-tree data structure used to index the column becomes much less efficient. Index traversals become more expensive operations when half the values in an index are identical: you end up with an unbalanced tree.
Inapt use of NULL values, or inappropriate query techniques that don't utilize NULL values as they were designed often results in poor performance, because progammers often fall back on the bad habit of searching or joining on computed column values, which ignores the fantastic set-processing ability of modern db servers. I've consulted at lots of places where the development staff has made a habit of writing clauses like:
WHERE ISNULL(myColumn, '') = ''
which means that the DB server cannot use an index directly, and must perform a computation on every single row of that section of the execution tree to evaluate the query. That is not because there is any intrinsic inefficiency in storing, comparing, or evaluating NULL values, but because the query thwarts the strengths of the database engine to achieve a particular result.

Related

Performance (not space) differences between storing dates in DATE vs. VARCHAR fields

We have a MySQL-based system that stores date values in VARCHAR(10) fields, as SQL-format strings (e.g. '2021-09-07').
I am aware of the large additional space requirements of using this format over the native DATE format, but I have been unable to find any information about performance characteristics and how the two formats would differ in terms of speed. I would assume that working with strings would be slower for non-indexed fields. However, if the field is indexed I could imagine that the indexes on string fields could potentially yield much larger improvements than those on date fields, and possibly even overtake them in terms of performance, for certain tasks.
Can anyone advise on the speed implications of choosing one format over the other, in relation to the following situations (assuming the field is indexed):
Use of the field in a JOIN condition
Use of the field in a WHERE clause
Use of the field in an ORDER BY clause
Updating the index on INSERT/UPDATE
I would like to migrate the fields to the more space-efficient format but want to get some information about any potential performance implications (good or bad) that may apply. I plan to do some profiling of my own if I go down this route, but don't want to waste my time if there is a clear, known advantage of one format over the other.
Note that I am also interested in the same question for VARCHAR(19) vs. DATETIME, particualrly if it yields a different answer.
Additional space is a performance issue. Databases work by reading data pages (and index pages) from disk. Bigger records require more data pages to store. And that has an effect on performance.
In other words, your date column is 11 bytes versus 4 bytes. If you had a table with only ten date columns, that would be 110 bytes versus 40, that would require almost three times as much time to scan the data.
As for your operations, if you have indexes that are being used, then the indexes will also be larger. Because of the way that MySQL handles collations for columns, comparing string values is (generally) going to be less efficient than comparing binary values.
Certain operations such as extracting the date components are probably comparable (a string function versus a date function). However, other operations such as extracting the day of the week or the month name probably require converting to a date and then to a string, which is more expensive.
More bytes being compared --> slightly longer to do the comparison.
More bytes in the column --> more space used --> slightly slower operations because of more bulk on disk, in cache, etc.
This applies to either 1+10 or 1+19 bytes for VARCHAR versus 3 for DATE or 5 for DATETIME. The "1" is for the 'length' of VARCHAR; if the strings are a consistent length, you could switch to CHAR.
A BTree is essentially the same whether the value is an Int, Float, Decimal, Enum, Date, or Char. VARCHAR is slightly different in that it has a 'length'; I don't see this is a big issue in the grand scheme of things.
The number of rows that need to be fetched is the most important factor in the speed of a query. Other things, such as datatype size, come further down the list of what slows things down.
There are lots of DATETIME functions that you won't be able to used. This may lead to a table scan instead of using an INDEX. That would have a big impact on performance. Let's see some specific SQL queries.
Using CHARACTER SET ascii COLLATE ascii_bin for your date/datetime columns would make comparisons slightly faster.

MySQL multiple rows vs storing values all in one string

I was just wondering about the efficiency of storing a large amount of boolean values inside of a CHAR or VARCHAR
data
"TFTFTTF"
vs
isFoo isBar isText
false true false
Would it be worth the worse performance by switching storing these values in this manner? I figured it would just be easier just to set a single value rather than having all of those other fields
thanks
Don't do it. MySQL offers types such as char(1) and tinyint that occupy the same space as a single character. In addition, MySQL offers enumerated types, if you want your flags to have more than one value -- and for the values to be recognizable.
That last point is the critical point. You want your code to make sense. The string 'FTF' does not make sense. The columns isFoo, isBar, and isText do make sense.
There is no need to obfuscate your data model.
This would be a bad idea, not only does it have no advantage in terms of the space used, it also has a bad influence on query performance and the comprehensibility of your data model.
Disk Space
In terms of storage usage, it makes no real difference whether the data is stored in a single varchar(n) or char(n) column or in multiple tinynt, char(1)or bit(1) columns. Only when using varchar you would need 1 to 2 bytes more disk space per entry.
For more information about the storage requirements of the different data types, see the MySql documentation.
Query Performance
If boolean values were stored in a VarChar, the search for all entries where a specific value is True would take much longer, since string operations would be necessary to find the correct entries. Even when searching for a combination of Boolean values such as "TFTFTFTFTT", the query would still take longer than if the boolean values were stored in individual columns. Furthermore you can assign indexes to single columns like isFoo or isBar, which has a great positive effect on query performance.
Data Model
A data model should be as comprehensible as possible and if possible independent of any kind of implementation considerations.
Realistically, a database field should only contain one atomic value, that is to say: a value that can't be subdivided into separate parts.
Columns that do not contain atomic values:
cannot be sorted
cannot be grouped
cannot be indexed
So let's say you want to find all rows where isFoo is true you wouldn't be able to do it unless you were to do string operations like "find the third characters in this string and see if it's equal to "F". This would imply a full table scan with every query which would degrade performance quite dramatically.
it depends on what you want to do after storing the data in this format.
after retrieving this record you will have to do further processing on the server side which worsen the performance if you want to load the data by checking specific conditions. the logic in the server would become complex.
The columns isFoo, isBar, and isText would help you to write queries better.

How do databases handle redundant values?

Suppose I have a database with several columns. In each column there are lots of values that are often similar.
For example I can have a column with the name "Description" and a value could be "This is the description for the measurement". This description can occur up to 1000000 times in this column.
My question is not how I could optimize the design of this database but how a database handles such redundant values. Are these redundant values stored as effectively as with a perfect design (with respect to the total size of the database)? If so, how are the values compressed?
The only correct answer would be: depends on the database and the configuration. Because there is no silver bullet for this one. Some databases do only store values of each column once (some column stores or the like) but technically there is no necessity to do or not do this.
In some databases you can let the DBMS propose optimizations and in such a case it could possibly propose an ENUM field that holds only existing values, which would reduce the string to an id that references the string. This "optimization" comes at a price, for example, when you want to add a new value in the field description you have to adapt the ENUM field.
Depending on the actual use case those optimizations are worth nothing or are even a show stopper, for example when the data changes very often (inserts or updates). The dbms would spend more time managing uniqueness/duplicates than actually processing queries.
On the question of compression: also depends on the configuration and the database system I guess, depends on the field type too. text data can be compressed and in the case of non-indexed text fields there should be almost no drawback in using a simple compression algorithm. Which algorithm depends on the dbms and configuration, I suspect.
Unless you become more specific, there is no more specific answer, I believe.

sql query LIKE % on Index

I am using a mysql database.
My website is cut in different elements (PRJ_12 for projet 12, TSK_14 for task 14, DOC_18 for document 18, etc). We currently store the references to these elements in our database as VARCHAR. The relation columns are Indexed so it is faster to select.
We are thinking of currint these columns in 2 columns (on column "element_type" with PRJ and one "element_id" with 12). We are thinking on this solution as we do a lot of requests containing LIKE ...% (for example retrieve all tasks of one user, no matter the id of the task).
However, splitting these columns in 2 will increase the number of Indexed columns.
So, I have two questions :
Is a LIKE ...% request in an Indexed column realy more slow than a a simple where query (without like). I know that if the column is not indexed, it is not advisable to do where ... LIKE % requests but I don't realy know how Index work).
The fact that we split the reference columns in two will double the number of Indexed table. Is that a problem?
Thanks,
1) A like is always more costly than a full comparison (with = ), however it all comes down to the field data types and the number of records (unless we're talking of a huge table you shouldn't have issues)
2) Multicolumn indexes are not a problem, yes it makes the index bigger, but so what? Data types and ammount of total rows matter, but thats what indexes are for.
So go for it
There are a number of factors involved, but in general, adding one more index on a table that has only one index already is unlikely to be a big problem. Some things to consider.
If the table most mostly read-only, then it is almost certainly not a problem. If updates are rare, then the indexes won't need to be modified often meaning there will be very little extra cost (aside from the additional disk space).
If updates to existing records do not change either of those key values, then no index modification should be needed and so again there would be no additional runtime cost.
DELETES and INSERTS will need to update both indexes. So if that is the majority of the operations (and far exceeding reads), then an additional index might incur measurable performance degradation (but it might not be a lot and not noticeable from a human perspective).
The like operator as you describe the usage should be fully optimized. In other words, the clause WHERE combinedfield LIKE 'PRJ%' should perform essentially the same as WHERE element_type = 'PRJ' if there is an index existing in both situations. The more expensive situation is if you use the wild card at the beginning (e.g., LIKE '%abc%'). You can think of a LIKE search as being equivalent to looking up a word in a dictionary. The search for 'overf%' is basically the same as a search for 'overflow'. You can do a "manual" binary search in the dictionary and quickly find the first word beginning with 'overf'. Searching for '%low', though is much more expensive. You have to scan the entire dictionary in order to find all the words that end with "low".
Having two separate fields to represent two separate values is almost always better in the long run since you can construct more efficient queries, easily perform joins, etc.
So based on the given information, I would recommend splitting it into two fields and index both fields.

mysql: 'WHERE something!=true' excludes fields with NULL

I have a 2 tables, one in which I have groups, the other where I set user restrictions of which groups are seen.
When I do LEFT JOIN and specify no condition, it shows me all records. When I do WHERE group_hide.hide!='true' it only shows these records that have false enum type set to them. With JOIN, other groups get the hide field set as "NULL".
How can I make it so that it excludes only these that are set to true, and show everything else that has either NULL or false?
In MySQL you must use IS NULL or IS NOT NULL when dealing with nullable values.
HEre you should use (group_hide.hide IS NULL OR group_hide.hide != 'true')
Don already provided good answer to the question that you asked and will solve your immediate problem.
However, let me address the point of wrong data type domain. Normally you would make hide be BOOLEAN but mysql does not really implement it completely. It converts it to TINYINT(1) which allows values from -128 to 127 (see overview of data types for mysql). Since mysql does not support CHECK constraint you are left with options to either use a trigger or foreign reference to properly enforce the domain.
Here are the problems with wrong data domain (your case), in order of importance:
The disadvantages of allowing NULL for a field that can be only 1 or 0 are that you have to employ 3 value logic (true, false, null), which btw is not perfectly implemented in SQL. This makes certain query more complex and slower then they need to be. If you can make a column NOT NULL, do.
The disadvantages of using VARCHAR for a field that can be only 1 or 0 are the speed of the query, due to the extra I/O and bigger storage needs (slows down reads, writes, makes indexes bigger if a field is part of the index and influences the size of backups; keep in mind that none of these effects might be noticeable with wrong domain of a single field for a smaller size tables, but if data types are consistently set too big or if the table has serious number of records the effects will bite). Also, you will always need to convert the VARCHAR to a 1 or 0 to use natural mysql boolean operators increasing complexity of queries.
The disadvantage of mysql using TINYINT(1) for BOOL is that certain values are allowed by RDBMS that should not be allowed, theoretically allowing for meaningless values to be stored in the system. In this case your application layer must guarantee the data integrity and it is always better if RDBMS guarantees integrity as it would protect you from certain bugs in application layer and also mistakes that might be done by database administrator.
an obvious answer would be:
WHERE (group_hide.hide is null or group_hide.hide ='false')
I'm not sure off the top of my head what the null behaviour rules are.