I was just wondering about the efficiency of storing a large amount of boolean values inside of a CHAR or VARCHAR
data
"TFTFTTF"
vs
isFoo isBar isText
false true false
Would it be worth the worse performance by switching storing these values in this manner? I figured it would just be easier just to set a single value rather than having all of those other fields
thanks
Don't do it. MySQL offers types such as char(1) and tinyint that occupy the same space as a single character. In addition, MySQL offers enumerated types, if you want your flags to have more than one value -- and for the values to be recognizable.
That last point is the critical point. You want your code to make sense. The string 'FTF' does not make sense. The columns isFoo, isBar, and isText do make sense.
There is no need to obfuscate your data model.
This would be a bad idea, not only does it have no advantage in terms of the space used, it also has a bad influence on query performance and the comprehensibility of your data model.
Disk Space
In terms of storage usage, it makes no real difference whether the data is stored in a single varchar(n) or char(n) column or in multiple tinynt, char(1)or bit(1) columns. Only when using varchar you would need 1 to 2 bytes more disk space per entry.
For more information about the storage requirements of the different data types, see the MySql documentation.
Query Performance
If boolean values were stored in a VarChar, the search for all entries where a specific value is True would take much longer, since string operations would be necessary to find the correct entries. Even when searching for a combination of Boolean values such as "TFTFTFTFTT", the query would still take longer than if the boolean values were stored in individual columns. Furthermore you can assign indexes to single columns like isFoo or isBar, which has a great positive effect on query performance.
Data Model
A data model should be as comprehensible as possible and if possible independent of any kind of implementation considerations.
Realistically, a database field should only contain one atomic value, that is to say: a value that can't be subdivided into separate parts.
Columns that do not contain atomic values:
cannot be sorted
cannot be grouped
cannot be indexed
So let's say you want to find all rows where isFoo is true you wouldn't be able to do it unless you were to do string operations like "find the third characters in this string and see if it's equal to "F". This would imply a full table scan with every query which would degrade performance quite dramatically.
it depends on what you want to do after storing the data in this format.
after retrieving this record you will have to do further processing on the server side which worsen the performance if you want to load the data by checking specific conditions. the logic in the server would become complex.
The columns isFoo, isBar, and isText would help you to write queries better.
Related
We have a MySQL-based system that stores date values in VARCHAR(10) fields, as SQL-format strings (e.g. '2021-09-07').
I am aware of the large additional space requirements of using this format over the native DATE format, but I have been unable to find any information about performance characteristics and how the two formats would differ in terms of speed. I would assume that working with strings would be slower for non-indexed fields. However, if the field is indexed I could imagine that the indexes on string fields could potentially yield much larger improvements than those on date fields, and possibly even overtake them in terms of performance, for certain tasks.
Can anyone advise on the speed implications of choosing one format over the other, in relation to the following situations (assuming the field is indexed):
Use of the field in a JOIN condition
Use of the field in a WHERE clause
Use of the field in an ORDER BY clause
Updating the index on INSERT/UPDATE
I would like to migrate the fields to the more space-efficient format but want to get some information about any potential performance implications (good or bad) that may apply. I plan to do some profiling of my own if I go down this route, but don't want to waste my time if there is a clear, known advantage of one format over the other.
Note that I am also interested in the same question for VARCHAR(19) vs. DATETIME, particualrly if it yields a different answer.
Additional space is a performance issue. Databases work by reading data pages (and index pages) from disk. Bigger records require more data pages to store. And that has an effect on performance.
In other words, your date column is 11 bytes versus 4 bytes. If you had a table with only ten date columns, that would be 110 bytes versus 40, that would require almost three times as much time to scan the data.
As for your operations, if you have indexes that are being used, then the indexes will also be larger. Because of the way that MySQL handles collations for columns, comparing string values is (generally) going to be less efficient than comparing binary values.
Certain operations such as extracting the date components are probably comparable (a string function versus a date function). However, other operations such as extracting the day of the week or the month name probably require converting to a date and then to a string, which is more expensive.
More bytes being compared --> slightly longer to do the comparison.
More bytes in the column --> more space used --> slightly slower operations because of more bulk on disk, in cache, etc.
This applies to either 1+10 or 1+19 bytes for VARCHAR versus 3 for DATE or 5 for DATETIME. The "1" is for the 'length' of VARCHAR; if the strings are a consistent length, you could switch to CHAR.
A BTree is essentially the same whether the value is an Int, Float, Decimal, Enum, Date, or Char. VARCHAR is slightly different in that it has a 'length'; I don't see this is a big issue in the grand scheme of things.
The number of rows that need to be fetched is the most important factor in the speed of a query. Other things, such as datatype size, come further down the list of what slows things down.
There are lots of DATETIME functions that you won't be able to used. This may lead to a table scan instead of using an INDEX. That would have a big impact on performance. Let's see some specific SQL queries.
Using CHARACTER SET ascii COLLATE ascii_bin for your date/datetime columns would make comparisons slightly faster.
Suppose I have a database with several columns. In each column there are lots of values that are often similar.
For example I can have a column with the name "Description" and a value could be "This is the description for the measurement". This description can occur up to 1000000 times in this column.
My question is not how I could optimize the design of this database but how a database handles such redundant values. Are these redundant values stored as effectively as with a perfect design (with respect to the total size of the database)? If so, how are the values compressed?
The only correct answer would be: depends on the database and the configuration. Because there is no silver bullet for this one. Some databases do only store values of each column once (some column stores or the like) but technically there is no necessity to do or not do this.
In some databases you can let the DBMS propose optimizations and in such a case it could possibly propose an ENUM field that holds only existing values, which would reduce the string to an id that references the string. This "optimization" comes at a price, for example, when you want to add a new value in the field description you have to adapt the ENUM field.
Depending on the actual use case those optimizations are worth nothing or are even a show stopper, for example when the data changes very often (inserts or updates). The dbms would spend more time managing uniqueness/duplicates than actually processing queries.
On the question of compression: also depends on the configuration and the database system I guess, depends on the field type too. text data can be compressed and in the case of non-indexed text fields there should be almost no drawback in using a simple compression algorithm. Which algorithm depends on the dbms and configuration, I suspect.
Unless you become more specific, there is no more specific answer, I believe.
Is there any performance benefit in using the exact data types needed for a column? Or is it just storage optimisation?
For example, I'm creating a users table and I know for certainty that there will only be 200 users in total. When I'm manipulating the data in the the server, doing some select/update/insert/delete, is there any performance difference between using TINYINT - UN for the users_id column or using just INT?
The same applies to the user's name. I know, for now, that the user with the longest name length is 48, but I don't know if in the future there won't be a new user inserted in the table with a name with 65 characters in length. Is there any performance benefit in reserving only the needed lenght, for now, using VARCHAR(48) or can I avoid having to check constantly the column allowed length for each new user and use just VARCHAR(255)?
There is little advantage in either case.
For the number, you do gain a slight performance advantage. Typically, integers are 4 and a tinyint is 1 byte. So, if you have multiple smaller fields, then your records will be smaller. Smaller records then imply fewer data pages and ultimately slightly faster queries. This shows up when you start to have lots of records.
For the varchar, you don't even have that advantage. Both varchar(48) and varchar(255) occupy the same amount of space (there is one addition byte for lengths greater than 255). The values determine the space for this data type.
In other cases, it can make a big difference. In particular, storing dates as the native format is usually important, both to take advantage of date/time functions and to make better use of indexes.
From what I understand, one should avoid nullable columns in databases whenever possible.
But, in what specific situations do nullable columns actually cause a significant performance drop?
In other words, when does null really hurt performance? (As opposed to when it's negligible, and does not matter at all).
I'm asking so I can know when and how it actually makes a difference.
Don't know where you heard it, but it's not true.
Nullable columns are there to represent data accurately: if a value is unknown, or not yet entered, NULL is a natural value to store. Null values are no more onerous to store or retrieve than values of any other type: most database servers store them in a single bit, which means that it will take less I/O and processor effort to retrieve a NULL value than assembling a varchar, BLOB, or text field from a bunch of fragments that may require walking through a linked list, or reading more disk blocks off the hard drive.
There are a couple instances marginally related to nullable columns that may affect performance:
If you create an index on a nullable column, and the actual values in the column are sparse (i.e. many rows have a NULL value, or only a very few values are present (as, with, say a controlled vocabulary value), the b-tree data structure used to index the column becomes much less efficient. Index traversals become more expensive operations when half the values in an index are identical: you end up with an unbalanced tree.
Inapt use of NULL values, or inappropriate query techniques that don't utilize NULL values as they were designed often results in poor performance, because progammers often fall back on the bad habit of searching or joining on computed column values, which ignores the fantastic set-processing ability of modern db servers. I've consulted at lots of places where the development staff has made a habit of writing clauses like:
WHERE ISNULL(myColumn, '') = ''
which means that the DB server cannot use an index directly, and must perform a computation on every single row of that section of the execution tree to evaluate the query. That is not because there is any intrinsic inefficiency in storing, comparing, or evaluating NULL values, but because the query thwarts the strengths of the database engine to achieve a particular result.
I'm using MySQL and I'm reading in some places that using CHAR in indexed columns is 20% faster than use VARCHAR. In other places seems that its benefit is only when the table doesn't have any VARCHAR column. Is that true?
The information that I want store is a GUID. Is a better option store the data in a BINARY or in a CHAR if the database uses character set utf8? It's worth convert my data to BINARY every time that I want insert, update or query filtering by the GUID? I prefer faster data access than save disk usage.
Any fixed-width column type will lend itself to faster seeking operations when compared to variable-width types. Unless your table is partitioned, it is also true that the variable-width types can degrade performance even on operations which do not involve them. For consideration, think through the algorithm for how you would iterate all the values in a column when all columns are fixed-width and then when some aren't:
For all-fixed width tables (or partitions), you might use simple pointer arithmetic, where you add the value of the combined data-width of all the columns in the partition each time through the loop.
If there are any variable-width columns, however, you would need to calculate the amount to add to the pointer every iteration, based on the actual on-disk "width" of the columns.