I have a table (millions of rows) where one of the columns is a Text field (stores json blobs). But only about 10-20% of them are actually non-Null.
What is the best practice when it comes to sparse columns?
Should I
a) Just keep the table as is or
b) Create a new table with just that Text column?
If I am not mistaken, option (a) is fine because InnoDB will dynamically only allocate as much space that is needed for that Text column right? Is there any reason to go with option (b)? Seems like option (b) will just add complexity in querying (joining) these tables and further increase the space complexity as well.
MySQL (InnoDB storage engine) stores nothing for a NULL. Well, each row has a bitfield with 1 bit for each nullable column. The bitfield is followed by data values for non-NULL columns. And variable-length columns like VARCHAR, TEXT, BLOB, or JSON take only the space needed given their length.
So I'd suggest keeping your table as is, keep the TEXT field in the table, and make it NULL when there's no JSON data.
P.S.: Aren't you using the JSON data type?
You mentioned the storage/space consideration. I think most importantly is how you will use the data. If you performance is okey with doing a like "%% match, then just leave it.
Denormalize the data allow you better query/index the content.
In general, it does not matter whether you do (a) or (b). But here are some more considerations:
If you do SELECT * but ignore that column, then (a) is wasteful.
Certain InnoDB ROW_FORMATs will put 'short' strings in the table, not separate; others will store them in a separate block, leaving behind 20 or 767 bytes in the main block. (It gets rather tedious and confusing to see if this will really matter for (a).)
(b) involves a LEFT JOIN in your code when you do want the column. You may consider this a bother.
Related
I have a MariaDB InnoDB table with several million rows, but with short, fixed-width rows consisting of numbers and timestamps only.
We usually search, filter and sort the rows using any of the existing columns.
We want to add a column to store an associated "url" to each row. Ideally every row will have it's url.
We know for a fact that we won't be sorting, searching and filtering by the url column.
We don't mind truncating the URL to it's first 255 bytes, so we are going to give it the VARCHAR type.
But of course that column's width would be variable. The whole record will become variable-width and the width of the original record will double in many cases.
We were considering the alternative of using a different, secondary table for storing the varchar.
We could join them when querying the data, or even more efficiently -probably- just fetch the url's for the page we are showing.
Would this approach be advisable?
Is there a better alternative that would also allow us to preserve performance?
Update: As user Bill Karwin noted in one comment below, InnoDB does not benefit from fixed width as much as MyISAM does, so the real issue here is about the size of the row and not so much about the fixed versus variable width discussion.
Assuming you have control over how the URL is generated, you may want to change it to a fixed-length state. Youtube videos' URIs, for instance, are always 11 characters long and base-64. This fixes the variable length problem and avoids joining tables.
If changing URI generation is not an option, you have a few alternatives to make it fixed-length:
You could fill in the blanks with a special character to force every url to be 255 within the database, and removing it just before returning it. This is not a clean solution but makes DQL operations faster than joining.
You could fetch the url as you have stated, but beware that two http requests may be more time consuming than any other option with just one request.
You could join with another table only when the user requires it, as opposed to it being the default.
Consider that having variable length may not be as big a problem, depending on your needs. The only issue might be if you're grossly oversizing fields, but it doesn't seem to be your case.
I'm going to create a table which will have an amount of rows between 1000-20000, and I'm having fields that might repeat a lot... about 60% of the rows will have this value, where about each 50-100 have a shared value.
I've been concerned about efficiency lately and I'm wondering whether it would be better to store this string on each row (it would be between 8-20 characters) or to create another table and link them with its representative ID instead... So having ~1-50 rows in this table replacing about 300-5000 strings with ints?
Is this a good approach, or at all even neccessary?
Yes, it's a good approach in most circumstances. It's called normalisation, and is mainly done for two reasons:
Removing repeated data
Avoiding repeating entities
I can't tell from your question what the reason would be in your case.
The difference between the two is that the first reuses values that just happen to look the same, while the second connects values that have the same meaning. The practical difference is in what should happen if a value changes, i.e. if the value changes for one record, should the value itself change so that it changes for all other records also using it, or should that record be connected to a new value so that the other records are left unchanged.
If it's for the first reason then you will save space in the database, but it will be more complicated to update records. If it's for the second reason you will not only save space, but you will also reduce the risk of inconsistency, as a value is only stored in one place.
That is a good approach to have a look-up table for the strings. That way you can build more efficient indexes on the integer values. It wouldn't be absolutely necessary but as a good practice I would do that.
I would recommend using an int with a foreign key to a lookup table (like you describe in your second scenario). This will cause the index to be much smaller than indexing a VARCHAR so the storage required would be smaller. It should perform better, too.
Avitus is right, that it's generally a good practice to create lookups.
Think about the JOINS you will use this table in. 1000-20000 rows are not a lot to be handled by MySQL. If you don't have any, I would not bother about the lookups, just index the column.
BUT as soon as you start joining the table with others (of the same size) that's where the performance loss comes in, which you can (most likely) compensate by introducing lookups.
When we creating database for our application, We limited lengths of database columns.
example -
String (200)
int (5)
etc
Is there any effect on Speed or some effect?
First of all, one does not limit the length of a "database". Instead, one does limit the size of columns of tables in a database.
Why do we do this, you ask?
We don't want to waste any space for data that's never going to use it (that's what the varchar, varbinary and the like are for).
It's a best practice because it forces you to think of your data structure BEFORE you actually use it.
The less data there is the faster the processing of the application (that's a tautology).
It makes it easier to validate your data if you know exactely how much space it is allowed to take.
Full text indexes gain greatly when limited in size
One reason I can think of is, When you didn't specify the length of a column data type, the MYsql engine would assume a default length value that may be a lot larger of the length of the actual data that would be stored in that column. So it is always a best practice never to ignore the length property of a column.
Limiting the length of database fields ensures validation of data, you won't get any unexpected data of a length other than what has been specified. Also certain fields cannot be indexed such as LONG so choose appropriately and wisely. With regard to performance the effect is negligible. You need to also think about the data itself, for example storing data in a unicode encoding such as UTF-8 may increase the storage requirements.
I've heard that if you have a table with a TEXT column that will hold a large chunk of text data, it's better for performance to move that column into a separate table and get it via JOINs to the base record.
Is this true, and if so why?
Not with PostgreSQL, from the manual:
Very long values are also stored in background tables so that they do not interfere with rapid access to shorter column values.
So a large character column (such as TEXT or VARCHAR without a specified size limit) is stored away from the main table data. So, PostgreSQL has your "put it in a separate table" optimization built in. If you're using PostgreSQL, arrange your table sensibly and leave the data layout to PostgreSQL.
I don't know how MySQL or other RDBMs arrange their data.
The reason behind this optimization is that the database will usually keep the data for each row in contiguous blocks on disk to cut down on seeking when the row needs to be read or updated. If you have a TEXT (or other variable length type) column in a row then the size of the row is variable so more work is needed to go from row to row. An analogy would be the difference between accessing something in a linked list versus accessing an array; with a linked list, you have to read three elements one at a time to get to the fourth one, with an array you just offset 3 * element_size bytes from the beginning and you're there in one step.
From the MySQL Manual:
For a table with several columns, to
reduce memory requirements for queries
that do not use the BLOB column,
consider splitting the BLOB column
into a separate table and referencing
it with a join query when needed.
In some scenarios, this might be true. The reason is that let's say your table is:
create table foo (
id serial primary key,
title varchar(200) not null,
pub_date datetime not null,
text_content text
);
Then you do a query like this:
select id, title, pub_date
from foo;
You will have to load much more pages from disk that you would have if you didn't have the text_content field in that table. And query optimization is most about reducing disk I/O to the minimum possible.
MySQL specifies the row format of a table as either fixed or dynamic, depending on the column data types. If a table has a variable-length column data type, such as TEXT or VARCHAR, the row format is dynamic; otherwise, it's fixed.
My question is, what's the difference between the two row formats? Is one more efficient than the other?
The difference really only matters for MyISAM, other storage engines do not care about the difference.
EDIT : Many users commented that InnoDB does care: link 1 by steampowered, link 2 by Kaan.
With MyISAM with fixed width rows, there are a few advantages:
No row fragmentation: It is possible with variable width rows to get single rows split into multiple sections across the data file. This can increase disk seeks and slow down operations. It is possible to defrag it with OPTIMIZE TABLE, but this isn't always practical.
Data file pointer size: In MyISAM, there is a concept of a data file pointer which is used when it needs to reference the data file. For example, this is used in indexes when they refer to where the row actually is present. With fixed width sizes, this pointer is based on the row offset in the file (ie. rows are 1, 2, 3 regardless of their size). With variable width, the pointer is based on the byte offset (ie. rows might be 1, 57, 163). The result is that with large tables, the pointer needs to be larger which then adds potentially a lot more overhead to the table.
Easier to fix in the case of corruption. Since every row is the same size, if your MyISAM table gets corrupted it is much easier to repair, so you will only lose data that is actually corrupted. With variable width, in theory it is possible that the variable width pointers get messed up, which can result in hosing data in a bad way.
Now the primary drawback of fixed width is that it wastes more space. For example, you need to use CHAR fields instead of VARCHAR fields, so you end up with extra space taken up.
Normally, you won't have much choice in the format, since it is dictated based on the schema. However, it might be worth if you only have a few varchar's or a single blob/text to try to optimize towards this. For example, consider switching the only varchar into a char, or split the blob into it's own table.
You can read even more about this at:
http://dev.mysql.com/doc/refman/5.0/en/static-format.html
http://dev.mysql.com/doc/refman/5.0/en/dynamic-format.html
One key difference occurs when you update a record. If the row format is fixed, there is no change in the length of the record. In contrast, if the row format is dynamic and the new data causes the record to increase in length, a link is used to point to the "overflow" data (i.e. it's called the overflow pointer).
This fragments the table and generally slows things down. There is a command to defragment (OPTIMIZE TABLE), which somewhat mitigates the issue.
This page in MySQL's documentation seems to contradict the top answer here, in that DYNAMIC row format means something for InnoDB tables as well:
https://dev.mysql.com/doc/refman/5.7/en/innodb-row-format.html
Fixed means that every row is exactly the same size. That means that if the 3rd row on a data page needs to be loaded, it will be at exactly PageHeader+2*RowSize, saving some access time.
In order to find the beginning of a dynamic record, the list of record offsets must be consulted, which involves an extra indirection.
In short, yes, there's a slight performance hit for dynamic rows. No, it's not a very big one. If you think it will be a problem, test for it.
Fixed should be faster and more secure than dynamic, with the drawback of having a fixed char-lenght.
You can find this information here: http://dev.mysql.com/doc/refman/5.0/en/static-format.html