I have a MariaDB InnoDB table with several million rows, but with short, fixed-width rows consisting of numbers and timestamps only.
We usually search, filter and sort the rows using any of the existing columns.
We want to add a column to store an associated "url" to each row. Ideally every row will have it's url.
We know for a fact that we won't be sorting, searching and filtering by the url column.
We don't mind truncating the URL to it's first 255 bytes, so we are going to give it the VARCHAR type.
But of course that column's width would be variable. The whole record will become variable-width and the width of the original record will double in many cases.
We were considering the alternative of using a different, secondary table for storing the varchar.
We could join them when querying the data, or even more efficiently -probably- just fetch the url's for the page we are showing.
Would this approach be advisable?
Is there a better alternative that would also allow us to preserve performance?
Update: As user Bill Karwin noted in one comment below, InnoDB does not benefit from fixed width as much as MyISAM does, so the real issue here is about the size of the row and not so much about the fixed versus variable width discussion.
Assuming you have control over how the URL is generated, you may want to change it to a fixed-length state. Youtube videos' URIs, for instance, are always 11 characters long and base-64. This fixes the variable length problem and avoids joining tables.
If changing URI generation is not an option, you have a few alternatives to make it fixed-length:
You could fill in the blanks with a special character to force every url to be 255 within the database, and removing it just before returning it. This is not a clean solution but makes DQL operations faster than joining.
You could fetch the url as you have stated, but beware that two http requests may be more time consuming than any other option with just one request.
You could join with another table only when the user requires it, as opposed to it being the default.
Consider that having variable length may not be as big a problem, depending on your needs. The only issue might be if you're grossly oversizing fields, but it doesn't seem to be your case.
Related
I have a table (millions of rows) where one of the columns is a Text field (stores json blobs). But only about 10-20% of them are actually non-Null.
What is the best practice when it comes to sparse columns?
Should I
a) Just keep the table as is or
b) Create a new table with just that Text column?
If I am not mistaken, option (a) is fine because InnoDB will dynamically only allocate as much space that is needed for that Text column right? Is there any reason to go with option (b)? Seems like option (b) will just add complexity in querying (joining) these tables and further increase the space complexity as well.
MySQL (InnoDB storage engine) stores nothing for a NULL. Well, each row has a bitfield with 1 bit for each nullable column. The bitfield is followed by data values for non-NULL columns. And variable-length columns like VARCHAR, TEXT, BLOB, or JSON take only the space needed given their length.
So I'd suggest keeping your table as is, keep the TEXT field in the table, and make it NULL when there's no JSON data.
P.S.: Aren't you using the JSON data type?
You mentioned the storage/space consideration. I think most importantly is how you will use the data. If you performance is okey with doing a like "%% match, then just leave it.
Denormalize the data allow you better query/index the content.
In general, it does not matter whether you do (a) or (b). But here are some more considerations:
If you do SELECT * but ignore that column, then (a) is wasteful.
Certain InnoDB ROW_FORMATs will put 'short' strings in the table, not separate; others will store them in a separate block, leaving behind 20 or 767 bytes in the main block. (It gets rather tedious and confusing to see if this will really matter for (a).)
(b) involves a LEFT JOIN in your code when you do want the column. You may consider this a bother.
I know that changing a table with fixed width rows to have variable width rows (by changing a CHAR column to a VARCHAR) has performance implications.
However my question is, given a preexisting table with variable width rows (due to many VARCHAR columns), and thus with that performance penalty already paid, would adding another variable length column further impact performance?
My hunch is that it wouldn't, the biggest performance penalty would be switching from fixed width rows to variable width rows and that adding another variable width column would have a negligible impact.
Yes and no. It is true that variable width character columns are slightly slower then fixed width character columns. But the "penalty" (or performance cost) is cummulative and per column. So, every column you add to your query in general (fixed width or otherwise) is going to impact performance (as you query more data, it takes longer to fetch all of the data).
Each Variable length column you add to the table, makes it worse to retrieve the data.
Another consideration would also be - if the variable length columns are part of the Query (filter/Where clause) and if you are going to be using those in indexes. Variable Length fields in the index will also add to the index overhead. For details, you will need to look at the documentation of the particular database you are using. e.g. http://dev.mysql.com/doc/refman/5.6/en/innodb-table-and-index.html
Sure. Extracting a data row into the component fields will take a few extra cycles.
That, however, will be more than offset by the almost certain reduction in row size — meaning more rows per data page and thus faster lookups across the board.
It will make a tiny (measured in microseconds) difference to data retrieval performance, BUT the human performance impact of using the wrong datatype just to squeeze every last drop out of the database could be large and therefore costly.
Use the datatype most appropriate for the attribute you're persisting in the database.
Don't be driven by "performance", be driven by the usual guidelines for software development, like readability, maintainability, use ability, etc.
Use the wrong datatype and your code will be more complex (possibly losing more performance gain than you gained), and you'll regretted ever doing it. And I doubt you would ever notice those gains anyway.
Only do such things when you have proof that there's a problem, and the problem is large enough to matter. Doing what you're proposing is called "premature optimization", and is probably the worst design strategy there is.
When we creating database for our application, We limited lengths of database columns.
example -
String (200)
int (5)
etc
Is there any effect on Speed or some effect?
First of all, one does not limit the length of a "database". Instead, one does limit the size of columns of tables in a database.
Why do we do this, you ask?
We don't want to waste any space for data that's never going to use it (that's what the varchar, varbinary and the like are for).
It's a best practice because it forces you to think of your data structure BEFORE you actually use it.
The less data there is the faster the processing of the application (that's a tautology).
It makes it easier to validate your data if you know exactely how much space it is allowed to take.
Full text indexes gain greatly when limited in size
One reason I can think of is, When you didn't specify the length of a column data type, the MYsql engine would assume a default length value that may be a lot larger of the length of the actual data that would be stored in that column. So it is always a best practice never to ignore the length property of a column.
Limiting the length of database fields ensures validation of data, you won't get any unexpected data of a length other than what has been specified. Also certain fields cannot be indexed such as LONG so choose appropriately and wisely. With regard to performance the effect is negligible. You need to also think about the data itself, for example storing data in a unicode encoding such as UTF-8 may increase the storage requirements.
I've heard (from a colleague, who heard it from another developer) that VARCHAR columns should always be put at the end of a table definition in MySQL, because they are variable in length and could therefore slow down queries.
The research I've done on stack overflow seems to contradict this however and suggests that column order is important, while there is varying agreement on how much this applies to VARCHARs.
He wasn't specific about storage engines, or about whether this only applied to VARCHAR columns which are infrequently accessed.
Asking that question about "MySQL" is not helpful, as MySQL relegates storage to storage engines, and they implement storage in very different ways. It makes sense to ask this question for any individual storage engine.
In the MEMORY engine, variable length data types do not exist. A VARCHAR is silently changed into a CHAR. In the context of your question: It does not matter where in a table definition you put your VARCHAR.
In the MyISAM engine, if a table has no variable length data whatsoever (VARCHAR, VARBINARY or any TEXT or BLOB type) it is of the FIXED variant of MyISAM, that is, records have a fixed byte length. This can have performance implications, especially if data is deleted and inserted repeatedly (i.e. the table is not append only). As soon as any variable length data type is part of a table definition it becomes the DYNAMIC variant of MyISAM, and MyISAM internally changes any but the shortest CHAR type internally to VARCHAR. Again, position and even definition of CHAR/VARCHAR do not matter.
In the InnoDB engine, data is stored in pages of 16 KB size. A page has a page footer with a checksum, and a page header, with among other things a page directory. The page directory contains for each row the offset of that row relative to the beginning of the page. A page also contains free space, and all I/O is done in pages.
Hence InnoDB can, as long as there is free space in a page, grow VARCHAR in place, and move rows around inside a page, without incurring any additional I/O. Also, since all rows are being addressed as (pagenumber, page directory entry), movement of a row inside a page is localized to the page and not visible from the outside.
It also means that for InnoDB too, the order of columns inside a row does not matter at all.
These are the three storage engines that are most commonly used with MySQL, and order of columns does not matter for any of these three. It may be that other, more exotic storage engines exist for which this is not true.
It does not matter. And some engines store varlena types in a separate area (e.g. TOAST in Postgres).
Moreover, the logical order (what you see when you select *) may actually differ from the physical order (how it's stored, which is based on the order in which you've created the actual columns using subsequent alter table statements).
http://www.sqlskills.com/BLOGS/PAUL/post/Inside-the-Storage-Engine-Anatomy-of-a-record.aspx
MySQL specifies the row format of a table as either fixed or dynamic, depending on the column data types. If a table has a variable-length column data type, such as TEXT or VARCHAR, the row format is dynamic; otherwise, it's fixed.
My question is, what's the difference between the two row formats? Is one more efficient than the other?
The difference really only matters for MyISAM, other storage engines do not care about the difference.
EDIT : Many users commented that InnoDB does care: link 1 by steampowered, link 2 by Kaan.
With MyISAM with fixed width rows, there are a few advantages:
No row fragmentation: It is possible with variable width rows to get single rows split into multiple sections across the data file. This can increase disk seeks and slow down operations. It is possible to defrag it with OPTIMIZE TABLE, but this isn't always practical.
Data file pointer size: In MyISAM, there is a concept of a data file pointer which is used when it needs to reference the data file. For example, this is used in indexes when they refer to where the row actually is present. With fixed width sizes, this pointer is based on the row offset in the file (ie. rows are 1, 2, 3 regardless of their size). With variable width, the pointer is based on the byte offset (ie. rows might be 1, 57, 163). The result is that with large tables, the pointer needs to be larger which then adds potentially a lot more overhead to the table.
Easier to fix in the case of corruption. Since every row is the same size, if your MyISAM table gets corrupted it is much easier to repair, so you will only lose data that is actually corrupted. With variable width, in theory it is possible that the variable width pointers get messed up, which can result in hosing data in a bad way.
Now the primary drawback of fixed width is that it wastes more space. For example, you need to use CHAR fields instead of VARCHAR fields, so you end up with extra space taken up.
Normally, you won't have much choice in the format, since it is dictated based on the schema. However, it might be worth if you only have a few varchar's or a single blob/text to try to optimize towards this. For example, consider switching the only varchar into a char, or split the blob into it's own table.
You can read even more about this at:
http://dev.mysql.com/doc/refman/5.0/en/static-format.html
http://dev.mysql.com/doc/refman/5.0/en/dynamic-format.html
One key difference occurs when you update a record. If the row format is fixed, there is no change in the length of the record. In contrast, if the row format is dynamic and the new data causes the record to increase in length, a link is used to point to the "overflow" data (i.e. it's called the overflow pointer).
This fragments the table and generally slows things down. There is a command to defragment (OPTIMIZE TABLE), which somewhat mitigates the issue.
This page in MySQL's documentation seems to contradict the top answer here, in that DYNAMIC row format means something for InnoDB tables as well:
https://dev.mysql.com/doc/refman/5.7/en/innodb-row-format.html
Fixed means that every row is exactly the same size. That means that if the 3rd row on a data page needs to be loaded, it will be at exactly PageHeader+2*RowSize, saving some access time.
In order to find the beginning of a dynamic record, the list of record offsets must be consulted, which involves an extra indirection.
In short, yes, there's a slight performance hit for dynamic rows. No, it's not a very big one. If you think it will be a problem, test for it.
Fixed should be faster and more secure than dynamic, with the drawback of having a fixed char-lenght.
You can find this information here: http://dev.mysql.com/doc/refman/5.0/en/static-format.html