store TEXT/BLOB in same table or not? - mysql

While searching trough SO, I've found two contradicting answers (and even a comment that stated that) but no definitive answer:
The problem is: is there any performance benefit, if you store a TEXT/BLOB field outside of a table?
We assume:
You SELECT correctly (only selection the TEXT/BLOB if required, no SELECT *)
Tables are indexed properly, where it makes sense (so it's not a matter of 'if you index it')
The database design doesnt really matter. This is a question to identify the MySQL behaviour in this special case, not to solve certain database design problems. Let's assume this Database has only one table (or two, if the TEXT/BLOB gets separated)
used engine: innoDB (others would be interesting too, if they fetch different results)
This post states, that putting the TEXT/BLOB into a separate table, only helps if you're already SELECTing in a wrong way (always SELECTing the TEXT/BLOB even when it's not necessary) - basically stating, that TEXT/BLOB in the same table is basically the better solution (less complexity, no performance hit, etc) since the TEXT/BLOB is stored seprately anyway
The only time that moving TEXT columns into another table will offer any benefit is if there it a tendency to usually select all columns from tables. This is merely introducing a second bad practice to compensate for the first. It should go without saying the two wrongs is not the same as three lefts.
MySQL Table with TEXT column
This post however, states that:
When a table has TEXT or BLOB columns, the table can't be stored in memory
Does that mean that it's already enough to have a TEXT/BLOB inside a table, to have a performance hit?
MySQL varchar(2000) vs text?
My Question basically is: What's the correct answer?
Does it really matter if you store TEXT/BLOB into a separate table, if you SELECT correctly?
Or does even having a TEXT/BLOB inside a table, create a potential performance hit?

Update: Barracuda is the default InnoDB file format since version 5.7.
If available on your MySQL version, use the InnoDB Barracuda file format using
innodb_file_format=barracuda
in your MySQL configuration and set up your tables using ROW_FORMAT=Dynamic (or Compressed) to actually use it.
This will make InnoDB to store BLOBs, TEXTs and bigger VARCHARs outside the row pages and thus making it a lot more efficient. See this MySQLperformanceblog.com blog article for more information.
As far as I understand it, using the Barracuda format will make storing TEXT/BLOB/VARCHARs in separate tables not valid anymore for performance reasons. However, I think it's always good to keep proper database normalization in mind.

One performance gain is to have a table with fixed length records. This would mean no variable length fields like varchar or text/blob. With fixed length records, MySQL doesn't need to "seek" the end of a record since it knows the size offset. It also knows how much memory it needs to load X records. Tables with fixed length records are less prone to fragmentation since space made available from deleted records can be fully reused. MyISAM tables actually have a few other benefits from fixed length records.
Assuming you are using innodb_file_per_table, keeping the tex/blob in a separate table will increase the likelihood that the file system caching will be used since the table will be smaller.
That said, this is a micro optimization. There are many other things you can do to get much bigger performance gains. For example, use SSD drives. It's not going to give you enough of a performance boost to push out the day of reckoning when your tables get so big you'll have to implement sharding.
You don't hear about databases using the "raw file system" anymore even though it can be much faster. "Raw" is when the database accesses the disk hardware directly, bypassing any file system. I think Oracle still supports this. But it's just not worth the added complexity, and you have to really know what you are doing. In my opinion, storing your text/blob in a separate table just isn't worth the added complexity for the possible performance gain. You really need to know what you are doing, and your access patterns, to take advantage of it.

Related

In which case would MySQL's InnoDB's COMPACT row_format would be faster/better than REDUNDANT?

I saw many comparisons on Stackoverflow, DBA and Server Fault, but it's never actually clear when it comes to performance with specific situations, whether using COMPACT or REDUNDANT row format with InnoDB
Are there any cases in which some simple tables would have a definitive performance boost? For example in a simple relational table user_roles that would map a users table with a roles table, using two integers that will always make a row the same size on disk?
If that's not a good example, are there good examples that would make a clear difference?
Thanks!
COMPACT format is slightly better with storing field lengths. REDUNDANT stores length for every field even if it's a fixed size INT. COMPACT however stores only lengths of variable length fields.
IMO that will contribute to performance difference so little that it doesn't make sense to bother with formats.
YMMV.
I'm pretty sure this is a case where there is no easy categorization of schemas into "this schema would be better (worse) using that row format".
You did not mention DYNAMIC or COMPRESSED, which were introduced in later versions. Let me give my opinion on all 4 formats...
Better would be to see if your code takes advantage of any of the row formats. The main difference has to do with the handling of TEXT (etc) columns. SELECT * asks for all columns to be returned, but if you specify all but the text columns, the query will probably run a lot faster because the off-row columns don't need to be fetched.
Having the first 767 bytes of a column included in the main part of the row may provide some speedup. But this depends on what you are doing that could use only the first part of the text, and whether there is an optimization to actually take advantage of the particular case. Note: "prefix indexes" (eg, INDEX my_text(44)) are useful only in a few cases.)
COMPRESSED is of dubious utility -- there is a lot of overhead involved in having both the compressed and uncompressed copies in the buffer_pool. I have trouble imagining a situation where compressed is clearly better. And the compression rate is typically only 2:1. Ordinary text compression is 3:1. If you have big text/blob strings, I think it is better to do client compression (to cut back on network bandwidth) of selected TEXT columns and store into BLOBs.
If you have a table with only two integers in it, there will be a lot of overhead. See SHOW TABLE STATUS; it will probably say Avg_row_length of maybe 40 bytes. And even more if that does not include a PRIMARY KEY.
Furthermore, due to various other things, it would be difficult to see much difference in "faster/better" unless the table has over, say, a million rows.
Bottom line: Go with the default for your version. Don't lose sleep over the decision.
For performance focus on indexes and formulation of queries. For space, test your table and report back. Be aware that "free space" comes in many flavors, most of which are not reported anywhere, so size number can change if you sneeze at the table.

Strategies for large databases with changing schemas

We have a mysql database table with hundreds of millions of rows. We run into issues with performing any kind of operation on it. For example, adding columns is becoming impossible to do with any kind of predictable time frame. When we want to roll out a new column the "ALTER TABLE" command takes forever so we dont have a good idea as to what the maintenance window is.
We're not tied to keeping this data in mysql, but I was wondering if there are strategies for mysql or databases in general, for updating schemas for large tables.
One idea, which I dont particularly like, would be to create a new table with the old schema plus additional column, and run queries against a view which unioned the results until all data could be moved to the new table schema.
Right now we already run into issues where deleting large numbers of records based on a where clause exit in error.
Ideas?
In MySQL, you can create a new table using an entity-attribute-value model. This would have one row per entity and attribute, rather than putting the attribute in a new column.
This is particularly useful for sparse data. Words of caution: types are problematic (everything tends to get turned into strings) and you cannot define foreign key relationships.
EAV models are particularly useful for sparse values -- when you have attributes that only apply to a minimum number of roles. They may prove useful in your case.
In NOSQL data models, adding new attributes or lists of attributes is simpler. However, there is no relationship to the attributes in other rows.
Columnar databases (at least the one in MariaDB) is very frugal on space -- some say 10x smaller than InnoDB. The shrinkage, alone, may be well worth it for 100M rows.
You have not explained whether your data is sparse. If it is, the JSON is not that costly for space -- completely leave out any 'fields' that are missing; zero space. With almost any other approach, there is at least some overhead for missing cells.
As you suggest, use regular columns for common fields. But also for the main fields that you are likely to search on. Then throw the rest into JSON.
I like to compress (in the client) the JSON string and use a BLOB. This give 3x shrinkage over using uncompressed TEXT.
I dislike the one-row per attribute EAV approach; it is very costly in space, JOINs, etc, etc.
[More thoughts] on EAV.
Do avoid ALTER whenever possible.

Smaller VARCHAR column on MySQL better for indexing?

If a column that is indexed in a MySQL table with the data type VARCHAR(255) can be brought down to say, VARCHAR(10), how much can that possibly improve performance for queries?
The key_len gets reduced if you take a look at an EXPLAIN statement, but I still don't have enough data/insight to understand how much of a performance improvement, if any at all, this would have.
One of the most basic optimizations is to design your tables to take as little space on the disk as possible. This can give huge improvements because disk reads are faster, and smaller tables normally require less main memory while their contents are being actively processed during query execution. Indexing also is a lesser resource burden if done on smaller columns.
MySQL supports a lot of different table types and row formats. For each table, you can decide which storage/index method to use. Choosing the right table format for your application may give you a big performance gain. See Chapter 8, "MySQL Storage Engines and Table Types."
You can get better performance on a table and minimize storage space using the techniques listed here:
One of them is:
The primary index of a table should be as short as possible. This makes identification of each row easy and efficient.

MySQL Performance: Single table or multiple tables

I have a 8 sets of data of about 30,000 rows for each set, the data is the same structure just for different languages.
The front end of the site will get relatively high traffic.
So my question is regarding MySQL performance, if i should have a single table with one column to distinguish which set the data belongs to (i.e. coloumn "language") or create individual tables for each language set?
(an explanation on why if possible would be really helpful)
Thanks in advance
Shadi
I would go with single table design. Seek time, with proper index, should be exactly the same, no matter how "wide" table is.
Apart from performance issues, this will simplify design and relations with other tables (foreign keys etc).
Another drawback to the "one table per language" design is that you have to change your schema every time you add one.
A language column means you just have to add data, which is not intrusive. The latter is the way to go.
I'd go with one-table design too. Since the cardinality of the language_key is very low, I'd partition the table over language_key instead of defining an index. (if your database supports it.)
I agree with the other responses - I'd use of a single table. With regards to performance optimization a number of things have the potential to have a bigger impact on performance:
appropriate indexing
writing/testing for query efficiency
chosing appropriate storage engine(s)
the hardware
type and configuration of the filesystem(s)
optimizing mysql configuration settings
... etc. I'm a fan of High Performance MySQL

What is the optimal amount of data for a table?

How much data should be in a table so that reading is optimal? Assuming that I have 3 fields varchar(25). This is in MySQL.
I would suggest that you consider the following in optimizing your database design:
Consider what you want to accomplish with the database. Will you be performing a lot of inserts to a single table at very high rates? Or will you be performing reporting and analytical functions with the data?
Once you've determined the purpose of the database, define what data you need to store to perform whatever functions are necessary.
Normalize till it hurts. If you're performing transaction processing (the most common function for a database) then you'll want a highly normalized database structure. If you're performing analytical functions, then you'll want a more denormalized structure that doesn't have to rely on joins to generate report results.
Typically, if you've really normalized the structure till it hurts then you need to take your normalization back a step or two to have a data structure that will be both normalized and functional.
A normalized database is mostly pointless if you fail to use keys. Make certain that each table has a primary key defined. Don't use surrogate keys just cause its what you always see. Consider what natural keys might exist in any given table. Once you are certain that you have the right primary key for each table, then you need to define your foreign key references. Establishing explicit foreign key relationships rather than relying on implicit definition will give you a performance boost, provide integrity for your data, and self-document the database structure.
Look for other indexes that exist within your tables. Do you have a column or set of columns that you will search against frequently like a username and password field? Indexes can be on a single column or multiple columns so think about how you'll be querying for data and create indexes as necessary for values you'll query against.
Number of rows should not matter. Make sure the fields your searching on are indexed properly. If you only have 3 varchar(25) fields, then you probably need to add a primary key that is not a varchar.
Agree that you should ensure that your data is properly indexed.
Apart from that, if you are worried about table size, you can always implement some type of data archival strategy to later down the line.
Don't worry too much about this until you see problems cropping up, and don't optimise prematurely.
For optimal reading you should have an index. A table exists to hold the rows it was designed to contain. As the number of rows increases, the value of the index comes into play and reading remains brisk.
Phrased as such I don't know how to answer this question. An idexed table of 100,000 records is faster than an unindexed table of 1,000.
What are your requirements? How much data do you have? Once you know the answer to these questions you can make decisions about indexing and/or partitioning.
This is a very loose question, so a very loose answer :-)
In general if you do the basics - reasonable normalization, a sensible primary key and run-of-the-mill queries - then on today's hardware you'll get away with most things on a small to medium sized database - i.e. one with the largest table having less than 50,000 records.
However once you get past the 50k - 100k rows, which roughly corresponds to the point when the rdbms is likely to be memory constrained - then unless you have your access paths set up correctly (i.e. indexes) then performance will start to fall off catastrophically. That is in the mathematical sense - in such scenario's it's not unusual to see performance deteriorate by an order of magnitude or two for a doubling in table size.
Obviously therefore the critical table size at which you need to pay attention will vary depending upon row size, machine memory, activity and other environmental issues, so there is no single answer, but it is well to be aware that performance generally does not degrade gracefully with table size and plan accordingly.
I have to disagree with Cruachan about "50k - 100k rows .... roughly correspond(ing) to the point when the rdbms is likely to be memory constrained". This blanket statement is just misleading without two additional data: approx. size of the row, and available memory. I'm currently developing a database to find the longest common subsequence (a la bio-informatics) of lines within source code files, and reached millions of rows in one table, even with a VARCHAR field of close to 1000, before it became memory constrained. So, with proper indexing, and sufficient RAM (a Gig or two), as regards the original question, with rows of 75 bytes at most, there is no reason why the proposed table couldn't hold tens of millions of records.
The proper amount of data is a function of your application, not of the database. There are very few cases where a MySQL problem is solved by breaking a table into multiple subtables, if that's the intent of your question.
If you have a particular situation where queries are slow, it would probably be more useful to discuss how to improve that situation by modifying query or the table design.