varchar(max) everywhere? - sql-server-2008

Is there any problem with making all your Sql Server 2008 string columns varchar(max)? My allowable string sizes are managed by the application. The database should just persist what I give it. Will I take a performance hit by declaring all string columns to be of type varchar(max) in Sql Server 2008, no matter what the size of the data that actually goes into them?

By using VARCHAR(MAX) you are basically telling SQL Server "store the values in this field how you see best", SQL Server will then choose whether to store values as a regular VARCHAR or as a LOB (Large object). In general if the values stored are less than 8,000 bytes SQL Server will treat values as a regular VARCHAR type.
If the values stored are too large then the column is allowed to spill off the page in to LOB pages, exactly as they do for other LOB types (text, ntext and image) - if this happens then additional page reads are required to read the data stored in the additional pages (i.e. there is a performance penatly), however this only happens if the values stored are too large.
In fact under SQL Server 2008 or later data can overflow onto additional pages even with the fixed length data types (e.g. VARCHAR(3,000)), however these pages are called row overflow data pages and are treated slightly differently.
Short version: from a storage perspective there is no disadvantage of using VARCHAR(MAX) over VARCHAR(N) for some N.
(Note that this also applies to the other variable-length field types NVARCHAR and VARBINARY)
FYI - You can't create indexes on VARCHAR(MAX) columns

Indexes can not be over 900 bytes wide for one. So you can probably never create an index. If your data is less then 900 bytes, use varchar(900).
This is one downside: because it gives
really bad searching performance
no unique constraints

Simon Sabin wrote a post on this some time back. I don't have the time to grab it now, but you should search for it, because he comes up with the conclusion that you shouldn't use varchar(max) by default.
Edited: Simon's got a few posts about varchar(max). The links in the comments below show this quite nicely. I think the most significant one is http://sqlblogcasts.com/blogs/simons/archive/2009/07/11/String-concatenation-with-max-types-stops-plan-caching.aspx, which talks about the effect of varchar(max) on plan caching. The general principle is to be careful. If you don't need it to be max, then don't use max - if you need more than 8000 characters, then sure... go for it.

For this question specifically a few points I don't see mentioned.
On 2005/2008/2008 R2 if a LOB column is included in an index this will block online index rebuilds.
On 2012 the online index rebuild restriction is lifted but LOB columns cannot participate in the new functionality Adding NOT NULL Columns as an Online Operation.
Locks can be taken out longer on rows containing columns of this data type. (more)
A couple of other reasons are covered in my answer as to why not varchar(8000) everywhere.
Your queries may end up requesting huge memory grants not justified by the size of data.
On table with triggers it can prevent an optimisation where versioning tags are not added.

I asked the similar question earlier. got some interesting replies. check it out here
There was one site that had a guy talking about the detriment of using wide columns, however if your data is limited in the application, my testing disproved it.
The fact you can't create indexes on the columns means I wouldn't use them all the time (personally i wouldn't use them that much at all, but i'm a bit of a purist in that regard).
However if you know there isn't much stored in them, i don't think they are that bad.
If you do any sorting on columns a recordset with a varchar(max) in it (or any wide column being char or varchar), then you could suffer performance penalties. these could be resolved (if required) by indexes, but you can't put indexes on varchar(max).
If you want to future proof your columns, why not just put them to something reasonable. eg a name column be 255 characters instead of max... that kinda thing.

There is another reason to avoid using varchar(max) on all columns. For the same reason we use check constraints (to avoid filling tables with junk caused by errant software or user entries), we would want to guard against any faulty process that adds much more data than intended. For example, if someone or something tried to add 3,000 bytes into a City field, we would know for certain that something is amiss and would want to stop the process dead in its tracks to debug it at the earliest possible point. We would also know that a 3000-byte city name could not possibly be valid and would mess up reports and such if we tried to use it.

Ideally, you should only allow what you need. Meaning if you're certain a particular column (say a username column) is never going to be more than 20 characters long, using a VARCHAR(20) vs. a VARCHAR(MAX) lets the database optimize queries and data structures.
From MSDN:
http://msdn.microsoft.com/en-us/library/ms176089.aspx
Variable-length, non-Unicode character data. n can be a value from 1 through 8,000. max indicates that the maximum storage size is 2^31-1 bytes.
Are you really going ever going to come close to 2^31-1 bytes for these columns?

Related

MySQL best way to store long strings

I'm looking for some advice on the best way to store long strings of data from the mySQL experts.
I have a general purpose table which is used to store any kind of data, by which I mean it should be able to hold alphanumeric and numeric data.
Currently, the table structure is simple with an ID and the actual data stored in a single column as follows:
id INT(11)
data VARCHAR(128)
I now have a requirement to store a larger amount of data (up to 500 characters) and am wondering whether the best way would be to simply increase the varchar column size, or whether I should add a new column (a TEXT type column?) for the times I need to store longer strings.
If any experts out there has any advice I'm all ears!
My preferred method would be to simply increase the varchar column, but that's because I'm lazy.
The mySQL version I'm running is 5.0.77.
I should mention the new 500 character requirement will only be for the odd record; most records in the table will be not longer than 50 characters.
I thought I'd be future-proofing by making the column 128. Shows how much I knew!
Generally speaking, this is not a question that has a "correct" answer. There is no "infinite length" text storage type in MySQL. You could use LONGTEXT, but that still has an (absurdly high) upper limit. Yet if you do, you're kicking your DBMS in the teeth for having to deal with that absurd blob of a column for your 50-character text. Not to mention the fact that you hardly do anything with it.
So, most futureproofness(TM) is probably offered by LONGTEXT. But it's also a very bad method of resolving the issue. Honestly, I'd revisit the application requirements. Storing strings that have no "domain" (as in, being well-defined in their application) and arbitrary length is not one of the strengths of RDBMS.
If I'd want to solve this on the "application design" level, I'd use NoSQL key-value store for this (and I'm as anti-NoSQL-hype as they get, so you know it's serious), even though I recognize it's a rather expensive change for such a minor change. But if this is an indication of what your DBMS is eventually going to hold, it might be more prudent to switch now to avoid this same problem hundred times in the future. Data domain is very important in RDBMS, whereas it's explicitly sidelined in non-relational solutions, which seems to be what you're trying to solve here.
Stuck with MySQL? Just increase it to VARCHAR(1000). If you have no requirements for your data, it's irrelevant what you do anyway.
Careful if using text. TEXT data is not stored in the database server’s memory, therefore, whenever you query TEXT data, MySQL has to read from it from the disk, which is much slower in comparison with CHAR and VARCHAR as it cannot make use of indexes.The better way to store long string will be nosql databases
We can use varchar(<maximum_limit>). The maximum limit that we can pass is 65535 bytes.
Note: This maximum length of a VARCHAR is shared among all columns except TEXT/BLOB columns and the character set used.

SQL Server maximum 8KB per row?

I just happened to read the Maximum Capacity Specification for SQL Server 2008 and saw a maximum of 8060bytes per row? What the... Only 8KB per row allowed? (Yes, I saw "row-overflow storage" special handling, I'm talking about standard behavior)
Did I misunderstand something here? I'm sure I have, because I'm sure I saw binary objects with several MB sizes stored inside SQL Server databases. Does this ominous per row really mean a table row as in one row, multiple columns?
So when I have three nvarchar columns with each 4000 characters in there (suppose three legal documents written in textboxes...) - the server spits out a warning?
Yes, you'll get a warning on CREATE TABLE, an error on INSERT or UPDATE
LOB types (nvarchar(max), varchar(max) and varbinary(max) allow 2Gb-1 bytes which is how you'd store large chunks of data and is what you'd have seen before.
For a single field > 4000 characters/8000 bytes I'd use nvarchar(max)
For 3 x nvarchar(4000) in one row I'd consider one of:
my design is wrong
nvarchar(max) for one or more column
1:1 child table for the "least populated" columns
2008 will handle the overflow while in 2000, it would simply refuse to insert a record that overflowed. However, it is still best to design with this in mind because a significant number of records overflowed might cause some performance issues in querying. In the case you described, I might consider a related table with a column for document type, a large field for document and and a foreign key to the intial table. If however it is unlikey that all three columns would be filled in the same record or at the max values, then the design might be fine. You have to know your data to determine which is best. Another consideration is to continue as you have now until you have problems and then replace with a separate document table. You could even refactor by renaming the existing table and creating a new one and then creating a view with the existing tablename that pulls the data from the new structure. This could keep alot of your code from breaking although you would still have to adjust any insert or update statements.

MySQL: Why use VARCHAR(20) instead of VARCHAR(255)? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Are there disadvantages to using a generic varchar(255) for all text-based fields?
In MYSQL you can choose a length for the VARCHAR field type. Possible values are 1-255.
But what are its advantages if you use VARCHAR(255) that is the maximum instead of VARCHAR(20)? As far as I know, the size of the entries depends only on the real length of the inserted string.
size (bytes) = length+1
So if you have the word "Example" in a VARCHAR(255) field, it would have 8 bytes. If you have it in a VARCHAR(20) field, it would have 8 bytes, too. What is the difference?
I hope you can help me. Thanks in advance!
Check out: Reference for Varchar
In short there isn't much difference unless you go over the size of 255 in your VARCHAR which will require another byte for the length prefix.
The length indicates more of a constraint on the data stored in the column than anything else. This inherently constrains the MAXIMUM storage size for the column as well. IMHO, the length should make sense with respect to the data. If your storing a Social Security # it makes no sense to set the length to 128 even though it doesn't cost you anything in storage if all you actually store is an SSN.
There are many valid reasons for choosing a value smaller than the maximum that are not related to performance. Setting a size helps indicate the type of data you are storing and also can also act as a last-gasp form of validation.
For instance, if you are storing a UK postcode then you only need 8 characters. Setting this limit helps make clear the type of data you are storing. If you chose 255 characters it would just confuse matters.
I don't know about mySQL but in SQL Server it will let you define fields such that the total number of bytes used is greater than the total number of bytes that can actually be stored in a record. This is a bad thing. Sooner or later you will get a row where the limit is reached and you cannot insert the data.
It is far better to design your database structure to consider row size limits.
Additionally yes, you do not want people to put 200 characters in a field where the maximum value should be 10. If they do, it is almost always bad data.
You say, well I can limit that at the application level. But data does not get into the database just from one application. Sometimes multiple applications use it, sometimes data is imported and sometimes it is fixed manually from the query window (update all the records to add 10% to the price for instance). If any of these other sources of data don't know about the rules you put in your application, you will have bad, useless data in your database. Data integrity must be enforced at the database level (which doesn't stop you from also checking before you try to enter data) or you have no integrity. Plus it has been my experience that people who are too lazy to design their database are often also too lazy to actually put the limits into the application and there is no data integrity check at all.
They have a word for databases with no data integrity - useless.
There is a semantical difference (and I believe that's the only difference): if you try to fill 30 non-space characters into varchar(20), it will produce an error, whereas it will succeed for varchar(255). So it is primarily an additional constraint.
Well, if you want to allow for a larger entry, or limit the entry size perhaps.
For example, you may have first_name as a VARCHAR 20, but perhaps street_address as a VARCHAR 50 since 20 may not be enough space. At the same time, you may want to control how large that value can get.
In other words, you have set a ceiling of how large a particular value can be, in theory to prevent the table (and potentially the index/index entries) from getting too large.
You could just use CHAR which is a fixed width as well, but unlike VARCHAR which can be smaller, CHAR pads the values (although this makes for quicker SQL access.
From a database perspective performance wise I do not believe there is going to be a difference.
However, I think a lot of the decision on the length to use comes down to what you are trying to accomplish and documenting the system to accept just the data that it needs.

When setting MySQL schema, why use certain types?

When I'm setting up a MySQL table, it asks me to define the name of the column, type of input, and length. My assumption, without having read anything about it, is that it's for minimization. Specify the smallest possible int/smallint/tinyint for your needs, and it will reduce overhead of some sort. If it's all positives, make it unsigned to double your space, etc.
What happens if I just make every field a varchar-200 characters? When/why is this bad, what will I miss out on, and when will any inefficiencies manifest themselves? 100k records?
I think about this every time I set up a DB, but I haven't built anything to scale enough where I've ever had my scheme setup inappropriately, either too "strict/small" or "loose/big". Can someone confirm that I'm making good assumptions about speed and efficiency?
Thanks!
Data types not only optimize storage, but how data is indexed. As your databases get bigger, it will become apparent that it's quicker to search for all the records that have a 1 in an integer field than those that have a "1" in a varchar field. This becomes especially important when you're joining data from more than one table and your database engine is having to do this sort of thing repeatedly. (Daren also rightly points out below that it's important that the types of the fields you're matching on are identical as well.)
The level at which these inefficiencies become an issue depends greatly on your hardware and your application design. We have big enough iron these days that if you're building moderate-scale apps, you may not see an appreciable difference. (Aside from feeling a little bit guilty about your database design!) But establishing good habits on small projects makes the bigger ones easier when they come along.
If you have two columns as varchar and put in the values 10 and 20 and add them, you'll get 1020 instead of 30 which you'd likely expect.
Sure, you could save everything as VARCHAR strings. But you'd be giving up a lot of functionality provided by the database engine.
You should choose the database type that most closely matches the intended use of the column. For example, using DATE or DATETIME to store dates provides you with all sorts of date/time functions that you don't get with basic VARCHAR types.
Likewise, fields used to count things or provide simple unique IDs should be INT or one of its related types. Also bear in mind that an INT occupies only 4 bytes, whereas a 9-digit string uses at least 9 bytes.
For character data, it's wise to use NVARCHAR for internationalized values that users in any locale are going to enter (esp. names and locations). If you know the text is limited to US or internal use only, VARCHAR is safe.

char vs varchar for performance in stock database

I'm using mySQL to set up a database of stock options. There are about 330,000 rows (each row is 1 option). I'm new to SQL so I'm trying to decide on the field types for things like option symbol (varies from 4 to 5 characters), stock symbol (varies from 1 to 5 characters), company name (varies from 5 to 60 characters).
I want to optimize for speed. Both creating the database (which happens every 5 minutes as new price data comes out -- i don't have a real-time data feed, but it's near real-time in that i get a new text file with 330,000 rows delivered to me every 5 minutes; this new data completely replaces the prior data), and also for lookup speed (there will be a web-based front end where many users can run ad hoc queries).
If I'm not concerned about space (since the db lifetime is 5 minutes, and each row contains maybe 300 bytes, so maybe 100MBs for the whole thing) then what is the fastest way to structure the fields?
Same question for numeric fields, actually: Is there a performance difference between int(11) and int(7)? Does one length work better than another for queries and sorting?
Thanks!
In MyISAM, there is some benefit to making fixed-width records. VARCHAR is variable width. CHAR is fixed-width. If your rows have only fixed-width data types, then the whole row is fixed-width, and MySQL gains some advantage calculating the space requirements and offset of rows in that table. That said, the advantage may be small and it's hardly worth a possible tiny gain that is outweighed by other costs (such as cache efficiency) from having fixed-width, padded CHAR columns where VARCHAR would store more compactly.
The breakpoint where it becomes more efficient depends on your application, and this is not something that can be answered except by you testing both solutions and using the one that works best for your data under your application's usage.
Regarding INT(7) versus INT(11), this is irrelevant to storage or performance. It is a common misunderstanding that MySQL's argument to the INT type has anything to do with size of the data -- it doesn't. MySQL's INT data type is always 32 bits. The argument in parentheses refers to how many digits to pad if you display the value with ZEROFILL. E.g. INT(7) will display 0001234 where INT(11) will display 00000001234. But this padding only happens as the value is displayed, not during storage or math calculation.
If the actual data in a field can vary a lot in size, varchar is better because it leads to smaller records, and smaller records mean a faster DB (more records can fit into cache, smaller indexes, etc.). For the same reason, using smaller ints is better if you need maximum speed.
OTOH, if the variance is small, e.g. a field has a maximum of 20 chars, and most records actually are nearly 20 chars long, then char is better because it allows some additional optimizations by the DB. However, this really only matters if it's true for ALL the fields in a table, because then you have fixed-size records. If speed is your main concern, it might even be worth it to move any non-fixed-size fields into a separate table, if you have queries that use only the fixed-size fields (or if you only have shotgun queries).
In the end, it's hard to generalize because a lot depends on the access patterns of your actual app.
Given your system constraints I would suggest a varchar since anything you do with the data will have to accommodate whatever padding you put in place to make use of a fixed-width char. This means more code somewhere which is more to debug, and more potential for errors. That being said:
The major bottleneck in your application is due to dropping and recreating your database every five minutes. You're not going to get much performance benefit out of microenhancements like choosing char over varchar. I believe you have some more serious architectural problems to address instead. – Princess
I agree with the above comment. You have bigger fish to fry in your architecture before you can afford to worry about the difference between a char and varchar. For one, if you have a web user attempting to run an ad hoc query and the database is in the process of being recreated, you are going to get errors (i.e. "database doesn't exist" or simply "timed out" type issues).
I would suggest that instead you build (at the least) a quote table for the most recent quote data (with a time stamp), a ticker symbol table and a history table. Your web users would query against the ticker table to get the most recent data. If a symbol comes over in your 5-minute file that doesn't exist, it's simple enough to have the import script create it before posting the new info to the quote table. All others get updated and queries default to the current day's data.
I would definitely not recreate the database each time. Instead I would do the following:
read in the update/snapshot file and create some object based on each row.
for each row get the symbol/option name (unique) and set that in the database
If it were me I would also have an in memory cache of all the symbols and the current price data.
Price data is never an int - you can use characters.
The company name is probably not unique as there are many options for a particular company. That should be an index and you can save space just using the id of a company.
As someone else also pointed out - your web clients do not need to have to hit the actual database and do a query - you can probably just hit your cache. (though that really depends on what tables and data you expose to your clients and what data they want)
Having query access for other users is also a reason NOT to keep removing and creating a database.
Also remember that creating databases is subject to whatever actual database implementation you use. If you ever port from MySQL to, say, Postgresql, you will discover a very unpleasant fact that creating databases in postgresql is a comparatively very slow operation. It is orders of magnitude slower than reading and writing table rows, for instance.
It looks like there is an application design problem to address first, before you optimize for performance choosing proper data types.