I just happened to read the Maximum Capacity Specification for SQL Server 2008 and saw a maximum of 8060bytes per row? What the... Only 8KB per row allowed? (Yes, I saw "row-overflow storage" special handling, I'm talking about standard behavior)
Did I misunderstand something here? I'm sure I have, because I'm sure I saw binary objects with several MB sizes stored inside SQL Server databases. Does this ominous per row really mean a table row as in one row, multiple columns?
So when I have three nvarchar columns with each 4000 characters in there (suppose three legal documents written in textboxes...) - the server spits out a warning?
Yes, you'll get a warning on CREATE TABLE, an error on INSERT or UPDATE
LOB types (nvarchar(max), varchar(max) and varbinary(max) allow 2Gb-1 bytes which is how you'd store large chunks of data and is what you'd have seen before.
For a single field > 4000 characters/8000 bytes I'd use nvarchar(max)
For 3 x nvarchar(4000) in one row I'd consider one of:
my design is wrong
nvarchar(max) for one or more column
1:1 child table for the "least populated" columns
2008 will handle the overflow while in 2000, it would simply refuse to insert a record that overflowed. However, it is still best to design with this in mind because a significant number of records overflowed might cause some performance issues in querying. In the case you described, I might consider a related table with a column for document type, a large field for document and and a foreign key to the intial table. If however it is unlikey that all three columns would be filled in the same record or at the max values, then the design might be fine. You have to know your data to determine which is best. Another consideration is to continue as you have now until you have problems and then replace with a separate document table. You could even refactor by renaming the existing table and creating a new one and then creating a view with the existing tablename that pulls the data from the new structure. This could keep alot of your code from breaking although you would still have to adjust any insert or update statements.
Related
I have a big table with over 100 million rows. I have been trimming it down for months getting rid of bad data (rows wise), trying to keep it small. I already had 9 columns on this table. I want to add a new boolean column to it. Below is the current state.
This table started off small, and now its getting pretty wide. Yet again, I am tasked with adding more information per row. This time it's a new boolean field. I expect this field to be low volume, meaning less than 10% will have this set to true. I know I can make it default null, and it is a boolean column which should be small.
However, I wanted to get some advice. This table cannot get infinitely wide, and I will need to work around this. Under these circumstances, does it make more sense to create another table and foreign key reference the record when I have additional data to add? How do the pro's handle this in database design?
The best situation for usability is to have all data on the record so any form of a query can get or calculate on the table itself without joins. I just do not have confidence that it will scale to 1 BILLION rows (insert meme).
At my job I support MySQL instances that have multi-billion row tables. At that scale, care must be taken to optimize queries properly. You don't want to do a table-scan at that scale.
But that's about rows, not columns. You asked first about columns.
The way to choose between adding a column versus adding another table is to follow rules of database normalization. If the new column is for an attribute of the same entity as your current table, add the column to that table. If it's a multi-valued attribute or if it's really an attribute of some other entity, then add it to a different table.
Very, very rarely is it the right choice to make another table solely for the sake of having too many columns. A given MySQL table can have dozens of columns pretty easily, and hundreds if you're careful.
In theory, there is no limit to the number of columns that might be appropriate to put in the same table with respect to normalization. But there are limitations due to the code to store those columns in a given implementation (e.g. InnoDB storage engine in MySQL).
See https://www.percona.com/blog/2013/04/08/understanding-the-maximum-number-of-columns-in-a-mysql-table/
So the maximum number of columns for a table in MySQL is somewhere between 191 and 2829, depending on a number of factors.
In the comments on that blog, I was able to design a table that failed to be created at 59 columns. Read the blog for details.
I have a situation where I have to create tables dynamically. Depending on some criteria I am going to vary the size of the columns of a particular table.
For that purpose I need to calculate the size of one row.
e.g.
If I am going to create a following table
CREATE TABLE sample(id int, name varchar(30));
so that formula would give me the size of a single row for the table above considering all overheads for storing a row in a mysql table.
Is possible to do so and Is it feasible to do so?
It depends on the storage engine you use and the row format chosen for that table, and also your indexes. But it is not a very useful information.
Edit:
I suggest going against normalization only when you know exactly what you're doing. A DBMS is created to deal with large amount of data. You probably don't need to serialize your strctured data into a single field.
Keep in mind that your application layer then has to tokenie (or worse) the serialized field data to get the original meaning back, which has certainly larger overhead than getting the data already in a structured form, from the DB.
The only exeption I can think of is a client-heavy architcture, when moving processing to the client side actually takes burden off the server, and you would serialize our data anyway for the sake of the transfer. - In server-side code (like php) it is not a good practive to save serialized stye data into the DB.
(Though, using php's built in serialization may be a good idea in some cases. Your current project does not seem to benefit from it.)
The VARCHAR is a variable-length data type, it has a length property, but the value can be empty; calculation may be not exact. Have a look at 'Avg_row_length' field in information_schema.tables.
Is there any problem with making all your Sql Server 2008 string columns varchar(max)? My allowable string sizes are managed by the application. The database should just persist what I give it. Will I take a performance hit by declaring all string columns to be of type varchar(max) in Sql Server 2008, no matter what the size of the data that actually goes into them?
By using VARCHAR(MAX) you are basically telling SQL Server "store the values in this field how you see best", SQL Server will then choose whether to store values as a regular VARCHAR or as a LOB (Large object). In general if the values stored are less than 8,000 bytes SQL Server will treat values as a regular VARCHAR type.
If the values stored are too large then the column is allowed to spill off the page in to LOB pages, exactly as they do for other LOB types (text, ntext and image) - if this happens then additional page reads are required to read the data stored in the additional pages (i.e. there is a performance penatly), however this only happens if the values stored are too large.
In fact under SQL Server 2008 or later data can overflow onto additional pages even with the fixed length data types (e.g. VARCHAR(3,000)), however these pages are called row overflow data pages and are treated slightly differently.
Short version: from a storage perspective there is no disadvantage of using VARCHAR(MAX) over VARCHAR(N) for some N.
(Note that this also applies to the other variable-length field types NVARCHAR and VARBINARY)
FYI - You can't create indexes on VARCHAR(MAX) columns
Indexes can not be over 900 bytes wide for one. So you can probably never create an index. If your data is less then 900 bytes, use varchar(900).
This is one downside: because it gives
really bad searching performance
no unique constraints
Simon Sabin wrote a post on this some time back. I don't have the time to grab it now, but you should search for it, because he comes up with the conclusion that you shouldn't use varchar(max) by default.
Edited: Simon's got a few posts about varchar(max). The links in the comments below show this quite nicely. I think the most significant one is http://sqlblogcasts.com/blogs/simons/archive/2009/07/11/String-concatenation-with-max-types-stops-plan-caching.aspx, which talks about the effect of varchar(max) on plan caching. The general principle is to be careful. If you don't need it to be max, then don't use max - if you need more than 8000 characters, then sure... go for it.
For this question specifically a few points I don't see mentioned.
On 2005/2008/2008 R2 if a LOB column is included in an index this will block online index rebuilds.
On 2012 the online index rebuild restriction is lifted but LOB columns cannot participate in the new functionality Adding NOT NULL Columns as an Online Operation.
Locks can be taken out longer on rows containing columns of this data type. (more)
A couple of other reasons are covered in my answer as to why not varchar(8000) everywhere.
Your queries may end up requesting huge memory grants not justified by the size of data.
On table with triggers it can prevent an optimisation where versioning tags are not added.
I asked the similar question earlier. got some interesting replies. check it out here
There was one site that had a guy talking about the detriment of using wide columns, however if your data is limited in the application, my testing disproved it.
The fact you can't create indexes on the columns means I wouldn't use them all the time (personally i wouldn't use them that much at all, but i'm a bit of a purist in that regard).
However if you know there isn't much stored in them, i don't think they are that bad.
If you do any sorting on columns a recordset with a varchar(max) in it (or any wide column being char or varchar), then you could suffer performance penalties. these could be resolved (if required) by indexes, but you can't put indexes on varchar(max).
If you want to future proof your columns, why not just put them to something reasonable. eg a name column be 255 characters instead of max... that kinda thing.
There is another reason to avoid using varchar(max) on all columns. For the same reason we use check constraints (to avoid filling tables with junk caused by errant software or user entries), we would want to guard against any faulty process that adds much more data than intended. For example, if someone or something tried to add 3,000 bytes into a City field, we would know for certain that something is amiss and would want to stop the process dead in its tracks to debug it at the earliest possible point. We would also know that a 3000-byte city name could not possibly be valid and would mess up reports and such if we tried to use it.
Ideally, you should only allow what you need. Meaning if you're certain a particular column (say a username column) is never going to be more than 20 characters long, using a VARCHAR(20) vs. a VARCHAR(MAX) lets the database optimize queries and data structures.
From MSDN:
http://msdn.microsoft.com/en-us/library/ms176089.aspx
Variable-length, non-Unicode character data. n can be a value from 1 through 8,000. max indicates that the maximum storage size is 2^31-1 bytes.
Are you really going ever going to come close to 2^31-1 bytes for these columns?
Problem1: What is the maximum no of columns we can have in a table
Problem2: What is the maximum no of columns we should have in a table
Answer 1: Probably more than you have, but not more than you will grow to have.
Answer 2: Fewer than you have.
Asking these questions usually indicates that you haven't designed the table well. You probably are practicing the Metadata Tribbles antipattern. The columns tend to accumulate over time, creating an unbounded set of columns that store basically the same type of data. E.g. subtotal1, subtotal2, subtotal3, etc.
Instead, I'm guessing you should create an additional dependent table, so your many columns become many rows. This is part of designing a proper normalized database.
CREATE TABLE Subtotals (
entity_id INT NOT NULL,
year_quarter SMALLINT NOT NULL, -- e.g. 20094
subtotal NUMERIC(9,2) NOT NULL,
PRIMARY KEY (entity_id, year_quarter),
FOREIGN KEY (entity_id) REFERENCES Entities (entity_id)
);
My former colleague also wrote a blog about this:
Understanding the maximum number of columns in a MySQL table
The answer is not so straightforward as you might think.
SQL 2000 : 1024
SQL 2005 : 1024
SQL 2008 : 1024 for a non-wide table, 30k for a wide table.
The wide tables are for when you have used the new sparse column feature in SQL 2008 which is designed for when you have a large number of columns that are normally empty.
Just because these limits are available, does not mean you should be using them however, I would start with designing the tables based on the requirements and then check whether a vertical partitioning of 1 table into 2 smaller tables is required etc.
1)
http://msdn.microsoft.com/en-us/library/aa933149%28SQL.80%29.aspx
1024 seems to be the limit.
2)
Much less than 1024 :). Seriously, it depends on how normalized you want your DB to be. Generally, the fewer columns you have in a table the easier it will be for someone to understand (normally). Like for a person table, you might want to store the person's address in another table (person_address, for example). It's best to break your data up into entities that make sense for your business model, and go from there.
2) There are plenty of guidelines out there. In particular regarding database normalization. The overarching principle is always to be able to adapt. Similar to classes, tables with large number of columns are not very flexible. Some of the questions you should ask yourself:
Does Column A describes an attribute of the Object (Table) that could/should be grouped with Column B
Data update. Keep in mind that most RDBMS perform a row lock when updating values. This means that if you are constantly updating Column A while another process is updating Column B, they will fight for the row and this will create contention.
Database design is an art more than a science. While guidelines and technical limitations will get you in the right direction, there are no hard rules that will make your system work or fail 100%.
I think 4096 in mysql, SQL Server I don't know
I asked the same question a few months ago in a special scenario, maybe the answers help you decide. Usually, as few as possible I would say.
I have a table where one of the columns is a sort of id string used to group several rows from the table. Let's say the column name is "map" and one of the values for map is e.g. "walmart". The column has an index on it, because I use to it filter those rows which belong to a certain map.
I have lots of such maps and I don't know how much space the different map values take up from the table. Does MYSQL recognizes the same map value is stored for multiple rows and stores it only once internally and only references it with an internal numeric id?
Or do I have to replace the map string with a numeric id explicitly and use a different table to pair map strings to ids if I want to decrease the size of the table?
MySQL will store the whole data for every row, regardless of whether the data already exists in a different row.
If you have a limited set of options, you could use an ENUM field, else you could pull the names into another table and join on it.
I think MySQL will duplicate your content each time : it stores data row by row, unless you explicitly specify otherwise (putting the data in another table, like you suggested).
Using another table will mean you need to add a JOIN in some of your queries : you might want to think a bit about the size of your data (are they that big ?), compared to the (small ?) performance loss you may encounter because of that join.
Another solution would be using an ENUM datatype, at least if you know in advance which string you will have in your table, and there are only a few of those.
Finally, another solution might be to store an integer "code" corresponding to the strings, and have those code translated to strings by your application, totally outside of the database (or use some table to store the correspondances, but have that table cached by your application, instead of using joins in SQL queries).
It would not be as "clean", but might be better for performances -- still, this may be some kind of micro-optimization that is not necessary in your case...
If you are using the same values over and over again, then there is a good functional reason to move it to a separate table, totally aside from disk space considerations: To avoid problems with inconsistent data.
Suppose you have a table of Stores, which includes a column for StoreName. Among the values in StoreName "WalMart" occurs 300 times, and then there's a "BalMart". Is that just a typo for "WalMart", or is that a different store?
Also, if there's other data associated with a store that would be constant across the chain, you should store it just once and not repeatedly.
Of course, if you're just showing locations on a map and you really don't care what they are, it's just a name to display, then this would all be irrelevant.
And if that's the case, then buying a bigger disk is probably a simpler solution than redesigning your database just to save a few bytes per record. Because if we're talking arbitrary strings for place names here, then trying to find duplicates and have look-ups for them is probably a lot of work for very little gain.