char vs varchar for performance in stock database - mysql

I'm using mySQL to set up a database of stock options. There are about 330,000 rows (each row is 1 option). I'm new to SQL so I'm trying to decide on the field types for things like option symbol (varies from 4 to 5 characters), stock symbol (varies from 1 to 5 characters), company name (varies from 5 to 60 characters).
I want to optimize for speed. Both creating the database (which happens every 5 minutes as new price data comes out -- i don't have a real-time data feed, but it's near real-time in that i get a new text file with 330,000 rows delivered to me every 5 minutes; this new data completely replaces the prior data), and also for lookup speed (there will be a web-based front end where many users can run ad hoc queries).
If I'm not concerned about space (since the db lifetime is 5 minutes, and each row contains maybe 300 bytes, so maybe 100MBs for the whole thing) then what is the fastest way to structure the fields?
Same question for numeric fields, actually: Is there a performance difference between int(11) and int(7)? Does one length work better than another for queries and sorting?
Thanks!

In MyISAM, there is some benefit to making fixed-width records. VARCHAR is variable width. CHAR is fixed-width. If your rows have only fixed-width data types, then the whole row is fixed-width, and MySQL gains some advantage calculating the space requirements and offset of rows in that table. That said, the advantage may be small and it's hardly worth a possible tiny gain that is outweighed by other costs (such as cache efficiency) from having fixed-width, padded CHAR columns where VARCHAR would store more compactly.
The breakpoint where it becomes more efficient depends on your application, and this is not something that can be answered except by you testing both solutions and using the one that works best for your data under your application's usage.
Regarding INT(7) versus INT(11), this is irrelevant to storage or performance. It is a common misunderstanding that MySQL's argument to the INT type has anything to do with size of the data -- it doesn't. MySQL's INT data type is always 32 bits. The argument in parentheses refers to how many digits to pad if you display the value with ZEROFILL. E.g. INT(7) will display 0001234 where INT(11) will display 00000001234. But this padding only happens as the value is displayed, not during storage or math calculation.

If the actual data in a field can vary a lot in size, varchar is better because it leads to smaller records, and smaller records mean a faster DB (more records can fit into cache, smaller indexes, etc.). For the same reason, using smaller ints is better if you need maximum speed.
OTOH, if the variance is small, e.g. a field has a maximum of 20 chars, and most records actually are nearly 20 chars long, then char is better because it allows some additional optimizations by the DB. However, this really only matters if it's true for ALL the fields in a table, because then you have fixed-size records. If speed is your main concern, it might even be worth it to move any non-fixed-size fields into a separate table, if you have queries that use only the fixed-size fields (or if you only have shotgun queries).
In the end, it's hard to generalize because a lot depends on the access patterns of your actual app.

Given your system constraints I would suggest a varchar since anything you do with the data will have to accommodate whatever padding you put in place to make use of a fixed-width char. This means more code somewhere which is more to debug, and more potential for errors. That being said:
The major bottleneck in your application is due to dropping and recreating your database every five minutes. You're not going to get much performance benefit out of microenhancements like choosing char over varchar. I believe you have some more serious architectural problems to address instead. – Princess
I agree with the above comment. You have bigger fish to fry in your architecture before you can afford to worry about the difference between a char and varchar. For one, if you have a web user attempting to run an ad hoc query and the database is in the process of being recreated, you are going to get errors (i.e. "database doesn't exist" or simply "timed out" type issues).
I would suggest that instead you build (at the least) a quote table for the most recent quote data (with a time stamp), a ticker symbol table and a history table. Your web users would query against the ticker table to get the most recent data. If a symbol comes over in your 5-minute file that doesn't exist, it's simple enough to have the import script create it before posting the new info to the quote table. All others get updated and queries default to the current day's data.

I would definitely not recreate the database each time. Instead I would do the following:
read in the update/snapshot file and create some object based on each row.
for each row get the symbol/option name (unique) and set that in the database
If it were me I would also have an in memory cache of all the symbols and the current price data.
Price data is never an int - you can use characters.
The company name is probably not unique as there are many options for a particular company. That should be an index and you can save space just using the id of a company.
As someone else also pointed out - your web clients do not need to have to hit the actual database and do a query - you can probably just hit your cache. (though that really depends on what tables and data you expose to your clients and what data they want)
Having query access for other users is also a reason NOT to keep removing and creating a database.

Also remember that creating databases is subject to whatever actual database implementation you use. If you ever port from MySQL to, say, Postgresql, you will discover a very unpleasant fact that creating databases in postgresql is a comparatively very slow operation. It is orders of magnitude slower than reading and writing table rows, for instance.
It looks like there is an application design problem to address first, before you optimize for performance choosing proper data types.

Related

MySQL/PostgreSQL Column Sizes, Why?

I'm developing a program and ran into a bug where inserting a value in a tables column, that has the type int, and the value is larger than Integer.MAX_VALUE it spits out an error saying the number is too large. I read that the fix for this is to quite simply just alter the table to BigInt and that should fix it. But that made me thinking, why don't all programmers just use the max column values (such as Varchar(255), BigInt, etc.) rather than something smaller like Varchar(30) or Int?
Wouldn't this almost completely eliminate an error like mine occurring when you're not sure of whats going to be inserted, especially if it's based off of users input? Is there any cons into just using the largest possible type you need for the columns? Would the table size be bigger even if you just "2" in a big int column (even though that would work with int?). Is there a performance loss?
Thanks!
For Varchar, the reason you generally don't just use MAX is because it stores it differently and puts limitations on your index maintenance operations. For instance, you cannot rebuild an index "online" with a varchar(max) field on it. While there's a little hand waving involved, basically varchar(max) data gets stored off row so there's overhead in maintaining that extra data store.
For numeric types, the main thing is space. Bigint is an 8 byte signed integer whereas an int is only 4 bytes. If you dont need a space bigger than 2.4 billion, that's just wasted space (and often a lot of it if you have, say, 2.4 billion rows of data).
Data Compression can solve some of those issues, but not without the cost of having to de-compress the data when it's queried.
So the reasons are varied, but with the possible exception of using larger size varchars (not varchar(max)), picking the "right" data type for your data is just a good idea.
I can't speak to any RDBMS other than SQL Server (but I imagine this applies to all of them)... A BIG INT takes up twice as much space as an INT... which means less data fitting onto a page meaning less data in cache meaning slower performance.
In SQL Server there are actually 4 INT types:
TINYINT (1 byte),
SMALLINT (2 bytes),
INT (4 bytes),
BIGINT (8 bytes).
A good database developer will put very careful thought into choosing the proper data type based on the data that's expected to be put in the column. Aside from the issue of storage space, data types function as data constraints. So if I choose TINYINT as my data type, that means I only expect to see values between 0 and 255 and will reject anything that falls outside of that range.
If a coworker were to submit a table design with all VARCHAR(255) & BIGINTs, I'd reject it and have them size everything appropriately. It's lazy thinking like that, that causes huge problem on the DB side of the house.
why don't all programmers just use the max column values (such as
Varchar(255), BigInt, etc.) rather than something smaller like
Varchar(30) or Int?
Some do exactly that. It's also not at all uncommon to see developers store numeric or date/time values in varchar columns too.
I often see performance and storage costs called out as the reason not to do this. Those are considerations (which vary by DBMS) but a more important one in the world of relational databases is data integrity. The chosen datatype is a critical part of the data model because it determines the domain of data that can be stored. On top of that, relational databases provide check, referential, and NULL constraints to further limit column values.
Wouldn't this almost completely eliminate an error like mine occurring
when you're not sure of whats going to be inserted, especially if it's
based off of users input?
Of course, but why stop at a 64-bit integer? Why not NUMERIC(1000)? That's a rhetorical question to point out that one must know about the business domain so data can be properly modeled and validation rules enforced. A 64-bit integer is certainly overkill to store a person's number of children but you may end up with a value of several billion due to careless data entry. The column data type is the last defense for bad data and is especially important when it's based off of users input.
All that being said, one can use a RDBMS as nothing more than a dumb storage engine and enforce data integrity rules (if any) in application code. In that case, storage and performance are the only consideration.

Performance benefit in using correct data types

Is there any performance benefit in using the exact data types needed for a column? Or is it just storage optimisation?
For example, I'm creating a users table and I know for certainty that there will only be 200 users in total. When I'm manipulating the data in the the server, doing some select/update/insert/delete, is there any performance difference between using TINYINT - UN for the users_id column or using just INT?
The same applies to the user's name. I know, for now, that the user with the longest name length is 48, but I don't know if in the future there won't be a new user inserted in the table with a name with 65 characters in length. Is there any performance benefit in reserving only the needed lenght, for now, using VARCHAR(48) or can I avoid having to check constantly the column allowed length for each new user and use just VARCHAR(255)?
There is little advantage in either case.
For the number, you do gain a slight performance advantage. Typically, integers are 4 and a tinyint is 1 byte. So, if you have multiple smaller fields, then your records will be smaller. Smaller records then imply fewer data pages and ultimately slightly faster queries. This shows up when you start to have lots of records.
For the varchar, you don't even have that advantage. Both varchar(48) and varchar(255) occupy the same amount of space (there is one addition byte for lengths greater than 255). The values determine the space for this data type.
In other cases, it can make a big difference. In particular, storing dates as the native format is usually important, both to take advantage of date/time functions and to make better use of indexes.

MySQL PRIMARY KEYs: UUID / GUID vs BIGINT (timestamp+random)

tl;dr: Is assigning rows IDs of {unixtimestamp}{randomdigits} (such as 1308022796123456) as a BIGINT a good idea if I don't want to deal with UUIDs?
Just wondering if anyone has some insight into any performance or other technical considerations / limitations in regards to IDs / PRIMARY KEYs assigned to database records across multiple servers.
My PHP+MySQL application runs on multiple servers, and the data needs to be able to be merged. So I've outgrown the standard sequential / auto_increment integer method of identifying rows.
My research into a solution brought me to the concept of using UUIDs / GUIDs. However the need to alter my code to deal with converting UUID strings to binary values in MySQL seems like a bit of a pain/work. I don't want to store the UUIDs as VARCHAR for storage and performance reasons.
Another possible annoyance of UUIDs stored in a binary column is the fact that rows IDs aren't obvious when looking at the data in PhpMyAdmin - I could be wrong about this though - but straight numbers seem a lot simpler overall anyway and are universal across any kind of database system with no conversion required.
As a middle ground I came up with the idea of making my ID columns a BIGINT, and assigning IDs using the current unix timestamp followed by 6 random digits. So lets say my random number came about to be 123456, my generated ID today would come out as: 1308022796123456
A one in 10 million chance of a conflict for rows created within the same second is fine with me. I'm not doing any sort of mass row creation quickly.
One issue I've read about with randomly generated UUIDs is that they're bad for indexes, as the values are not sequential (they're spread out all over the place). The UUID() function in MySQL addresses this by generating the first part of the UUID from the current timestamp. Therefore I've copied that idea of having the unix timestamp at the start of my BIGINT. Will my indexes be slow?
Pros of my BIGINT idea:
Gives me the multi-server/merging advantages of UUIDs
Requires very little change to my application code (everything is already programmed to handle integers for IDs)
Half the storage of a UUID (8 bytes vs 16 bytes)
Cons:
??? - Please let me know if you can think of any.
Some follow up questions to go along with this:
Should I use more or less than 6 random digits at the end? Will it make a difference to index performance?
Is one of these methods any "randomer" ?: Getting PHP to generate 6 digits and concatenating them together -VS- getting PHP to generate a number in the 1 - 999999 range and then zerofilling to ensure 6 digits.
Thanks for any tips. Sorry about the wall of text.
I have run into this very problem in my professional life. We used timestamp + random number and ran into serious issues when our applications scaled up (more clients, more servers, more requests). Granted, we (stupidly) used only 4 digits, and then change to 6, but you would be surprised how often that the errors still happen.
Over a long enough period of time, you are guaranteed to get duplicate key errors. Our application is mission critical, and therefore even the smallest chance it could fail to due inherently random behavior was unacceptable. We started using UUIDs to avoid this issue, and carefully managed their creation.
Using UUIDs, your index size will increase, and a larger index will result in poorer performance (perhaps unnoticeable, but poorer none-the-less). However MySQL supports a native UUID type (never use varchar as a primary key!!), and can handle indexing, searching,etc pretty damn efficiently even compared to bigint. The biggest performance hit to your index is almost always the number of rows indexed, rather than the size of the item being index (unless you want to index on a longtext or something ridiculous like that).
To answer you question: Bigint (with random numbers attached) will be ok if you do not plan on scaling your application/service significantly. If your code can handle the change without much alteration and your application will not explode if a duplicate key error occurs, go with it. Otherwise, bite-the-bullet and go for the more substantial option.
You can always implement a larger change later, like switching to an entirely different backend (which we are now facing... :P)
You can manually change the autonumber starting number.
ALTER TABLE foo AUTO_INCREMENT = ####
An unsigned int can store up to 4,294,967,295, lets round it down to 4,290,000,000.
Use the first 3 digits for the server serial number, and the final 7 digits for the row id.
This gives you up to 430 servers (including 000), and up to 10 million IDs for each server.
So for server #172 you manually change the autonumber to start at 1,720,000,000, then let it assign IDs sequentially.
If you think you might have more servers, but less IDs per server, then adjust it to 4 digits per server and 6 for the ID (i.e. up to 1 million IDs).
You can also split the number using binary digits instead of decimal digits (perhaps 10 binary digits per server, and 22 for the ID. So, for example, server 76 starts at 2^22*76 = 318,767,104 and ends at 322,961,407).
For that matter you don't even need a clear split. Take 4,294,967,295 divide it by the maximum number of servers you think you will ever have, and that's your spacing.
You could use a bigint if you think you need more identifiers, but that's a seriously huge number.
Use the GUID as a unique index, but also calculate a 64-bit (BIGINT) hash of the GUID, store that in a separate NOT UNIQUE column, and index it. To retrieve, query for a match to both columns - the 64-bit index should make this efficient.
What's good about this is that the hash:
a. Doesn't have to be unique.
b. Is likely to be well-distributed.
The cost: extra 8-byte column and its index.
If you want to use the timestamp method then do this:
Give each server a number, to that append the proccess ID of the application that is doing the insert (or the thread ID) (in PHP it's getmypid()), then to that append how long that process has been alive/active for (in PHP it's getrusage()), and finally add a counter that starts at 0 at the start of each script invocation (i.e. each insert within the same script adds one to it).
Also, you don't need to store the full unix timestamp - most of those digits are for saying it's year 2011, and not year 1970. So if you can't get a number saying how long the process was alive for, then at least subtract a fixed timestamp representing today - that way you'll need far less digits.

varchar(max) everywhere?

Is there any problem with making all your Sql Server 2008 string columns varchar(max)? My allowable string sizes are managed by the application. The database should just persist what I give it. Will I take a performance hit by declaring all string columns to be of type varchar(max) in Sql Server 2008, no matter what the size of the data that actually goes into them?
By using VARCHAR(MAX) you are basically telling SQL Server "store the values in this field how you see best", SQL Server will then choose whether to store values as a regular VARCHAR or as a LOB (Large object). In general if the values stored are less than 8,000 bytes SQL Server will treat values as a regular VARCHAR type.
If the values stored are too large then the column is allowed to spill off the page in to LOB pages, exactly as they do for other LOB types (text, ntext and image) - if this happens then additional page reads are required to read the data stored in the additional pages (i.e. there is a performance penatly), however this only happens if the values stored are too large.
In fact under SQL Server 2008 or later data can overflow onto additional pages even with the fixed length data types (e.g. VARCHAR(3,000)), however these pages are called row overflow data pages and are treated slightly differently.
Short version: from a storage perspective there is no disadvantage of using VARCHAR(MAX) over VARCHAR(N) for some N.
(Note that this also applies to the other variable-length field types NVARCHAR and VARBINARY)
FYI - You can't create indexes on VARCHAR(MAX) columns
Indexes can not be over 900 bytes wide for one. So you can probably never create an index. If your data is less then 900 bytes, use varchar(900).
This is one downside: because it gives
really bad searching performance
no unique constraints
Simon Sabin wrote a post on this some time back. I don't have the time to grab it now, but you should search for it, because he comes up with the conclusion that you shouldn't use varchar(max) by default.
Edited: Simon's got a few posts about varchar(max). The links in the comments below show this quite nicely. I think the most significant one is http://sqlblogcasts.com/blogs/simons/archive/2009/07/11/String-concatenation-with-max-types-stops-plan-caching.aspx, which talks about the effect of varchar(max) on plan caching. The general principle is to be careful. If you don't need it to be max, then don't use max - if you need more than 8000 characters, then sure... go for it.
For this question specifically a few points I don't see mentioned.
On 2005/2008/2008 R2 if a LOB column is included in an index this will block online index rebuilds.
On 2012 the online index rebuild restriction is lifted but LOB columns cannot participate in the new functionality Adding NOT NULL Columns as an Online Operation.
Locks can be taken out longer on rows containing columns of this data type. (more)
A couple of other reasons are covered in my answer as to why not varchar(8000) everywhere.
Your queries may end up requesting huge memory grants not justified by the size of data.
On table with triggers it can prevent an optimisation where versioning tags are not added.
I asked the similar question earlier. got some interesting replies. check it out here
There was one site that had a guy talking about the detriment of using wide columns, however if your data is limited in the application, my testing disproved it.
The fact you can't create indexes on the columns means I wouldn't use them all the time (personally i wouldn't use them that much at all, but i'm a bit of a purist in that regard).
However if you know there isn't much stored in them, i don't think they are that bad.
If you do any sorting on columns a recordset with a varchar(max) in it (or any wide column being char or varchar), then you could suffer performance penalties. these could be resolved (if required) by indexes, but you can't put indexes on varchar(max).
If you want to future proof your columns, why not just put them to something reasonable. eg a name column be 255 characters instead of max... that kinda thing.
There is another reason to avoid using varchar(max) on all columns. For the same reason we use check constraints (to avoid filling tables with junk caused by errant software or user entries), we would want to guard against any faulty process that adds much more data than intended. For example, if someone or something tried to add 3,000 bytes into a City field, we would know for certain that something is amiss and would want to stop the process dead in its tracks to debug it at the earliest possible point. We would also know that a 3000-byte city name could not possibly be valid and would mess up reports and such if we tried to use it.
Ideally, you should only allow what you need. Meaning if you're certain a particular column (say a username column) is never going to be more than 20 characters long, using a VARCHAR(20) vs. a VARCHAR(MAX) lets the database optimize queries and data structures.
From MSDN:
http://msdn.microsoft.com/en-us/library/ms176089.aspx
Variable-length, non-Unicode character data. n can be a value from 1 through 8,000. max indicates that the maximum storage size is 2^31-1 bytes.
Are you really going ever going to come close to 2^31-1 bytes for these columns?

When setting MySQL schema, why use certain types?

When I'm setting up a MySQL table, it asks me to define the name of the column, type of input, and length. My assumption, without having read anything about it, is that it's for minimization. Specify the smallest possible int/smallint/tinyint for your needs, and it will reduce overhead of some sort. If it's all positives, make it unsigned to double your space, etc.
What happens if I just make every field a varchar-200 characters? When/why is this bad, what will I miss out on, and when will any inefficiencies manifest themselves? 100k records?
I think about this every time I set up a DB, but I haven't built anything to scale enough where I've ever had my scheme setup inappropriately, either too "strict/small" or "loose/big". Can someone confirm that I'm making good assumptions about speed and efficiency?
Thanks!
Data types not only optimize storage, but how data is indexed. As your databases get bigger, it will become apparent that it's quicker to search for all the records that have a 1 in an integer field than those that have a "1" in a varchar field. This becomes especially important when you're joining data from more than one table and your database engine is having to do this sort of thing repeatedly. (Daren also rightly points out below that it's important that the types of the fields you're matching on are identical as well.)
The level at which these inefficiencies become an issue depends greatly on your hardware and your application design. We have big enough iron these days that if you're building moderate-scale apps, you may not see an appreciable difference. (Aside from feeling a little bit guilty about your database design!) But establishing good habits on small projects makes the bigger ones easier when they come along.
If you have two columns as varchar and put in the values 10 and 20 and add them, you'll get 1020 instead of 30 which you'd likely expect.
Sure, you could save everything as VARCHAR strings. But you'd be giving up a lot of functionality provided by the database engine.
You should choose the database type that most closely matches the intended use of the column. For example, using DATE or DATETIME to store dates provides you with all sorts of date/time functions that you don't get with basic VARCHAR types.
Likewise, fields used to count things or provide simple unique IDs should be INT or one of its related types. Also bear in mind that an INT occupies only 4 bytes, whereas a 9-digit string uses at least 9 bytes.
For character data, it's wise to use NVARCHAR for internationalized values that users in any locale are going to enter (esp. names and locations). If you know the text is limited to US or internal use only, VARCHAR is safe.