Best way to store 'extra' user data in MySQL? - mysql

I'm adding a new feature to my user module for my CMS and I've hit a road block... Or I guess, a fork in the road, and I wanted to get some opinions from stackoverflow before I commit to anything.
Basically I want to allow admins to add new, 'extra' user fields that users can fill out on registration, edit in their profile, and/or be controlled by other modules. An example of this would be a birthday field, a lengthy description of themselves, or maybe points the user has earned on the site. Needless to say, the data stored will be varied and can range from large amounts of text, to a small integer value. To make matters worse - I want there to be the option to search this data.
With that out of the way - what would be the best way to do this? Right now I'm leaning towards having a table with the following columns.
userid, refFieldID, varchar, tinyint, smallint, int, text, date, datetime, etc.
I would prefer this as it would make searching significantly faster, and the reference table (Which holds all of the field's data, such as the name of the field, whether it's searchable or not, etc.) can reference which column should be used when storing data for that field.
The other idea, which was suggested to me and I've seen used in other solutions (vBulletin being one, although I have seen others whose names escape me at the moment), where you just have the userid, reference id, and a medtext field. I don't know enough about MySQL to say this with any certainty, but this method seems like it would be slower to search, and possibly have a larger overhead.
So which method would be 'best'? Is there another method I'm missing? Whichever method I end up using, it needs to be fast to search, not massive (A tiny bit of overhead is fine), and preferably allow complex queries used against the data.

I agree that a key-value table is probably the best solution. My first inclination would be to just store a text column, like vBulletin did. But, if you wanted to add the ability for the data store to be a bit more extensible and searchable like you've laid out, I might suggest:
1 medium/longtext or medium/longblob field for arbitrary text/binary storage (whatever is stored + overhead of 3-4 bytes for string length). Only reason to choose medium over long is to limit what can be stored to 2^24 bytes (16.7 MB) versus 2^32 bytes (2 GB).
1 integer (4 bytes) or bigint (8 bytes)
1 datetime (8 bytes)
Perhaps 1 float or double (4-8 bytes) for floating point storage
These fields will allow you to store nearly any type of data in the table but without inflating the width of table** (like a varchar would) and avoid any redundant storage (like having tinyint and mediumint etc). The text stored in the longtext field can still be reasonably searched using a fulltext index or a regular limited length index (e.g. index longtext_storage(8)).
** all blob values, such as longtext, are stored independently from the main table.

One technique that might work for you is to store this arbitrary data as text, in some notation like JSON, XML, or YAML. This decision depends on how you'll need to access the data: if you only look up each user's full chunk of user data, it could be ideal. If you need to run SQL queries on specific fields in the user data, you'll need to use a pure SQL or a hybrid approach.
Many of the newer, highly scalable "NoSQL" systems seem to favor JSON data (eg, MongoDB, CouchDB, and Project Voldemort). It's nice and terse, and you can create arbitrarily complex structures including maps (JSON objects) and lists (JSON arrays).

Related

MySQL/PostgreSQL Column Sizes, Why?

I'm developing a program and ran into a bug where inserting a value in a tables column, that has the type int, and the value is larger than Integer.MAX_VALUE it spits out an error saying the number is too large. I read that the fix for this is to quite simply just alter the table to BigInt and that should fix it. But that made me thinking, why don't all programmers just use the max column values (such as Varchar(255), BigInt, etc.) rather than something smaller like Varchar(30) or Int?
Wouldn't this almost completely eliminate an error like mine occurring when you're not sure of whats going to be inserted, especially if it's based off of users input? Is there any cons into just using the largest possible type you need for the columns? Would the table size be bigger even if you just "2" in a big int column (even though that would work with int?). Is there a performance loss?
Thanks!
For Varchar, the reason you generally don't just use MAX is because it stores it differently and puts limitations on your index maintenance operations. For instance, you cannot rebuild an index "online" with a varchar(max) field on it. While there's a little hand waving involved, basically varchar(max) data gets stored off row so there's overhead in maintaining that extra data store.
For numeric types, the main thing is space. Bigint is an 8 byte signed integer whereas an int is only 4 bytes. If you dont need a space bigger than 2.4 billion, that's just wasted space (and often a lot of it if you have, say, 2.4 billion rows of data).
Data Compression can solve some of those issues, but not without the cost of having to de-compress the data when it's queried.
So the reasons are varied, but with the possible exception of using larger size varchars (not varchar(max)), picking the "right" data type for your data is just a good idea.
I can't speak to any RDBMS other than SQL Server (but I imagine this applies to all of them)... A BIG INT takes up twice as much space as an INT... which means less data fitting onto a page meaning less data in cache meaning slower performance.
In SQL Server there are actually 4 INT types:
TINYINT (1 byte),
SMALLINT (2 bytes),
INT (4 bytes),
BIGINT (8 bytes).
A good database developer will put very careful thought into choosing the proper data type based on the data that's expected to be put in the column. Aside from the issue of storage space, data types function as data constraints. So if I choose TINYINT as my data type, that means I only expect to see values between 0 and 255 and will reject anything that falls outside of that range.
If a coworker were to submit a table design with all VARCHAR(255) & BIGINTs, I'd reject it and have them size everything appropriately. It's lazy thinking like that, that causes huge problem on the DB side of the house.
why don't all programmers just use the max column values (such as
Varchar(255), BigInt, etc.) rather than something smaller like
Varchar(30) or Int?
Some do exactly that. It's also not at all uncommon to see developers store numeric or date/time values in varchar columns too.
I often see performance and storage costs called out as the reason not to do this. Those are considerations (which vary by DBMS) but a more important one in the world of relational databases is data integrity. The chosen datatype is a critical part of the data model because it determines the domain of data that can be stored. On top of that, relational databases provide check, referential, and NULL constraints to further limit column values.
Wouldn't this almost completely eliminate an error like mine occurring
when you're not sure of whats going to be inserted, especially if it's
based off of users input?
Of course, but why stop at a 64-bit integer? Why not NUMERIC(1000)? That's a rhetorical question to point out that one must know about the business domain so data can be properly modeled and validation rules enforced. A 64-bit integer is certainly overkill to store a person's number of children but you may end up with a value of several billion due to careless data entry. The column data type is the last defense for bad data and is especially important when it's based off of users input.
All that being said, one can use a RDBMS as nothing more than a dumb storage engine and enforce data integrity rules (if any) in application code. In that case, storage and performance are the only consideration.

MySQL best way to store long strings

I'm looking for some advice on the best way to store long strings of data from the mySQL experts.
I have a general purpose table which is used to store any kind of data, by which I mean it should be able to hold alphanumeric and numeric data.
Currently, the table structure is simple with an ID and the actual data stored in a single column as follows:
id INT(11)
data VARCHAR(128)
I now have a requirement to store a larger amount of data (up to 500 characters) and am wondering whether the best way would be to simply increase the varchar column size, or whether I should add a new column (a TEXT type column?) for the times I need to store longer strings.
If any experts out there has any advice I'm all ears!
My preferred method would be to simply increase the varchar column, but that's because I'm lazy.
The mySQL version I'm running is 5.0.77.
I should mention the new 500 character requirement will only be for the odd record; most records in the table will be not longer than 50 characters.
I thought I'd be future-proofing by making the column 128. Shows how much I knew!
Generally speaking, this is not a question that has a "correct" answer. There is no "infinite length" text storage type in MySQL. You could use LONGTEXT, but that still has an (absurdly high) upper limit. Yet if you do, you're kicking your DBMS in the teeth for having to deal with that absurd blob of a column for your 50-character text. Not to mention the fact that you hardly do anything with it.
So, most futureproofness(TM) is probably offered by LONGTEXT. But it's also a very bad method of resolving the issue. Honestly, I'd revisit the application requirements. Storing strings that have no "domain" (as in, being well-defined in their application) and arbitrary length is not one of the strengths of RDBMS.
If I'd want to solve this on the "application design" level, I'd use NoSQL key-value store for this (and I'm as anti-NoSQL-hype as they get, so you know it's serious), even though I recognize it's a rather expensive change for such a minor change. But if this is an indication of what your DBMS is eventually going to hold, it might be more prudent to switch now to avoid this same problem hundred times in the future. Data domain is very important in RDBMS, whereas it's explicitly sidelined in non-relational solutions, which seems to be what you're trying to solve here.
Stuck with MySQL? Just increase it to VARCHAR(1000). If you have no requirements for your data, it's irrelevant what you do anyway.
Careful if using text. TEXT data is not stored in the database server’s memory, therefore, whenever you query TEXT data, MySQL has to read from it from the disk, which is much slower in comparison with CHAR and VARCHAR as it cannot make use of indexes.The better way to store long string will be nosql databases
We can use varchar(<maximum_limit>). The maximum limit that we can pass is 65535 bytes.
Note: This maximum length of a VARCHAR is shared among all columns except TEXT/BLOB columns and the character set used.

Field size for user registration form in MySQL?

When I add user info to MySQL through a PHP registration form, there are with limits on the data fields (e.g. name is 20 max chars, email 18 chars, additional info 200, pass 12 chars, etc.)
Should I create exact same fields in the MySQL table, or I should define longer fields?
Is there any benefits of doing so rather than just creating all string fields e.g. 500 characters long?
When storing age as an integer, should I use a small int (i.e. with max 256) or not?
In general, it doesn't really matter. The important part is how you validate the information on the server side.
Make sure the entered data does not exceed the size of the column. If you don't, you can run into issues where mysql will auto-truncate the data.
Don't limit the password size. If someone wants to enter a 200 character password, let them. You should be storing it in a storing hash and not in plain text, so the exact length shouldn't make a difference.
Always store your data types properly. If you expect an integer age, store it in an integer column. There's no real reason to store it in a string column type.
As far as the rest of your limits, it's really application dependent more than anything. If you expect 200 character info limit, then store it in a VARCHAR(200). But if you're just assuming, store it in a TEXT type so that the user can enter as much as they'd like. But that's more application and use-case dependent than anything else...
Suggest you be liberal with your database column lengths (for your varchar), but strict on your application in enforcing size/lengths.
The business logic may change over time. Your application tier will be the keeper and enforcer of those rules.
Your database shouldn't have to adjust often to the changing business rules regarding length. Defining a column of type varchar(100) doesn't cost you anything today. The length is variable up to 100, so your performance and storage won't suffer at all.
Application and database changes/maintenance are expensive; database storage is cheap.
Some other detailed suggestions, if you will:
don't store age. Derive it from a date (birthdate) by using math (Today-Birthdate).
passwords shouldn't be stored or have a max length!
all your string fields -- define them as varchar(256) or 1024 and be done with them. Let your application enforce the business rules of the day.
If you're using the MyISAM table type it will be a bit more efficient for queries if you keep the record length static. So if you can use char fields of a fixed size instead of varchar it's better (as long as all the fields in a record are static). However, the whole number of characters you specify will be blocked out in memory so you need to decide if memory usage is more important.
Unless your application is huge and your DB is going to be massive, this should matter very little in terms of performance. I would say that you should give yourself a bit of extra room in the MySQL fields as it is a lot easier to change the registration form max lengths than the MySQL max lengths later. Using smallint for age is fine. Or not. Generally you shouldn't allow for fields to take up more space than they are going to need, but I would give myself some padding just in case. Again, though, it shouldn't make much of a difference.

When setting MySQL schema, why use certain types?

When I'm setting up a MySQL table, it asks me to define the name of the column, type of input, and length. My assumption, without having read anything about it, is that it's for minimization. Specify the smallest possible int/smallint/tinyint for your needs, and it will reduce overhead of some sort. If it's all positives, make it unsigned to double your space, etc.
What happens if I just make every field a varchar-200 characters? When/why is this bad, what will I miss out on, and when will any inefficiencies manifest themselves? 100k records?
I think about this every time I set up a DB, but I haven't built anything to scale enough where I've ever had my scheme setup inappropriately, either too "strict/small" or "loose/big". Can someone confirm that I'm making good assumptions about speed and efficiency?
Thanks!
Data types not only optimize storage, but how data is indexed. As your databases get bigger, it will become apparent that it's quicker to search for all the records that have a 1 in an integer field than those that have a "1" in a varchar field. This becomes especially important when you're joining data from more than one table and your database engine is having to do this sort of thing repeatedly. (Daren also rightly points out below that it's important that the types of the fields you're matching on are identical as well.)
The level at which these inefficiencies become an issue depends greatly on your hardware and your application design. We have big enough iron these days that if you're building moderate-scale apps, you may not see an appreciable difference. (Aside from feeling a little bit guilty about your database design!) But establishing good habits on small projects makes the bigger ones easier when they come along.
If you have two columns as varchar and put in the values 10 and 20 and add them, you'll get 1020 instead of 30 which you'd likely expect.
Sure, you could save everything as VARCHAR strings. But you'd be giving up a lot of functionality provided by the database engine.
You should choose the database type that most closely matches the intended use of the column. For example, using DATE or DATETIME to store dates provides you with all sorts of date/time functions that you don't get with basic VARCHAR types.
Likewise, fields used to count things or provide simple unique IDs should be INT or one of its related types. Also bear in mind that an INT occupies only 4 bytes, whereas a 9-digit string uses at least 9 bytes.
For character data, it's wise to use NVARCHAR for internationalized values that users in any locale are going to enter (esp. names and locations). If you know the text is limited to US or internal use only, VARCHAR is safe.

char vs varchar for performance in stock database

I'm using mySQL to set up a database of stock options. There are about 330,000 rows (each row is 1 option). I'm new to SQL so I'm trying to decide on the field types for things like option symbol (varies from 4 to 5 characters), stock symbol (varies from 1 to 5 characters), company name (varies from 5 to 60 characters).
I want to optimize for speed. Both creating the database (which happens every 5 minutes as new price data comes out -- i don't have a real-time data feed, but it's near real-time in that i get a new text file with 330,000 rows delivered to me every 5 minutes; this new data completely replaces the prior data), and also for lookup speed (there will be a web-based front end where many users can run ad hoc queries).
If I'm not concerned about space (since the db lifetime is 5 minutes, and each row contains maybe 300 bytes, so maybe 100MBs for the whole thing) then what is the fastest way to structure the fields?
Same question for numeric fields, actually: Is there a performance difference between int(11) and int(7)? Does one length work better than another for queries and sorting?
Thanks!
In MyISAM, there is some benefit to making fixed-width records. VARCHAR is variable width. CHAR is fixed-width. If your rows have only fixed-width data types, then the whole row is fixed-width, and MySQL gains some advantage calculating the space requirements and offset of rows in that table. That said, the advantage may be small and it's hardly worth a possible tiny gain that is outweighed by other costs (such as cache efficiency) from having fixed-width, padded CHAR columns where VARCHAR would store more compactly.
The breakpoint where it becomes more efficient depends on your application, and this is not something that can be answered except by you testing both solutions and using the one that works best for your data under your application's usage.
Regarding INT(7) versus INT(11), this is irrelevant to storage or performance. It is a common misunderstanding that MySQL's argument to the INT type has anything to do with size of the data -- it doesn't. MySQL's INT data type is always 32 bits. The argument in parentheses refers to how many digits to pad if you display the value with ZEROFILL. E.g. INT(7) will display 0001234 where INT(11) will display 00000001234. But this padding only happens as the value is displayed, not during storage or math calculation.
If the actual data in a field can vary a lot in size, varchar is better because it leads to smaller records, and smaller records mean a faster DB (more records can fit into cache, smaller indexes, etc.). For the same reason, using smaller ints is better if you need maximum speed.
OTOH, if the variance is small, e.g. a field has a maximum of 20 chars, and most records actually are nearly 20 chars long, then char is better because it allows some additional optimizations by the DB. However, this really only matters if it's true for ALL the fields in a table, because then you have fixed-size records. If speed is your main concern, it might even be worth it to move any non-fixed-size fields into a separate table, if you have queries that use only the fixed-size fields (or if you only have shotgun queries).
In the end, it's hard to generalize because a lot depends on the access patterns of your actual app.
Given your system constraints I would suggest a varchar since anything you do with the data will have to accommodate whatever padding you put in place to make use of a fixed-width char. This means more code somewhere which is more to debug, and more potential for errors. That being said:
The major bottleneck in your application is due to dropping and recreating your database every five minutes. You're not going to get much performance benefit out of microenhancements like choosing char over varchar. I believe you have some more serious architectural problems to address instead. – Princess
I agree with the above comment. You have bigger fish to fry in your architecture before you can afford to worry about the difference between a char and varchar. For one, if you have a web user attempting to run an ad hoc query and the database is in the process of being recreated, you are going to get errors (i.e. "database doesn't exist" or simply "timed out" type issues).
I would suggest that instead you build (at the least) a quote table for the most recent quote data (with a time stamp), a ticker symbol table and a history table. Your web users would query against the ticker table to get the most recent data. If a symbol comes over in your 5-minute file that doesn't exist, it's simple enough to have the import script create it before posting the new info to the quote table. All others get updated and queries default to the current day's data.
I would definitely not recreate the database each time. Instead I would do the following:
read in the update/snapshot file and create some object based on each row.
for each row get the symbol/option name (unique) and set that in the database
If it were me I would also have an in memory cache of all the symbols and the current price data.
Price data is never an int - you can use characters.
The company name is probably not unique as there are many options for a particular company. That should be an index and you can save space just using the id of a company.
As someone else also pointed out - your web clients do not need to have to hit the actual database and do a query - you can probably just hit your cache. (though that really depends on what tables and data you expose to your clients and what data they want)
Having query access for other users is also a reason NOT to keep removing and creating a database.
Also remember that creating databases is subject to whatever actual database implementation you use. If you ever port from MySQL to, say, Postgresql, you will discover a very unpleasant fact that creating databases in postgresql is a comparatively very slow operation. It is orders of magnitude slower than reading and writing table rows, for instance.
It looks like there is an application design problem to address first, before you optimize for performance choosing proper data types.