I seem to often find myself wanting to store data of more than one type (usually specifically integers and text) in the same column in a MySQL database. I know this is horrible, but the reason it happens is when I'm storing responses that people have made to questions in a questionnaire. Some questions need an integer response, some need a text response and some might be an item selected from a list.
The approaches I've taken in the past have been:
Store everything as text and convert to int (or whatever) when needed later.
Have two columns - one for text and one for int. Then you just fill one in per row per response, and leave the other one as null.
Have two tables - one for text responses and one for integer responses.
I don't really like any of those, though, and I have a feeling there must be a much better way to deal with this kind of situation.
To make it more concrete, here's an example of the tables I might have:
CREATE TABLE question (
id int(11) NOT NULL auto_increment,
text VARCHAR(200) NOT NULL default '',
PRIMARY KEY ('id')
)
CREATE TABLE response (
id int(11) NOT NULL auto_increment,
question int (11) NOT NULL,
user int (11) NOT NULL,
response VARCHAR(200) NOT NULL default ''
)
or, if I went with using option 2 above:
CREATE TABLE response (
id int(11) NOT NULL auto_increment,
question int (11) NOT NULL,
user int (11) NOT NULL,
text_response VARCHAR(200),
numeric_response int(11)
)
and if I used option 3 there'd be a responseInteger table and a responseText table.
Is any of those the right approach, or am I missing an obvious alternative?
[Option 2 is] NOT the most normalized option [as #Ray claims]. The most normalized would have no nullable fields and obviously option 2 would require a null on every row.
At this point in your design you have to think about the usage, the queries you'll do, the reports you'll write. Will you want to do math on all of the numeric responses at the same time? i.e. WHERE numeric_response IS NOT NULL? Probably unlikely.
More likely would be, What's the average response WHERE Question = 11. In those cases you can either choose the INT table or the INT column and neither would be easier to do than the other.
If you did do two tables, you'd more than likely be constantly unioning them together for questions like, what % of questions have a response etc.
Can you see how the questions you ask your database to answer start to drive the design?
I'd opt for Option 1. The answers are always text strings, but sometimes the text string happens to be the representation of an integer. What is less easy is to determine what constraints, if any, should be placed on the answer to a given question. If some answer should only be a sequence of one or more digits, how do you validate that? Most likely, the Questions table should contain information about the possible answers, and that should guide the validation.
I note that the combination of QuestionID and UserID is (or should be) unique (for a given questionnaire). So, you really don't need the auto-increment column in the answer. You should also have a unique constraint (or primary key constraint) on the QuestionID and UserID anyway (regardless of whether you keep the auto-increment column).
Option 2 is the correct, most normalized option.
Related
I am trying to create a MySQL table that has a generic ID column, but also a secondary ID column, both of which need some form of auto incrementing
currently my MySQL table looks like this:
`ban_id` mediumint unsigned NOT NULL AUTO_INCREMENT,
`student_uuid` varchar(36) NOT NULL,
`student_ban_id` tinyint unsigned NOT NULL AUTO_INCREMENT,
(a bunch of data irrelevant to this question)
PRIMARY KEY (`student_uuid`, `student_ban_id`),
UNIQUE (`ban_id`)
The desired behavior is that ban_id is just a generic entry_id and that student_ban_id is the ban's number for the given student. (my reasoning is that I want to be able to reference bans by an id value if the student_uuid is unavailable, but the program spec also requires the ability to take student:banID as a valid means of reference)
A example row might be BanID:501, {studentUUID}, studentBanID:2 (501st ban, 2nd ban against the given student)
I have run into the issue that the MyISAM engine does not support tracking two separate incremental columns at once (I believe it can handle both desired behaviors, but not at the same time)
What might be the best way to achieve such a behavior?
Much appreciated!
-Cryptic
I have a question regarding primary keys in Relational Databases. Let's assume that I have the following tables:
Box
id
box_name
BoxItems
id
item_name
belongs_to_box_id (foreign key)
Let's also assume that I intend to store millions of items per day. I would probably use bigint or a guid for the BoxItems.Id.
What I was thinking, and I need your advice on that, is instead of Bigint Id for the BoxItems, use a sequencial TinyInt number and what identified each item is the combination of the belongs_to_box_id plus the tinyint row (e.g. item_numner).
So now instead of the above we get:
BoxItems
belongs_to_box_id
item_sequence_number [TINYINT]
item_name
Example:
Items.Insert(1,1, "my item 1");
Items.Insert(1,2, "my item 2");
So instead of using bigint or GUID for that matter, I can use tinyint and save a lot of disk space.
I want to know what the cons and pros of such approach. I am developing my app using MySQL and ASP.NET 4.5
When you think about it, there's really not much difference between the "box/contents" problem and the "order/line item" problem.
create table boxes (
box_id integer primary key,
box_name varchar(35) not null
);
create table boxed_items (
box_id integer not null references boxes (box_id),
box_item_num tinyint not null,
item_name varchar(35) not null
);
For MySQL, you'd probably use unsigned integer and unsigned tinyint. There's no compelling reason for a database to avoid negative numbers, but developers should lean on the Principle of Least Surprise.
Make sure 256 values are enough. Getting that wrong can be expensive to correct in a table that gets millions of rows each day.
I would recommend writing a simple test for both approaches and compare performance, disk space and ease of implementation and make a judgement call. Both of your suggestions are reasonable and I doubt there will be much of a difference in performance but the best way to find out is to just try it out and then you will know for sure.
I have a database design where i store image filenames in a table called resource_file.
CREATE TABLE `resource_file` (
`resource_file_id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`resource_id` int(11) NOT NULL,
`filename` varchar(200) NOT NULL,
`extension` varchar(5) NOT NULL DEFAULT '',
`display_order` tinyint(4) NOT NULL,
`title` varchar(255) NOT NULL,
`description` text NOT NULL,
`canonical_name` varchar(200) NOT NULL,
PRIMARY KEY (`resource_file_id`)
) ENGINE=InnoDB AUTO_INCREMENT=592 DEFAULT CHARSET=utf8;
These "files" are gathered under another table called resource (which is something like an album):
CREATE TABLE `resource` (
`resource_id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(255) NOT NULL,
`description` text NOT NULL,
PRIMARY KEY (`resource_id`)
) ENGINE=InnoDB AUTO_INCREMENT=285 DEFAULT CHARSET=utf8;
The logic behind this design comes handy if i want to assign a certain type of "resource" (album) to a certain type of "item" (product, user, project & etc) for example:
CREATE TABLE `resource_relation` (
`resource_relation_id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`module_code` varchar(32) NOT NULL DEFAULT '',
`resource_id` int(11) NOT NULL,
`data_id` int(11) NOT NULL,
PRIMARY KEY (`resource_relation_id`)
) ENGINE=InnoDB AUTO_INCREMENT=328 DEFAULT CHARSET=utf8;
This table holds the relationship of a resource to a certain type of item like:
Product
User
Gallery
& etc.
I do exactly this by giving the "module_code" a value like, "product" or "user" and assigning the data_id to the corresponding unique_id, in this case, product_id or user_id.
So at the end of the day, if i want to query the resources assigned to a product with the id of 123 i query the resource_relation table: (very simplified pseudo query)
SELECT * FROM resource_relation WHERE data_id = 123 AND module_code = 'product'
And this gives me the resource's for which i can find the corresponding images.
I find this approach very practical but i don't know if it is a correct approach to this particular problem.
What is the name of this approach?
Is it a valid design?
Thank you
This one uses super-type/sub-type. Note how primary key propagates from a supert-type table into sub-type tables.
To answer your second question first: the table resource_relation is an implementation of an Entity-attribute-value model.
So the answer to the next question is, it depends. According to relational database theory it is bad design, because we cannot enforce a foreign key relationship between data_id and say product_id, user_id, etc. It also obfuscates the data model, and it can be harder to undertake impact analysis.
On the other hand, lots of people find, as you do, that EAV is a practical solution to a particular problem, with one table instead of several. Although, if we're talking practicality, EAV doesn't scale well (at least in relational products, there are NoSQL products which do things differently).
From which it follows, the answer to your first question, is it the correct approach?, is "Strictly, no". But does it matter? Perhaps not.
" I can't see a problem why this would "not" scale. Would you mind
explaining it a little bit further? "
There are two general problems with EAV.
The first is that small result sets (say DATE_ID=USER_ID) and big result sets (say DATE_ID=PRODUCT_ID) use the same query, which can lead to sub-optimal execution plans.
The second is that adding more attributes to the entity means the query needs to return more rows, whereas a relational solution would return the same number of rows, with more columns. This is the major scaling cost. It also means we end up writing horrible queries like this one.
Now, in your specific case perhaps neither of these concerns are relevant. I'm just explaining the reasons why EAV can cause problems.
"How would i be supposed to assign "resources" to for example, my
product table, "the normal way"?"
The more common approach is to have a different intersection table (AKA junction table) for each relationship e.g.USER_RESOURCES, PRODUCT_RESOURCES, etc. Each table would consist of a composite primary key, e.g. (USER_ID, RESOURCE_ID), and probably not much else.
The other approach is to use a generic super-type table with specific sub-type tables. This is the implementation which Damir has modelled. The normal use caee for super-types is when we have a bunch of related entities which have some attributes, behaviours and usages in common plus seom distinct features of their own. For instance, PERSON and USER, CUSTOMER, SUPPLIER.
Regarding your scenario I don't think USER, PRODUCT and GALLERY fit this approach. Sure they are all consumers of RESOURCE, but that is pretty much all they have in common. So trying to map them to an ITEM super-type is a procrustean solution; gaining a generic ITEM_RESOURCE table is likely to be a small reward for the additiona hoops you're going to have to jump through elsewhere.
I have a database design where i store images in a table called
resource_file.
You're not storing images; you're storing filenames. The filename may or may not identify an image. You'll need to keep database and filesystem permissions in sync.
Your resource_file table structure says, "Image filenames are identifiable in the database, but are unidentifiable in the filesystem." It says that because resource_file_id is the primary key, but there are no unique constraints besides that id. I suspect your image files actually are identifiable in the filesystem, and you'd be better off with database constraints that match that reality. Maybe a unique constraint on (filename, extension).
Same idea for the resource table.
For resource_relation, you probably need a unique constraint on either (resource_id, data_id) or (resource_id, data_id, module_code). But . . .
I'll try to give this some more thought later. It's kind of hard to figure out what you're trying to do resource_relation, which is usually a red flag.
I just read the accepted answer of this question, which left me with this question.
Here's a quote from that answer:
"But since you tagged this question with MySQL, I'll mention a MySQL-specific tip: when your query implicitly generates a temporary table, for instance while sorting or GROUP BY, VARCHAR fields are converted to CHAR to gain the advantage of working with fixed-width rows. If you use a lot of VARCHAR(255) fields for data that doesn't need to be that long, this can make the temporary table very large."
As I understand it, the advantage of CHAR is that you get fixed-width rows, so doesn't a VARCHAR in the same table mess that up? Are there any advantages of using CHAR when you have a VARCHAR in the same table?
Here's an example:
Table with CHAR:
CREATE TABLE address (
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
street VARCHAR(100) NOT NULL,
postcode CHAR(8) NOT NULL,
PRIMARY KEY (id)
);
Table without CHAR:
CREATE TABLE address (
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
street VARCHAR(100) NOT NULL,
postcode VARCHAR(8) NOT NULL,
PRIMARY KEY (id)
);
Will the table with CHAR perform any better than the table without CHAR, and if so, in what situations?
"VARCHAR" basically sets a maximum length for the field and only stores the data that is entered into it, thus saving on space. The "CHAR" type has a fixed length, so if you set "CHAR(100)", 100 character worth of space will be used regardless of what the contents are.
The only time you will gain a speed advantage is if you have no variable length fields in your record ("VARCHAR", "TEXT", etc.). You may notice that Internally all your "CHAR" fields are changed to "VARCHAR" as soon as a variable length field type is added, by MySQL.
Also "CHAR" is less efficient from a space storage point of view, but more efficient for searching and adding. It's faster because the database only has to read an offset value to get a record rather than reading parts until it finds the end of a record. And fixed length records will minimize fragmentation, since deleted record space can be reused for new records.
Hope it helps.
I plan to create a table to store the race result like this:
Place RaceNumber Gender Name Result
12 0112 Male Mike Lee 1:32:40
16 0117 Female Rose Mary 2:20:40
I am confused at the items type definitions.
I am not sure the result can be set to varchar(32) or other type?
and for racenumber, between int(11) and varchar(11), which one is better?
Can I use UNIQUE KEY like my way?
Do I need to split name to firstname and lastName in my DB table?
DROP TABLE IF EXISTS `race_result`;
CREATE TABLE IF NOT EXISTS `race_result` (
`id` int(11) NOT NULL auto_increment,
`place` int(11) NOT NULL,
`racenumber` int(11) NOT NULL,
`gender` enum('male','female') NOT NULL,
`name` varchar(16) NOT NULL,
`result` varchar(32) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `racenumber` (`racenumber`,`id`)
) ENGINE=MyISAM AUTO_INCREMENT=3 DEFAULT CHARSET=utf8 AUTO_INCREMENT=3;
Some advice/opinions regarding datatypes.
Result - This is a time, you may want to do some calculations on this time, therefore you should store it as a time type.
RaceNumber - This is a reference, whilst it is a number, you will be performing no calculations on this number. Therefore you should store it as a varchar rather than an int. This will avoid confusion as to its usage and avoid accidently manipulation of it as a number.
Name - Look at the length of string you allow for the Name. Be careful about limiting this value by so much. 16 characters may be too small for some names in the future.
Place - Is this required storage? Can you calculate the place of a runner based on their Result alone? However, you should keep a good primary key for your table.
In answer to your specific questions:
Result: I would just set the result to an integer number of seconds. My opinion is that data should be stored in databases, not formatting. Since the likely things you're going to want to do with this is sort by it and return rows less than or greater than specific values of it, an integer seems better to me.
Race number: Same for race number. If it's always going to be numeric, use an integer and worry about the formatting in the application. If it can be non-numeric then by all means make it varchar but, for a numeric value, I can't see enough gain in making it so.
Unique key: I don't really see the point in having a unique index on race number and ID. ID is, by definition, already unique as a primary key. Perhaps you meant race number and place although even that is risky in the event of two people drawing for a place.
Split names: If you're ever going to treat them as individual values, then yes. Otherwise no. In other words, avoid things like where fullname like 'Mike %'.
For the name, if you ever want to sort on lastname, while you display it as "firstname lastname", then you will need to use separate columns.
In general: think about what you want to do with the data. Leave formatting to the application that is displaying the data. Avoid situations where you need string manipulation or complicated maths to get at the values you need.