I plan to create a table to store the race result like this:
Place RaceNumber Gender Name Result
12 0112 Male Mike Lee 1:32:40
16 0117 Female Rose Mary 2:20:40
I am confused at the items type definitions.
I am not sure the result can be set to varchar(32) or other type?
and for racenumber, between int(11) and varchar(11), which one is better?
Can I use UNIQUE KEY like my way?
Do I need to split name to firstname and lastName in my DB table?
DROP TABLE IF EXISTS `race_result`;
CREATE TABLE IF NOT EXISTS `race_result` (
`id` int(11) NOT NULL auto_increment,
`place` int(11) NOT NULL,
`racenumber` int(11) NOT NULL,
`gender` enum('male','female') NOT NULL,
`name` varchar(16) NOT NULL,
`result` varchar(32) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `racenumber` (`racenumber`,`id`)
) ENGINE=MyISAM AUTO_INCREMENT=3 DEFAULT CHARSET=utf8 AUTO_INCREMENT=3;
Some advice/opinions regarding datatypes.
Result - This is a time, you may want to do some calculations on this time, therefore you should store it as a time type.
RaceNumber - This is a reference, whilst it is a number, you will be performing no calculations on this number. Therefore you should store it as a varchar rather than an int. This will avoid confusion as to its usage and avoid accidently manipulation of it as a number.
Name - Look at the length of string you allow for the Name. Be careful about limiting this value by so much. 16 characters may be too small for some names in the future.
Place - Is this required storage? Can you calculate the place of a runner based on their Result alone? However, you should keep a good primary key for your table.
In answer to your specific questions:
Result: I would just set the result to an integer number of seconds. My opinion is that data should be stored in databases, not formatting. Since the likely things you're going to want to do with this is sort by it and return rows less than or greater than specific values of it, an integer seems better to me.
Race number: Same for race number. If it's always going to be numeric, use an integer and worry about the formatting in the application. If it can be non-numeric then by all means make it varchar but, for a numeric value, I can't see enough gain in making it so.
Unique key: I don't really see the point in having a unique index on race number and ID. ID is, by definition, already unique as a primary key. Perhaps you meant race number and place although even that is risky in the event of two people drawing for a place.
Split names: If you're ever going to treat them as individual values, then yes. Otherwise no. In other words, avoid things like where fullname like 'Mike %'.
For the name, if you ever want to sort on lastname, while you display it as "firstname lastname", then you will need to use separate columns.
In general: think about what you want to do with the data. Leave formatting to the application that is displaying the data. Avoid situations where you need string manipulation or complicated maths to get at the values you need.
Related
I have to store DOIs in a MySQL database. The handbook says:
There is no limitation on the length of a DOI name.
So far, the maximum length of a DOI in my current data is 78 chars. Which field length would you recommend in order to not waste storage space and to be on the safe side? In general:
How do you handle the problem of not knowing the maximum length of input data that has to be stored in a database, considering space and the efficiency of transactions?
EDIT
There are these two (simplified) tables document and topic with a one-to-many relationship:
CREATE TABLE document
(
ID int(11) NOT NULL,
DOI ??? NOT NULL,
PRIMARY KEY (ID)
);
CREATE TABLE topic
(
ID int(11) NOT NULL,
DocID int(11) NOT NULL,
Name varchar(255) NOT NULL,
PRIMARY KEY (ID),
FOREIGN KEY (DocID) REFERENCES Document(ID), UNIQUE(DocID)
);
I have to run the following (simplified) query for statistics, returning the total value of referenced topic-categories per document (if there are any references):
SELECT COUNT(topic.Name) AS number, document.DOI
FROM document LEFT OUTER JOIN topic
ON document.ID = topic.DocID
GROUP BY document.DOI;
The character set used is utf_8_general_ci.
TEXT and VARCHAR can store 64KB. If you're being extra paranoid, use LONGTEXT which allows 4GB, though if the names are actually longer than 64KB then that is a really abusive standard. VARCHAR(65535) is probably a reasonable accommodation.
Since VARCHAR is variable length then you really only pay for the extra storage if and when it's used. The limit is just there to cap how much data can, theoretically, be put in the field.
Space is not a problem; indexing may be a problem. Please provide the queries that will need an index on this column. Also provide the CHARACTER SET needed. With those, we can discuss the ramifications of various cutoffs: 191, 255, 767, 3072, etc.
I have a question regarding primary keys in Relational Databases. Let's assume that I have the following tables:
Box
id
box_name
BoxItems
id
item_name
belongs_to_box_id (foreign key)
Let's also assume that I intend to store millions of items per day. I would probably use bigint or a guid for the BoxItems.Id.
What I was thinking, and I need your advice on that, is instead of Bigint Id for the BoxItems, use a sequencial TinyInt number and what identified each item is the combination of the belongs_to_box_id plus the tinyint row (e.g. item_numner).
So now instead of the above we get:
BoxItems
belongs_to_box_id
item_sequence_number [TINYINT]
item_name
Example:
Items.Insert(1,1, "my item 1");
Items.Insert(1,2, "my item 2");
So instead of using bigint or GUID for that matter, I can use tinyint and save a lot of disk space.
I want to know what the cons and pros of such approach. I am developing my app using MySQL and ASP.NET 4.5
When you think about it, there's really not much difference between the "box/contents" problem and the "order/line item" problem.
create table boxes (
box_id integer primary key,
box_name varchar(35) not null
);
create table boxed_items (
box_id integer not null references boxes (box_id),
box_item_num tinyint not null,
item_name varchar(35) not null
);
For MySQL, you'd probably use unsigned integer and unsigned tinyint. There's no compelling reason for a database to avoid negative numbers, but developers should lean on the Principle of Least Surprise.
Make sure 256 values are enough. Getting that wrong can be expensive to correct in a table that gets millions of rows each day.
I would recommend writing a simple test for both approaches and compare performance, disk space and ease of implementation and make a judgement call. Both of your suggestions are reasonable and I doubt there will be much of a difference in performance but the best way to find out is to just try it out and then you will know for sure.
I am currently working on a project, which involves altering data stored in a MYSQL database. Since the table that I am working on does not have a key, I add a key with the following command:
ALTER TABLE deCoupledData ADD COLUMN MY_KEY INT NOT NULL AUTO_INCREMENT KEY
Due to the fact that I want to group my records according to selected fields, I try to create an index for the table deCoupledData that consists of MY_KEY, along with the selected fields. For example, If I want to work with the fields STATED_F and NOT_STATED_F, I type:
ALTER TABLE deCoupledData ADD INDEX (MY_KEY, STATED_F, NOT_STATED_F)
The real issue is that the fields that I usually work with are more than 16, so MYSQL does not allow super-keys longer than 16 fields.
In conclusion, Is there another way to do this? Can I make (somehow) MYSQL to order the records according to the desired super-key (something like clustering)? I really need to make my script faster and the main overhead is that each group may contain records which are not stored on the same page of the disk, and I assume that my pc starts random I/Os in order to retrieve records.
Thank you for your time.
Nick Katsipoulakis
CREATE TABLE deCoupledData (
AA double NOT NULL DEFAULT '0',
STATED_F double DEFAULT NULL,
NOT_STATED_F double DEFAULT NULL,
MIN_VALUES varchar(128) NOT NULL DEFAULT '-1,-1',
MY_KEY int(11) NOT NULL AUTO_INCREMENT,
PRIMARY KEY (MY_KEY),
KEY AA (AA) )
ENGINE=InnoDB AUTO_INCREMENT=74358 DEFAULT CHARSET=latin1
Okay, first of all, when you add an index over multiple columns and you don't really use the first column, the index is useless.
Example: You have a query like
SELECT *
FROM deCoupledData
WHERE
stated_f = 5
AND not_stated_f = 10
and an index over (MY_KEY, STATED_F, NOT_STATED_F).
The index can only be used, if you have another AND my_key = 1 or something in the WHERE clause.
Imagine you want to look up every person in a telephone book with first name 'John'. Then the knowledge that the book is sorted by last name is useless, you still have to look up every single name.
Also, the primary key does not have to be a surrogate / artificial one. It's nearly always better to have a primary key which is made up of columns which identify each row uniquely anyway.
Also it's not always good to have many indexes. Not only do indexes slow down INSERTs and UPDATEs, sometimes they just cause an extra lookup, since first a look at the index is taken and a second look to find the actual data.
That's just a few tips. Maybe Jordan's hint is not a bad idea, "You should maybe post a new question that has your actual SQL query, table layout, and performance questions".
UPDATE:
Yes, that is possible. According to manual
If you define a PRIMARY KEY on your table, InnoDB uses it as the clustered index.
which means that the data is practically sorted on disk, yes.
Be aware that it's also possible to define a primary key over multiple columns!
Like
CREATE TABLE deCoupledData (
AA double NOT NULL DEFAULT '0',
STATED_F double DEFAULT NULL,
NOT_STATED_F double DEFAULT NULL,
MIN_VALUES varchar(128) NOT NULL DEFAULT '-1,-1',
MY_KEY int(11) NOT NULL AUTO_INCREMENT,
PRIMARY KEY (NOT_STATED_F, STATED_F, AA),
KEY AA (AA) )
ENGINE=InnoDB AUTO_INCREMENT=74358 DEFAULT CHARSET=latin1
as long as the combination of the columns is unique.
I have a database design where i store image filenames in a table called resource_file.
CREATE TABLE `resource_file` (
`resource_file_id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`resource_id` int(11) NOT NULL,
`filename` varchar(200) NOT NULL,
`extension` varchar(5) NOT NULL DEFAULT '',
`display_order` tinyint(4) NOT NULL,
`title` varchar(255) NOT NULL,
`description` text NOT NULL,
`canonical_name` varchar(200) NOT NULL,
PRIMARY KEY (`resource_file_id`)
) ENGINE=InnoDB AUTO_INCREMENT=592 DEFAULT CHARSET=utf8;
These "files" are gathered under another table called resource (which is something like an album):
CREATE TABLE `resource` (
`resource_id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(255) NOT NULL,
`description` text NOT NULL,
PRIMARY KEY (`resource_id`)
) ENGINE=InnoDB AUTO_INCREMENT=285 DEFAULT CHARSET=utf8;
The logic behind this design comes handy if i want to assign a certain type of "resource" (album) to a certain type of "item" (product, user, project & etc) for example:
CREATE TABLE `resource_relation` (
`resource_relation_id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`module_code` varchar(32) NOT NULL DEFAULT '',
`resource_id` int(11) NOT NULL,
`data_id` int(11) NOT NULL,
PRIMARY KEY (`resource_relation_id`)
) ENGINE=InnoDB AUTO_INCREMENT=328 DEFAULT CHARSET=utf8;
This table holds the relationship of a resource to a certain type of item like:
Product
User
Gallery
& etc.
I do exactly this by giving the "module_code" a value like, "product" or "user" and assigning the data_id to the corresponding unique_id, in this case, product_id or user_id.
So at the end of the day, if i want to query the resources assigned to a product with the id of 123 i query the resource_relation table: (very simplified pseudo query)
SELECT * FROM resource_relation WHERE data_id = 123 AND module_code = 'product'
And this gives me the resource's for which i can find the corresponding images.
I find this approach very practical but i don't know if it is a correct approach to this particular problem.
What is the name of this approach?
Is it a valid design?
Thank you
This one uses super-type/sub-type. Note how primary key propagates from a supert-type table into sub-type tables.
To answer your second question first: the table resource_relation is an implementation of an Entity-attribute-value model.
So the answer to the next question is, it depends. According to relational database theory it is bad design, because we cannot enforce a foreign key relationship between data_id and say product_id, user_id, etc. It also obfuscates the data model, and it can be harder to undertake impact analysis.
On the other hand, lots of people find, as you do, that EAV is a practical solution to a particular problem, with one table instead of several. Although, if we're talking practicality, EAV doesn't scale well (at least in relational products, there are NoSQL products which do things differently).
From which it follows, the answer to your first question, is it the correct approach?, is "Strictly, no". But does it matter? Perhaps not.
" I can't see a problem why this would "not" scale. Would you mind
explaining it a little bit further? "
There are two general problems with EAV.
The first is that small result sets (say DATE_ID=USER_ID) and big result sets (say DATE_ID=PRODUCT_ID) use the same query, which can lead to sub-optimal execution plans.
The second is that adding more attributes to the entity means the query needs to return more rows, whereas a relational solution would return the same number of rows, with more columns. This is the major scaling cost. It also means we end up writing horrible queries like this one.
Now, in your specific case perhaps neither of these concerns are relevant. I'm just explaining the reasons why EAV can cause problems.
"How would i be supposed to assign "resources" to for example, my
product table, "the normal way"?"
The more common approach is to have a different intersection table (AKA junction table) for each relationship e.g.USER_RESOURCES, PRODUCT_RESOURCES, etc. Each table would consist of a composite primary key, e.g. (USER_ID, RESOURCE_ID), and probably not much else.
The other approach is to use a generic super-type table with specific sub-type tables. This is the implementation which Damir has modelled. The normal use caee for super-types is when we have a bunch of related entities which have some attributes, behaviours and usages in common plus seom distinct features of their own. For instance, PERSON and USER, CUSTOMER, SUPPLIER.
Regarding your scenario I don't think USER, PRODUCT and GALLERY fit this approach. Sure they are all consumers of RESOURCE, but that is pretty much all they have in common. So trying to map them to an ITEM super-type is a procrustean solution; gaining a generic ITEM_RESOURCE table is likely to be a small reward for the additiona hoops you're going to have to jump through elsewhere.
I have a database design where i store images in a table called
resource_file.
You're not storing images; you're storing filenames. The filename may or may not identify an image. You'll need to keep database and filesystem permissions in sync.
Your resource_file table structure says, "Image filenames are identifiable in the database, but are unidentifiable in the filesystem." It says that because resource_file_id is the primary key, but there are no unique constraints besides that id. I suspect your image files actually are identifiable in the filesystem, and you'd be better off with database constraints that match that reality. Maybe a unique constraint on (filename, extension).
Same idea for the resource table.
For resource_relation, you probably need a unique constraint on either (resource_id, data_id) or (resource_id, data_id, module_code). But . . .
I'll try to give this some more thought later. It's kind of hard to figure out what you're trying to do resource_relation, which is usually a red flag.
I seem to often find myself wanting to store data of more than one type (usually specifically integers and text) in the same column in a MySQL database. I know this is horrible, but the reason it happens is when I'm storing responses that people have made to questions in a questionnaire. Some questions need an integer response, some need a text response and some might be an item selected from a list.
The approaches I've taken in the past have been:
Store everything as text and convert to int (or whatever) when needed later.
Have two columns - one for text and one for int. Then you just fill one in per row per response, and leave the other one as null.
Have two tables - one for text responses and one for integer responses.
I don't really like any of those, though, and I have a feeling there must be a much better way to deal with this kind of situation.
To make it more concrete, here's an example of the tables I might have:
CREATE TABLE question (
id int(11) NOT NULL auto_increment,
text VARCHAR(200) NOT NULL default '',
PRIMARY KEY ('id')
)
CREATE TABLE response (
id int(11) NOT NULL auto_increment,
question int (11) NOT NULL,
user int (11) NOT NULL,
response VARCHAR(200) NOT NULL default ''
)
or, if I went with using option 2 above:
CREATE TABLE response (
id int(11) NOT NULL auto_increment,
question int (11) NOT NULL,
user int (11) NOT NULL,
text_response VARCHAR(200),
numeric_response int(11)
)
and if I used option 3 there'd be a responseInteger table and a responseText table.
Is any of those the right approach, or am I missing an obvious alternative?
[Option 2 is] NOT the most normalized option [as #Ray claims]. The most normalized would have no nullable fields and obviously option 2 would require a null on every row.
At this point in your design you have to think about the usage, the queries you'll do, the reports you'll write. Will you want to do math on all of the numeric responses at the same time? i.e. WHERE numeric_response IS NOT NULL? Probably unlikely.
More likely would be, What's the average response WHERE Question = 11. In those cases you can either choose the INT table or the INT column and neither would be easier to do than the other.
If you did do two tables, you'd more than likely be constantly unioning them together for questions like, what % of questions have a response etc.
Can you see how the questions you ask your database to answer start to drive the design?
I'd opt for Option 1. The answers are always text strings, but sometimes the text string happens to be the representation of an integer. What is less easy is to determine what constraints, if any, should be placed on the answer to a given question. If some answer should only be a sequence of one or more digits, how do you validate that? Most likely, the Questions table should contain information about the possible answers, and that should guide the validation.
I note that the combination of QuestionID and UserID is (or should be) unique (for a given questionnaire). So, you really don't need the auto-increment column in the answer. You should also have a unique constraint (or primary key constraint) on the QuestionID and UserID anyway (regardless of whether you keep the auto-increment column).
Option 2 is the correct, most normalized option.