I will be storing draw numbers (1-60) in fixed values and fixed order.
One draw has 4 numbers
another has 6 numbers
and another 2 numbers
My idea was to separate each draw type in a separate sql table. The question is, would it be optimal to store the numbers in a single column separated by a delimiter....
ID(int) | numbers(varchar)
or store each number in a separate column instead?
ID(int) | num1(tinyint) | num2(tinyint) | num3(tinyint) | num4(tinyint)
I won't be needing to search for the numbers when they're stored.
If you don't ever need to search for them separately or retrieve them separately, then they are just one opaque "blob" from the database perspective and you won't be violating the principle of atomicity and the 1NF by storing them into single filed.
But, just because that's the case now, doesn't mean it won't change in the future. So at least use the second option. Also, this would allow the DBMS to enforce the integrity of domain and ensure these are actually numbers and just any strings.
However, to future-proof your data, I'd go even further and use the following structure:
In addition to treating numbers in the uniform way and avoiding many NULLs, it'll also allow you to easily vary the max. number of numbers if that ever becomes necessary. I suspect querying will also be easier in this structure.
BTW, if there are no (other fields) and a draw cannot exist without at least one number, you can dispense with the DRAW table altogether and just use DRAW_NUMBER.
separate columns (Database normalization)
If you don't need to search for the numbers (i.e. find which draw has a certain number), then I would store the numbers in the same field.
CLARIFICATION
He said it himself, he's just storing data and doesn't need to do any sort of operation on it. What that data is doesn't matter. It happens to be between 2 to 6 numbers, but that's irrelevant. There is no reason to put them in separate columns unless you need it for some reason.
What I would do is to use only one table, with three columns: id, draw_type, numbers
It's much easier to work with than 3 different tables with 3 to 7 columns each.
Related
I have a table in which i have a field that requires 3 letters and 3 numbers (that have to between the values 2000 & 7000).
I've been reading around, and i'm still not sure which is the better way to handle this issue, whether it can be with a simple datatype, say for instance char(6), or if there has to be a field that contains only the 3 letters, and another field that contains the 3 numbers with a check restriction to ensure that the values of that field are between 2000 & 7000.
Any help that you can offer me, i would be glad. Thanks in advance
You may have to give more specificity about the requirements, but it sounds to me like a single column is the best option -- especially if order matters. If the letters and numbers have meanings separately, then they should be in two columns. Otherwise, you'll just end up having to concatenate them together.
char(6) is fine as long as you know it will always be 6 characters exactly. You can't enforce such a specific limit as 2000 to 7000 at column level anyway (which is 4 numbers, isn't it?)
Every field should represent an attribute of the entities the table holds. In other words, if these three letters and three numbers represents different attributes, they should be in separate fields, otherwise (e.g. they are to represent a serial number) you can have them in one field.
Another approach is to think of a possible use case, like: Am I going to perform a query based on the second number? If your answer is yes, then they should be in separate fields, otherwise they represent one attribute and they should be in one field.
Hope it helps.
If the value is "one" value, use one column, say char(6), but...
Here's a surprising fact: mysql doesn't support CHECK constraints!
Mysql allows CHECK constraints to defined, however they are completely ignored and allowed only for comparability with SQL from other databases.
If you want to enforce a format, you'll need to use a trigger, but mysql doesn't support raising exceptions, so you'll have to use a work-around.
The best option is probably to use app code for validation.
Please help me understand which of the following is better for scaling and performance.
Table: test
columns: id <int, primary key>, doc <int>, keyword <string>
The data i want to store is a pointer to the documents containing a particular keyword
Design 1:
have unique constraint on the keyword column and store the list of documents as an array
e.g id: 1, doc: [4,5,6], keyword: google
Design 2:
insert a row for each document
1 4 google
2 5 google
3 6 google
Lets the say the average number of documents a particular keyword would be found in is close to 100000. there may not be a max number of documents the keyword appears in.
You can forget about option 1 because there's no array data type in mysql.
To be honest if you want a scallable solution for this type of data I think you should look into a different type of database. Research more on NoSQL and 'key-value pair store database'.
With mysql, the best I can think of is your 2nd option, with the exception that you should create another table with a numeric ID and a list of unique keywords. That way, when you do your search you'll first look up the ID, then filter the big table by the ID instead of string. Numeric comparison is faster than string comparison.
A lot of factors come into scaling and performance so it's not usually a good idea to try to optimise unknowns early in development.
For database design I find it's usually best to go with the more correct normalised approach (your design 2) and then worry about the scaling and performance if it becomes an issue. You can then de-normalise certain areas or take other approaches depending on what issues you face.
Your design option 1 is likely to hit other issues more immediately with the inability to join the doc column with another table, as well as complexities updating and searching it as well.
Design 1 is potentially limited by MySQL's row size limit.
Design 2 makes the most sense to me. What if you need to remove one of those values? You just delete a row rather than having to search through and update an array. It's also nice because it allows you to limit the size of your results if necessary (e.g., for pagination).
You might also consider creating a many-to-many relationship between this table and a keywords table instead of storing keywords as a field here.
My database stores version numbers; however, they come in two formats: major.minor.build (e.g. 8.2.0, 12.0.1) and dates (e.g. YY-MM-DD). I had thought of two solutions:
+---+---+-----+-----------+ +-----+-----+-----+-----+ +-----+--------+
|...|...|id |versionType| |id |major|minor|build| |id |date |
|---+---+-----+-----------| |-----+-----+-----+-----| |-----+--------|
|...|...|12345|0 | |12345|0 |1 |2 | |21432|12-04-05|
|---+---+-----+-----------| +-----+-----+-----+-----+ +-----+--------+
|...|...|21432|1 |
+---+---+-----+-----------+
or
+---+---+-----+-----+-----+-----+--------+
|...|...|id |major|minor|build|date |
|---+---+-----+-----+-----+-----+--------|
|...|...|12345|0 |1 |2 |null |
|---+---+-----+-----+-----+-----+--------+
|...|...|21432|null |null |null |12-04-05|
+---+---+-----+-----+-----+-----+--------+
Neither of these look particularly efficient: the first requires a join across two tables just to get a version number, while the second wastes twice as much space per version entry compared to the first. Alternatively, I could just store the value in some amount of bits in a column then interpret that on the client side, but I'm hoping that there's some standard practice for this situation that I've overlooked.
Is there a proper way to store two different types of data for the same 'column' in a relational database?
Is your situation one where you have different distinct kinds of versioned object, where one kind of versioning is using dates, and another kind of versioning is using version numbers, or is your situation one where the same kind of object's version is referenced both using dates and using version numbers ?
In the first case, don't bother with creating such an artificial table that doesn't serve any useful purpose. You need to create tables only if they solve a business problem that really exists, and the translation from version date to version number or vice-versa is one that doesn't exist in this situation. And even if it arises later on, then you can still ...
In the second case, define a table like the one in your second option, but :
WITHOUT all those stupid meaningless ID's. Just leave the four columns maj/min/bld/date. And DONT MAKE ANY OF THEM NULLABLE. Define two keys : maj/min/bld and date. Register a row for each new build, recording the creation (/activation/whatever ...) date of the build. Use the maj/min/bld construct as the version indicator in whatever table describes the versioned object you are managing, and whenever a request comes in in which the version reference is done using a date, resolve this to a version number through a query on your 4-column table.
I don't think there's a silver bullet here. If you are adamant about having a single column you could just naively slapping them both into a CHAR(10), though this has its own issues (e.g., invalid dates, or malformed build number strings, etc).
I think the key question is really what sort of queries do you envision running, and how many rows do you expect to have?
I'd let the anticipated query need drive the DB design.
The First one is better. If you feel that everytime doing the join is a headache for you, then you can create a view (with the join) and use that view rather than directly using the tables (or doing join each time).
Possibly you can store your data in one int/bigint field. But in this case you will have to convert all values:
Date - convert it to the number of days starting from some date or possibly you can use unixtimestamp
Version - give limits to values of built (let it be max 1000), minor (1000).
= major*1000*1000 + minor*1000 + build
I would go with the second option because the grain of the table is a version, no matter how it comes in. Storing it in 3 number columns and 1 date column should only be 16 bytes per row (4 bytes for each number and date column) on a database like Oracle. If your database handles virtual (calculated) columns you could add one that looks something like nvl(to_char(date),major||'.'||minor||'.'||build) and always select from it with the data formatted as a varchar or you can put a view on it.
It really depends on how you want to use this version number in queries.
If it's enough to treat version numbers as labels, then you should consider storing it in a varchar column.
If that is not enough, and you want to be able to sort on version number, things get more complex, because it would then be more convenient to store the version number in a data type that preserves natural sorting order. I'd probably go with your solution 2. Is there a reason to worry about efficiency? Are you expecting millions of rows? If not, then you probably shouldnt worry too much.
If space, speed, and volume are definite considerations, you could consider storing the actual version number as a text label, and deriving a surrogate version number (a single integer) that you store in a separate column which you use for sorting purposes.
If on the other hand it turns out some objects have a version date, some a version number, and sometimes both, I would still consider your second solution. If you really want an uber-normalized model, you could consider creating separate version_date and version_number tables, which have a 1:1 relationship with the object table. And you would still need those joins. Again though -there is no reason to worry about that: databases are good at joins, it's what they were made to do well.
I have a membership database that I am looking to rebuild. Every member has 1 row in a main members table. From there I will use a JOIN to reference information from other tables. My question is, what would be better for performance of the following:
1 data table that specifies a data type and then the data. Example:
data_id | member_id | data_type | data
1 | 1 | email | test#domain.com
2 | 1 | phone | 1234567890
3 | 2 | email | test#domain2.com
Or
Would it be better to make a table of all the email addresses, and then a table of all phone numbers, etc and then use a select statement that has multiple joins
Keep in mind, this database will start with over 75000 rows in the member table, and will actually include phone, email, fax, first and last name, company name, address city state zip (meaning each member will have at least 1 of each of those but can be have multiple (normally 1-3 per member) so in excess of 75000 phone numbers, email addresses etc)
So basically, join 1 table of in excess of 750,000 rows or join 7-10 tables of in excess of 75,000 rows
edit: performance of this database becomes an issue when we are inserting sales data that needs to be matched to existing data in the database, so taking a CSV file of 10k rows of sales and contact data and querying the database to try to find which member attributes to which sales row from the CSV? Oh yeah, and this is done on a web server, not a local machine (not my choice)
The obvious way to structure this would be to have one table with one column for each data item (email, phone, etc) you need to keep track of. If a particular data item can occur more than once per member, then it depends on the exact nature of the relationship between that item and the member: if the item can naturally occur a variable number of times, it would make sense to put these in a separate table with a foreign key to the member table. But if the data item can occur multiple times in a limited, fixed set of roles (say, home phone number and mobile phone number) then it makes more sense to make a distinct column in the member table for each of them.
If you run into performance problems with this design (personally, I don't think 75000 is that much - it should not give problems if you have indexes to properly support your queries) then you can partition the data. Mysql supports native partitioning (http://dev.mysql.com/doc/refman/5.1/en/partitioning.html), which essentially distributes collections of rows over separate physical compartments (the partitions) while maintaining one logical compartment (the table). The obvious advantage here is that you can keep querying a logical table and do not need to manually bunch up the data from several places.
If you still don't think this is an option, you could consider vertical partitioning: that is, making groups of columns or even single columns an put those in their own table. This makes sense if you have some queries that always need one particular set of columns, and other queries that tend to use another set of columns. Only then would it make sense to apply this vertical partitioning, because the join itself will cost performance.
(If you're really running into the billions then you could consider sharding - that is, use separate database servers to keep a partition of the rows. This makes sense only if you can either quickly limit the number of shards that you need to query to find a particular member row or if you can efficiently query all shards in parallel. Personally it doesn't seem to me you are going to need this.)
I would strongly recommend against making a single "data" table. This would essentially spread out each thing that would naturally be a column to a row. This requires a whole bunch of joins and complicates writing of what otherwise would be a pretty straightforward query. Not only that, it also makes it virtually impossible to create proper, efficient indexes over your data. And on top of that it makes it very hard to apply constraints to your data (things like enforcing the data type and length of data items according to their type).
There are a few corner cases where such a design could make sense, but improving performance is not one of them. (See: entity attribute value antipattern http://karwin.blogspot.com/2009/05/eav-fail.html)
YOu should research scaling out vs scaling up when it comes to databases. In addition to aforementioned research, I would recommend that you use one table in our case if you are not expecting a great deal of data. If you are, then look up dimensions in database design.
75k is really nothing for a DB. You might not even notice the benefits of indexes with that many (index anyway :)).
Point is that though you should be aware of "scale-out" systems, most DBs MySQL inclusive, can address this through partitioning allowing your data access code to still be truly declarative vs. programmatic as to which object you're addressing/querying. It is important to note sharding vs. partitioning, but honestly are conversations when you start exceeding records approaching the count in 9+ digits, not 5+.
Use neither
Although a variant of the first option is the right approach.
Create a 'lookup' table that will store values of data type (mail, phone etc...). Then use the id from your lookup table in your 'data' table.
That way you actually have 3 tables instead of two.
Its best practice for a classic many-many relationship such as this
I have any kind of content what has an ID now here I can specify multiple types for the content.
The question is, should I use multiple rows to add multiple types or use the type field and put there the types separated with commas and parse them in PHP
Multiple Rows
`content_id` | `type`
1 | 1
1 | 2
1 | 3
VS
Single Row
`content_id` | `type`
1 | 1,2,3
EDIT
I'm looking for the faster answer, not the easier, please consider this. Performance is really important for me. So I'm talking about a really huge database with millions or ten millions of rows.
I'd generally always recommend the "multiple rows" approach as it has several advantages:
You can use SQL to return for example WHERE type=3 without any great difficulty as you don't have to use WHERE type LIKE '%3%', which is less efficient
If you ever need to store additional data against each content_id and type pair, you'll find it a lot easier in the multiple row version
You'll be able to apply one, or more, indexes to your table when it's stored in the "multiple row" format to improve the speed at which data is retrieved
It's easier to write a query to add/remove content_id and type pairs when each pair is stored separately than when you store them as a comma seaparated list
It'll (nearly) always be quicker to let SQL process the data to give you a subset than to pass it to PHP, or anything else, for processing
In general, let SQL do what it does best, which is allow you to store the data, and obtain subsets of the data.
I always use multiple rows. If you use single rows your data is hard to read and you have to split it up once you grab it from the database.
Use multiple rows. That way, you can index that type column later, and search it faster if you need to in the future. Also it removes a dependency on your front-end language to do parsing on query results.
Normalised vs de-normalised design.
usually I would recommend sticking to the "multiple rows" style (normalised)
Although sometimes (for performance/storage reasons) people deliberately implement "single row" style.
Have a look here:
http://www.databasedesign-resource.com/denormalization.html
The single row could be better in a few cases. Reporting tends to be easer with some denormalization is the main example. So if your code is cleaner/performs better with the single row, then go for that. Other wise the multiple rows would be the way to go.
Never, ever, ever cram multiple logical fields into a single field with comma separators.
The right way is to create multiple rows.
If there's some performance reason that demands you use a single row, at least make multiple fields in the row. But that said, there is almost never a good performance reason to do this. First make a good design.
Do you ever want to know all the records with, say, type=2? With multiple rows, this is easy: "select content_id from mytable where type=2". With the crammed field, you would have to say "select content_id from mytable where type like '%2%'". Oh, except what happens if there are more than 11 types? The above query would find "12". Okay, you could say "where type like '%,2,%'". Except that doesn't work if 2 is the first or the last in the list. Even if you came up with a way to do it reliably, a LIKE search with an initial % means a sequential read of every record in the table, which is very slow.
How big will you make the cram field? What if the string of types is too big to fit in your maximum?
Do you carry any data about the types? If you create a second table with key of "type" and, say, a description of that type, how will you join to that table. With multiple rows, you could simply write "select content_id, type_id, description from content join type using (type_id)". With a crammed field ... not so easy.
If you add a new type, how do you make it consistent? Suppose it used to say "3,7,9" and now you add "5". Can you say "3,7,9,5" ? Or do they have to be in order? If they're not in order, it's impossible to check for equality, because "1,2" and "2,1" will not look equal but they are really equivalent. In either case, updating a type field now becomes a program rather than a single SQL statement.
If there is some trivial performace gain, it's just not worth it.