MySQL normalizing data not always present - mysql

I'm storing the information for Magic the Gathering cards in a data base, my doubt is how to store information that is not always present.
i.e. creatures have a power/toughness, while other cards don't. should I store it for every card and leave it empty for the ones that don't have it? or create another table to store this information, like:
cardID power resist
how would this affect query times?

From a normalization point of view, power and resistance are functionally dependent from the cardID, and should thus be on the same table as all the other things which depend from the cardID.
If I correctly remember how magic works, you would have one big table with several columns for all the features of a card: power, resistance, cost (which I would model as a VARCHAR), picture.
Then, since a 1-N (or M-N, don't remember) relationship exists between effects and cards, you will have an extra table with the effects, which use the cardID as a foreign key (or maybe a dispatch table, if it's M-N).
Anyway, I found only 4 fields + ID for the cards, it isn't so much you have to worry, it's quite a litte, rather. Even if you wanted to model the cost as multiple INTs, the number of elements is limited (5-6?). A table with 10, or even 20 columns is not a problem for a DBMS.
Maybe yo're going to have a worse time concerning the effects (if you're planning to implement them).

Related

Should I split tables based on activity?

I'm working on a hobby project that is a online game. That game stores player data in one big flat file. The data itself contains all the information of the player from Name to even items on the player itself. It's a rather large amount of columns by itself and having dozens of items only increases the flat file size to boot.
To give you a visual. My current player file is 192 columns (not accounting for items).
Player Data
There is 51 columns in my flat files for player data after I reduced the fluff. This does not include the items or the abilities for the players. I've already decided those can be separated into separate tables and linked with a FK.
The 51 columns of data are unique to the player and should not be duplicated. They are not what I've been told as good candidates for normalization.
Table
id
name
password
race
sex
class
level
gold
silver
experience
quest
armor
strength
wisdom
dexterity
etc
Activity
However, the activity of when some of these columns are selected and updated is vastly different from one another. Some are updated when the player moves, others are rarely utilized outside of when the player logs into the game and loaded into memory. Records are never dropped or rebuilt. Every column has a value. frequency of activity is anywhere from every second to once a month.
Question
That leads me to a question. Instead of traditional way of normalizing data, can I split these columns up based on activity and increase performance if they were in the same table? Or should I leave them the same table all together and just rely on proper indexing? Most of the columns look good to go, but like I said, some are used more than others. But, there is a vast difference in when some are used more than others. This sort of scares me.
What you're mentioning is called denormalization and is actually a quite known and frequent matter.
There are no general rules and indications as to when to denormalize.
This depends on so many things specific to each project (like the hardware, the type of DB, and the "activity" you mention to name a few) that it comes down to profiling each application to get to a conclusion.
Also, sometimes denormalization means splitting a table into two tables with a one-to-one relationship (like in your case). Sometimes it means getting rid of FKs and putting everything in a BIG table with many columns to avoid the joins when selecting.
Most importantly, keep in mind that your question is as much about performance than it is about scalability. Separating into different tables/databases mean you could eventually store the data in different machines, each having a specific hardware architecture with a database that fits the use case.
Example of denormalization in the gaming industry
One example of denormalization I can think of when it comes to MMORPGs is to store all the unfrequently changed user data in a BLOB. Not only is this denormalizing, but the whole row is stored as a series of bytes. Dr. E.F. Codd wouldn't be happy at all.
One company that does this is Playfish.
This means that you have faster selects at the cost of slower updates and, most importantly, changing the schema for the user becomes a real hassle (but the reasoning here is it will always be Username, Password, E-mail until the end of time). This also means that your user data can now be stored in a simpler key/value store instead of an RDBMS with more overhead. Of course, the login server fetching user information won't need to be as performant as the one handling the gameplay.
So try reading about use cases for denormalization (this is a very active topic) and see where you can apply your findings in your case. Also, keep in mind that pre-optimization can be sometimes counter-productive, maybe you should focus now on developing your game. When you have scaling/performance problems, you will most probably have the funding that comes with the high number of users to address the problem. Good luck!

MySQL design regarding a web

I am tackling a problem in class to design a mySQL representation of a web that stores a list of events associated with a person. So, for this table/tables, it would have 2 columns, one of which is the person's name and the other is the event. However, a person will generally have anywhere from 30-1000 events, so this table, which we plan to have for our entire undergraduate class of 6000 students, will have millions of entries. Is there a better way to store this in mySQL that will take less space, but will still be able to retrieve individual events and the list of people that attended it just as easily as if it was a table of two columns?
Yes, there is a technique called many-to-many, and essentially breaks your one table into three, which is critical when you consider that there are indeed exactly three entities being modeled (as a good sanity check)
Person
Event
A Person's association with an Event
You model this as three tables, with the first two having essentially two columns each: one with a unique index (called "primary key"), and the second being a semantic name (person name, event name). Note that you can also add any number of columns to these with only one factor of increased storage (most likely your first move will be to add a date column to the event table).
The third table is the interesting one, it contains only 2 columns, each numeric, both of which are references to the other tables (each row is simply: (person_id, event_id)). We term these "foreign keys".
This structure means a few things:
No matter how many events someone goest to, that someone is only represented once.
same with events, not matter how many attendees
The attendance is a "first-class" entity, and can grow to include it's own attributes (i.e. "role")
This structure is called many-to-many because each person may attend many events, and each event may have many attendees.
The quintessential feature of the design is that no single piece of domain knowledge is repeated, only "keys" are repeated as necessary to model the real-world domain. (i.e. in your first example, accounting for a name change would require an unknown quantity of updates, and might lead to data anomalies, avoidance of which is a primary concern of database normalization.
Don't worry about "space". This isn't the 1970s and we're not going to run out of columns on punch cards to store data. You should be concerned with expressing your requirements in the proper, most normalized data structure. With proper indexing there shouldn't be a problem, not with this volume of data.
Remember indexes need to be defined on anything you will include as part of a WHERE clause, and sometimes you may need to add additional indexes for large lists fetched with ORDER BY and LIMIT.
Whenever possible or practical use an integer identifier instead of a string. These are stored as a small number of bytes, typically 4, compared with a variable length string which is typically at least the length of the string in bytes plus 1.
A properly normalized database will use numerical identifiers for things anyway, so this kind if thing isn't a huge concern. The only time you go against this, or deliberately de-normalize your data, is when you have a legitimate performance problem that cannot be easily solved using some other method.
As always, test your schema by generating large amounts of dummy data and see how it performs. Since you have a good idea of the requirements in advance, do some testing at those levels, and then, to be on the safe side, try 2x, 5x and 10x the data to see how much flexibility your design has. It's okay to have performance limitations so long as you know at what kind of scale you'll experience them.
mySQL relational databases were designed specifically to handle this sort of problem. Handling millions of entries is not a problem. Complex queries may take a couple seconds but will perform remarkably well.
It is best design to store 1 event per row. The way you are going about it sounds like the best way. Good Luck.

Suggested database design for columns that are usually empty

I have a table with four fields that are usually filled in:
`animal`
- id
- type
- name
- weight
- location
Three additional fields are filled in if the animal type = 'person'. This happens about 5% of the time. The additional table would be:
`person_additional`
- animal_id (FK)
- IQ
- native_language
- handedness
Is the suggested practice in db design to store this in two tables or one table? It almost makes no difference to me, but I was curious about best practices and why one would be preferable over the other.
Two tables is probably the right approach, but I might suggest a different second table. I would define it as:
`animal_additional`
- animal_id (FK)
- Trait (this would enumerate allowable traits)
- value
This would give you more flexibility in having different traits for different types, or even different traits for the same type.
If you were to store them in the same table, then that would effectively be a multivalued dependency; a violation of 4th Normal Form, so from a purist point of view, separate tables is better.
Also, what happens if another kind of animal is added that requires different kinds of supplementary fields - if all your data were in one table, then eventually, you'd have a bunch of different fields for different purposes.
From a practical point of view, it depends on how the data is used, etc;
From a pedantic point of view, other animals have handedness :)
Normalization issues aside. Animal and person are an instance of the pattern called generalization specialization, or gen-spec for short. The design of relational tables for cases of gen-spec has been covered in other questions. Do a search on "class table hierarchy" in SO.
Example: Table design and class hierarchies
One additional good reason to split this into 2 tables is that by having everything in one table, the amount of space required to store one row will increase unnecessarily since most of the time your columns will be empty but the database still has to allocate certain amount of bytes for every row.
Splitting into 2 tables, makes more efficient use of hard drive space.

How many columns in table to keep? - MySQL

I am stuck between row vs columns table design for storing some items but the decision is which table is easier to manage and if columns then how many columns are best to have? For example I have object meta data, ideally there are 45 pieces of information (after being normalized) on the same level that i need to store per object. So is 45 columns in a heavry read/write table good? Can it work flawless in a real world situation of heavy concurrent read/writes?
If all or most of your columns are filled with data and this number is fixed, then just use 45 fields. It's nothing inherently bad with 45 columns.
If all conditions are met:
You have a possibility of the the attributes which are neither known nor can be predicted at design time
The attributes are only occasionally filled (say, 10 or less per entity)
There are many possible attributes (hundreds or more)
No attribute is filled for most entities
then you have a such called sparce matrix. This (and only this) model can be better represented with an EAV table.
"There is a hard limit of 4096 columns per table", it should be just fine.
Taking the "easier to manage" part of the question:
If the property names you are collecting do not change, then columns is just fine. Even if it's sparsely populated, disk space is cheap.
However, if you have up to 45 properties per item (row) but those properties might be radically different from one element to another then using rows is better.
For example taking a product catalog. One product might have color, weight, and height. Another might have a number of buttons or handles. These are obviously radically different properties. Further this type of data suggests that new properties will be added that might only be related to a particular set of products. In this case, rows is much better.
Another option is to go NoSql and utilize a document based database server. This would allow you to set the named "columns" on a per item basis.
All of that said, management of rows will be done by the application. This will require some advanced DB skills. Management of columns will be done by the developer at design time; which is usually easier for most people to get their minds around.
I don't know if I'm correct but I once read in MySQL to keep your table with minimum columns IF POSSIBLE, (read: http://dev.mysql.com/doc/refman/5.0/en/data-size.html ), do NOTE: this is if you are using MySQL, I don't know if their concept applies to other DBMS like oracle, firebird, posgresql, etc.
You could take a look at your table with 45 column and analyze what you truly need and leave the optional fields into other table.
Hope it helps, good luck

Database Tables, more the better?

Lately I've been rethinking a database design I made a couple of months ago. The main reason is that last night I read the databse schema of vBulletin and saw that they use many, MANY, tables.
The current "idea" I'm using for my schema, for instance my log table, is to keep everything in one table by differencing the type of Log with an integer:
id, type, type_id, action, message
1 , 1, 305, 2, 'Explanation for user Ban'
2, 2, 1045, 1, 'Reason for deletion of Article'
Where type 1 = user, type 2 = article, type_id = the ID of the user, article or w/e and action 2 = ban, action 1 = deletion.
Should I change the design to two tables logBans, logSomething and so on? or is it better to keep the method I'm currently using?
The issue here is subtyping. There are three basic approaches to dealing with subtypes.
Put each record type into a completely separate table;
Put a record in a parent table and then a record in a subtype table; and
Put all the records in one table, having nullable columns for the "optional" data (ie things that don't apply to that type).
Each strategy has its merits.
For example, (3) is particularly applicable if there is little to no difference between different subtypes. In your case, do different log records have extra columns if they're of a particular type? If they don't or there are few cases when they do putting them all in one table makes perfect sense.
(2) is common used for a Party table. This is a common model in CRMs that involves a parent Party object which has subtypes for Person and Organization (Organization may also have subtypes like Company, Association, etc). Person and Organization have different properties (eg salutation, given names, date of birth, etc for Person) so it makes sense to split this up rather than using nullable columns.
(2) is potentially more space efficient (although the overhead of NULL columns in modern DBMSs is very low). The bigger issue is that (2) might be more confusing to developers. You will get a situation where someone needs to store an extra field somewhere and will whack it in a column that's empty for that type simply because it's easier doing that than getting approval for the DBAs to add a column (no, I'm not kidding).
(1) is probably the least frequently used scheme of the 3 in my experience.
Lastly, scalability has to be considered and is probably the best case for (1). At a certain points JOINs don't scale effectively and you'll need to use some kind of partitioning scheme to cut down your table sizes. (1) is one method of doing that (but a crude method).
I wouldn't worry too much about that though. You'll typically need to get to hundreds of millions or billions of records before that becomes an issue (unless your records are really really large, in which case it'll happen sooner).
It depends. If you're going to have 1500000000 entries of type 1 and 1000 entries of type 2 and you'll be doing a LOT of queries on type 2, separate the tables. If not, it's more convenient to keep only one table.
Keep in mind scalability:
How many entries of each type will I have in 1 year?
How many requests on this table will I be doing ?
Can you, at some point, clear this log? Can you move it to another table (like archive entries older than X months) ?
The one drawback I see right now is that you cannot enforce foreign key integrity on your type_id since it points to many different tables.
I want to add a small tip. A little off topic, and quite basic, but it's a lot clearer to use enum instead of tinyint for status flags, i.e.
enum('user','type')
If there are only two statuses, tinyint is a little more memory efficient, but less clear. Another disadvantage in enum is that you put a part of the business logic in the data tier - when you need to add or remove statuses, you have to alter the DB. Otherwise it's much more clear and I prefer enum.
I would keep things as specific as possible - in this case I would create two tables.
Each table has a specific purpose so I cannot see why you would combine them.
I wouldn't do what vBulletin does. The problem with older apps like vBulletin is that while they might have started as lean-machines, over the time they collect a lot of entropy and end up being bloated. Since there are plugins, and third-party tools, and developers who've worked on the old code, breaking it is a tough choice.
That's why there is not much refactoring going on here. Don't make them your programming model. Look around, find out what works best and use that. A lot of table sounds like a bad thing to me, not good.