It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 12 years ago.
I'm creating a music player, where the user can search for artists, albums, or songs.
I have created a script that reads all the tags from the mp3s in the music library, and updates a database of songs, in a single table, containing artist names, albums, track titles, etc.
Currently, this works well, because it can scan for any changes in the music library, and add/delete rows for corresponding songs in the database.
This scan routine is therefore a fairly short an easy to understand piece of code, because it maintains only a single table.
I understand the database would be more powerful if artists, albums, and tracks have their own table, and are all linked to each other. I haven't done anything about the search part yet -- how screwed am I, if I keep everything in one table?
Thanks.
Your database is not normalized. You say it's all in one table, but you haven't given any information about the schema.
The kinds of problems which non-normalized databases have include problems with consistency related to storing redundant information - if you have something like:
Album, Track, Artist
then to change the Album name, you have to change it on every track associated with the Album.
Of course, there are all kinds of "database" systems out there which are not normalized, but these usually have mechanisms to handle these kinds of things which are appropriate to their paradigms.
In regards to the Pink/P!nk situation, if that's a big deal to you, then yes, normalization would be useful.
Your songs table would reference an artist_id.
You'd also have a table of artist aliases, which would map the various names that a particular artist has gone by to that artist_id.
But this can get pretty complex, and technically, it may not even be correct in your situation, as if an artist chooses to release projects under different names, they may not want them all lumped together.
In general, normalized databases are a safe place to start, but there are plenty of good reasons to denormalize, and it is more important to understand those reasons then blindly always do things one way.
pretty screwed, indeed. it's hardly normalized. go for separate tables.
if you've never heard of normalization or understood why it was important, perhaps you should read this. it's a succinct, simple explanation without a lot of jargon.
or you could go straight to the source since you're already using mysql:
http://dev.mysql.com/tech-resources/articles/intro-to-normalization.html
think about the cardinalities and relationships in your model:
an album will have zero or more tracks; a track will belong to only one album (album-to-track is one to many)
an artist can create zero or more albums; an album can be created by one or more artists (artist-to-album is many-to-many)
you'll want to think carefully about indexes, primary, and foreign keys. add indexes to non-key columns or groups that you'll want to search on.
this design would have four tables: album, track, artist, and artist_to_album many to many join table.
So the subject you're asking about is called "Normalization" and while this is useful in many circumstances, it can't always be applied.
Consider the artist Pink. Some of her albums have her name as Pink and others P!nk which we recognize as the same visually, because we know it's her. But a database would by force see these two separately (which also makes searching for her songs harder, but that's another story). Also consider Prince, "The artist formally known as Prince", etc.
So it might be possible to have an artist ID that matches to both Pink and P!nk but that also matches to her albums Funhouse etc. (I'm really gonna stop with the examples now, as any more examples will need to be tabular).
So, I think the question becomes, how complex do you want your searching to be? As is, you're able to maintain a 1:1 correlation between tag and database info. It just depends how fancy you want things to be. Also, for the lookup I mentioned above, consider that most times that information is coming from the user, you really can't supply a lookup from P!nk to Pink any more than you would from Elephant to Pachyderm because you don't know what people are going to want to enter.
I think in this case, the naive approach is just as well.
Related
I'm struggling with a database design issue, and it's kind of a long winded one:
My website will have an unlimited number of organizations users they can join, subgroups under those organizations, and finally specific profiles for those subgroups. Subgroups within the same organization will be able to borrow and make changes to profiles from each other. Users will generate the organizations, the subgroups, and profiles.
I can draw it out, make the flow sensible on paper. When it comes to actually putting it to either SQL I'm lost. The majority of the help guides out there assumes static groups so a simple primary and foreign key set up can refer back to the right information. Mine has too much dynamic information for most of these to outright work as I understand it.
Most writers say stay away from dynamically generated tables, but that's where my instinct takes me. Another idea I had was 3 massive tables one for all Organizations, Groups, and Profiles.
So is there a better way to go about this? Or are there any good documents I should read up on to help me translate from drawing to actual code?
I have some experience with both SQL and MongoDB if that helps explain things.
I don't know about MongoDB(NoSQL), but from the SQL standpoint, here is my opinion.
As far as your schema goes, Most of the time when your "instinct" indicates that :- Only a "Dynamic Tables" solution is your best bet, for some problem that you are working on.
Remember there is a high chance that, that very problem can be solved by multiple static tables with different relationships. (By Static I mean the ones which you have created yourself as a developer.)
Also I'd like to mention that, I too myself in my initial days always thought of problem solving the similar way, but then I started understanding the principles and how exactly the databases work.
Back To Your Problem:-
If your organisation hirerchy consists of three major types of objects/levels, viz. Organizations, Groups, and Profiles then I'd suggest that you go with the 3 tables with correct relationships, which any SQL engine is quiet efficient at handling, in comparison to creating tables at runtime.
Now if the hierarchy is dynamic like say, An organisation can contain many groups which in turn shall contain profiles which again shall/can contain other organisations and so on.... Then you may want to look at Recursive structure with SQL(Recursion). (Just do a google search there are a lot of articles about that.)
I'm rewriting a system that is currently linked to a MySQL database that is roughly 1GB in size. There are hundreds of thousands of articles, each with a list of contributors (think Wiki style). I've not yet been given access to the existing database schema, but while I wait I've been brainstorming a bit.
Basically, what I'm wondering is if having an article_contributors table would be an efficient way of handling this or if there is a better method to approaching this situation. Considering there are roughly 200,000 articles, if there are 5 contributors on each, that'd be 1,000,000 rows in the meta table.
I'd call that a one-to-many table, not a "meta" table. Or else a multi-valued attribute.
Storing contributors in a separate table, one per row, is the proper way of designing a relational database. There may be other ways to store the data, but they are not relational.
Consider my answer to Is storing a delimited list in a database column really that bad? Storing the contributors as a list in the articles table causes a lot of common SQL queries to break or become horribly inefficient. If you need to do a variety of queries against this data, you will thank yourself for storing it in a normalized fashion.
On the other hand, if you never query anything but the list of contributors as an indivisible unit, then why not store it denormalized (as a list)? That's a valid choice too -- but it depends on how you're going to use the table.
By the way, 1 million rows is not a large MySQL database by some people's standards. This week I'm advising a client who has a table with 900 million rows.
An interesting question!
You're going to need to see the schema to get a straight answer about this. That's because the schema probably embodies some core decisions made by experts in bibliography (reference librarians, etc).
If you try use a join table (articles_contributors) so you can avoid listing a given contributor multiple times when she contributes to multiple articles, you're implicitly declaring that you can create a canonical list of contributors, with a contributor_id for each distinct person.
In the world of bibliography and library science, that sort of list is called a "controlled vocabulary" It's controlled by an "authority." (Read this: http://en.wikipedia.org/wiki/Authority_control) That is, some organization has the responsibility to decide whether this "Jane Smaith" is a different person from that "Jane Smith." That is surprisingly hard to do correctly with people.
For an example of a relatively simple controlled vocabulary, see the "North American Industry Classification System" (NAICS). This has a code for each distinct kind of industry. http://www.census.gov/eos/www/naics/ It's controlled by national committees in three countries. Many bibliographic databases that cover industry include those terms as one of the ways of classifying their contents.
The designers of the system you're soon to take over will have made decisions about these kinds of controlled vocabularies. Will they have one for contributors? You could wait and see, or you could ask. But one thing is sure: the bibliographic designers won't be too delighted if you, on your own authority, create that kind of controlled vocabulary.
The Library of Congress in the USA doesn't attempt to create a controlled list of authors and contributors.
Edit
If you do have a definitive list of contributors, it is a good idea to create a join table articles_contributors as you suggested. You should consider the following columns:
article_id primary key
contributor_id primary key
role primary key values like ("author", "illustrator", "editor", etc)
order 1, 2, 3 so contributors can be listed in proper order.
contact 1 or 0 indicating whether readers should contact this author for more info.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I am designing a database and am wondering which approach should I use. I am going to describe the database I intend to design and the possible approaches that I can use to store the data in the tables.
Please recommend which approach I should use and why?
About the data:
A) I have seven attributes that need to be taken care of. These are just examples and not the actual ones I intend to store. Let me call them:
1)Name
2)DOB (Modified..I had earlier put in age here..)
3)Gender
4)Marital Status
5)Salary
6)Mother Tongue
7)Father's Name
B) There will be a minimum of 10000 rows in the table and they can go up from there in the long term
C) The number of attributes can change over the period of time. That is, new attributes can be added to the existing dataset. No attributes will ever be removed.
Approach 1
Create a table with 7 attributes and store the data as it is. Added new columns if and when new attributed need to be added.
Pro: Easier to read the data and information is well organized
Con: There can be a lot of null values in certain rows for certain attributes for which values are unknown.
Approach 2
Create a table with 3 attributes. Let them be called :
1) Attr_Name : Stores the attribute name . eg name,age,gender ..etc
2) Attr_Value :Stores value for the above attribute, eg : Tom, 25, Male
3) Unique ID : Uniquely identifies the Name, Value pair in the database. eg. SSN
So, in approach 2, in case new attributes need to be added for certain rows, we can just add them to the hashmap we have created without worrying about null values.
Pro: Hashmap structure. Eliminates nulls.
Con: Data is not easy to read. Information cannot be easily grasped.
C) The Question
Which is the better approach.?
I feel that approach 1 is the better approach. Because its not too tough to handle null values and data is well organized and its easy to grasp this king of data. Please suggest which approach I should use and why?
Thanks!
This is a typical narrow table (attribute based) vs. wide table discussion. The problem with approach #2 is that you are probably going to have to pivot the data, to get it into a form the user can work with (back into a wide view format). This can be very resource intensive as the number of rows grows, and as the number of attributes grows. It's also hard to look at the table, in raw table view, and see what's going on.
We have had this discussion many times at our company. We have some tables that lend themselves very well to an attribute type schema. We've always decided against it because of the necessity to pivot the data and the inability to view the data and have it make sense (but this is the lessor of the two problems for us - we just don't want to pivot millions of rows of data).
BTW, I wouldn't store age as a number. I would store the birth date, if you have it. Also, I don't know what 'Mother Tongue' refers to, but, if it's the language the mother speaks, I would store this as a FK to a master language table. It's more efficient and lessens the problem of bad data because of a misspelled language.
Your second option is one of teh worst design mistakes you can make. This should only be done when you have hundreds of attributes that change constantly and are in no way the same from object to object (such as medical lab tests). If you need to do that, then do not under any circumstances use a relational database to do it. NOSQL database handle EAV designs better by far than relational ones.
Another problem with design 2 is that it becomes almost impossible to have good data integrity as you cannot correctly enforce FKs and data types and add contraints to the data. Since this stuff shoudl never be designed to happen only in the application since things other than the application often affect the data, this factor alone is enough to make your second idea foolish and foolhardy.
The first design will perform better in general. It will be easier to write queries and it will force you to think about what needs to change when you add an attribute (this is a plus not a minus) instead of having to design to always show all attributes whether you need them or not. If you would have a lot of nulls, then add a related table rather than more columns(you can have one-to-one related tables). Usually in this case you might have something that you know only a subset of the records will have and they often fall into groupings by subject fairly naturally. For instance you might have general people related attributes (name, phone, email, address) that belong in one table. Then you might have student-related attributes that belong in a separate table and teacher-related attributes that belong in a third table. Or you might have things you need for all insurance policies and separate tables for vehicle insurance, health insurance, House insurance and life insurance.
There is a third design possibility. If you have a set of attributes you know up front then put them in one table and have an EAV table only for attributes that cannot be determined at design time. This is the common pattern when the application wants to have the flexibility for the user to add customer specific data fields.
I don't think anyone can really determine which one is better immediately, but here are a couple of things to think about:
Do you have sample data? If yes then see if there will be a lot of nulls, if there are not then just go with option 1
Do you have a good sense on how the attributes will grow? For instance looking at the attributes you listed above, you may not know all of them, but they all do exist - so in theory you could fill the table. If you will have a lot of sparse data then #2 may work
When you do get new types of data can you group it into another table and use a foreign key? For instance if you want to capture the address you could always have an address table that references your initial table
What type of queries do you plan on using? It's much harder to query a key-value table than a "normal one" (not super hard, just harder - if you're comfortable using implied joins and the like to normalize the data then it's probably not a big deal).
Overall I'd be really careful before you implemented #2 - I've done it for certain specialized cases (metrics gathering where I have dozens of different metrics and don't really want to maintain dozens of different tables) but in general it's more trouble than it's worth.
For something like this I'd just create one table, and either add columns as you go along, or just create new tables for new data structures if necessary.
This question already has answers here:
Is storing a delimited list in a database column really that bad?
(10 answers)
Closed 8 years ago.
A while ago, I came to the realization that a way I would like to hold the skills for a player in a game would be through CSV format. On the player's stats, I made a varchar of skills that would be stored as CSV. (1,6,9,10 etc.) I made a 'skills' table with affiliated stats for each skill (name, effect) and when it comes time to see what skills they have, all I have to do is query that single column and use PHP's str_getcsv() to see if a certain skill exists because it'll be in an array.
However, my coworker suggests that a superior system is to have each skill simply be an entry into a master "skills" table that each player will use, and each skill will have an ID foreign key to the player. I just query all rows in this table, and what's returned will be their skills!
At first I thought this wouldn't be very good at all, but it appears the Internet disagrees. I understand that it's less searchable - but it was not my intention to ever say, "does the player have x skill?" or "show me all players with this skill!". At worst if I wanted such data, I'd just make a PHP report for it that would, admittedly, be slow.
But it appears as though this is really faster?! I'm having trouble finding a hard answer extending beyond "yeah it's good and normalized". Can Stack Overflow help me out?
Edit: Thanks, guys! I never realized how bad this was. And sorry about the dupe, but believe me, I didn't type all of that without at least checking for dupes. :P
Putting comma-separated values into a single field in a database is not just a bad idea, it is the incarnation of Satan expressed in a database model.
It cannot represent a great many situations accurately (cases in which the value contains a comma or something else that your CSV-consuming code has trouble with), often has problems with values nested in other values, cannot be properly indexed, cannot be used in database JOINs, is difficult to dedupe, cannot have additional information added to it (number of times the skill was earned, in your case, or a skill level), cannot participate in relational integrity, cannot enforce type constraints, and so on. The list is almost endless.
This is especially true of MySQL which has the very convenient group_concat function that makes it easy to present this data as a comma-separated string when needed while still maintaining the full functionality and speed of a normalized database.
You gain nothing from using the comma-separate approach but lose searchability and performance. Get Satan behind thee, and normalize your data.
Well, there are things such as scaleability to consider. What if you need to add/remove a skill? How about renaming a skill? What happens if the number of skills out grows the size of your field? It's bad practice to have to re-size a field just to accommodate something like this.
What about maintainability? Could another developer come in and understand what you've done? What happens if the same skill is given to a player twice?
You coworker's suggestion is not correct either. You would have 3 tables in this case. A master player table, a skills table, and a table that has a relationship to both, creating a many to many relationship, allowing a single skill to be associated with many players, and many players having the same skill.
Since the database will index the content (assuming that you use index) it will be very very fast to search the content and get the desired contents. Remember: databases are designed to hold a lot of information and a database such as mysql, which is a relational database, is made for relations.
Another matter is the maintainability of the system. It will be much much easier to maintain a system that's normalized. And when you are to remove or add a skill it will be easier.
When you are about to get the information from the database regarding the skills of the player you can easily get information connected to the concerned skills with a simple JOIN.
I say: Let the database do what it does best - handle the data. And let your programming do what it should do ;)
Overview (Sorry its vague - I think if I went into more detail it would just over complicate things)
I have three tables, table one contains an id, table two contains its own id and table one's id and table three contains its own id and table two's id.
I have spent a lot of time pondering and I think it would be more efficient for table three to also contain the related table ones id.
-It will mean I will not have to join three tables, I can just query table three (for a query that will be used very often)
-It will allow me to implement a reservation system more easily by only locking rows within table three that contain a specific id from table one.
For anyone who wants to know more about the database layout there is more info here
Question
What are the disadvantaged to de-normalisation? I have seen some people who are completely against it and others who believe in the right situation it is a useful tool. The id's will never change so I do not really see any disadvantage other than having to insert the same data twice and thus the additional space it will consume (which as it is just id's will surely be negligible).
My advice is to follow this general rule: Normalise by default, then denormalise if and when you identify a performance problem which it will solve.
I find normalised data, and code dealing with it, easier and more logical to maintain. I don't think there is any problem using denormalisation to improve performance, but I would not speculatively apply any performance optimisation which results in a decrease in maintainability until you are sure they are necessary.
The only time you really want to denormalize is if its required to get the performance you want
This was already asked several times. See here
As its a one (Table 1) to many (Table 2), with another one (table 2) to many (Table 3) I would keep the same structure as their seems to be 3 layers there.
e.g.
Table 1
Table 2
Table 3
Also, a lot will depend on what additional fields you are storing within those tables.
Every rule might be broken if there is a good reason for it.
In your case I wonder what the three tables contain. Does Table three really describe Table two or does it describe table one directly?
The disadvantage to have self-id, table-two-id and table-one-id in table three in this case is, that it can lead to inconsistence - what if you have table-one-id 1 in table two and table-one-id 15 in table three by a mistake?
It depends on the data and the entity relationship of your data. For me, it would be more important to have no inconsistencies and to have a little bit more time at selection...
EDIT: After reading about your Tables I would suggest to add a table-one-id to table three (areas), because table-one-id doesn't change after all and for that reason its relatively save for inconsistency.
Normalization vs efficiency is usually a trade-off, while normalization is generally a good thing, it is not a silver bullet. If you have a clear reason (as it seems you do), denormalization is perfectly acceptable.
Schemas containing less than fully normalized tables suffer from what is called "harmful redundancy". Harmful redundancy can result in storing the same fact in more than one place, or in not having any place to store a fact that needs to be stored. These problems are known as "insert anomalies", "update anomalies", or "delete anomalies".
To make a long story short, if you store a fact in more than one place, then sooner or later you are going to store mutually contradictory facts in the two places, and your database will begin to give contradictory answers, depending on which version of the facts the query found.
If you are forced to "invent a dummy record" in order to have a place to store a needed fact, then sooner or later you are going to write a query that mistakenly treats the dummy record like a real one.
If you are a super programmer, and you never make mistakes, then you don't have to worry about the above. I never met such a programmer, although I've met lots of people who think they never make mistakes.
I would refrain from "denormalizing" as a practice. That's like "driving away from Chicago". You still don't know where you are going. However, there are times when normalization rules should be disregarded, as others have noted. If you are designing a star schema (or a snowflake schema) you are going to have to disregard some of the normalization rules in order to get the best star (or snowflake).