This is a more theoretical question, not a specific scenario:
Let's assume, we have a simplified table scheme like this:
items contains some basic data, item_data additional properties for each item an rel_items sets a tree relationship between the different items. There are different types of items (represented by the field items.item_type) which have different fields stored in item_data, for example: dog, cat, mouse.
If we have some bigger queries with some joins and conjunctions (stuff like getting items with their parent items having some conditions with other items and so on), could this become a performance issue compared to splitting all different types of items into separate tables (dog, cat, mouse) and not merging them into a single one?
If we keep it all in one basic item table, does creating views (dog, cat, mouse) impact performance somehow?
edit (as commented below): I thought of "species", "house-pets" and so on as item_types. Each type has different properties. The intention of using a basic item table and the item_data table is to have a basic "object" and attaching as many properties to them as necessary without having to modify the database scheme. For example, I don't know how many animals there will be in the application and what properties they have, so I thought of a database scheme that doesn't need to be alterted each time the user creates a new animal.
If we have some bigger queries with some joins ..., could this become a performance issue compared to splitting all different types of items into separate tables (dog, cat, mouse) and not merging them into a single one?
No.
If we keep it all in one basic item table, does creating views (dog, cat, mouse) impact performance somehow?
No.
Separate tables means they're fundamentally different things -- different attributes or different operations (or both are different)
Same table means they're fundamentally the same things -- same attributes and same operations.
Performance is not the first consideration.
Meaning is the first consideration.
After you sort out what these things mean, and what the real functional dependencies among the items are, then you can consider join performance.
"Dog, cat, mouse" are all mammals. One table.
"Dog, cat, mouse" are two carnivores and one omnivore. Two tables.
"Dog, cat, mouse" are two conventional house-pets and one conventional pest. Two tables.
"Dog, cat, mouse" are one cool animal and two nasty animals. Two tables.
"Dog, cat, mouse" are three separate species. Three tables.
It's about the meaning.
The attempt to build a schema that can acommodate new objects, ones not analyzed and included when the database was designed, is an idea that comes up over and over again in discussions of relational databases.
In classical relational data modeling, relations can be devised in the light of certain propositions that are to be asserted about the universe of discussion. These propositions are the facts that users of the data can obtain by retrieving data from the database. Base relations are asserted by storing something in the database. Derived relations can be obtained by operations on the base relations. When an SQL database is built using a relational data model as a guide, base relations become tables and derived relations become views.
But all of this presupposes that the attributes are discovered during data analysis, before database design begins.
In practice, over the last 25 years, most databases have been built on the basis of analysis later revealed to have been incomplete or incorrect. Databases then get revised in the light of new and improved analysis, and the revised database sometimes requires application code maintenance. To be sure, the relational model and the SQL databases created fewer application dependencies than the pre-relational databases did.
But it's natural to try to come up with a generic data schema like yours, that can accomodate any subject matter whatsoever with no schema changes. There are consequences to this approach, and they involve far greater costs than mere performance issues. For small projects, these costs are quite manageable, and the completely generic schema may work well in those cases.
But in the very big cases, where there are dozens of entity types and hundreds of relevant propositions based on those entities and their relationships, the attempt to build a schema that is "subject matter agnostic" has often resulted in disaster. These development disasters are well documented, and the larger disasters involve millions of dollars of wasted effort.
I can't prove to you that such an approach has to lead to disaster. But learning from other people's mistakes is often much more worthwhile than taking the risk of repeating them.
Surely, accessing data in joined table WILL be slower, always.
But with proper indexes it might be acceptable slowdown (like 2x).
I would move common items you use in queries into items table, and leave in item_data only values you need to display, which are not uses in WhERE and JOIN conditions.
Related
Let's say you have a recipe app. Each recipe could have ingredients, directions, and tags.
Each of those could be ManyToMany with recipe.
As well, directions could be ManyToMany with steps, and ingredients could be ManyToMany with amounts, and amounts could be ManyToMany with measurements.
Is there a convention for when it does and doesn't make sense to use a junction table?
Yes, the convention is called database normalization.
Each many-to-many relationship requires its own junction table, if you want your database to be in normal form.
The purpose of normal forms is to reduce the chance for data anomalies. That is, accidentally inserting data that contradicts other data. If you follow rules of normalization, your database represents every fact only once, so there are no cases of data anomalies.
If you are uncomfortable with the level of complexity this represents, don't blame the tables! It's the real-world information you're trying to model that is causing the complexity.
Some people like to avoid the table, for example by storing a string in the recipes table with a comma-separated list of tags. This is an example of denormalization. In this example, it would make it simpler to query a recipe along with its tags. But it would make it harder to do some other types of queries:
How many / which recipes share a given tag?
What's the average number of tags per recipe?
Update all recipes to change the spelling of a given tag, or remove a given tag from all recipes that have it.
You may also like my answer: Is storing a delimited list in a database column really that bad?
In general every kind of optimization optimizes for one type of query, at the expense of other types of queries against the same data.
Organizing your data to conform to rules of normalization makes your database better able to serve a wider variety of queries.
what is the difference between a relation of a relational database and a dimension as represented in a star diagram?
As part of an assignment I have a relational datawarehouse design, where most of the tables have been normalised using many to many, one to one, one to many relationship schema (I think this is the right terminology? please correct me if I'm wrong). The next step is to draw a star diagram that could be used in a data-mining environment, which I guess means a fact table that draws from different dimensions...
I'm a little confused here because 1. any data-analysis I could think of could be taken from the relational database, so whats the point of re-structuring it? and 2. If some of the tables that you want to draw data from contain foreign keys, how do you split that into dimensions.
for example:
I have these relations:
Courses {course_id, description}
Modules {module_id, description}
Course_modules {course_id, module_id}
Students {student_id, address, enrollment_option, enrollment_date, name, surname, nationality, home_language, gender ...}
Module_grades {student_id, module_id, assignment_1, assignment_1_sub_date, assignment_2, assignment_2_sub_date, exam, exam_date, overall_result}
and I'd like to know how course results relate to module grades. With a relational database I would query to join a table containing students information with the module grades table. What would be the equivalent with dimensions and reports? Especially as I'm using multiple columns as my primary key in the grades relation..
An operational database is highly normalized, which improves write performance, and minimizes write anomalies. It is designed to facilitate transaction processing.
An analytic database (data warehouse) is highly denormalized, which improves read performance, and makes it easier for non DBAs to understand. It is designed to facilitate analysis.
what is the difference between a relation of a relational database and a dimension
A data warehouse can be in a relational database, and can use its relations (tables), so there is no difference.
any data-analysis I could think of could be taken from the relational
database, so whats the point of re-structuring it?
A data warehouse often includes data from many sources, not just your operational database. Examples: emails, website scraping.
If you tell your boss to join ten tables to do a simple analysis, you will get fired.
If some of the tables that you want to draw data from contain foreign keys, how do you split that into dimensions.
This depends entirely on what you are trying to analyze, but in general you denormalize and copy the data to dimension tables.
Dimensional Design
You need to start with a process or event that you want to analyze.
Use Excel. Add all the columns that are pertinent to your analysis. For example, if you were analyzing the process of people visiting your website, each row in Excel would represent a site visit, and columns might be start_time, # pages visited, first page, last page, etc.
Now do ONE level of normalization. Find categorical columns that you can group together (like info about the user's web browser). These would go in a browsers dimension table. Find (true) numerical values that you cannot normalize out. These are measures. Example, the number of pages visited.
The measures, and keys that refer to your dimension tables, are your fact table.
Now go read this book.
I hope the title is clear, please read further and I will explain what I mean.
We having a disagreement with our database designer about high level structure. We are designing a MySQL database and we have a trove of data that will become part of it. Conceptually, the data is complex - there are dozens of different types of entities (representing a variety of real-world entities, you could think of them as product developers, factories, products, inspections, certifications, etc.) each with associated characteristics and with relationships to each other.
I am not an experienced DB designer but everything I know tells me to start by thinking of each of these entities as a table (with associated fields representing characteristics and data populating them), to be connected as appropriate given the underlying relationships. Every example of DB design I have seen does this.
However, the data is currently in a totally different form. There are four tables, each representing a level of data. A top level table lists the 39 entity types and has a long alphanumeric string tying it to the other three tables, which represent all the entities (in one table), entity characteristics (in one table) and values of all the characteristics in the DB (in one table with tens of millions of records.) This works - we have a basic view in php which lets you navigate among the levels and view the data, etc. - but it's non-intuitive, to say the least. The reason given for having it this way is that it makes the size of the DB smaller, shortens query time and makes expansion easier. But it's not clear to me that the size of the DB means we should optimize this over, say, clarity of organization.
So the question is: is there ever a reason to structure a DB this way, and what is it? I find it difficult to get a handle on the underlying data - you can't, for example, run through a table in traditional rows-and-columns format - and it hides connections. But a more "traditional" structure with tables based on entities would result in many more tables, definitely more than 50 after normalization. Which approach seems better?
Many thanks.
OK, I will go ahead and answer my own question based on comments I got and more research they led me to. The immediate answer is yes, there can be a reason to structure a DB with very few tables and with all the data in one of them, it's an Entity-Attribute-Value database (EAV). These are characterized by:
A very unstructured approach, each fact or data point is just dumped into a big table with the characteristics necessary to understand it. This makes it easy to add more data, but it can be slow and/or difficult to get it out. An EAV is optimized for adding data and for organizational flexibility, and the payment is it's slower to access and harder to write queries, etc.
A "long and skinny" format, lots of rows, very few columns.
Because the data is "self encoded“ with its own characteristics, it is often used in situations when you know there will be lots of possible characteristics or data points but that most of them will be empty ("sparse data"). A table approach would have lots of empty cells, but an EAV doesn't really have cells, just data points.
In our particular case, we don't have sparse data. But we do have a situation where flexibility in adding data could be important. On the other hand, while I don't think that speed of access will be that important for us because this won't be a heavy-access site, I would worry about the ease of creating queries and forms. And most importantly I think this structure would be hard for us BD noobs to understand and control, so I am leaning towards the traditional model - sacrificing flexibility and maybe ease of adding new data in favor of clarity. Also, people seem to agree that large numbers of tables are OK as long as they are really called for by the data relationships. So, decision made.
I'm currently designing a web application using php, javascript, and MySQL. I'm considering two options for the databases.
Having a master table for all the tournaments, with basic information stored there along with a tournament id. Then I would create divisions, brackets, matches, etc. tables with the tournament id appended to each table name. Then when accessing that tournament, I would simply do something like "SELECT * FROM BRACKETS_[insert tournamentID here]".
My other option is to just have generic brackets, divisions, matches, etc. tables with each record being linked to the appropriate tournament, (or matches to brackets, brackets to divisions etc.) by a foreign key in the appropriate column.
My concern with the first approach is that it's a bit too on the fly for me, and seems like the database could get messy very quickly. My concern with the second approach is performance. This program will hopefully have a national if not international reach, and I'm concerned with so many records in a single table, and with so many people possibly hitting it at the same time, it could cause problems.
I'm not a complete newb when it comes to database management; however, this is the first one I've done completely solo, so any and all help is appreciated. Thanks!
Do not create tables for each tournament. A table is a type of an entity, not an instance of an entity. Maintainability and scalability would be horrible if you mix up those concepts. You even say so yourself:
This program will hopefully have a national if not international reach, and I'm concerned with so many records in a single table, and with so many people possibly hitting it at the same time, it could cause problems.
How on Earth would you scale to that level if you need to create a whole table for each record?
Regarding the performance of your second approach, why are you concerned? Do you have specific metrics to back up those concerns? Relational databases tend to be very good at querying relational data. So keep your data relational. Don't try to be creative and undermine the design of the database technology you're using.
You've named a few types of entities:
Tournament
Division
Bracket
Match
Competitor
etc.
These sound like tables to me. Manage your indexes based on how you query the data (that is, don't over-index or you'll pay for it with inserts/updates/deletes). Normalize the data appropriately, de-normalize where audits and reporting are more prevalent, etc. If you're worried about performance then keep an eye on the query execution paths for the ways in which you access the data. Slight tweaks can make a big difference.
Don't pre-maturely optimize. It adds complexity without any actual reason.
First, find the entities that you will need to store; things like tournament, event, team, competitor, prize etc. Each of these entities will probably be tables.
It is standard practice to have a primary key for each of them. Sometimes there are columns (or group of columns) that uniquely identify a row, so you can use that as primary key. However, usually it's best just to have a column named ID or something similar of numeric type. It will be faster and easier for the RDBMS to create and use indexes for such columns.
Store the data where it belongs: I expect to see the date and time of an event in the events table, not in the prizes table.
Another crucial point is conforming to the First normal form, since that assures data atomicity. This is important because it will save you a lot of headache later on. By doing this correctly, you will also have the correct number of tables.
Last but not least: add relevant indexes to the columns that appear most often in queries. This will help a lot with performance. Don't worry about tables having too many rows, RDBMS-es these days handle table with hundreds of millions of rows, they're designed to be able to do that efficiently.
Beside compromising the quality and maintainability of your code (as others have pointed out), it's questionable whether you'd actually gain any performance either.
When you execute...
SELECT * FROM BRACKETS_XXX
...the DBMS needs to find the table whose name matches "BRACKETS_XXX" and that search is done in the DBMS'es data dictionary which itself is a bunch of tables. So, you are replacing a search within your tables with a search within data dictionary tables. You pay the price of the search either way.
(The dictionary tables may or may not be "real" tables, and may or may not have similar performance characteristics as real tables, but I bet these performance characteristics are unlikely to be better than "normal" tables for large numbers of rows. Also, performance of data dictionary is unlikely to be documented and you really shouldn't rely on undocumented features.)
Also, the DBMS would suddenly need to prepare many more SQL statements (since they are now different statements, referring to separate tables), which would present the additional pressure on performance.
The idea of creating new tables whenever a new instance of an item appears is really bad, sorry.
A (surely incomplete) list of why this is a bad idea:
Your code will need to automatically add tables whenever a new Division or whatever is created. This is definitely a bad practice and should be limited to extremely niche cases - which yours definitely isn't.
In case you decide to add or revise a table structure later (e.g. adding a new field) you will have to add it to hundreds of tables which will be cumbersome, error prone and a big maintenance headache
A RDBMS is built to scale in terms of rows, not tables and associated (indexes, triggers, constraints) elements - so you are working against your tool and not with it.
THIS ONE SHOULD BE THE REAL CLINCHER - how do you plan to handle requests like "list all matches which were played on a Sunday" or "find the most recent three brackets where Frank Perry was active"?
You say:
I'm not a complete newb when it comes to database management; however, this is the first one I've done completely solo...
Can you remember another project where tables were cloned whenever a new set was required? If yes, didn't you notice some problems with that approach? If not, have you considered that this is precisely what a DBA would never ever do for any reason whatsoever?
I have 2 scenarios for a MySQL DB and I'm not sure which to choose, and I've run into the same dilemma for a few tables.
I'm making a web application only accessed by members. Each member has their own deals, expenses, and say "listings". The criteria for the records is the same across users, but each user can have completely different amounts of records.
My 2 scenarios are whether I should have one table for deals, one table for listings, one table for expenses...and have a field in each that links to the primary key for a particular user. Or...if it is better to have a separate deal table, expense table, and listing table for each user..(using a combined string like "user"+deals, or "user"+exp). Deals can be used across 1 or 2 users, but expenses and listings are completely independent. I am going to have a master deal table to hold all the info for each deal, but there is a user deal table(s) that links their primary key to a deal primary key.
So, separate tables or one table? If there are thousands of users with hundreds of deals/expenses/listings..I just don't want the queries to be extremely slow after a lot of deals or expenses have built up...No user will ever need to view anything from other users...strictly just their data.
Also, I'm familiar with how a database works and stores data, but I'm not 100% clear. I just want it to work quickly, so my other question is (although it may be stupid) when a user submits a new deal or expense...is it inserted in the beginning or end the table? Or is it irrelevant...because a query will search everything in the table either way before returning information?
Always use one table to store one kind of entity.
Or more specifically, what you're talking about is a nasty, complicated optimisation that works in an incredibly small subset of cases which almost certainly isn't yours.
You want to use just one table for one kind of entry. Index it appropriately, and try to get rid of old records when you don't need them any more.
Also, a lot of peoples' idea of "big data" isn't actually particularly big. Databases normally need little optimisation while their data still fit in RAM, which on a modern system means, say, 32Gb.
Regarding your second question:
In MySql the order of the records on the disk is defined by your PRIMARY KEY. Meaning a record does not get inserted at the end or the beginning, but rather wherever it belongs based on the primary key.
In other db's you have th option to use CLUSTERED KEYS in order to use another key than the PRIMARY to order the records on disk, but this is not supported in MySql to my knowledge.
Regarding your first question:
I found myself in this position a couple of times and recently I keep getting back to one blog post (last of a series, the conclusion is in the bottom):
http://weblogs.asp.net/manavi/archive/2011/01/03/inheritance-mapping-strategies-with-entity-framework-code-first-ctp5-part-3-table-per-concrete-type-tpc-and-choosing-strategy-guidelines.aspx
I quote:
Before we get into this discussion, I
want to emphasize that there is no one
single "best strategy fits all
scenarios" exists. As you saw, each of
the approaches have their own
advantages and drawbacks. Here are
some rules of thumb to identify the
best strategy in a particular
scenario:
If you don’t require polymorphic associations or queries, lean toward
TPC—in other words, if you never or
rarely query for BillingDetails and
you have no class that has an
association to BillingDetail base
class. I recommend TPC (Table per Concrete Type) (only) for the
top level of your class hierarchy,
where polymorphism isn’t usually
required, and when modification of the
base class in the future is unlikely.
If you do require polymorphic associations or queries, and
subclasses declare relatively few
properties (particularly if the main
difference between subclasses is in
their behavior), lean toward TPH (Table per Hierarchy). Your
goal is to minimize the number of
nullable columns and to convince
yourself (and your DBA) that a
denormalized schema won’t create
problems in the long run.
If you do require polymorphic associations or queries, and
subclasses declare many properties
(subclasses differ mainly by the data
they hold), lean toward TPT (Table per Type). Or,
depending on the width and depth of
your inheritance hierarchy and the
possible cost of joins versus unions,
use TPC.
By default, choose TPH only for simple
problems. For more complex cases (or
when you’re overruled by a data
modeler insisting on the importance of
nullability constraints and
normalization), you should consider
the TPT strategy. But at that point,
ask yourself whether it may not be
better to remodel inheritance as
delegation in the object model
(delegation is a way of making
composition as powerful for reuse as
inheritance). Complex inheritance is
often best avoided for all sorts of
reasons unrelated to persistence or
ORM. EF acts as a buffer between the
domain and relational models, but that
doesn’t mean you can ignore
persistence concerns when designing
your classes.