I'm building a database and I'm not sure how deep of a hierarchy I should make.
It seems the best case for space saving would be three layers.
group->sub_group->item
In average scenarios there are 300 items to a subgroup and 100 subgroups to groups. Items are currently nearing 1 million and growth is accelerating.
I'm attached to distinguishing GROUP from ITEM because it reflects the real world, but SUB_GROUP only exists because ITEMs are usually identical for a few hundred rows. To be clear, I could take the data in one instance of sub group and attach it to every instance of item.
Would making at least 3 joins in every query be better for performance? Or am I better off making less tables with more repetitive data?
This isn't so much a MySQL question as a SQL RDBMS question in general.
You have a database normalized to eliminate duplication and minimize storage.
You may be right to normalize and put the subgroup info with the item if:
The subgroup data is small.
You often query both subgroup and item information together and view isn't sufficing.
You actually have a performance issue that would drive a change--don't prematurely optimize.
There are probably other considerations, but that shoud get you started.
There is not absolute here.
Your decision should not be based, in the first instance, on performance issues, number of JOINs required to retrieve the data, or any related issue. You should choose the design that correctly models your data. If you're using a relational database to store your data, that means following the principles of data normalization independent of perceived performance issues. Having done that, if you really cannot achieve the performance that you require you can consider limited, strategic denormalizations to get the performance that you need. In most cases, however, the normalized solution will also be the most performant.
That said, it's not entirely clear to me what "group", "sub-group", and "item" are modelling in the real world. However, principles of data normalization absolutely state that common values in multiple columns of multiple records in items should be abstracted out to a single record in sub-group. What's not clear is whether you can get by with only three tables, or whether there are additional levels of grouping that need to be normalized out.
Related
I want to create a table about "users" for each of the 50 states. Each state has about 2GB worth of data. Which option sounds better?
Create one table called "users" that will be 100GB large OR
Create 50 separate tables called "users_{state}", each which will be 2GB large
I'm looking at two things: performance, and style (best practices)
I'm also running RDS on AWS, and I have enough storage space. Any thoughts?
EDIT: From the looks of it, I will not need info from multiples states at the same time (i.e. won't need to frequently join tables if I go with Option 2). Here is a common use case: The front-end passes a state id to the back-end, and based on that id, I need to query data from the db regarding the specified state, and return data back to front-end.
Are the 50 states truly independent in your business logic? Meaning your queries would only need to run over one given state most of the time? If so, splitting by state is probably a good choice. In this case you would only need joining in relatively rarer queries like reporting queries and such.
EDIT: Based on your recent edit, this first option is the route I would recommend. You will get better performance from the table partitioning when no joining is required, and there are multiple other benefits to having the smaller partitioned tables like this.
If your queries would commonly require joining across a majority of the states, then you should definitely not partition like this. You'd be better off with one large table and just build the appropriate indices needed for performance. Most modern enterprise DB solutions are capable of handling the marginal performance impact going from 2GB to 100GB just fine (with proper indexing).
But if your queries on average would need to join results from only a handful of states (say no more than 5-10 or so), the optimal solution is a more complex gray area. You will likely be able to extract better performance from the partitioned tables with joining, but it may make the code and/or queries (and all coming maintenance) noticeably more complex.
Note that my answer assumes the more common access frequency breakdowns: high reads, moderate updates, low creates/deletes. Also, if performance on big data is your primary concern, you may want to check out NoSQL (for example, Amazon AWS DynamoDB), but this would be an invasive and fundamental departure from the relational system. But the NoSQL performance benefits can be absolutely dramatic.
Without knowing more of your model, it will be difficult for anyone to make judgement calls about performance, etc. However, from a data modelling point of view, when thinking about a normalized model I would expect to see a User table with a column (or columns, in the case of a compound key) which hold the foreign key to a State table. If a User could be associated with more than one state, I would expect another table (UserState) to be created instead, and this would hold the foreign keys to both User and State, with any other information about that relationship (for instance, start and end dates for time slicing, showing the timespan during which the User and the State were associated).
Rather than splitting the data into separate tables, if you find that you have performance issues you could use partitioning to split the User data by state while leaving it within a single table. I don't use MySQL, but a quick Google turned up plenty of reference information on how to implement partitioning within MySQL.
Until you try building and running this, I don't think you know whether you have a performance problem or not. If you do, following the above design you can apply partitioning after the fact and not need to change your front-end queries. Also, this solution won't be problematic if it turns out you do need information for multiple states at the same time, and won't cause you anywhere near as much grief if you need to look at User by some aspect other than State.
For a project, I was asked to look at an existing SQL database and to see if it could be improved. It was basically a customer database with a bunch of different types of data per customer. This is (basically) how it was organized:
Each customer had a row in the customer table with a customer ID. Then for each type of data, each customer had its own table. So, for instance, there would not be one central table for "jobs", with a customer ID in each row, but for each customer there would be a jobs table called "jobs1234" (1234 being a customer ID.
Now, my first response was confusion as to why you would organize it like that. I've always just learned that it's always better to normalize without really thinking beyond that point. But when I discussed it with people, a few pointed out it may have been for performance reasons. They said that if there were too many rows for "jobs", it would be better to have them split up per customer than to have them all in one table.
Something about indexing and the customer ID being the identifier. I'm confused as to why this approach would improve performance and haven't really gotten a very clear answer so far. Can anyone explain to me why that's the case and if it's even true that this approach is better in some cases?
I find this statement rather shocking:
They said that if there were too many rows for "jobs", it would be
better to have them split up per customer than to have them all in one
table.
Databases are designed to have tables that have lots and lots of rows -- millions of rows should be no problem. You don't specify what the volume of data is, but with a name like jobs, I'd be surprised if the total volume exceeds a few million rows in total. For this volume of data, a single table with suitable indexes should be fine.
There are cases where splitting data by customer would make sense. The strongest case is when it is an explicit requirement, typically for security reasons. In other words, the clients are promised that "their data is never mixed with anyone else's data". And, in most databases (MySQL included), it is easier to deal with security at the table level than at the row level.
Another possible reason would be when the tables have different formats, reflecting different data for each customer. In this case, you would really be dealing separate applications, and each customer should have their own database.
Are there any the downsides to splitting the customer data into multiple tables per customer? Yes. Here are some:
You cannot write generic queries/views to access the data. Basically, all queries in the code need to by dynamic, so you can put in the right table name.
Maintaining the data becomes cumbersome. Instad of updating a single table, you have to update multiple tables.
Answering questions such as "How many jobs does each customer have?" or "What is the growth in the number of jobs over time?" become so difficult to answer that people probably won't even bother asking them.
Performance is a mixed bag. Although you might save the overhead of storing the customer id in each table, you incur another cost. Having lots of smaller tables means lots of tables with partially filled pages. Depending on the number of jobs per customer and number of overall customers, you might actually be multiplying the amount of space used. In the worst case of one job per customer where a page contains -- say -- 100 jobs, you would be multiplying the required space by about 100.
The last point also applies to the page cache in memory. So, data in one table that would fit into memory might not fit into memory when split among many tables.
Partitioning is one way to implement something similar. However, this would work best when the query load is focused on one customer at a time. If all customers are accessing the data at the same time, then partitioning is going to be less of a win, and indexing should be sufficient.
Unless there is a really good reason for splitting the data into separate tables (a requirement, cumbersome security for each client, or custom formats for each client), you simply would not take that approach. Even when there are reasons for doing it, there are often other solutions (such as partitioning) that solve the same problem.
I will have a table with a few million entries and I have been wondering if it was smarter to create more than just this one table, even though they would all have the same structure? Would it save resources and would it be more efficient in the end?
This is my particular concern, because I plan creating a small search engine which indexes about 3.000.000 sites and each sites will have approximately 30 words that are being indexed. This is my structure right now
site
--id
--url
word
--id
--word
appearances
--site_id
--word_id
--score
Should I keep this structure? Or should I create tables for A words, B words, C words etc? Same with the appearances table
Select queries are faster on smaller tables. You want to fit the indexes you have to sort on into your systems memory for better performance.
More importantly, tables should not be defined in order to hold a certain type of data, but instead a collection of associated data. So if the data you are storing has logical differences they maybe should be broken into separate tables.
(Incomplete)
Pros:
Faster data access
Easier to copy or back up
Cons:
Cannot easily compare data from different tables.
Union and join queries are needed to compare across tables
If you aren't concerned with some latency on your database it should be able to handle this on the other of a few million records without too much trouble.
Here's some questions to ask yourself:
Are the records all inter-related? Is there any way of cleanly dividing them into different, non-overlapping groups? Are these groups well defined, or subject to change?
Is maintaining optimal write speed more of a concern than simplicity of access to data?
Is there any way of partitioning the records into different categories?
Is replication a concern? Redundancy?
Are you concerned about transaction safety?
Is it possible to re-structure the data later if you get the initial schema wrong?
There are a lot of ways of tackling this problem, but until you know the parameters you're working with, it's very hard to say.
Usually step one is to collect either a large corpus of genuine data, or at least simulate enough data that's reasonably similar to the genuine data to be structurally the same. Then you use your test data to try out different methods of storing and retrieving it.
Without any test data you're just stabbing in the dark
I hope the title is clear, please read further and I will explain what I mean.
We having a disagreement with our database designer about high level structure. We are designing a MySQL database and we have a trove of data that will become part of it. Conceptually, the data is complex - there are dozens of different types of entities (representing a variety of real-world entities, you could think of them as product developers, factories, products, inspections, certifications, etc.) each with associated characteristics and with relationships to each other.
I am not an experienced DB designer but everything I know tells me to start by thinking of each of these entities as a table (with associated fields representing characteristics and data populating them), to be connected as appropriate given the underlying relationships. Every example of DB design I have seen does this.
However, the data is currently in a totally different form. There are four tables, each representing a level of data. A top level table lists the 39 entity types and has a long alphanumeric string tying it to the other three tables, which represent all the entities (in one table), entity characteristics (in one table) and values of all the characteristics in the DB (in one table with tens of millions of records.) This works - we have a basic view in php which lets you navigate among the levels and view the data, etc. - but it's non-intuitive, to say the least. The reason given for having it this way is that it makes the size of the DB smaller, shortens query time and makes expansion easier. But it's not clear to me that the size of the DB means we should optimize this over, say, clarity of organization.
So the question is: is there ever a reason to structure a DB this way, and what is it? I find it difficult to get a handle on the underlying data - you can't, for example, run through a table in traditional rows-and-columns format - and it hides connections. But a more "traditional" structure with tables based on entities would result in many more tables, definitely more than 50 after normalization. Which approach seems better?
Many thanks.
OK, I will go ahead and answer my own question based on comments I got and more research they led me to. The immediate answer is yes, there can be a reason to structure a DB with very few tables and with all the data in one of them, it's an Entity-Attribute-Value database (EAV). These are characterized by:
A very unstructured approach, each fact or data point is just dumped into a big table with the characteristics necessary to understand it. This makes it easy to add more data, but it can be slow and/or difficult to get it out. An EAV is optimized for adding data and for organizational flexibility, and the payment is it's slower to access and harder to write queries, etc.
A "long and skinny" format, lots of rows, very few columns.
Because the data is "self encoded“ with its own characteristics, it is often used in situations when you know there will be lots of possible characteristics or data points but that most of them will be empty ("sparse data"). A table approach would have lots of empty cells, but an EAV doesn't really have cells, just data points.
In our particular case, we don't have sparse data. But we do have a situation where flexibility in adding data could be important. On the other hand, while I don't think that speed of access will be that important for us because this won't be a heavy-access site, I would worry about the ease of creating queries and forms. And most importantly I think this structure would be hard for us BD noobs to understand and control, so I am leaning towards the traditional model - sacrificing flexibility and maybe ease of adding new data in favor of clarity. Also, people seem to agree that large numbers of tables are OK as long as they are really called for by the data relationships. So, decision made.
i'm learning mysql and was working on a database for work. Everything's fine so far but I had a question. I am organizing financial statements for firms(balance sheet table, income statement table, cashflow table,etc.) and most companies have quarterly statements(they are unaudited) and annual statements(which are audited). Right now for each statement I have a column that flags it for annual or quarterly.
Its not likely that someone will be running a report on an audited and unaudited statement at the same time, so I was thinking if it was worth it to create a table for audited and one for unaudited. The reason I was thinking this was eventually the data will get fairly large and I thought the smaller the tables the faster performance.
So when I design a database should I be designing based on the content(i.e. group everything thats the same regardless) or should I be grouping based on how people will access it?
Another question this raises is should I be grouping financial statements by countries..since all analysis down at our firm in 90% within the same country
This is impossible to answer definitively without knowing the whole problem.
However, usually you want a single table to represent each logical entity in your system. From the sound of it, quarterly and annual statements represent the same logical entity, but differ by a single category column/field. The same holds true for the country question--if the only difference is the country (a categorization), then they likely should all be stored in the same table.
If you were to split your data into separate tables by category, your data would be scattered across multiple tables, and would be very hard to query. For example, if you wanted a count of all statements in the system, you would have to query ALL country tables and add the results together.
Edit: Joe Celko calls this anti-pattern "Attribute Splitting".
First of all I have to point out, I'm not a professional DB designer.
But if I ware you, in this case I would create one table as the entities are the same basically.
If you fear of mysql's performace on lager datasets, maybe it would be better to start building your app on Postgres. You can boost mysql's performace with stored functions/procedures or maybe views if you have to run complicated queries and of course you can use memcache or any nosql stuff to let the SQL rest a bit.
If you are sure in that users will search mainly only for this or that type of records, you can build three tables. One for all of the records, one-one for the audited and unaudited ones. You can keep them syncronized with the InnoDB's triggers (ON UPDATE/DELETE/INSERT). They could work like views, but I think (not tested) they would be faster then views. In this case you have to manage only the first "large" table. If you insert an audited record, the trigger fires and puts a record in to the audited table an so on...
Best wishes!
I agree with Phil and Damien - one table is better. What you want is one table per type of real business thing. If you design your tables to resemble real things, even abstract or conceptual things, then your data design is more likely to stand the test of time. Once you've sketched out a schema based on the real things you have data about, then you can go back and apply the rules of normalization to formalize your design.
As a rule, it is a bad idea to design for a performance problem you are worried about, but haven't actually seen. Your intuition about big tables being slower might actually be wrong. Most DBMS systems like bigger tables, at least to a point. When tables are big the query optimizers choose to use indexes. When tables are small they often end up getting full table scans, which can really slow down concurrent access. If your tables get so big that they are beyond the capabilities of your DBMS then it is time to consider either archiving out old data that you aren't using anymore or buying a more scalable DBMS.