There are two tables - users and orders:
id
first_name
orders_amount_total
1
Jone
5634200
2
Mike
3982830
id
user_id
order_amount
1
1
200
2
1
150
3
2
70
4
1
320
5
2
20
6
2
10
7
2
85
8
1
25
The tables are linked by user id. The task is to show for each user the sum of all his orders, there can be thousands of them (orders), maybe tens of thousands, while there can be hundreds and thousands of users simultaneously making a request. There are two options:
With each new order, in addition to writing to the orders table, increase the orders_amount_total counter, and then simply show it to the user.
Remove the orders_amount_total field, and to show the sum of all orders using tables JOIN and use the SUM operator to calculate the sum of all orders of a particular user.
Which option is better to use? Why? Why is the other option bad?
P.S. I believe that the second option is concise and correct, given that the database is relational, but there are strong doubts about the load on the server, because the sample when calculating the amount is large even for one user, and there are many of them.
Option 2. is the correct one for the vast majority of cases.
Option 1. would cause data redundancy that may lead to inconsistencies. With option 2. you're on the safe side to always get the right values.
Yes, denormalizing tables can improve performance. But that's a last resort and great care needs to be taken. "tens of thousands" of rows isn't a particular large set for an RDMBS. They are built to handle even millions and more pretty well. So you seem to be far away from the last resort and should go with option 1. and proper indexes.
I agree with #sticky_bit that Option 2. is better than 1. There's another possibility:
Create a VIEW that's a pre-defined invocation of the JOIN/SUM query. A smart DBMS should be able to infer that each time the orders table is updated, it also needs to adjust orders_amount_total for the user_id.
BTW re your schema design: don't name columns id; don't use the same column name in two different tables except if they mean the same thing.
Related
This is a performance question. In a query joining another table (the other acting as dictionary) where the joined data repeat, because foreign key value is repeated in many records of the base table, will database engine extract the repeating data multiple times (I mean by that not the presented output, but actually accessing and searching the table again and again), or is it smart enough to somehow cache the results and extract everything just once? I am using mySQL.
I mean a situation like this:
SELECT *
FROM Tasks
JOIN People
ON Tasks.personID = People.ID;
Lets assume People table consists of:
ID | Name
1 | John
2 | Mary
And Tasks:
ID | personID
1 | 1
2 | 1
3 | 2
Will "John" data be physically extracted twice or once? Is it worth trying to avoid such queries?
John will show up twice in the result set.
However, if I interpret your question right, this is not about the resulting result set, but more about how the data is internally read to produce this result set.
In this case you have a join between two tables. In a join between two tables there's a "driving table" that's read first, and then the "secondary table" that is accessed once per each row of the driving table.
Now:
If MySQL chooses Tasks as the driving table, then the row John from the People will be accessed twice (because it will be in the secondary table).
If MySQL chooses People as the driving table, then naturally the row John will be accessed only once.
So, which option will MySQL pick? Get the execution plan and you'll find out. The table that shows up first in the plan is the driving table; the other is the secondary table. Mind that the execution plan may change in the future without notice.
Note: accessing doesn't mean to perform physical I/O on the disk. Once the row is read, it becomes "hot" and it's usually cached for some time; any repeated access will probably end up reading from the cache and won't cause more physical I/O.
The answer to your question is that it repeats the data. The string values are not cached or reduced to one per distinct value.
In general, this isn't a problem because you would run queries that have small result sets by selecting a limited subset of data.
But if you don't limit the query, it would produce a large result set, potentially with strings repeated.
MySQL takes the table task and add for every row a/some row(s) from people that fits.
It has to gather every row, that belongs one row of the table tasks.
So it would grab for the second row woth the same id also the same data again.
this is usually not aproblem as you would put the colums in an INDEX and s it would find them quickly
Take this table as an example :
CREATE TABLE UserServices (
ID BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
Service1 TEXT,
Service2 TEXT,
.
.
.
) ENGINE = MYISAM;
Every user will have different number of services, so lets say the table starts with 10 columns for services for each user. If one user will have 11 services, must all other users have 11 columns also? Now of course it is a table and row needs to have the same number of columns, but it is just seems like an awful waste of memory. Maybe the use of another database type is better?
Thank you!!
Storing a boatload of nulls isn't really a "waste of memory" because the space is negligible - hard disks cost pence per gigabyte, programmers cost tens/hundreds of $/hr so it's certainly economical to burn the space and it's not really a great argument for avoidance.
There is a better argument though, as others have said; databases don't do variable numbers of columns for a particular ID in a table, but they DO do variable numbers of rows per ID.. This is how DBs are designed: columns are fixed, rows are variable. Everything that a database does and offers in terms of querying, storage, retrieval, internal design etc is optimised towards this pattern
There are well established operations (called pivots) that will turn your vertical arrangement of data into horizontal (with nulls) at query time, so you don't have to store the data horizontally
Here's a pivot example:
Table:
ID, ServiceIdentifier, ServiceOwner
1, SV1, John
1, SV2, Sarah
2, SV1, Phil
2, SV2, John
2, SV3, Joe
3, SV2, Mark
SELECT
ID,
MAX(CASE WHEN ServiceIdentifier = 'SV1' THEN ServiceOwner END) as SV1_Owner,
MAX(CASE WHEN ServiceIdentifier = 'SV2' THEN ServiceOwner END) as SV2_Owner,
MAX(CASE WHEN ServiceIdentifier = 'SV3' THEN ServiceOwner END) as SV3_Owner
FROM
Table
GROUP BY
ID
Result:
ID SV1_Owner SV2_Owner SV3_Owner
1 John Sarah
2 Phil John Joe
3 Mark
As noted, it's not a huge cost to just store the data horizontally and if you're sure the table will never change/ not need new columns adding on a weekly basis to cope with new services etc, then it might be a sensible developer optimisation to just have columns full of nulls. If you'll add columns regularly, or one day have thousands of services, then vertical storage is going to have to be the way it goes
To expand a little on what's already been said:
Is there a way to add an attribute to only 1 row in SQL?
No, and that's kinda fundamental to how relationship databases (SQL) work - and that's in any version of SQL, whether it's mysql, t-sql, etc. If you have a table - and you want to add an attribute to that table, it's going to be another column, and that column will be there for every row. Not just relational databases - that's just how tables work.
But, that's not how anyone would do it. What you would do is what Alan suggested - a separate table for Services, then a 3rd table (he suggested naming it 'UserServices') that links the two. And that's not a one-off suggestion - that's pretty much "the" way to do it. There's no waste.
Maybe the use of another database type is better?
Possibly, if you want something with less restrictions, then you could go with something other than SQL. Since SQL is so dominant, everything is usually categorized as NOSQL. - Mongo is the most popular NOSQL database currently, which is why RC brought it up.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Let's say I want to create a table like this:
id | some_foreign_id | attribute | value
_________________________________________
1 1 Weight 100
2 1 Reps 5
3 2 Reps 40
4 3 Time 10
5 4 Weight 50
6 4 Reps 60
Versus the same data represented this way
id | some_foreign_id | weight | reps | time
____________________________________________
1 1 100 5 NULL
2 2 NULL 40 NULL
3 3 NULL NULL 10
4 4 50 60 NULL
And since in this case the id = foreign_id I think we can just append these columns to whatever table foreign_id is referring to.
I would assume most people would overwhelmingly say the latter approach is the accepted practice.
Is the former approach considered a bad idea, even though it doesn't result in any NULLs? What are the tradeoffs between these two approaches exactly? It seems like the former might be more versatile, at the expense of not really having a clear defined structure, but I don't know if this would actually result in other ramifications. I can imagine a situation where you have tons of columns in the latter example, most of which are NULL, and maybe only like three distinct values filled in.
EAV is the model your first example is in. It's got a few advantages, however you are in mysql and mysql doesn't handle this the best. As pointed out in this thread Crosstab View in mySQL? mysql lacks functions that other databases have. Postgres and other databases have some more fun functions PostgreSQL Crosstab Query that make this significantly easier. In the MSSQL world, this gets referred to as sparsely populated columns. I find columnar structures actually lend themselves quite well to this (vertica, or high end oracle)
Advantages:
Adding a new column to this is significantly easier than altering a table schema. If you are unsure of what future column names will be, this is the way to go
Sparsely populated columns result in tables full of nulls and redundant data. You can setup logic to create a 'default' value for a column...IE if no value is specified for this attribute, then use this value.
Downsides:
A bit harder to program with in MySQL in particular as per comments above. Not all SQL dev's are familiar with the model and you might accidentally implement a steeper learning curve for new resources.
Not the most scalable. Indexing is a challenge and you need work around (Strawberry's input in the comments is towards this, your value column is basically forced to Varchar and that does not index well, nor does it search easily...welcome to table scan hell) . Though you can get around this with a third table (say you query on dates like create date and close date alot. Create a third 'control' table that contains those frequently queried columns and index that...refer to the EAV tables from there) or creating multiple EAV tables, one for each data type.
First one is the right one.
If later you want change the number of properties, you dont have to change your DB structure.
Changing db structure can cause your app to break.
If the number of null is too big you are wasting lot of storage.
My take on this
The first I would probably use if I have a lot of different attributes and values I would like to add in a more dynamic way, like user tags or user specific information etc,
The second one I would probably use if I just have the three attributes (as in your example) weights, reps, time and have no need for anything dynamic or need to add any more attributes (if this was the case, I would just add another column)
I would say both works, it is as you yourself say, "the former might be more versatile". Both ways needs their own structure around them to extract, process and store data :)
Edit: for the first one to achieve the structure of the second one, you would have to add a join for each attribute you would want to include in the data extract.
I think the first way contributes better towards normalization. You could even create a new table with attributes:
id attribute
______________
1 reps
2 weight
3 time
And turn the second last column into a foreign id. This will save space and will save you the risk of mistyping the attribute names. Like this:
id | some_foreign_id | attribute | value
_________________________________________
1 1 2 100
2 1 1 5
3 2 1 40
4 3 3 10
5 4 2 50
6 4 1 60
As others have stated, the first way is the better way. Why? Well, it normalizes the structure. Reference: https://en.wikipedia.org/wiki/Database_normalization
As that article states, normalization reduces database size & allows for easy expansion.
I guess that title isn't very descriptive, so I will explain! I have table called users_favs where is stored all info about which posts user has liked, which post he has favourited and the same for comments. info there is stored as serealized array / or JSON who cares.
Question: What is better? Stay like this or to make 4 tables for each of the fields and store not in serealized version but like user_id => post_id???
What I think about second option is that after some time this field will be GIGANTIC. Also, I will need to make 4 queries (or with JOINS) to take all of the info from these tables.
Keeping it in 1 table means that you'll only need 1 table access and 0 joins to get all the data. While storing it in 4 tables, you'll need at least 1 table access and n-1 joins, when you need n fields of information. Your result set at the end of the query will probably be the same, so the amount of data send over the network is independent of your table structure.
I presume a scenario when you will have data for fav_categories and other columns are null. Similarly for columns fav_posts, liked_posts, liked_comments. So there is a high probability that in each row , only three columns will have data most of the time (id,user_id,any one of rest). If my assumptions are right and the use cases as well , then i would definitely go four four tables.
To add to above you can always choose from whether you want to make read-friendly or write-friendly.
In my case I want to maintain a table for store some kind of data and after some period remove from the first table and store to another table.
I want to clarify the what is the best practice in this kind of scenario.
I am using MySql database in java base application.
Generally, I follow this procedure. Incase I want to delete a row. I have a tinyint column called deleted. I mark this column for that row as true.
That indicates that that row has been marked as deleted, So I dont, pick it up.
Later (maybe once a day), I run a script which in a single shot either delete the rows entirely or migrate them to another table... etc.
This is useful as every time you delete a row (even if it's 1 row), mysql has to reindex (it's indexes). This might require significant system resources depending on your data size or number of indexes. You might not want to incur these overheads everytime...
You did not provide enough information but I think if both tables have same data structure then you can avoid using two tables. Just add another column in first table and set status/type for those particular second table records.
For Example:
id | Name | BirthDate | type
------------------------------------
1 | ABC | 01-10-2001 | firsttable
2 | XYZ | 01-01-2000 | secondtable
You can pick records like this:
select * from tablename where type='firsttable'
OR
select * from tablename where type='secondtable'
If you are archiving old data, there should be a way to set up a scheduled job in mysql. I know there is in SQL Server and it's the kind of function that most databases require, so I imagine it can be done in mySQL. Shecdule the job to run in the low-usage hours. Have it select all records more than a year old (or whatever amount of time of records you want to keep active) and move them to an archive table and then delete them. Depending on the number of records you would be moving, it might be best to do this once a week or daily. You don't want the number of records expiring to be so large it affects performance greatly or makes the job take too long.
Inarchiving, the critical piece is to make sure you keep all the records that will be needed frequently and don't forget to consider reporting in that(many reports need to havea years worth or two years worth of data, do not archive records these reports should need). Then you also need to set up a way for users to access the archived records on the rare occasions they may need to see them.