I have a parent model Post and a child model Comment. Posts have privacy setting - column privacy in the DB. Any time when I have to deal with a child model Comment I have to check privacy settings if the parent model: $comment->post->privacy.
My app is becoming bigger and bigger and such approach needs more and more SQL-requests. Eager loading helps, but sometimes there is no other reasons to touch the parent model except of checking the privacy field.
My question is: Is it a good practice to duplicate the privacy column into the Posts table and keep them in sync? It will allow me to simply use $comment->privacy without touching the Posts table.
Planned redundancy (denormalization of the model) for a specific purpose can be good.
You specifically mention keeping the privacy column on the child table "in sync" with the privacy column in the parent table. That implies you have control of the redundancy. That's acceptable practice, especially for improved performance.
If it doesn't improve performance, then there wouldn't really be a need.
Uncontrolled redundancy can be bad.
Assuming that the privacy properties have to be in the parent (if the "Post" are not used directly on its own you can always move the property "privacy" to all the children)
First you should try enhance the performance using optimization techniques (like indexes, materialized views.. etc.)
Second if that didn't help much with the performance (very very rare case) you can start thinking about duplicating the information. but that should be your last option, and you need to take all the possible measures to preserve data consistency (using constraints, triggers or whatever).
Duplicating columns will be BAD in terms of space. Assume a situation when you will have huge amount of data in posts model, if you duplicate that, same amount of space will be used again, just to minimize your time.
Basically you always have to think about the trade off between space and time optimization.
Try to optimize time by some algorithmic approach like Hash tables, indexing, Binary Search Tree and all. And if you find it still time consuming after certain amount of data space, then think of duplicating data. But remember performance may be increased but space will be utilized more for the same.
Related
Need input on data model design
I have parent_table as
id (PK)
current_version
latest_child_id
child_table as
id (PK)
parent_table_id (FK to parent)
version (running number . largest number implies latest child record)
The relationship between parent_table to child_table is 1:m .
The parent_table in addition keeps a pointer to the latest version of the record in child table.
The system will insert n mutable rows into the child_table and update the parent_table to point to the latest version- for faster reads.
My question is:
Is it a bad practice to have the parent_table store the latest
version of the child table ?
Am I looking at potential performance
problems \ locking issues ? since any insert into the child
table-needs a lock on the parent table as well ?
Database in question: MySQL
Is it a bad practice to have the parent_table store the latest version of the child table ?
Phrases like "bad practice" are loaded with context. I much prefer to consider the trade-offs, and understand the decision at that level.
By storing an attribute which you could otherwise calculate, you're undertaking denormalization. This is an established way of dealing with performance challenges - but it's only one of several. The trade-offs are roughly as follows.
Negative: takes more storage space. Assume this doesn't matter
Negative: requires more code. More code means more opportunity for bugs. Consider wrapping the data access code in a test suite.
Negative: denormalized schemas can require additional "brain space" - you have to remember that you calculate (for instance) the number of children a parent has, but find the latest one by looking at the attribute in the parent table. In an ideal world, a normalized schema describes the business context without having to remember implementation details.
Negative: may make your data model harder to extend in future. As you add more entities and attributes, this denormalized table may become harder and harder to keep in sync. One denormalized column is usually easy to work with, but if you have lots of denormalized columns, keeping them all up to date may be very difficult.
Negative: for data that is not accessed often, the denormalized design may be a bigger performance hit than calculating on the fly. Your question 2 is an example of this. In complex scenarios, it's possible that multiple threads create inconsistencies in the denormalized data.
Positive: with data that is read often, and where the calculation is expensive, a denormalized schema will allow faster read access.
In your case, I doubt you need to store this data as a denormalized attribute. By creating an index on parent_table_id, version DESC, retrieving this data on the fly will be too fast to measure (assuming your database holds 10s of millions of records, not 10s of billions).
In general, I recommend only denormalizing if:
You can prove you have a performance problem (i.e. you have measured it)
You cannot improve performance by creating better indexes
You cannot improve performance through better hardware
Am I looking at potential performance problems \ locking issues ? since any insert into the child table-needs a lock on the parent table as well ?
As #TheImpaler writes, probably not. However, it depends on the complexity of your insert logic (does it do any complicated calculations which might slow things down?), and the likelihood of several concurrent threads trying to update the parent record. You may also end up with inconsistent data in these scenarios.
ORDER BY child_id DESC LIMIT 1
Is a very efficient way to get the "latest" child (assuming you have INDEX(child_id)).
This eliminates the need for the naughty "redundant" info you are proposing.
Is it a bad practice to have the parent_table store the latest version of the child table ?
No, that's perfectly OK, if it fits the requirements of your application. You need to add the extra logic to update the tables correctly, but that's it. Databases offer you a range of possibilities to store your data and relationships, and this is a perfectly good one.
Am I looking at potential performance problems \ locking issues ? since any insert into the child table-needs a lock on the parent table as well ?
It depends on how often you are updating/inserting/deleting children. Most likely it's not going to be a problem, unless the rate of changes is above 200+ per second, considering current database servers. Exclusive locking can become a problem for high volume of transactions.
Normally the locks will be at the row level. That it, they will lock only the row you are working with, so multiple threads with different parents will not create a bottleneck.
If your system really requires high level of transactions (1000+ / second), then the options I see are:
Throw more hardware at it: The easiest way. Just buy a bigger machine and problem solved... at least for a while, until your system grows again.
Use Optimistic Locking: this strategy doesn't require you to perform any actual lock at all. However, you'll need to add an extra numeric column to store the version number of the row.
Switch to another database: MySQL may not handle really high volume perfectly well. If that's the case you can consider PostgreSQL, or even Oracle database, that has surely better caching technology but is also very expensive.
So basically I am in the process of creating a personal finance tracking system. It occurred to be that keeping tabs on when each instance and transaction was last edited or updated might be of relevant information some day.
Now as far as I can see there are two approaches to implement something like this:
Create "updated" fields to all the tables I want to keep track of and then let mysql update those fields for me (ON UPDATE clause)
Create a completely seperate table for holding the log data and then update that with a triggers and transactions
Now it seems that 1st approach would have the benefit of keeping things simple and easy to maintain. However how this will impact the performance if I suddenly decide to get every log in the database for review. Also this would kind of goes against normalization (not by much though) with same data stored in multiple tables.
The second approach would allow more flexibility to the logging system and might actually shorten the sql query necessary to retrieve certain data. However it would make the schema more complex as two additional tables would have to be created (the actual log table and many-to-many relation table for holding the keys) and maintained. On the other hand if I ever want to implement an activity history this approach would propably be the only one capable of doing it.
As such I would like to know some more pros and cons to each method. Since 2nd option allows more flexibility I am considering implementing it but I am not sure about performance issues. In the end it comes down to two guestions:
Are there any real life examples where both approaches are
implemented?
And:
Are there any studies, comparisons or other resource that might shed
some light on which is considered more performance friendly and "best
practices" approach?
It depends on what kind of reporting you need and your current architecture.
If you just want to know last update date, then having 2 fields (creation date and last update) should be enough. That's because having separate table won't give any perfomance boost, but will make your code harder to maintain.
It's another story if you want to have something more elaborate, like reporting differences (what was changed) and/or have full change log on each transaction (there might be few updates to one transaction, right?). In this case you actually must have separate table, because otherwise it will bloat your table and reduce perfomance.
Based on my experience, I'd go with separate table. That's because it will be easier to maintain - your logging logic will be practically separated from everything else and I think one day you'll need that additional info on your transactions and full transaction history.
As far as perfomance goes, you won't notice any formidable difference unless your system will be under serious load. But as your system is personal, any choice would suffice, just don't forget about proper indexing.
Note that I'm making alot of assumptions here, so if you want something more specific, please provide your actual architecture and reporting needs. I'd suggest some books on high availability/perfomance, but they are not on your specific needs, but on general availability/perfomance.
I'm trying to add musical style columns to an event table and from what I've gathered this can be either done by adding a column for each musical style or by doing a many-to-many table relation which I don't want because I want each event returned only once in a table. Do you think that having that many boolean columns in a row, that I would face considerable database slowdown? (Data will only be read by the users).
Thank you :)
The columns will not slow down the database per se, but keep in mind that adding a boolean column for every music style is very poor design. Over time, the possible musical styles in your application are likely to change: maybe new ones must be added, redundant or useless ones must be removed, whatever. With your proposed design, you'd have to modify your database structure to add the new columns to the table. This is usually painful, error-prone, and you'll also have to go over all your queries to make sure they don't break because of the new structure.
You should design your database schema so that it's flexible enough to allow for variance over time in the contents of your application. For example, you could have a master table with one row for every musical style, defining its ID and its name, description etc. Then, a relationships table that contains the relationship between an entity (event, if I understood your question correctly) and a music style from the master table. You enforce consistency by putting foreign keys in place, to ensure data is always clean (eg. you cannot reference a music style that is not in the master table). This way ,you can modify the music styles without touching anything in the database structure.
Reading a bit on database normalization will help you a lot; you don't have to go all the way to have a fully normalized database, but understanding the principles behind will allow you to design efficient and clean database structures.
The likely answer is no
Having multiple boolean columns in a row should not significantly slow down DB performance; assuming your indices are set up appropriately.
EDIT: That being said it may be optimal to assign a details table and JOIN to it to get this data... but you said you didn't want to do that.
I assume you want to do something like have an event row with a bunch of columns like "isCountry", "isMetal", "isPunk" and you'll query all events marked as
isPunk = 1 OR isMetal = 1
or something like that.
The weakness of this design is that to add/remove musical styles you need to change your DB schema.
An Alternative is a TBLMusicalStyles with an ID and a Name, then a TBLEventStyles which will just contains EventID and StyleID
Then you could join them and just search on the styles table... and adding and removing styles would be relatively simple.
The performance of requests that do not involve musical styles will not be affected.
If your columns are properly indexed, then requests that involve finding lines that match client-provided musical styles should actually be faster.
However, all other requests that involve musical styles will be significantly slower, and also harder to write. For instance, "get all lines that share at least one style with the current line" would be a much harder request to write and execute.
What are the pros and cons? When should we have them and when we shouldn't?
UPDATE
What is this comment in an update SP auto generated with RepositoryFactory? Does it have to do anything with above columns not present?
--The [dbo].[TableName] table doesn't have a timestamp column. Optimistic concurrency logic cannot be generated
If you don't need historical information about your data adding these columns will fill space unnecessarily and cause fewer records to fit on a page.
If you do or might need historical information then this might not be enough for your needs anyway. You might want to consider using a different system such as ValidFrom and ValidTo, and never modify or delete the data in any row, just mark it as no longer valid and create a new row.
See Wikipedia for more information on different schemes for keeping historic information about your data. The method you proposed is similar to Type 3 on that page and suffers from the same drawback that only information about the last change is recorded. I suggest you read some of the other methods too.
All I can say is that data (or full blown audit tables) has helped me find what or who caused a major data problem. All it takes is one use to convince you that it is good to spend the extra time to keep these fields up-to-date.
I don't usually do it for tables that are only populated through a single automated process and no one else has write permissions to the table. And usually it isn't needed for lookup tables which users generally can't update either.
There are pretty much no cons to having them, so if there are any chance you will need them, then add them.
People may mention performance or storage concerns but,
in reality they will have little to no effect on SELECT performance with modern hardware, and properly specified SELECT clauses
there can be a minor impact to write performance, but this will likley only be a concern in OLTP-type systems, and this is exactly the case where you suually want these kinds of columns
if you are at the point where adding columns like this are a dealbreaker in terms of performance, then you are likely looking at moving away from SQL databases as a storage platform
With CreatedDate, I almost always set it up with a default value of GetDate(), so I never have to think about it. When building out my schema, I will add both of these columns unless it is a lookup table with no GUI for administering it, because I know it is unlikely the data will be kept up to date if modified manually.
Some DBMSs provide other means to capture this information autmatically. For example Oracle Flashback or Microsoft Change Tracking / Change Data Capture. Those methods also capture more detail than just the latest modification date.
That column type timestamp is misleading. It has nothing to do with time, it is rowversion. It is widely used for optimistic concurrency, example here
Im storing columns in database with users able to add and remove columns, with fake columns. How do I implement this efficiently?
The best way would be to implement the data structure vertically, instead of normal horizontal.
This can be done using something like
TableAttribute
AttributeID
AttributeType
AttributeValue
This application of vertical is mostly used in applications where users can create their own custom forms, and field (if i recall corretly the devexpress form layout allows you to create custom layouts). Mostly used in CRM applications, this is easily modified inproduction, and easily maintainable, but can greatly decrease SQL performance once the data set becomes very large.
EDIT:
This will depend on how far you wish to take it. You can set it up that it will be per form/table, add attributes that describe the actual control (lookup, combo, datetime, etc...) position of the controls, allowed values (min/max/allow null).
This can become a very cumbersome task, but will greatly depend on your actual needs.
I'd think you could allow that at the user-permission level (grant the ALTER privilege on the appropriate tables) and then restrict what types of data can be added/deleted using your presentation layer.
But why add columns? Why not a linked table?
Allowing users to define columns is generally a poor choice as they don't know what they are doing or how to relate it properly to the other data. Sometimes people use the EAV approach to this and let them add as many columns as they want, but this quickly gets out of control and causes performance issues and difficulty in querying the data.
Others take the approach of having a table with user defined columns and give them a set number of columns they can define. This works better performance wise but is more limiting interms of how many new columns they can define.
In any event you should severely restrict who can define the additional columns only to system admins (who can be at the client level). It is a better idea to actually talk to users in the design phase and see what they need. You will find that you can properly design a system that has 90+% of waht the customer needs if you actually talk to them (and not just to managers either, to users at all levels of the organization).
I know it is common in today's world to slough off our responsibility to design by saying we are making things flexible, but I've had to use and provide dba support for many of these systems and the more flexible they try to make the design, the harder it is for the users to use and the more the users hate the system.