Should I duplicate data in my DB?

Should I duplicate data in my DB? - mysql

I have a MySQL DB containing entry for pages of a website.
Let's say it has fields like:
Table pages:
id | title | content | date | author
Each of the pages can be voted by users, so I have two other tables
Table users:
id | name | etc etc etc
Table votes:
id | id_user | id_page | vote
Now, I have a page where I show a list of the pages (10-50 at a time) with various information along with the average vote of the page.
So, I was wondering if it were better to:
a) Run the query to display the pages (note that this is already fairly heavy as it queries three tables) and then for each entry do another query to calculate the mean vote (or add a 4th join to the main query?).
or
b) Add an "average vote" column to the pages table, which I will update (along with the vote table) when an user votes the page.
nico

Use the database for what it's meant for; option a is by far your best bet. It's worth noting that your query isn't actually particularly heavy, joining three tables; SQL really excels at this sort of thing.
Be cautious of this sort of attempt at premature optimization of SQL; SQL is far more efficient at what it does than most people think it is.
Note that another benefit from using your option a is that there's less code to maintain, and less chance of data diverging as code gets updated; it's a lifecycle benefit, and they're too often ignored for miniscule optimization benefits.

You might "repeat yourself" (violate DRY) for the sake of performance. The trade-offs are (a) extra storage, and (b) extra work in keeping everything self-consistent within your DB.
There are advantages/disadvantages both ways. Optimizing too early has its own set of pitfalls, though.

Honestly, for this issue, I would recommend redundent information. Multiple votes for multiple pages can really create a heavy load for a server, in my opinion. If you forsee to have real traffic on your website, of course... :-)

Related

EAV vs null vs Mixed

I'm familar with normalized databases and I'm able to produce all kind of queries. But since i'm starting on a green-field project now, one question kept me busy during this week:
It's the typical "webshop-problem" i'd say (even if i'm not building a webshop): How to model the "product-information"?
There are some approaches, each with its own advantages or disadvantages:
One Table to rule them all
Putting every "product" into a single table, generating every column possible and working with this monster-table.
Pro:
Easy queries
Easy layout
Con:
Lot of NULL Values
The actual code becomes sensitive towards the query (different type, different columns are required)
EAV-Pattern
Obviously the EAV-Pattern can provide a nicer solution for this. However, I've been working with EAV in the past, and when it comes down to performance, it can become a Problem for a huge amount of entries.
Searching is easy, but listing a "normalized table" requires one join per actual column -> slow.
Pro:
Clean
Flexible
Con:
Performance
Not Normalized
Single Table per category
Basically the opposite of the EAV-Pattern: Create one table per product-type, i.e. "cats", "dogs", "cars", ...
While this might be possible for a countable number of categories, it becomse a nightmare for a steady growing amount of categories, if you have to maintain those.
Pro:
Clean
Performance
Con:
Maintenance
Query-Management
Best of both worlds
So, on my journey through the internet I found recommendations to mix both approaches: Use a single Table for the common information, while grouping other attributes into "attribute-groups" which are organized in the EAV-Fashion.
However, here I think, this would basically import the drawbacks of EACH approach... You need to work with regular Tables (basic information) and do a huge amount of joins to get ALL information.
Storing enhanced information in JSON/XML
Another approach is to store extendet information in JSON/XML Format entries (within a column of the "root-table").
However, I don't really like this, as it seems hard(er) to query and to work-with than a regular database layout.
Automating single tables
Another idea was automating the part of "creating tables" per category (and therefore automating the queries on those), while maintaining a "master-table" just containing the id and the category information, in order to get the best performance for an undetermined amount of tables...?
i.e.:
Products
id | category | actualId
1 | cat | 1
2 | car | 1
cats
id | color | mew
1 | white | true
cars
id | wheels | bhp
1 | 4 | 123
the (abstract) Product table would allow to query for everything, while details are available by an easy join with "actualId" and the responsible table.
However, this would lead to problems if you want to run a "show all" query, because this is not solvable by SQL alone, cause the table name (in the join) needs to be explicit in the query.
What other Options are available? There are a lot of "webshops", each dealing with this problem more or less - how do they solve it in a efficent way?

I strongly disagree with your opinion that the "monster" table approach leads to "Easy queries", and that the EAV approach will cause performance issues (premature optimization?). And it doesn't have to require complex queries:
SELECT base.id, base.other_attributes,
, GROUP_CONCAT(CONCATENATE(ext.key, '[', ext.type, ']', ext.value))
FROM base_attributes base
LEFT JOIN extended_attributes ext
ON base.id=ext.id
WHERE base.id=?
;
You would need to do some parsing on the above, but a wee bit of polishing would give something parseable as JSON or XML without putting your data inside anonymous blobs
If you don't care about data integrity and are happy to solve performance via replication, then NoSQL is the way to go (this is really the same thing as using JSON or XML to store your data).

How do I design my tables to allow for a massive number of rows?

I'm working on a website that, boiled down, will work for the end user as a glorified to-do list. SQL is where I have the least experience I need for doing this well. Ignoring whether or not in reality, this will actually get a user base that massive, how could I design for the scenario that I have tens of thousands or more people adding dozens of their own items to this table?
Here's the layout I currently have planned for the
Items table:
ItemID | UserID| Content | Subcontent | Parent | Hierarchy | Days | Note | Alert | Deadline
So, the items created by each user are contained in that table, to be queried using something like "SELECT * WHERE UserID = $thisUser", then placed on the page and handled correctly using the other information from that row.
With this layout, would hundreds of thousands or millions of entries become a serious performance problem? If you have any suggestions or resources that you think would be helpful, I would appreciate them. Thank you.

If you index the column user_id, some hundred thousand or a few million should be no big problem. If we speak of even more rows, maybe several ten or hundred million, you should think of a way to evenly distribute the items according to their users. However, the row count is only one aspect influencing the performance. The moddeling of your data and the code, which queries your database, are likely to have more impact.

I believe you need to rethink your database layout. Rarely are individual users going to use the same content. I think you should have a table for each user then it would be UerID|ItemID|Content|Subcontent....
This allows you to maintain your database when a user quits.

How to make a custom CMS go multi-user

I have made a custom CMS and I'd like to make it multi-client while duplicating as little as possible. I don't mean multi-user as in different people in the same organisation accessing the same program, I mean multiple-clients as in different organisations using their own access of the same program as though they are independent applications.
I understand the principle of sharing functions, and I imagine I'd need to put all the functions I've created into a shared folder in a parent directory.
I think I have got my head around at least the way the code works, but the database(sql) structure seems like the biggest challenge.
How is this typically accomplished?
My tables are fairly basic and I can see after doing some reading that its normal to simply add a 'client_id' or 'app_id' or something like this to every table and entry. This way there is not a duplicate of the databases, however then you get a mixture of all the clients data in the same tables. The problem it seems comes with if this program were to get very large with many clients that the data multiplies and so does the speed of the system for everybody. I'm not at that stage yet however, so should I not worry about that far ahead and cross that bridge when it comes since for now the speed sacrifice would be negligible?
Is it possible to somehow keep databases separate without doubling up on work if I change the structure of a table in the future or add extra fields etc?
I understand this might be difficult to answer without knowing the way I've structure my tables but they are quite simple like:
unique_id | title | modified_date | content
xx hello 0000-00-00 00:00:00 i am content
The best I can think so far is that this would then become:
client_id | unique_id | title | modified_date | content
xx xx hello 0000-00-00 00:00:00 i am content
like I said, I see this could run into some problems mostly with becoming bloated down the track but right now I don't see another way - perhaps you have another way of looking at this. Thanks.

Keep it as a single database with the client_id column added. If it gets large with many clients, partition the tables by LIST: http://dev.mysql.com/doc/refman/5.5/en/partitioning-list.html
Horizontal partitioning allows you to have one logical table be sub-divided so when your SQL includes "... WHERE client_id = 1", it will only ever have to read the index(es) or partition that contain "client_id = 1" data. Other partitions get ignored, almost as if you have a separate table for each client_id.
DISCLAIMER: I haven't used partitioning in MySQL myself. I'm just familiar with the concept from Oracle. Be sure your MySQL storage engine supports partitioning: http://dev.mysql.com/doc/refman/5.5/en/partitioning.html

Your best bet is to make it use separate databases for each of the client's data and retaining all shared information (users, etc.) in one central database. Then when a client grows they can be moved off to another database server without affecting anyone else.

I have a similar situation with my web-app: many users sharing one database, where most of the tables have a client-identifier in them. It's not a CMS, but similar enough, the users are performing CRUD operations on their data.
There are pros and cons, but I wouldn't worry about performance overly. Since you will have to re-create your existing unique indexes to contain the client-id, you should see no great difference in performance: your look-ups now have an additional predicate, which appears in the indexes.
As George3 said, if you have significant volumes from one or a few clients, horizontal partitioning could be worth pursuing. But premature optimization and all that: I would wait and see if it becomes an issue first.
Managing multiple databases, or multiple versions of tables for different clients doesn't scale well, and is a maintenance nightmare. Just get the security on the content right.

Is it good practice to automatically create tables in a database when a user registers?

I'm making my first site with Django, and I'm having a database design problem.
I need to store some of the users history, and I don't know whether it's better to create a table like this for each user every time one signs up:
table: $USERNAME$
id | some_data | some_more | even_more
or have one massive table from the start, with everyone's data in:
table: user_history
id | username | some_data | some_more | even_more
I know how to do the second one, just declare it in my Django models. If I should do the first one, how can I in Django?
The first one organises the data more hierarchically but could potentially create a lot of tables depending on the popularity of the service (is this a bad thing?)
The second one seems to more suit Django's design philosophies (from what I've seen so far), and would be easier to run comparative searches between users, but could get huge (number of users * average items in history). Can MySQL handle, say, 1 billion records? (I won't get that, but it's good to plan ahead)

Definitely the second format is the way you want to go. MySQL is pretty good at handling large numbers of rows (assuming they're indexed and cached as appropriate, of course). For example, all versions of all pages on Wikipedia are stored on one table in their database, and that works absolutely fine.

I just don't know what Django is, but I'm sure it's not a good practice to create a table per user for logging, (or almost anything, for that matter).
Best regards.

You should definitely store all users in one table, one row per user. It's the only way you can filter out data using a WHERE clause. And I'm not sure if MySQL can handle 1 billion records, but I've never found the records limit as a limiting factor. I wouldn't worry about the records limit for now.

You see, every high-loaded project started with something that was just well-designed. Well designed system has better perspectives of being improved to handle huge loads.
Also keep in mind, that even genious guys in twitter/fb/etc did not know what issues they will experience after a while. And you will not know either. Solving loading/scalability challenges and their prediction is a sort of rocket-science.
So the best you can do now - is just starting with the most normalized db and academic solutions, and solve the bottlenecks as soon as they will appear.

When creating a relational database, you would only want to create a new table if it contains significantly different data than the original table. In this case, all of the tables will be pretty much the same, so you would only want 1 table for all users.
If you want to break it down even further, you may not want to store all the users actions in the user table. You may want to have 1 table for user information, and another for user history, ie:
table: User
Id | UserName | Password | other data
table: User_history
Id | some_data | timestamp
There's no need to be worried about the speed of your database as long as you define proper indexes on the fields you plan to search. Using those indexes will definitely speed up your response time as more records are put into your table. The database I work on has several tables with 30,000,000+ records and there's no slow-down.

Definitely DO NOT create a TABLE per user. Create a row per user, and possibly a row per user and smaller tables if some data can be factored.

definitely stick with one table for all users, consider complicated queries that may request extra resources for running on multiple tables instead of just one.
run some tests, regarding resources i am sure you will find out one table works best.

Everyone has pointed out that the second option is the way to go, I'll add my +1 to that.
About the first option, in Django, you create tables by declaring subclasses of django.models.Model and then when you run the management command syncdb it will look at all the models and create missing tables for all "managed" models. It might be possible to invoke this behavior at run time, but it isn't the way things are done.

Is it wise to use temporary tables?

We have a mySQL database table for products. We are utilizing a cache layer to reduce database load, but we think that it's a good idea to minimize the actual data needed to be stored in the cache layer to speed up the application further.
All the products in the database, that is visible to visitors have a price attached to them:
The prices are stored in a different table, called prices . There are multiple price categories depending on which discount level each visitor (customer) applies to. From time to time, there are campaigns which means that a special price for each product is available. The special prices are stored in a table called specials.
Is it a bad to make a temp table that binds the tables together?
It would only have the neccessary information and would ofcourse be cached.
-------------|-------------|------------
| productId | hasPrice | hasSpecial
-------------|-------------|------------
1 | 1 | 0
2 | 1 | 1
By doing such, it would be super easy to know if the specific product really has a price, without having to iterate through the complete prices or specials table each time a product should be listed or presented.
Are temp tables a common thing for web applications or is it just bad design?

If you're going to cache this data anyways, does it really need to be in a temp table? You would only incur the overhead of the query when you needed to rebuild the cache, so the temp table might not even be necessary.

You should approach it like any other performance problem: Decide how much performance is necessary, then iterate doing testing on production-grade hardware in your lab. Do not do needless optimisations.
You shoud profile your app and discover if it's doing too many queries or the queries themselves are slow; most cases of web-app slowness are caused by doing too many queries (in my experience) even though the queries are very easy.
Normally the best engineering solution is to restructure the database, in some cases denormalising, to make the common read use-cases require fewer queries. Caching may be helpful as well, but refactoring so you need fewer queries is often the best.
Essentially you can increase the amount of work on the write-path to reduce the amount on the read-path, if you are planning to do a lot more reading than writing.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008