I have some very heavy write intensive tables (user tracking tables) which will be writing nonstop. Problem is on a fully normalized schema I will have 16 foreign keys. Some keys are purely for lookup references, some are imp like linking user ID, user session ID, activity ID, etc.
With this many FK on a write intensive table performance is an issue. (I have a user content website which needs near to real time updates). So I am planning to drop all FKs for these write intensive tables but before that I want to know how else can i link data? When people say in the code, what exactly are we doing at the code level to keep data linked together as i assume in the application we cannot have relationships?
Secondly, if I dont use FKs I assume data will still be consistent as long as the the corect ID is written? Not like if member ID is 2000 it will write 3000 instead if no FK is used for whatever reason?
Lastly, this will not effect joins right? While i hope to avoid joins I may need some. But i assume FKs or not joins can still be done as is?
Secondly, if I dont use FKs I assume data will still be consistent
as long as the the corect ID is written?
Yes.
Lastly, this will not effect joins right?
right.
When people say in the code, what exactly are we doing at the
code level to keep data linked together
This is the real question. Actually, the really real two questions are:
1) How confident are you that the incoming values are all valid and do not need to be checked.
2) How big are the lookup tables being referenced?
If the answers are "not very confident" and "really small" then you can enforce in code by caching the answers in the app layer and just doing lookups using these super-fast in-memory tables before inserting. however, consider this, the database will also cache those small tables, so it might still be simpler to keep the fks.
If the answers are "not very confident" and "really huge" then you have a choice. You can drop the FK constraints, knowingly insert bad values and do some post-job cleanup, or you can keep those fks in the database because otherwise you've got all of that bad data.
For this combination it is not practical to cache the tables in the app, and if you drop thee fks and do lookups from the app it is even slower than having fk's in the database.
If the answers are "100% confident" then the 2nd question does not matter. Drop the fk's and insert the data with speed and confidence.
Related
Would there be any advantages/disadvantages to having one million tables in my database.
I am trying to implement comments. So far, I can think of two ways to do this:
1. Have all comments from all posts in 1 table.
2. Have a separate table for each post and store all comments from that post in it's respective table.
Which one would be better?
Thanks
You're better off having one table for comments, with a field that identifies which post id each comment belongs to. It will be a lot easier to write queries to get comments for a given post id if you do this, as you won't first need to dynamically determine the name of the table you're looking in.
I can only speak for MySQL here (not sure how this works in Postgresql) but make sure you add an index on the post id field so the queries run quickly.
You can have a million tables but this might not be ideal for a number of reasons[*]. Classical RDBMS are typically deployed & optimised for storing millions/billions of rows in hundreds/thousands of tables.
As for the problem you're trying to solve, as others state, use foreign keys to relate a pair of tables: posts & comments a la [MySQL syntax]:
create table post(id integer primary key, post text);
create table comment(id integer primary key, postid integer , comment text, key fk (postid));
{you can add constraints to enforce referential integrity between comment and posts to avoid orphaned comments but this requires certain capabilities of the storage engine to be effective}
The generation of primary key IDs is left to the reader, but something as simple as auto increment might give you a quick start [http://dev.mysql.com/doc/refman/5.0/en/example-auto-increment.html].
Which is better?
Unless this is a homework assignment, storing this kind of material in a classic RDBMS might not fit with contemporary idioms. Keep the same spiritual schema and use something like SOLR/Elasticsearch to store your material and benefit from the content indexing since I trust that you'll want to avoid writing your own search engine? You can use something like sphinx [http://sphinxsearch.com] to index MySQL in an equal manner.
[*] Without some unconventional structuring of your schema, the amount of metadata and pressure on the underlying filesystem will be problematic (for example some dated/legacy storage engines, like MyISAM on MySQL will create three files per table).
When working with relational databases, you have to understand (a little bit about) normalization. The third normal form (3NF) is easy to understand and works in almost any case. A short tutorial can be found here. Use Google if need more/other/better examples.
One table per record is a red light, you know you're missing something. It also means you need dynamic DDL, you must create new tables when you have new records. This is also a security issue, the database user needs to many permissions and becomes a security risk.
I have an InnoDB based schema with roughly 100 tables, most use GUID/UUID's as the primary key. I started this at a point in time where I didn't really understand the implications of a UUID PK with regard to Disk IO and fragmentation, but wanted the benefits of avoiding a single key dispenser when dealing with server clusters. We're not currently dealing with large numbers of rows, but we will be (in the hundreds of millions) and I would like to be prepared for that.
Now that I understand indexing in InnoDB better, specifically the clustered nature of the primary key, I can see that my UUID's are a poor choice for scalability from a DISK IO perspective, but I don't want to stop using them due to the server clustering requirement.
The accepted/recommended solution seems to be a mix of Autoincrement PK (INT|BIGINT), with UNIQUE Indexed UUID keys. My intention is to add a new first column ai_col to each table and assign it as the new PK, I'm taking queues from:
http://dev.mysql.com/doc/refman/5.1/en/innodb-auto-increment-handling.html
I would then update/recreate a new "UNIQUE" index on my UUID keys and continue to use them in our application layer.
My expectation is that once this is done that I can essentially ignore the ai_col and everything else runs business as usual. InnoDB will have a relatively small int based PK from which to cluster on and append to the other unique indexes.
Question 1: Am I correct in assuming that in this new scenario, I can have my cake and eat it too?
The follow up question is with regard to smaller 'associational' tables, i.e. Only two columns, both Foreign Keys to other tables joining them implicitly. In these cases I have typically two indexes, one being a UNIQUE two column index with the more heavily used column first, then a second single index on the other column. I know that this is essentially 2.5x as large as the actual row data, but it seems to really help our more complex queries during optimization, and is on smaller tables so relatively acceptable.
Most of these associational tables will only be a fraction the number of records in the primary tables because they're typically more specific, however, there are a few cases where these have many multiples the number of records as their foreign parents, i.e. potentially billions.
Question 2: Is it a good idea to add the numeric PK's to these tables as well? I'm guessing that the answer will be something along the lines of "Benchtest it" but I'm just looking for helpful nuggets of wisdom.
If I've obviously mis-interpreted anything or you can offer insights that I may not be considering, I'd really appreciate that too!
Many thanks!
EDIT: As promised in the answer, I just wanted to follow up for anyone interested... This solution has worked famously :) Read and write performance increased across the board, and so far it's been tested up to about 6 billion i/o's / month, without breaking a sweat.
Without any other suggestions, confirmations, or otherwise, I've begun testing on our dev server with a number of less used tables but ones that would be affected none the less if the new AI based id's were going to affect our application layer.
So far it's looking good, indexes are performing as expected and the new table fields haven't required any changes to our application layer, we've been basically able to ignore them.
I haven't run any thorough bench testing though to test the actual Disk IO under heavy load but from the sheer amount of information out there on the subject, I can surmise that we're in good shape for scaling up.
Once this has been in place for a while I'll drop in a follow up in case anyone's in the same boat we were.
So I started working for a company where they had 3 to 5 different tables that were often queried in either a complex join or through a double, triple query (I'm probably the 4th person to start working here, it's very messy).
Anyhow, I created a table that when querying the other 3 or 5 tables at the same time inserts that data into my table along with whatever information normally got inserted there. It has drastically sped up the page speeds for many applications and I'm wondering if I made a mistake here.
I'm hoping that in the future to remove inserting into those other tables and simply inserting all that information into the table that I've started and to switch the applications to that one table. It's just a lot faster.
Could someone tell me why it's much faster to group all the information into one massive table and if there is any downside to doing it this way?
If the joins are slow, it may be because the tables did not have FOREIGN KEY relationships and indexes properly defined. If the tables had been properly normalized before, it is probably not a good idea to denormalize them into a single table unless they were not performant with proper indexing. FOREIGN KEY constraints require indexing on both the PK table and the related FK column, so simply defining those constraints if they don't already exist may go a long way toward improving performance.
The first course of action is to make sure the table relationships are defined correctly and the tables are indexed, before you begin denormalizing it.
There is a concept called materialized views, which serve as a sort of cache for views or queries whose result sets are deterministic, by storing the results of a view's query into a temporary table. MySQL does not support materialized views directly, but you can implement them by occasionally selecting all rows from a multi-table query and storing the output into a table. When the data in that table is stale, you overwrite it with a new rowset. For simple SELECT queries which are used to display data that doesn't change often, you may be able to speed up your pageloads using this method. It is not advisable to use it for data which is constantly changing though.
A good use for materialized views might be constructing rows to populate your site's dropdown lists or to store the result of complicated reports which are only run once a week. A bad use for them would be to store customer order information, which requires timely access.
Without seeing the table structures, etc it would be guesswork. But it sounds like possibly the database was over-normalized.
It is hard to say exactly what the issue is without seeing it. But you might want to look at adding indexes, and foreign keys to the tables.
If you are adding a table with all of the data in it, you might be denormalizing the database.
There are some cases where de-normalizing your tables has its advantages, but I would be more interested in finding out if the problem really lies with the table schema or with how the queries are being written. You need to know if the queries utilize indexes (or whether indexes need to be added to the table), whether the original query writer did things like using subselects when they could have been using joins to make a query more efficient, etc.
I would not just denormalize because it makes things faster unless there is a good reason for it.
Having a separate copy of the data in your newly defined table is a valid performance enchancing practice, but on the other hand it might become a total mess when it comes to keeping the data in your table and the other ones same. You are essentially having two truths, without good idea how to invalidate this "cache" when it comes to updates/deletes.
Read more about "normalization" and read more about "EXPLAIN" in MySQL - it will tell you why the other queries are slow and you might get away with few proper indexes and foreign keys instead of copying the data.
I have this idea I've been mulling around in my head based on another concept I read somewhere. Basically you have a single "Primary" table with very few fields, other tables inherit that primary table through a foreign key. This much has been done before so its no news. What I would like to do, is to have virtually every table in the database inherit from that Primary table. This way, every object, every record, every entry in every table can have a fully unique primary key(since the PK is actually stored in the Primary table), and can be simply referenced by ID instead of by table.
Another benefit is that it becomes easy to make relationships that can touch multiple tables. For example: I have a Transaction table, and this table wants to have a FK to whatever it is a transaction for(inventory, account, contact, order, etc.). The Transaction can just have a FK to the Primary table, and the necessary piece of data is referenced through that.
The issue that keeps coming up in my head, is whether or not that Primary table will become a bottleneck. The thing is gonna have literally millions of records at one point. I know that gigantic record sets can be handled by good table design, but whats the limit?
Has anyone attempted anything similar to this, and what were your results?
You have to consider that this table will have a tons of foreign key relations. These can cause performance issues, if you want to delete a row from the root table. (Which can cause some nasty execution plans on delete)
So if you plan to remove rows, then it could impact performance. I recently had issues with a setup like this, and it was a pain to clean it up (it was refferencing 120 other tables - deletes where slow as hell).
To overcome this performance issue, you might consider not enforcing contrains (Bad plan), using no contrains for performance (Bad plan), or try to group all data that belongs to one entity in one row, and stick to the normal normalization practices (Good plan)
Yes, the primary table will almost certainly be a bottleneck.
How do you enforce real referential integrity?
For example, How can you be sure that the transaction's FK is actually linked to an inventory, account, contact or order rather than an apple, orange or pineapple?
I think this is something that would be a horrible bottleneck. Not only that it would make enforcing the real PK/FK relationships much harder. It could create a data integrity nightmare. I don't see where you gain any benefits at all.
I want to crate new table for each new user on the web site and I assume that there will be many users, I am sure that search performance will be good, but what is with maintenance??
It is MySQL which has no limit in number of tables.
Thanks a lot.
Actually tables are stored in a table too. So in this case you would move searching in a table of users to searching in the system tables for a table.
Performance AND maintainibility will suffer badly.
This is not a good idea:
The maximum number of tables is unlimited, but the table cache is finite in size, opening tables is expensive. In MyISAM, closing a table throws its keycache away. Performance will suck.
When you need to change the schema, you will need to do one ALTER TABLE per user, which will be an unnecessary pain
Searching for things for no particular user will involve a horrible UNION query between all or many users' tables
It will be difficult to construct foreign key constraints correctly, as you won't have a single table with all the user ids in any more
Why are you sure that performance will be good? Have you tested it?
Why would you possibly want to do this? Just have one table for each thing that needs a table, and add a "user" column. Having a bunch of tables vs a bunch of rows isn't going to make your performance better.
To give you a direct answer to your question: maintenance will lower your enthousiasm at the same rate that new users sign up for your site.
Not sure what language / framework you are using for your web site, but in this stage it is best to look up some small examples in that. Our guess is that in every example that you'll find, every new user gets one record in a table, not a table in the database.
I would go with option 1 (a table called tasks with a user_id foreign key) in the short run, assuming that a task can't have more than one user? If so then you'll need a JOIN table. Check into setting up an actual foreign key as well, this promotes referential integrity in the data itself.