Related
I want to create a table about "users" for each of the 50 states. Each state has about 2GB worth of data. Which option sounds better?
Create one table called "users" that will be 100GB large OR
Create 50 separate tables called "users_{state}", each which will be 2GB large
I'm looking at two things: performance, and style (best practices)
I'm also running RDS on AWS, and I have enough storage space. Any thoughts?
EDIT: From the looks of it, I will not need info from multiples states at the same time (i.e. won't need to frequently join tables if I go with Option 2). Here is a common use case: The front-end passes a state id to the back-end, and based on that id, I need to query data from the db regarding the specified state, and return data back to front-end.
Are the 50 states truly independent in your business logic? Meaning your queries would only need to run over one given state most of the time? If so, splitting by state is probably a good choice. In this case you would only need joining in relatively rarer queries like reporting queries and such.
EDIT: Based on your recent edit, this first option is the route I would recommend. You will get better performance from the table partitioning when no joining is required, and there are multiple other benefits to having the smaller partitioned tables like this.
If your queries would commonly require joining across a majority of the states, then you should definitely not partition like this. You'd be better off with one large table and just build the appropriate indices needed for performance. Most modern enterprise DB solutions are capable of handling the marginal performance impact going from 2GB to 100GB just fine (with proper indexing).
But if your queries on average would need to join results from only a handful of states (say no more than 5-10 or so), the optimal solution is a more complex gray area. You will likely be able to extract better performance from the partitioned tables with joining, but it may make the code and/or queries (and all coming maintenance) noticeably more complex.
Note that my answer assumes the more common access frequency breakdowns: high reads, moderate updates, low creates/deletes. Also, if performance on big data is your primary concern, you may want to check out NoSQL (for example, Amazon AWS DynamoDB), but this would be an invasive and fundamental departure from the relational system. But the NoSQL performance benefits can be absolutely dramatic.
Without knowing more of your model, it will be difficult for anyone to make judgement calls about performance, etc. However, from a data modelling point of view, when thinking about a normalized model I would expect to see a User table with a column (or columns, in the case of a compound key) which hold the foreign key to a State table. If a User could be associated with more than one state, I would expect another table (UserState) to be created instead, and this would hold the foreign keys to both User and State, with any other information about that relationship (for instance, start and end dates for time slicing, showing the timespan during which the User and the State were associated).
Rather than splitting the data into separate tables, if you find that you have performance issues you could use partitioning to split the User data by state while leaving it within a single table. I don't use MySQL, but a quick Google turned up plenty of reference information on how to implement partitioning within MySQL.
Until you try building and running this, I don't think you know whether you have a performance problem or not. If you do, following the above design you can apply partitioning after the fact and not need to change your front-end queries. Also, this solution won't be problematic if it turns out you do need information for multiple states at the same time, and won't cause you anywhere near as much grief if you need to look at User by some aspect other than State.
I'm building a PHP app to prefill third party PDF account forms with client data, and am getting stuck on the database design.
The current form has about 70 fields, which seems like far too many to set up as individual columns, especially as some (ie company/trust information) are not relevant depending on the type of account the client requires.
I've tried to normalize but it seems like there would be a lot of joins, and also require several sub queries for things like multiple addresses.
It also means a ton of extra queries to check if rows exist or not when updating to decide if the script needs to do an INSERT, a DELETE or an UPDATE, whereas if it was all in one row, it would basically just be an UPDATE each time.
Not sure if this helps but here is a list of most of the fields:
id, account_type, account_phone, account_email, account_designation, account_adviser, account_source, account_complete,
account_residential_unit_number, account_residential_street_number, account_residential_street_name, account_residential_street_type, account_residential_suburb, account_residential_state, account_residential_postcode,
account_postal_unit_number, account_postal_street_number, account_postal_street_name, account_postal_street_type, account_postal_suburb, account_postal_state, account_postal_postcode,
individual_1_title, individual_1_firstname, individual_1_middlename, individual_1_lastname, individual_1_dob, individual_1_occupation, individual_1_email, individual_1_phone,
individual_1_unit_number, individual_1_street_number, individual_1_street_name, individual_1_street_type, individual_1_suburb, individual_1_state, individual_1_postcode,
individual_2_title, individual_2_firstname, individual_2_middlename, individual_2_lastname, individual_2_dob, individual_2_occupation, individual_2_email, individual_2_phone,
individual_2_unit_number, individual_2_street_number, individual_2_street_name, individual_2_street_type, individual_2_suburb, individual_2_state, individual_2_postcode,
company_name, company_date,
company_unit_number, company_street_number, company_street_name, company_street_type, company_suburb, company_state, company_postcode,
trust_name, trust_date,
settlement_bank, settlement_account, settlement_bsb
The most this will need to handle is around 200,000 applications, and once the data is in the database, it won't change very often, if at all - not sure if that is relevant?
So really just wanted to figure out the smartest way to do design this, even if it's just a name or topic to research further.
Generally speaking you can divide a database into two broad categories:
OLTP Systems
Online Transaction Processing Systems are normally write intensive i.e. a lot of updates compared to the reads of the data. This system is typically a day to day application used by a business users of all scopes i.e. data capture, admin etc. These databases are usually normalized to the extreme and then certain demoralized for performance gains in certain areas.
OLAP/DSS system:
On Line Analytic Processing are database that are normally large data warehouse like systems. Used to support Analytic activities such as data mining, data cubes etc. Typically the information is used by a more limited set of users than OLTP. These database are normally very denormalised.
Go read here for a short description of these and the main differences.
OLTP VS OLAP
Regarding your INSERT/UPDATE/DELETE point go read about the MySQL ON DUPLICATE KEY UPDATE statement which will resolve that issue for you easily. It is called a MERGE operation in most database systems.
Now I dont understand why you are worried about JOINS. I have had tables with millions (500 000 000+) rows that I joined with other tables also large in size and the queries ran very fast. So designing a database to eliminate joins is NOT a good idea.
My suggestion is:
If designing a OLTP system normalise as much as possible then denormalise to increase performance where needed. For A OLAP system look at star schemas etc and dont even bother with normalizing it first. Oh by the way most of the OLAP systems normally use a OLTP system as a data source.
Usually I normalise and then denormalise for performance. However
If I didn't have too much validation to do e.g Valid address, duplicated indivual
And I didn't want to reuse parts of the data for another version of the form, e.g select an existing individual , Name and address etc
And I didn't want to analyse it e.g Find all mentions of Fred Bloggs
And my user's were happy with entering all of this one form ( I wouldn't be)
Then I'd go with denormalise from the get go.
Thing is if you normalise, then denormalising if required is fairly trivial and low risk, normalising denormalised data usually means de-duplication which is likely to be really painful data and design wise.
Normalize your input, de-normalize the output. Meaning, for reporting, extract your data into a de-normalized format like Mongo and use that for querying. Or, create rollups of some sort. I have found, with large datasets, to extract the reporting data from the input data for best efficiency.
I find denormalized data extremely painful to work with at a very basic level. What if I want a tally of the number of people who live in Georgia. In your denormalized structure I would have to count where ind_1_state = GA or ind_2_state = GA.
This is not too bad I guess, but to anywho who has seen the ease of querying that normalization provides, it is quite painful.
Normalization establishing the foundation for more and more complex queries. Without it, you will find it increasingly difficult to implement richer data analysis.
Normalization also provides the basis for integrity and consistency in your database. If you have all the occurrences of a particular thing ( state abbreviations ) in one place ( one column ) you can easily check and constrain those values to not allow nonexistent codes.
The rationale for normalization goes on and on, but I hope I hit a few no brainers.
This is no brainer - all you have now is a noun-soup which you have shoved in a single table-storage-shoebox and glued some ID at the beginning of each row.
Create some kind of schema. If this is more like a OLAP -- and you decide for star schema -- it will have dimensions in 2-5 NF and facts in 2-6 NF. For OLTP (or different warehouse models) aim for BCNF - 6NF.
I would argue that you do not even have 1NF here, gluing that ID at the beginning does not count as preventing duplicates. Therefore, you can not de-normalize from this point even if you wanted to :) -- ok, maybe you could put some comma-separated list somewhere to make things definitely not in 1NF.
Joins are what relational databases do, so do not worry about that.
What are the pros and cons? When should we have them and when we shouldn't?
UPDATE
What is this comment in an update SP auto generated with RepositoryFactory? Does it have to do anything with above columns not present?
--The [dbo].[TableName] table doesn't have a timestamp column. Optimistic concurrency logic cannot be generated
If you don't need historical information about your data adding these columns will fill space unnecessarily and cause fewer records to fit on a page.
If you do or might need historical information then this might not be enough for your needs anyway. You might want to consider using a different system such as ValidFrom and ValidTo, and never modify or delete the data in any row, just mark it as no longer valid and create a new row.
See Wikipedia for more information on different schemes for keeping historic information about your data. The method you proposed is similar to Type 3 on that page and suffers from the same drawback that only information about the last change is recorded. I suggest you read some of the other methods too.
All I can say is that data (or full blown audit tables) has helped me find what or who caused a major data problem. All it takes is one use to convince you that it is good to spend the extra time to keep these fields up-to-date.
I don't usually do it for tables that are only populated through a single automated process and no one else has write permissions to the table. And usually it isn't needed for lookup tables which users generally can't update either.
There are pretty much no cons to having them, so if there are any chance you will need them, then add them.
People may mention performance or storage concerns but,
in reality they will have little to no effect on SELECT performance with modern hardware, and properly specified SELECT clauses
there can be a minor impact to write performance, but this will likley only be a concern in OLTP-type systems, and this is exactly the case where you suually want these kinds of columns
if you are at the point where adding columns like this are a dealbreaker in terms of performance, then you are likely looking at moving away from SQL databases as a storage platform
With CreatedDate, I almost always set it up with a default value of GetDate(), so I never have to think about it. When building out my schema, I will add both of these columns unless it is a lookup table with no GUI for administering it, because I know it is unlikely the data will be kept up to date if modified manually.
Some DBMSs provide other means to capture this information autmatically. For example Oracle Flashback or Microsoft Change Tracking / Change Data Capture. Those methods also capture more detail than just the latest modification date.
That column type timestamp is misleading. It has nothing to do with time, it is rowversion. It is widely used for optimistic concurrency, example here
I'm considering a design for a private messaging system and I need some input here, basically I have several questions regarding this. I've read most of the related questions and they've given me some thought already.
All of the basic messaging systems I've thus far looked into use a single table for all of the users' messages. With indexes etc this approach would seem fine.
What I wanted to know is if there would be any benefit to splitting the user messages into separate tables. So when a new user is created a new table is created (either in the same or a dedicated message database) which stores all of the messages - sent and received -for that user.
What are the pitfalls/benefits to approaching things that way?
I'm writing in PHP would the code required to write be particularly more cumbersome than the first large table option?
Would the eventual result, with a large amount of smaller tables be a more robust, trouble free design than one large table?
In the event of large amounts of concurrent users, how would the performance of the server compare where dealing with one large versus many small tables?
Any help with those questions or other input would be appreciated. I'm currently working through a smaller scale design for my test site before rewriting the PM module and would like to optimise it. My poor human brain handles separate table far more easily, but the same isn't necessarily so for a computer.
You'll just get headaches from moving to small numerous tables. Databases are made for handling lots of data, let it do it's thing.
You'll likely end up using dynamic table names in queries (SELECT * FROM $username WHERE ...), making smart features like stored procedures and possibly parameterized queries a lot trickier if not outright impossible. Usually a really bad idea.
Try rewriting SELECT * FROM messages WHERE authorID = 1 ORDER BY date_posted DESC, but where "messages" is anywhere between 1 and 30,000 different tables. Keeping your table relations monogamous will keep them bidirectional, way more useful.
If you think table size will really be a problem, set up an "archived messages" clone table and periodically move old & not-unread messages there where they won't get in the way. Also note how most forum software with private messaging allows for limiting user inbox sizes. There are a few ways to solve the problem while keeping things sane.
I'm agreeing with #MarkR here - in that initially the one table for messages is definitely the way to proceed. As time progresses and should you end up with a very large table then you can consider how to partition the table to best proceed. That's counter to the way I'd normally advise design, but we're talking about one table which is fairly simple - not a huge enterprise system.
A very long time ago (pre availability of SQL databases) I built a system that stored private and public messages, and I can confirm that once you split a message base logical entity into more than one everything¹ becomes a lot more complicated; and I doubt that a user per file is the right approach - the overheads will be massive compared to the benefit.
Avoid auto-increment[2] - and using natural keys is very important to the future scalability. Designing well to ensure that you can insert and retrieve without locking will be of more benefit.
¹ Indexing, threading, searching, purging/archiving.
² Natural keys are better if you can find one for your data as the autoincremented ID does not describe the data at all and databases are good at locating based on the primary key, so a natural primary key can improve things. Autoincrement can cause problems with a distributed database; it also leaks data when presented externally (to see the number of users registered just create a new account and check your user ID). If you can't find a natural key then a UUID (or GUID) may still be a better option - providing that the database has good support for this as a primary key. See When to use an auto-incremented primary key and when not to
Creating one table per user certainly won't scale well when there are a large number of users with a small number of messages. The way MySQL handles table opening/closing, very large numbers of tables (> 10k, say) become quite inefficient, especially at server startup and shutdown, as well as trying to backup non-transactional tables.
However, the way you've worded your question sounds like a case of premature optimisation. Make it work first, then fix performance problems. This is always the right way to do things.
Partitioning / sharding will become necessary once your scale gets high enough. But there are a lot of other things to worry about in the mean time. Sort them out first :)
One table is the right way to go from an RDBMS PoV. I recommend you use it until you know better.
Splitting large amounts of data into smaller sets makes sense if you're trying to avoid locking issues: for example - locking the messages table - doing big selects or updating huge amounts of data at once. In this case long running queries could block whole table and everyone needs to wait... You should ask yourself if this going to happen in your case? At least for me it looks like messaging system is not going to have such things because all information is being pushed into table or retrieved from it in rather small sets. If this is a user centric application - so, for example, getting all messages for single user is quite easy and fast to do, the same goes also for creating new messages for one or another particular user... Unless you would have really huge amounts of users/messages in your system.
Splitting data into multiple tables has also some drawbacks - you will need kind of management system or logic how do you split everything - giving separate table for each user could grow up soon into hundreds or thousands of tables - which is, in my opinion, not that nice. Therefore probably you would need some other criteria how to split the data. If you want splitting logic to be dynamic and easy adjustable - you would probably need also to save it in DB somehow. As you see complexity grows...
As advantage of such data sharding could be the scalability - you could easy put different sets of data on different machines once single machine is not able to handle whole load.
It depends how your message system works.
Are there cuncurrency issue?
Does it need to be scalable as the application accomodate more customers?
Designing one table will perfectly work on small, one message at a time single user system.
However, if you are considering multiple user, concurrent messaging system, the tables should be splited
Data model for Real time application is recommended to be "normalized"(Spliting table) due to "locking & latching" and data redundency issue.
Locking policy varies by Database Vendor. If you have tables that have updates & select by applicaiton concurrently, "Locking"(page level, row level, table level depending on vendor) issue araise. Some bad DB & app design completely lock the table so message never go through.
Redendency issue is more clear. If you use only one table, some information(like user. I guess one user could send multiple messages) is redundent.
Try to google with "normalization", 'Locking"..
I am designing a system and I don't think it's a good idea to give the ability to the end user to delete entries in the database. I think that way because often then end user, once given admin rights, might end up making a mess in the database and then turn to me to fix it.
Of course, they will need to be able to do remove entries or at least think that they did if they are set as admin.
So, I was thinking that all the entries in the database should have an "active" field. If they try to remove an entry, it will just set the flag to "false" or something similar. Then there will be some kind of super admin that would be my company's team who could change this field.
I already saw that in another company I worked for, but I was wondering if it was a good idea. I could just make regular database backups and then roll back if they commit an error and adding this field would add some complexity to all the queries.
What do you think? Should I do it that way? Do you use this kind of trick in your applications?
In one of our databases, we distinguished between transactional and dictionary records.
In a couple of words, transactional records are things that you cannot roll back in real life, like a call from a customer. You can change the caller's name, status etc., but you cannot dismiss the call itself.
Dictionary records are things that you can change, like assigning a city to a customer.
Transactional records and things that lead to them were never deleted, while dictionary ones could be deleted all right.
By "things that lead to them" I mean that as soon as the record appears in the business rules which can lead to a transactional record, this record also becomes transactional.
Like, a city can be deleted from the database. But when a rule appeared that said "send an SMS to all customers in Moscow", the cities became transactional records as well, or we would not be able to answer the question "why did this SMS get sent".
A rule of thumb for distinguishing was this: is it only my company's business?
If one of my employees made a decision based on data from the database (like, he made a report based on which some management decision was made, and then the data report was based on disappeared), it was considered OK to delete these data.
But if the decision affected some immediate actions with customers (like calling, messing with the customer's balance etc.), everything that lead to these decisions was kept forever.
It may vary from one business model to another: sometimes, it may be required to record even internal data, sometimes it's OK to delete data that affects outside world.
But for our business model, the rule from above worked fine.
A couple reasons people do things like this is for auditing and automated rollback. If a row is completely deleted then there's no way to automatically rollback that deletion if it was in error. Also, keeping a row around and its previous state is important for auditing - a super user should be able to see who deleted what and when as well as who changed what, etc.
Of course, that's all dependent on your current application's business logic. Some applications have no need for auditing and it may be proper to fully delete a row.
The downside to just setting a flag such as IsActive or DeletedDate is that all of your queries must take that flag into account when pulling data. This makes it more likely that another programmer will accidentally forget this flag when writing reports...
A slightly better alternative is to archive that record into a different database. This way it's been physically moved to a location that is not normally searched. You might add a couple fields to capture who deleted it and when; but the point is it won't be polluting your main database.
Further, you could provide an undo feature to bring it back fairly quickly; and do a permanent delete after 30 days or something like that.
UPDATE concerning views:
With views, the data still participates in your indexing scheme. If the amount of potentially deleted data is small, views may be just fine as they are simpler from a coding perspective.
I prefer the method that you are describing. Its nice to be able to undo a mistake. More often than not, there is no easy way of going back on a DELETE query. I've never had a problem with this method and unless you are filling your database with 'deleted' entries, there shouldn't be an issue.
I use a combination of techniques to work around this issue. For some things adding the extra "active" field makes sense. Then the user has the impression that an item was deleted because it no longer shows up on the application screen. The scenarios where I would implement this would include items that are required to keep a history...lets say invoice and payment. I wouldn't want such things being deleted for any reason.
However, there are some items in the database that are not so sensitive, lets say a list of categories that I want to be dynamic...I may then have users with admin privileges be allowed to add and delete a category and the delete could be permanent. However, as part of the application logic I will check if the category is used anywhere before allowing the delete.
I suggest having a second database like DB_Archives whre you add every row deleted from DB. The is_active field negates the very purpose of foreign key constraints, and YOU have to make sure that this row is not marked as deleted when it's referenced elsewhere. This becomes overly complicated when your DB structure is massive.
There is an acceptable practice that exists in many applications (drupal's versioning system, et. al.). Since MySQL scales very quickly and easily, you should be okay.
I've been working on a project lately where all the data was kept in the DB as well. The status of each individual row was kept in an integer field (data could be active, deleted, in_need_for_manual_correction, historic).
You should consider using views to access only the active/historic/... data in each table. That way your queries won't get more complicated.
Another thing that made things easy was the use of UPDATE/INSERT/DELETE triggers that handled all the flag changing inside the DB and thus kept the complex stuff out of the application (for the most part).
I should mention that the DB was a MSSQL 2005 server, but i guess the same approach should work with mysql, too.
Yes and no.
It will complicate your application much more than you expect since every table that does not allow deletion will be behind extra check (IsDeleted=false) etc. It does not sound much but then when you build larger application and in query of 11 tables 9 require chech of non-deletion.. it's tedious and error prone. (Well yeah, then there are deleted/nondeleted views.. when you remember to do/use them)
Some schema upgrades will become PITA since you'll have to relax FK:s and invent "suitable" data for very, very old data.
I've not tried, but have thought a moderate amount about solution where you'd zip the row data to xml and store that in some "Historical" table. Then in case of "must have that restored now OMG the world is dying!1eleven" it's possible to dig out.
I agree with all respondents that if you can afford to keep old data around forever it's a good idea; for performance and simplicity, I agree with the suggestion of moving "logically deleted" records to "old stuff" tables rather than adding "is_deleted" flags (moving to a totally different database seems a bit like overkill, but you can easily change to that more drastic approach later if eventually the amount of accumulated data turns out to be a problem for a single db with normal and "old stuff" tables).