Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
Background
I'm a first year CS student and I work part time for my dad's small business. I don't have any experience in real world application development. I have written scripts in Python, some coursework in C, but nothing like this.
My dad has a small training business and currently all classes are scheduled, recorded and followed up via an external web application. There is an export/"reports" feature but it is very generic and we need specific reports. We don't have access to the actual database to run the queries. I've been asked to set up a custom reporting system.
My idea is to create the generic CSV exports and import (probably with Python) them into a MySQL database hosted in the office every night, from where I can run the specific queries that are needed. I don't have experience in databases but understand the very basics. I've read a little about database creation and normal forms.
We may start having international clients soon, so I want the database to not explode if/when that happens. We also currently have a couple big corporations as clients, with different divisions (e.g. ACME parent company, ACME healthcare division, ACME bodycare division)
The schema I have come up with is the following:
From the client perspective:
Clients is the main table
Clients are linked to the department they work for
Departments can be scattered around a country: HR in London, Marketing in Swansea, etc.
Departments are linked to the division of a company
Divisions are linked to the parent company
From the classes perspective:
Sessions is the main table
A teacher is linked to each session
A statusid is given to each session. E.g. 0 - Completed, 1 - Cancelled
Sessions are grouped into "packs" of an arbitrary size
Each packs is assigned to a client
I "designed" (more like scribbled) the schema on a piece of paper, trying to keep it normalised to the 3rd form. I then plugged it into MySQL Workbench and it made it all pretty for me: (Click here for full-sized graphic)
(source: maian.org)
Example queries I'll be running
Which clients with credit still left are inactive (those without a class scheduled in the future)
What is the attendance rate per client/department/division (measured by the status id in each session)
How many classes has a teacher had in a month
Flag clients who have low attendance rate
Custom reports for HR departments with attendance rates of people in their division
Question(s)
Is this overengineered or am I headed the right way?
Will the need to join multiple tables for most queries result in a big performance hit?
I have added a 'lastsession' column to clients, as it is probably going to be a common query. Is this a good idea or should I keep the database strictly normalised?
Thanks for your time
Some more answers to your questions:
1) You're pretty much on target for someone who is approaching a problem like this for the first time. I think the pointers from others on this question thus far pretty much cover it. Good job!
2 & 3) The performance hit you will take will largely be dependent on having and optimizing the right indexes for your particular queries / procedures and more importantly the volume of records. Unless you are talking about well over a million records in your main tables you seem to be on track to having a sufficiently mainstream design that performance will not be an issue on reasonable hardware.
That said, and this relates to your question 3, with the start you have you probably shouldn't really be overly worried about performance or hyper-sensitivity to normalization orthodoxy here. This is a reporting server you are building, not a transaction based application backend, which would have a much different profile with respect to the importance of performance or normalization. A database backing a live signup and scheduling application has to be mindful of queries that take seconds to return data. Not only does a report server function have more tolerance for complex and lengthy queries, but the strategies to improve performance are much different.
For example, in a transaction based application environment your performance improvement options might include refactoring your stored procedures and table structures to the nth degree, or developing a caching strategy for small amounts of commonly requested data. In a reporting environment you can certainly do this but you can have an even greater impact on performance by introducing a snapshot mechanism where a scheduled process runs and stores pre-configured reports and your users access the snapshot data with no stress on your db tier on a per request basis.
All of this is a long-winded rant to illustrate that what design principles and tricks you employ may differ given the role of the db you're creating. I hope that's helpful.
You've got the right idea. You can however clean it up, and remove some of the mapping (has*) tables.
What you can do is in the Departments table, add CityId and DivisionId.
Besides that, I think everything is fine...
The only changes I would make are:
1- Change your VARCHAR to NVARCHAR, if you might be going international, you may want unicode.
2- Change your int id's to GUIDs (uniqueidentifier) if possible (this might just be my personal preference). Assuming you eventually get to the point where you have multiple environments (dev/test/staging/prod), you may want to migrate data from one to the other. Have GUID Ids makes this significantly easier.
3- Three layers for your Company -> Division -> Department structure may not be enough. Now, this might be over-engineering, but you could generalize that hierarchy such that you can support n-levels of depth. This will make some of your queries more complex, so that may not be worth the trade-off. Further, it could be that any client that has more layers may be easily "stuffable" into this model.
4- You also have a Status in the Client Table that is a VARCHAR and has no link to the Statuses table. I'd expect a little more clarity there as to what the Client Status represents.
No. It looks like you're designing at a good level of detail.
I think that Countries and Companies are really the same entity in your design, as are Cities and Divisions. I'd get rid of the Countries and Cities tables (and Cities_Has_Departments) and, if necessary, add a boolean flag IsPublicSector to the Companies table (or a CompanyType column if there are more choices than simply Private Sector / Public Sector).
Also, I think there's an error in your usage of the Departments table. It looks like the Departments table serves as a reference to the various kinds of departments that each customer division can have. If so, it should be called DepartmentTypes. But your clients (who are, I assume, attendees) do not belong to a department TYPE, they belong to an actual department instance in a company. As it stands now, you will know that a given client belongs to an HR department somewhere, but not which one!
In other words, Clients should be linked to the table that you call Divisions_Has_Departments (but that I would call simply Departments). If this is so, then you must collapse Cities into Divisions as discussed above if you want to use standard referential integrity in the database.
By the way, it's worth noting that if you're generating CSVs already and want to load them into a mySQL database, LOAD DATA LOCAL INFILE is your best friend: http://dev.mysql.com/doc/refman/5.1/en/load-data.html . Mysqlimport is also worth looking into, and is a command-line tool that's basically a nice wrapper around load data infile.
Most things have already been said, but I feel that I can add one thing: it is quite common for younger developers to worry about performance a little bit too much up-front, and your question about joining tables seems to go into that direction. This is a software development anti-pattern called 'Premature Optimization'. Try to banish that reflex from your mind :)
One more thing: Do you believe you really need the 'cities' and 'countries' tables? Wouldn't having a 'city' and 'country' column in the departments table suffice for your use cases? E.g. does your application need to list departments by city and cities by country?
Following comments based on role as a Business Intelligence/Reporting specialist and strategy/planning manager:
I agree with Larry's direction above. IMHO, It's not so much over engineered, some things just look a little out of place. To keep it simple, I would tag client directly to a Company ID, Department Description, Division Description, Department Type ID, Division Type ID. Use Department Type ID and Division Type ID as references to lookup tables and internal reporting/analysis fields for long term consistency.
Packs table contains "Credit" column, shouldn't that actually be tied to the Client base table so if they many packs you can see how much credit owed is left for future classes? The application can take care of the calc and store it centrally in the Client table.
Company info could use many more fields, including the obvious address/phone/etc. information. I'd also be prepared to add in D&B "DUNs" columns (Site/Branch/Ultimate) long term, Dun and Bradstreet (D&B) has a huge catalog of companies and you'll find later down the road their information is very helpful for reporting/analysis. This will take care of the multiple division issue you mention, and allow you to roll up their hierarchy for sub/division/branches/etc. of large corps.
You don't mention how many records you'll be working with which could imply setting yourself up for a large development initiative which could have been done quicker and far fewer headaches with prepackaged "reporting" software. If your not dealing with a large database (< 65000) rows, make sure MS-Access, OpenOffice (Base) or related report/app dev solutions couldn't do the trick. I use Oracle's free APEX software quite a bit myself, it comes with their free database Oracle XE just download it from their site.
FYI - Reporting insight: for large databases, you typically have two database instances a) transaction database for recording each detailed record. b) reporting database (data mart/data warehouse) housed on a separate machine. For more information search google both Star Schema and Snowflake Schema.
Regards.
I want to address only the concern that joining to mutiple tables will casue a performance hit. Do not be afraid to normalize because you will have to do joins. Joins are normal and expected in relational datbases and they are designed to handle them well. You will need to set PK/FK relationships (for data integrity, this is important to consider in designing) but in many databases FKs are not automatically indexed. Since they wil be used in the joins, you will definitelty want to start by indexing the FKS. PKs generally get an index on creation as they have to be unique. It is true that datawarehouse design reduces the number of joins, but usually one doesn't get to the point of data warehousing until one has millions of records needed to be accessed in one report. Even then almost all data warehouses start with a transactional database to collect the data in real time and then data is moved to the warehouse on a schedule (nightly or monthly or whatever the business need is). So this is a good start even if you need to design a data warehouse later to improve report performance.
I must say your design is impressive for a first year CS student.
It isn't over-engineered, this is how I would approach the problem. Joining is fine, there won't be much of a performance hit (it's completely necessary unless you de-normalise the database out which isn't recommended!). For statuses, see if you can use an enum datatype instead to optimise that table out.
I've worked in the training / school domain and I thought I'd point out that there's generally a M:1 relationship between what you call "sessions" (instances of a given course) and the course itself. In other words, your catalog offers the course ("Spanish 101" or whatever), but you might have two different instances of it during a single semester (Tu-Th taught by Smith, Wed-Fri taught by Jones).
Other than that, it looks like a good start. I bet you'll find that the client domain (graphs leading to "clients") is more complex than you've modeled, but don't go overboard with that until you've got some real data to guide you.
A few things came to mind:
The tables seemed geared to reporting, but not really running the business. I would think when a client signs up, there's essentially an order being placed for the client attending a list of sessions, and that order might be for multiple employees in one company. It would seem an "order" table would really be at the center of your system and driving your data capture and eventual reporting. (Compare the paper documents you've been using to run the business with your database design to see if there's a logical match.)
Companies often don't have divisions. Employees sometimes change divisions/departments, maybe even mid-session. Companies sometimes add/delete/rename divisions/departments. Make sure the possible realtime changing contents of your tables doesn't make subsequent reporting/grouping difficult. With so much contact data split over so many tables, you might have to enforce very strict data entry validation to keep your reports meaningful and inclusive. Eg, when a new client is added, making sure his company/division/department/city match the same values as his coworkers.
The "packs" concept isn't clear at all.
Since you indicate it's a small business, it would be surprising if performance would be an issue, considering the speed and capacity of current machines.
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
From mongoDB docs:
When would MySQL be a better fit?
A concrete example would be the booking engine behind a travel
reservation system, which also typically involves complex
transactions. While the core booking engine might run on MySQL, those
parts of the app that engage with users – serving up content,
integrating with social networks, managing sessions – would be better
placed in MongoDB
Two things i don't understand in this (not even a little) concrete example:
What kind of queries are complex enough to be better suited for MYSQL
(a concrete example of such a query be of great help)?
Where is the line that seperates the "core booking engine" from the
"parts of the app that engage with users"?
My concern is not theoretical as we use both MYSQL and MongoDB in our app, and a better understanding of the above would really help us in designing our DB models for future features.
MySQL is ACID compliant (assuming you're using INNODB or similar), MogoDB is not. Read the MongoDB docs about atomicity here:
MongoDB Atomicity
Think about going to the grocery store checkout, and that the POS system is using MySQL. What steps might take place in a single transaction?
Item scanned, price retrieved
Inventory updated, quantity on hand is subtracted by 1
Department metrics updated (add dollar amount, quantity, item
type, etc)
Is the item on sale? Show how much money the customer saved on
the receipt
Customer used a coupon, make sure we notify the vendor so we get
reimbursed
Send receipt total to accounting, update month / year / week stats
Now it's time to pay. OOPS! Customer left wallet at home, and says he'll come back later. We've made all these changes to many database tables, now what do we do? If we were using MySQL and had all these updates in a single transaction, we could just rollback that one transaction and no harm is done. All changes will be reverted automatically, and in the correct order.
Doing that in a non-transactional database means writing code to backtrack through all those changes, in the correct order.
MongoDB is good for document storage and retrieval. It wouldn't be my first choice for creating small pieces of a document a little at a time, where you want to store bits and pieces of information in seperate places.
How do we use MongoDB in our grocery store example? We could use it as part of an inventory system.
Our MySQL inventory could have a schema of things we absolutely MUST have --- SKU, price, department. We don't necessarily want to clutter it up with things that we don't often need to know, however, by adding columns such as 'Easter_2016_Promotion'. In MongoDB, since we don't have a schema that's set in stone, this isn't a problem.
Something like
db.inventory.update(
{ _id: 1 },
{ $set: { "Easter_2016": "y" } }
)
Could add the "Easter_2016" field to a single inventory item without affecting any of the others. In MySQL, you affect every row in a table by adding a single column --- not so in MongoDB. Additionally, when querying Mongo, you can search all records (documents) for a field that MAY or MAY not exist. In MySQL, the field either exists or it doesn't.
MongoDB is built for schemas that are fluid, dynamic, and (potentially) somewhat unknown. It's speed partially relies on the fact that there aren't monolithic transactions that it may have to undo, and in part that there isn't a schema to constantly validate against when inserting.
Need to analyze 100,000 receipt JSON files from our POS system? Just run mongoimport and start querying for what you want.
Need to add some special data for just a few inventory items or flag a handful of customers as 'special handing'? MongoDB for this as well.
Need to import and query tax returns from 20 different states (think: different field names, different number of fields, with a few overlaps)? Mongo wins here, hands down.
Anything that has several known, concrete steps that MUST work, and work in the proper seqeunce, however (think: ATM machine), and MySQL is a better fit.
A query with multiple joins will be a good example. The main idea behind this point, is in relational DB m:n relations are symmetrical, whilst in document-oriented DB, they are not. Since v3.2, MongDB has $lookup which address this issue to some degree.
The line between core booking engine and representation engine is drawn by CAP theorem. The core part must be consistent, while the client-facing part can be implemented with eventual consistency. A recommended workaround for lack of atomic transactions in MongoDB should shed some light to this statement. Alternatively your core booking part can use event sourcing to keep state consistent without transactions.
I'm building a PHP app to prefill third party PDF account forms with client data, and am getting stuck on the database design.
The current form has about 70 fields, which seems like far too many to set up as individual columns, especially as some (ie company/trust information) are not relevant depending on the type of account the client requires.
I've tried to normalize but it seems like there would be a lot of joins, and also require several sub queries for things like multiple addresses.
It also means a ton of extra queries to check if rows exist or not when updating to decide if the script needs to do an INSERT, a DELETE or an UPDATE, whereas if it was all in one row, it would basically just be an UPDATE each time.
Not sure if this helps but here is a list of most of the fields:
id, account_type, account_phone, account_email, account_designation, account_adviser, account_source, account_complete,
account_residential_unit_number, account_residential_street_number, account_residential_street_name, account_residential_street_type, account_residential_suburb, account_residential_state, account_residential_postcode,
account_postal_unit_number, account_postal_street_number, account_postal_street_name, account_postal_street_type, account_postal_suburb, account_postal_state, account_postal_postcode,
individual_1_title, individual_1_firstname, individual_1_middlename, individual_1_lastname, individual_1_dob, individual_1_occupation, individual_1_email, individual_1_phone,
individual_1_unit_number, individual_1_street_number, individual_1_street_name, individual_1_street_type, individual_1_suburb, individual_1_state, individual_1_postcode,
individual_2_title, individual_2_firstname, individual_2_middlename, individual_2_lastname, individual_2_dob, individual_2_occupation, individual_2_email, individual_2_phone,
individual_2_unit_number, individual_2_street_number, individual_2_street_name, individual_2_street_type, individual_2_suburb, individual_2_state, individual_2_postcode,
company_name, company_date,
company_unit_number, company_street_number, company_street_name, company_street_type, company_suburb, company_state, company_postcode,
trust_name, trust_date,
settlement_bank, settlement_account, settlement_bsb
The most this will need to handle is around 200,000 applications, and once the data is in the database, it won't change very often, if at all - not sure if that is relevant?
So really just wanted to figure out the smartest way to do design this, even if it's just a name or topic to research further.
Generally speaking you can divide a database into two broad categories:
OLTP Systems
Online Transaction Processing Systems are normally write intensive i.e. a lot of updates compared to the reads of the data. This system is typically a day to day application used by a business users of all scopes i.e. data capture, admin etc. These databases are usually normalized to the extreme and then certain demoralized for performance gains in certain areas.
OLAP/DSS system:
On Line Analytic Processing are database that are normally large data warehouse like systems. Used to support Analytic activities such as data mining, data cubes etc. Typically the information is used by a more limited set of users than OLTP. These database are normally very denormalised.
Go read here for a short description of these and the main differences.
OLTP VS OLAP
Regarding your INSERT/UPDATE/DELETE point go read about the MySQL ON DUPLICATE KEY UPDATE statement which will resolve that issue for you easily. It is called a MERGE operation in most database systems.
Now I dont understand why you are worried about JOINS. I have had tables with millions (500 000 000+) rows that I joined with other tables also large in size and the queries ran very fast. So designing a database to eliminate joins is NOT a good idea.
My suggestion is:
If designing a OLTP system normalise as much as possible then denormalise to increase performance where needed. For A OLAP system look at star schemas etc and dont even bother with normalizing it first. Oh by the way most of the OLAP systems normally use a OLTP system as a data source.
Usually I normalise and then denormalise for performance. However
If I didn't have too much validation to do e.g Valid address, duplicated indivual
And I didn't want to reuse parts of the data for another version of the form, e.g select an existing individual , Name and address etc
And I didn't want to analyse it e.g Find all mentions of Fred Bloggs
And my user's were happy with entering all of this one form ( I wouldn't be)
Then I'd go with denormalise from the get go.
Thing is if you normalise, then denormalising if required is fairly trivial and low risk, normalising denormalised data usually means de-duplication which is likely to be really painful data and design wise.
Normalize your input, de-normalize the output. Meaning, for reporting, extract your data into a de-normalized format like Mongo and use that for querying. Or, create rollups of some sort. I have found, with large datasets, to extract the reporting data from the input data for best efficiency.
I find denormalized data extremely painful to work with at a very basic level. What if I want a tally of the number of people who live in Georgia. In your denormalized structure I would have to count where ind_1_state = GA or ind_2_state = GA.
This is not too bad I guess, but to anywho who has seen the ease of querying that normalization provides, it is quite painful.
Normalization establishing the foundation for more and more complex queries. Without it, you will find it increasingly difficult to implement richer data analysis.
Normalization also provides the basis for integrity and consistency in your database. If you have all the occurrences of a particular thing ( state abbreviations ) in one place ( one column ) you can easily check and constrain those values to not allow nonexistent codes.
The rationale for normalization goes on and on, but I hope I hit a few no brainers.
This is no brainer - all you have now is a noun-soup which you have shoved in a single table-storage-shoebox and glued some ID at the beginning of each row.
Create some kind of schema. If this is more like a OLAP -- and you decide for star schema -- it will have dimensions in 2-5 NF and facts in 2-6 NF. For OLTP (or different warehouse models) aim for BCNF - 6NF.
I would argue that you do not even have 1NF here, gluing that ID at the beginning does not count as preventing duplicates. Therefore, you can not de-normalize from this point even if you wanted to :) -- ok, maybe you could put some comma-separated list somewhere to make things definitely not in 1NF.
Joins are what relational databases do, so do not worry about that.
I'm struggling to find the best way to build out a structure that will work for my project. The answer may be simple but I'm struggling due to the massive number of columns or tables, depending on how it's set up.
We have several tools, each that can be run for many customers. Each tool has a series of questions that populate a database of answers. After the tool is run, we populate another series of data that is the output of the tool. We have roughly 10 tools, all populating a spreadsheet of 1500 data points. Here's where I struggle... each tool can be run multiple times, and many tools share the same data point. My next project is to build an application that can begin data entry for a tool, but allow import of data that shares the same datapoint for a tool that has already been run.
A simple example:
Tool 1 - company, numberofusers, numberoflocations, cost
Tool 2 - company, numberofusers, totalstorage, employeepayrate
So if the same company completed tool 1, I need to be able to populate "numberofusers" (or offer to populate) when they complete tool 2 since it already exists.
I think what it boils down to is, would it be better to create a structure that has 1500 tables, 1 for each data element with additional data around each data element, or to create a single massive table - something like...
customerID(FK), EventID(fk), ToolID(fk), numberofusers, numberoflocations, cost, total storage, employee pay,.....(1500)
If I go this route and have one large table I'm not sure how that will impact performance. Likewise - how difficult it will be to maintain 1500 tables.
Another dimension is that it would be nice to have a description of each field:
numberofusers,title,description,active(bool). I assume this is only possible if each element is in its own table?
Thoughts? Suggestions? Sorry for the lengthy question, new here.
Build a main table with all the common data: company, # users, .. other stuff. Give each row a unique id.
Build a table for each unique tool with the company id from above and any data unique to that implementation. Give each table a primary (unique key) for 'tool use' and 'company'.
This covers the common data in one place, identifies each 'customer' and provides for multiple uses of a given tool for each customer. Every use and customer is trackable and distinct.
More about normalization here.
I agree with etherbubunny on normalization but with larger datasets there are performance considerations that quickly become important. Joins which are often required in normalized databases to display human readable information can be performance killers on even medium sized tables which is why a lot of data warehouse models use de-normalized datasets for reporting. This is essentially pre-building the joined reporting data into new tables with heavy use of indexing, archiving and partitioning.
In many cases smart use of partitioning on its own can also effectively help reduce the size of the datasets being queried. This usually takes quite a bit of maintenance unless certain parameters remain fixed though.
Ultimately in your case (and most others) I highly recommend building it the way you are able to maintain and understand what is going on and then performing regular performance checks via slow query logs, explain, and performance monitoring tools like percona's tool set. This will give you insight into what is really happening and give you some data to come back here or the MySQL forums with. We can always speculate here but ultimately the real data and your setup will be the driving force behind what is right for you.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 11 months ago.
The community reviewed whether to reopen this question 11 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
So this is more of a design question.
I have one primary key (say the user's ID), and I have tons of information associated with that user.
Should I have multiple tables broken down into categories according to the information, or should I have just one table with many columns?
The way I used to do it was to have multiple tables, so say, one table for application usage data, one table for profile info, one table for back end tokens etc. to keep things looking organized.
Recently some one told me that it's better not to do it that way and having a table with lots of columns is fine. The thing is, all those columns have the same primary key.
I'm pretty new to database design so which approach is better and what are the pros and cons?
What's the conventional way of doing it?
Any time information is one-to-one (each user has one name and password), then it's probably better to have it one table, since it reduces the number of joins the database will need to do to retrieve results. I think some databases have a limit on the number of columns per table, but I wouldn't worry about it in normal cases, and you can always split it later if you need to.
If the data is one-to-many (each user has thousands of rows of usage info), then it should be split into separate tables to reduce duplicate data (duplicate data wastes storage space, cache space, and makes the database harder to maintain).
You might find the Wikipedia article on database normalization interesting, since it discusses the reasons for this in depth:
Database normalization is the process of organizing the fields and tables of a relational database to minimize redundancy and dependency. Normalization usually involves dividing large tables into smaller (and less redundant) tables and defining relationships between them. The objective is to isolate data so that additions, deletions, and modifications of a field can be made in just one table and then propagated through the rest of the database via the defined relationships.
Denormalization is also something to be aware of, because there are cases where repeating data is better (since it reduces the amount of work the database needs to do when reading data). I'd highly recommend making your data as normalized as possible to start out, and only denormalize if you're aware of performance problems in specific queries.
One big table is often a poor choice. Related tables are what relational database were designed to work with. If you index properly and know how to write performant queries, they are going to perform fine.
When tables get too many columns, then you can run into issues with the actual size of the page that the database is storing the information on. Either the record can end up being too large for the page, in which can you may end up not being able to create or update a specific record which makes users unhappy or you may (in SQL Server at least) be allowed some overflow for particular datatypes (with a set of rules you need to look up if you are doing this) but if many records will overflow the page size you can create tremedous performance problems. Now how MYSQL handles the pages and whether you have a problem when the potential page size gets too large is something you would have to look up in the documentation for that database.
Came across this, and as someone who used to use MySQL a lot, and then switched over to Postgres recently, one of the big advantages is that you can add JSON objects to a field in Postgres.
So if you are in this situation, you don't have to necessarily decide between one large table with many columns and splitting it up, but you can merge columns into JSON objects to reduce it e.g. instead of address being 5 columns, it can just be one. You can also query on that object too.
I have a good example. Overly Normalized database with the following set of relationships:
people -> rel_p2staff -> staff
and
people -> rel_p2prosp -> prospects
Where people has names and persons details, staff has just the staff record details, prospects has just prospects details, and the rel tables are relationship tables with foreign keys from people linking to staff and prospects.
This sort of design carries on for entire database.
Now to query this set of relations it's a multi-table join every time, sometimes 8 and more table join. It has been working fine up to mid this year, when it started getting very slow now that we past 40000 records of people.
Indexing and all low hanging fruits had been used up last year, all queries are optimized to perfection. This is the end of the road for the particular normalized design and management now approved a rebuilt of entire application that depends on it as well as restructure of the database, over a term of 6 months. $$$$ Ouch.
The solution will be to have a direct relation for people -> staff and people -> prospect
ask yourself these questions if you put everything in one table, will you have multiple rows for that user? If you have to update a user do you want to keep an audit trail? Can the user have more than one instance of a data element? (like phone number for instance) will you have a case where you might want to add an element or set of elements later?
if you answer yes then most likely you want to have child tables with foreign key relationships.
Pros of parent/child tables is data integrity, performance via indexes (yes you can do it on a flat table also) and IMO easier to maintain if you need to add a field later, especially if it will be a required field.
Cons design is harder, queries become slightly more complex
But, there are many cases where one big flat table will be appropriate so you have to look at your situation to decide.
I'm already done doing some sort of database design. for me, it depends on the difficulty of the system with database management; yeah it is true to have unique data in one place only but it is really hard to make queries with overly normalized database with lots of record. Just combine the two schema; use one huge table if you feel that you'll be having a massive records that are hard to maintain just like facebook,gmail,etc. and use different table for one set of record for simple system... well this is just my opinion .. i hope it could help.. just do it..you can do it... :)
The conventional way of doing this would be to use different tables as in a star schema or snowflake schema. Howeevr, I would base this strategy to be two fold. I believe in the theory that data should only exist in one place, there for the schema I mentioned would work well. However, I also believe that for reporting engines and BI suites, a columnar approach would be hugely beneficial becuase it is more supportive of the the reporting needs. Columnar approaches like those with infobright.org have huge performance gains and compression that makes using both approaches incredibly useful. Alot of companies are starting to realize that have just one database architecture in the organization is not supportive of the full range of their needs. Alot of companies are implementing both the concept of having more than one database achitecture.
i think having a single table is more effective but you should make sure that the table is organised in a manner that it shows the relationship,trend as well as the difference in variables of the same row.
for example if the table shows age and grades of the students you should arange the table in a manner that thank highest scorer is well differentiated with the lowest scorer and the difference in the age of students is even.
I am working on an app right now which has the potential to grow quite large. The whole application runs through a single domain, with customers being given sub-domains, which means that it all, of course, runs through a common code-base.
What I am struggling with is the database design. I am not sure if it would be better to have a column in each table specifying the customer id, or to create a new set of tables (in the same database), or to create a complete new database per customer.
The nice thing about a "flag" in the database specifying the customer id is that everything is in a single location. The downfalls are obvious- Tables can (will) get huge, and maintenance can become a complete nightmare. If growth occurs, splitting this up over several servers is going to be a huge pain.
The nice thing about creating new tables it is easy to do, and also keeps the tables pretty small. And since customers data doesn't need to interact, there aren't any problems there. But again, maintenance might become an issue (Although I do have a migrations library that will do updates on the fly per customer, so that is no big deal). The other issue is I have no idea how many tables can be in a single database. Does anyone know what the limit is, and what the performance issues would be?
The nice thing about creating a new database per customer, is that when I need to scale, I will be able to, quite nicely. There are several sites that make use of this design (wordpress.com, etc). It has been shown to be effective, but also have some downfalls.
So, basically I am just looking for some advice on which direction I should (could) go.
Single Database Pros
One database to maintain. One database to rule them all, and in the darkness - bind them...
One connection string
Can use Clustering
Separate Database per Customer Pros
Support for customization on per customer basis
Security: No chance of customers seeing each others data
Conclusion
The separate database approach would be valid if you plan to support customer customization. Otherwise, I don't see the security as a big issue - if someone gets the db credentials, do you really think they won't see what other databases are on that server?
Multiple Databases.
Different customers will have different needs, and it will allow you to serve them better.
Furthermore, if a particular customer is hammering the database, you don't want that to negatively affect the site performance for all your other customers. If everything is on one database, you have no damage control mechanism.
The risk of accidentally sharing data between customers is much smaller with separate database. If you'd like to have all data in one place, for example for reporting, set up a reporting database the customers cannot access.
Separate databases allow you to roll out, and test, a bugfix for just one customer.
There is no limit on the amount of tables in MySQL, you can make an insane amount of them. I'd call anything above a hundred tables per database a maintenance nightmare though.
Are you planning to develop a Cloud App?
I think that you don´t need to make tables or data bases by customer. I recommend you to use a more scalable relational database management system. Personally I don´t know the capabilities of MySQL, but i´m pretty sure that it should support distributed data base model in order to handle the load.
creating tables or databases per customer can lead you to a maintenance nightmare.
I have worked with multi-company databases and every table contains customer ids and to access its data we develop views per customer (for reporting purposes)
Good luck,
You can do whatever you want.
If you've got the customer_id in each column, then you've got to write the whole application that way. That's not exactly true as there should be enough to add that column only to some tables, the rest could be done using some simple joins.
If you've got one database per user, there won't be any additional code in the application so that could be easier.
If you take to first approach there won't be a problem to move to many databases as you can have the customer_id column in all those tables. Of course then there will be the same value in this column in each table, but that's not a problem.
Personally I'd take the simple one customer one database approach. Easier to user more database servers for all customers, more difficult to show a customer data that belongs some other customer.