i'm learning mysql and was working on a database for work. Everything's fine so far but I had a question. I am organizing financial statements for firms(balance sheet table, income statement table, cashflow table,etc.) and most companies have quarterly statements(they are unaudited) and annual statements(which are audited). Right now for each statement I have a column that flags it for annual or quarterly.
Its not likely that someone will be running a report on an audited and unaudited statement at the same time, so I was thinking if it was worth it to create a table for audited and one for unaudited. The reason I was thinking this was eventually the data will get fairly large and I thought the smaller the tables the faster performance.
So when I design a database should I be designing based on the content(i.e. group everything thats the same regardless) or should I be grouping based on how people will access it?
Another question this raises is should I be grouping financial statements by countries..since all analysis down at our firm in 90% within the same country
This is impossible to answer definitively without knowing the whole problem.
However, usually you want a single table to represent each logical entity in your system. From the sound of it, quarterly and annual statements represent the same logical entity, but differ by a single category column/field. The same holds true for the country question--if the only difference is the country (a categorization), then they likely should all be stored in the same table.
If you were to split your data into separate tables by category, your data would be scattered across multiple tables, and would be very hard to query. For example, if you wanted a count of all statements in the system, you would have to query ALL country tables and add the results together.
Edit: Joe Celko calls this anti-pattern "Attribute Splitting".
First of all I have to point out, I'm not a professional DB designer.
But if I ware you, in this case I would create one table as the entities are the same basically.
If you fear of mysql's performace on lager datasets, maybe it would be better to start building your app on Postgres. You can boost mysql's performace with stored functions/procedures or maybe views if you have to run complicated queries and of course you can use memcache or any nosql stuff to let the SQL rest a bit.
If you are sure in that users will search mainly only for this or that type of records, you can build three tables. One for all of the records, one-one for the audited and unaudited ones. You can keep them syncronized with the InnoDB's triggers (ON UPDATE/DELETE/INSERT). They could work like views, but I think (not tested) they would be faster then views. In this case you have to manage only the first "large" table. If you insert an audited record, the trigger fires and puts a record in to the audited table an so on...
Best wishes!
I agree with Phil and Damien - one table is better. What you want is one table per type of real business thing. If you design your tables to resemble real things, even abstract or conceptual things, then your data design is more likely to stand the test of time. Once you've sketched out a schema based on the real things you have data about, then you can go back and apply the rules of normalization to formalize your design.
As a rule, it is a bad idea to design for a performance problem you are worried about, but haven't actually seen. Your intuition about big tables being slower might actually be wrong. Most DBMS systems like bigger tables, at least to a point. When tables are big the query optimizers choose to use indexes. When tables are small they often end up getting full table scans, which can really slow down concurrent access. If your tables get so big that they are beyond the capabilities of your DBMS then it is time to consider either archiving out old data that you aren't using anymore or buying a more scalable DBMS.
Related
I am designing a mysql database for a listing website. This is my first time and I did some googling on this:)
I wanted to cross check if there is any problems in my approach.
So basically, I will have 5 tables. Lets say
Studio (Actual studio data points)
Studio Facilities (Like, transport, water etc)
Studio Images
Studio Reviews. (Name, stars, etc..)
Location & Locality (City,City Direction,Locality e.g Bangalore, North Bangalore, Hebbal).
So Studio table will hold the master record. Facilities, images and reviews will hold data by using studio as FK.
Location ID will be saved on the studio record.
My question now is that if I want to show a listing of all hebbal studios, I will have to perform a join of all 5 tables to show the studio data, images, reviews and facilities.
1) Is this ok? Do you see any potential problems with this approach? Are there any better approaches?
2) will the query execution time increase as the number of records increase?
Thanks
This looks like a solid approaches, databases are designed to use joins and those are appropriate. Denormalization will cause more problems especially concerning data integrity than your current design,
Will the queries get slower with more data? Of course they will but that is true of any design. Sending 10 gig of data back across the network or internet is going to be slower than sending 10 bytes.
You don't ask but are there things you can do to prevent the queries from getting unacceptably slower? Yes. First you need to make sure that Foreign key and fields that will appear frequently in the where clause are indexed as well as the PKs. This will be enough to prevent horrible performance for most simple queries.
Another thing you should do is read a good book on performance tuning for the particular database backend you are using. Most of the ones I have read have at least one chapter on designing queries for performance. Read it and re-read it and re-read it until this is sunk into your bones and you would not consider writing queries any other way.
Many databases are considered slow because their developers are incompetent at writing performant queries and the joins get blamed. Learn about sargabilty. Understand how indexes work and what the tradeoffs for having them are. Understand how wide you table can be before performance suffers (Hint sometimes two tables in a one-one relationship are faster than using one table with the table would be very wide.).
You cannot effectively design a database without a solid grounding in performance tuning.
You can use one master table (here it is Studio) and can use reference keys to all other tables.
It is always better to split up the tables for normalization.
Benefits of normalization
Eliminate data redundancy
Improve performance
Query optimization
Faster update due to less number of columns in one table
Index improvement
Research normalization and you will get good idea to build DB.
I want to create a table about "users" for each of the 50 states. Each state has about 2GB worth of data. Which option sounds better?
Create one table called "users" that will be 100GB large OR
Create 50 separate tables called "users_{state}", each which will be 2GB large
I'm looking at two things: performance, and style (best practices)
I'm also running RDS on AWS, and I have enough storage space. Any thoughts?
EDIT: From the looks of it, I will not need info from multiples states at the same time (i.e. won't need to frequently join tables if I go with Option 2). Here is a common use case: The front-end passes a state id to the back-end, and based on that id, I need to query data from the db regarding the specified state, and return data back to front-end.
Are the 50 states truly independent in your business logic? Meaning your queries would only need to run over one given state most of the time? If so, splitting by state is probably a good choice. In this case you would only need joining in relatively rarer queries like reporting queries and such.
EDIT: Based on your recent edit, this first option is the route I would recommend. You will get better performance from the table partitioning when no joining is required, and there are multiple other benefits to having the smaller partitioned tables like this.
If your queries would commonly require joining across a majority of the states, then you should definitely not partition like this. You'd be better off with one large table and just build the appropriate indices needed for performance. Most modern enterprise DB solutions are capable of handling the marginal performance impact going from 2GB to 100GB just fine (with proper indexing).
But if your queries on average would need to join results from only a handful of states (say no more than 5-10 or so), the optimal solution is a more complex gray area. You will likely be able to extract better performance from the partitioned tables with joining, but it may make the code and/or queries (and all coming maintenance) noticeably more complex.
Note that my answer assumes the more common access frequency breakdowns: high reads, moderate updates, low creates/deletes. Also, if performance on big data is your primary concern, you may want to check out NoSQL (for example, Amazon AWS DynamoDB), but this would be an invasive and fundamental departure from the relational system. But the NoSQL performance benefits can be absolutely dramatic.
Without knowing more of your model, it will be difficult for anyone to make judgement calls about performance, etc. However, from a data modelling point of view, when thinking about a normalized model I would expect to see a User table with a column (or columns, in the case of a compound key) which hold the foreign key to a State table. If a User could be associated with more than one state, I would expect another table (UserState) to be created instead, and this would hold the foreign keys to both User and State, with any other information about that relationship (for instance, start and end dates for time slicing, showing the timespan during which the User and the State were associated).
Rather than splitting the data into separate tables, if you find that you have performance issues you could use partitioning to split the User data by state while leaving it within a single table. I don't use MySQL, but a quick Google turned up plenty of reference information on how to implement partitioning within MySQL.
Until you try building and running this, I don't think you know whether you have a performance problem or not. If you do, following the above design you can apply partitioning after the fact and not need to change your front-end queries. Also, this solution won't be problematic if it turns out you do need information for multiple states at the same time, and won't cause you anywhere near as much grief if you need to look at User by some aspect other than State.
I'm building a PHP app to prefill third party PDF account forms with client data, and am getting stuck on the database design.
The current form has about 70 fields, which seems like far too many to set up as individual columns, especially as some (ie company/trust information) are not relevant depending on the type of account the client requires.
I've tried to normalize but it seems like there would be a lot of joins, and also require several sub queries for things like multiple addresses.
It also means a ton of extra queries to check if rows exist or not when updating to decide if the script needs to do an INSERT, a DELETE or an UPDATE, whereas if it was all in one row, it would basically just be an UPDATE each time.
Not sure if this helps but here is a list of most of the fields:
id, account_type, account_phone, account_email, account_designation, account_adviser, account_source, account_complete,
account_residential_unit_number, account_residential_street_number, account_residential_street_name, account_residential_street_type, account_residential_suburb, account_residential_state, account_residential_postcode,
account_postal_unit_number, account_postal_street_number, account_postal_street_name, account_postal_street_type, account_postal_suburb, account_postal_state, account_postal_postcode,
individual_1_title, individual_1_firstname, individual_1_middlename, individual_1_lastname, individual_1_dob, individual_1_occupation, individual_1_email, individual_1_phone,
individual_1_unit_number, individual_1_street_number, individual_1_street_name, individual_1_street_type, individual_1_suburb, individual_1_state, individual_1_postcode,
individual_2_title, individual_2_firstname, individual_2_middlename, individual_2_lastname, individual_2_dob, individual_2_occupation, individual_2_email, individual_2_phone,
individual_2_unit_number, individual_2_street_number, individual_2_street_name, individual_2_street_type, individual_2_suburb, individual_2_state, individual_2_postcode,
company_name, company_date,
company_unit_number, company_street_number, company_street_name, company_street_type, company_suburb, company_state, company_postcode,
trust_name, trust_date,
settlement_bank, settlement_account, settlement_bsb
The most this will need to handle is around 200,000 applications, and once the data is in the database, it won't change very often, if at all - not sure if that is relevant?
So really just wanted to figure out the smartest way to do design this, even if it's just a name or topic to research further.
Generally speaking you can divide a database into two broad categories:
OLTP Systems
Online Transaction Processing Systems are normally write intensive i.e. a lot of updates compared to the reads of the data. This system is typically a day to day application used by a business users of all scopes i.e. data capture, admin etc. These databases are usually normalized to the extreme and then certain demoralized for performance gains in certain areas.
OLAP/DSS system:
On Line Analytic Processing are database that are normally large data warehouse like systems. Used to support Analytic activities such as data mining, data cubes etc. Typically the information is used by a more limited set of users than OLTP. These database are normally very denormalised.
Go read here for a short description of these and the main differences.
OLTP VS OLAP
Regarding your INSERT/UPDATE/DELETE point go read about the MySQL ON DUPLICATE KEY UPDATE statement which will resolve that issue for you easily. It is called a MERGE operation in most database systems.
Now I dont understand why you are worried about JOINS. I have had tables with millions (500 000 000+) rows that I joined with other tables also large in size and the queries ran very fast. So designing a database to eliminate joins is NOT a good idea.
My suggestion is:
If designing a OLTP system normalise as much as possible then denormalise to increase performance where needed. For A OLAP system look at star schemas etc and dont even bother with normalizing it first. Oh by the way most of the OLAP systems normally use a OLTP system as a data source.
Usually I normalise and then denormalise for performance. However
If I didn't have too much validation to do e.g Valid address, duplicated indivual
And I didn't want to reuse parts of the data for another version of the form, e.g select an existing individual , Name and address etc
And I didn't want to analyse it e.g Find all mentions of Fred Bloggs
And my user's were happy with entering all of this one form ( I wouldn't be)
Then I'd go with denormalise from the get go.
Thing is if you normalise, then denormalising if required is fairly trivial and low risk, normalising denormalised data usually means de-duplication which is likely to be really painful data and design wise.
Normalize your input, de-normalize the output. Meaning, for reporting, extract your data into a de-normalized format like Mongo and use that for querying. Or, create rollups of some sort. I have found, with large datasets, to extract the reporting data from the input data for best efficiency.
I find denormalized data extremely painful to work with at a very basic level. What if I want a tally of the number of people who live in Georgia. In your denormalized structure I would have to count where ind_1_state = GA or ind_2_state = GA.
This is not too bad I guess, but to anywho who has seen the ease of querying that normalization provides, it is quite painful.
Normalization establishing the foundation for more and more complex queries. Without it, you will find it increasingly difficult to implement richer data analysis.
Normalization also provides the basis for integrity and consistency in your database. If you have all the occurrences of a particular thing ( state abbreviations ) in one place ( one column ) you can easily check and constrain those values to not allow nonexistent codes.
The rationale for normalization goes on and on, but I hope I hit a few no brainers.
This is no brainer - all you have now is a noun-soup which you have shoved in a single table-storage-shoebox and glued some ID at the beginning of each row.
Create some kind of schema. If this is more like a OLAP -- and you decide for star schema -- it will have dimensions in 2-5 NF and facts in 2-6 NF. For OLTP (or different warehouse models) aim for BCNF - 6NF.
I would argue that you do not even have 1NF here, gluing that ID at the beginning does not count as preventing duplicates. Therefore, you can not de-normalize from this point even if you wanted to :) -- ok, maybe you could put some comma-separated list somewhere to make things definitely not in 1NF.
Joins are what relational databases do, so do not worry about that.
I have 40+ columns in my table and i have to add few more fields like, current city, hometown, school, work, uni, collage..
These user data wil be pulled for many matching users who are mutual friends (joining friend table with other user friend to see mutual friends) and who are not blocked and also who is not already friend with the user.
The above request is little complex, so i thought it would be good idea to put extra data in same user table to fast access, rather then adding more joins to the table, it will slow the query more down. but i wanted to get your suggestion on this
my friend told me to add the extra fields, which wont be searched on one field as serialized data.
ERD Diagram:
My current table: http://i.stack.imgur.com/KMwxb.png
If i join into more tables: http://i.stack.imgur.com/xhAxE.png
Some Suggestions
nothing wrong with this table and columns
follow this approach MySQL: Optimize table with lots of columns - which serialize extra fields into one field, which are not searchable's
create another table and put most of the data there. (this gets harder on joins, if i already have 3 or more tables to join to pull the records for users (ex. friends, user, check mutual friends)
As usual - it depends.
Firstly, there is a maximum number of columns MySQL can support, and you don't really want to get there.
Secondly, there is a performance impact when inserting or updating if you have lots of columns with an index (though I'm not sure if this matters on modern hardware).
Thirdly, large tables are often a dumping ground for all data that seems related to the core entity; this rapidly makes the design unclear. For instance, the design you present shows 3 different "status" type fields (status, is_admin, and fb_account_verified) - I suspect there's some business logic that should link those together (an admin must be a verified user, for instance), but your design doesn't support that.
This may or may not be a problem - it's more a conceptual, architecture/design question than a performance/will it work thing. However, in such cases, you may consider creating tables to reflect the related information about the account, even if it doesn't have a x-to-many relationship. So, you might create "user_profile", "user_credentials", "user_fb", "user_activity", all linked by user_id.
This makes it neater, and if you have to add more facebook-related fields, they won't dangle at the end of the table. It won't make your database faster or more scalable, though. The cost of the joins is likely to be negligible.
Whatever you do, option 2 - serializing "rarely used fields" into a single text field - is a terrible idea. You can't validate the data (so dates could be invalid, numbers might be text, not-nulls might be missing), and any use in a "where" clause becomes very slow.
A popular alternative is "Entity/Attribute/Value" or "Key/Value" stores. This solution has some benefits - you can store your data in a relational database even if your schema changes or is unknown at design time. However, they also have drawbacks: it's hard to validate the data at the database level (data type and nullability), it's hard to make meaningful links to other tables using foreign key relationships, and querying the data can become very complicated - imagine finding all records where the status is 1 and the facebook_id is null and the registration date is greater than yesterday.
Given that you appear to know the schema of your data, I'd say "key/value" is not a good choice.
I would advice to run some tests. Try it both ways and benchmark it. Nobody will be able to give you a definitive answer because you have not shared your hardware configuration, sample data, sample queries, how you plan on using the data etc. Here is some information that you may want to consider.
Use The Database as it was intended
A relational database is designed specifically to handle data. Use it as such. When written correctly, joining data in a well written schema will perform well. You can use EXPLAIN to optimize queries. You can log SLOW queries and improve their performance. Databases have been around for years, if putting everything into a single table improved performance, don't you think that would be all the buzz on the internet and everyone would be doing it?
Engine Types
How will inserts be affected as the row count grows? Are you using MyISAM or InnoDB? You will most likely want to use InnoDB so you get row level locking and not table. Make sure you are using the correct Engine type for your tables. Get the information you need to understand the pros and cons of both. The wrong engine type can kill performance.
Enhancing Performance using Partitions
Find ways to enhance performance. For example, as your datasets grow you could partition the data. Data partitioning will improve the performance of a large dataset by keeping slices of the data in separate partions allowing you to run queries on parts of large datasets instead of all of the information.
Use correct column types
Consider using UUID Primary Keys for portability and future growth. If you use proper column types, it will improve performance of your data.
Do not serialize data
Using serialized data is the worse way to go. When you use serialized fields, you are basically using the database as a file management system. It will save and retrieve the "file", but then your code will be responsible for unserializing, searching, sorting, etc. I just spent a year trying to unravel a mess like that. It's not what a database was intended to be used for. Anyone advising you to do that is not only giving you bad advice, they do not know what they are doing. There are very few circumstances where you would use serialized data in a database.
Conclusion
In the end, you have to make the final decision. Just make sure you are well informed and educated on the pros and cons of how you store data. The last piece of advice I would give is to find out what heavy users of mysql are doing. Do you think they store data in a single table? Or do they build a relational model and use it the way it was designed to be used?
When you say "I am going to put everything into a single table", you are saying that you know more about performance and can make better choices for optimization in your code than the team of developers that constantly work on MySQL to make it what it is today. Consider weighing your knowledge against the cumulative knowledge of the MySQL team and the DBAs, companies, and members of the database community who use it every day.
At a certain point you should look at the "short row model", also know as entity-key-value stores,as well as the traditional "long row model".
If you look at the schema used by WordPress you will see that there is a table wp_posts with 23 columns and a related table wp_post_meta with 4 columns (meta_id, post_id, meta_key, meta_value). The meta table is a "short row model" table that allows WordPress to have an infinite collection of attributes for a post.
Neither the "long row model" or the "short row model" is the best model, often the best choice is a combination of the two. As #nevillek pointed out searching and validating "short row" is not easy, fetching data can involve pivoting which is annoyingly difficult in MySql and Oracle.
The "long row model" is easier to validate, relate and fetch, but it can be very inflexible and inefficient when the data is sparse. Some rows may have only a few of the values non-null. Also you can't add new columns without modifying the schema, which could force a system outage, depending on your architecture.
I recently worked on a financial services system that had over 700 possible facts for each instrument, most had less than 20 facts. This could have been built by setting up dozens of tables, each for a particular asset class, or as a table with 700 columns, but we chose to use a combination of a table with about 20 columns containing the most popular facts and a 4 column table which contained the other facts. This design was efficient but was difficult ot access, so we built a few table functions in PL/SQL to assist with this.
I have a general comment for you,
Think about it: If you put anything more than 10-12 columns in a table even if it makes sense to put them in a table, I guess you are going to pay the price in the short term, long term and medium term.
Your 3 tables approach seems to be better than the 1 table approach, but consider making those into 5-6 tables rather than 3 tables because you still can.
Move currently, currently_position, currently_link from user-table and work from user-profile into a new table with your primary key called USERWORKPROFILE.
Move locale Information from user-profile to a newer USERPROFILELOCALE information because it is generic in nature.
And yes, all your generic attributes in all the tables should be int and not varchar.
For instance, City needs to move out to a new table called LIST_OF_CITIES with cityid.
And your attribute city should change from varchar to int and point to cityid in LIST_OF_CITIES.
Do not worry about performance issues; the more tables you have, better the performance, because you are actually handing out the performance to the database provider instead of taking it all in your own hands.
I'm currently designing a web application using php, javascript, and MySQL. I'm considering two options for the databases.
Having a master table for all the tournaments, with basic information stored there along with a tournament id. Then I would create divisions, brackets, matches, etc. tables with the tournament id appended to each table name. Then when accessing that tournament, I would simply do something like "SELECT * FROM BRACKETS_[insert tournamentID here]".
My other option is to just have generic brackets, divisions, matches, etc. tables with each record being linked to the appropriate tournament, (or matches to brackets, brackets to divisions etc.) by a foreign key in the appropriate column.
My concern with the first approach is that it's a bit too on the fly for me, and seems like the database could get messy very quickly. My concern with the second approach is performance. This program will hopefully have a national if not international reach, and I'm concerned with so many records in a single table, and with so many people possibly hitting it at the same time, it could cause problems.
I'm not a complete newb when it comes to database management; however, this is the first one I've done completely solo, so any and all help is appreciated. Thanks!
Do not create tables for each tournament. A table is a type of an entity, not an instance of an entity. Maintainability and scalability would be horrible if you mix up those concepts. You even say so yourself:
This program will hopefully have a national if not international reach, and I'm concerned with so many records in a single table, and with so many people possibly hitting it at the same time, it could cause problems.
How on Earth would you scale to that level if you need to create a whole table for each record?
Regarding the performance of your second approach, why are you concerned? Do you have specific metrics to back up those concerns? Relational databases tend to be very good at querying relational data. So keep your data relational. Don't try to be creative and undermine the design of the database technology you're using.
You've named a few types of entities:
Tournament
Division
Bracket
Match
Competitor
etc.
These sound like tables to me. Manage your indexes based on how you query the data (that is, don't over-index or you'll pay for it with inserts/updates/deletes). Normalize the data appropriately, de-normalize where audits and reporting are more prevalent, etc. If you're worried about performance then keep an eye on the query execution paths for the ways in which you access the data. Slight tweaks can make a big difference.
Don't pre-maturely optimize. It adds complexity without any actual reason.
First, find the entities that you will need to store; things like tournament, event, team, competitor, prize etc. Each of these entities will probably be tables.
It is standard practice to have a primary key for each of them. Sometimes there are columns (or group of columns) that uniquely identify a row, so you can use that as primary key. However, usually it's best just to have a column named ID or something similar of numeric type. It will be faster and easier for the RDBMS to create and use indexes for such columns.
Store the data where it belongs: I expect to see the date and time of an event in the events table, not in the prizes table.
Another crucial point is conforming to the First normal form, since that assures data atomicity. This is important because it will save you a lot of headache later on. By doing this correctly, you will also have the correct number of tables.
Last but not least: add relevant indexes to the columns that appear most often in queries. This will help a lot with performance. Don't worry about tables having too many rows, RDBMS-es these days handle table with hundreds of millions of rows, they're designed to be able to do that efficiently.
Beside compromising the quality and maintainability of your code (as others have pointed out), it's questionable whether you'd actually gain any performance either.
When you execute...
SELECT * FROM BRACKETS_XXX
...the DBMS needs to find the table whose name matches "BRACKETS_XXX" and that search is done in the DBMS'es data dictionary which itself is a bunch of tables. So, you are replacing a search within your tables with a search within data dictionary tables. You pay the price of the search either way.
(The dictionary tables may or may not be "real" tables, and may or may not have similar performance characteristics as real tables, but I bet these performance characteristics are unlikely to be better than "normal" tables for large numbers of rows. Also, performance of data dictionary is unlikely to be documented and you really shouldn't rely on undocumented features.)
Also, the DBMS would suddenly need to prepare many more SQL statements (since they are now different statements, referring to separate tables), which would present the additional pressure on performance.
The idea of creating new tables whenever a new instance of an item appears is really bad, sorry.
A (surely incomplete) list of why this is a bad idea:
Your code will need to automatically add tables whenever a new Division or whatever is created. This is definitely a bad practice and should be limited to extremely niche cases - which yours definitely isn't.
In case you decide to add or revise a table structure later (e.g. adding a new field) you will have to add it to hundreds of tables which will be cumbersome, error prone and a big maintenance headache
A RDBMS is built to scale in terms of rows, not tables and associated (indexes, triggers, constraints) elements - so you are working against your tool and not with it.
THIS ONE SHOULD BE THE REAL CLINCHER - how do you plan to handle requests like "list all matches which were played on a Sunday" or "find the most recent three brackets where Frank Perry was active"?
You say:
I'm not a complete newb when it comes to database management; however, this is the first one I've done completely solo...
Can you remember another project where tables were cloned whenever a new set was required? If yes, didn't you notice some problems with that approach? If not, have you considered that this is precisely what a DBA would never ever do for any reason whatsoever?