Tips on designing a MySQL database - mysql

I am designing a mysql database for a listing website. This is my first time and I did some googling on this:)
I wanted to cross check if there is any problems in my approach.
So basically, I will have 5 tables. Lets say
Studio (Actual studio data points)
Studio Facilities (Like, transport, water etc)
Studio Images
Studio Reviews. (Name, stars, etc..)
Location & Locality (City,City Direction,Locality e.g Bangalore, North Bangalore, Hebbal).
So Studio table will hold the master record. Facilities, images and reviews will hold data by using studio as FK.
Location ID will be saved on the studio record.
My question now is that if I want to show a listing of all hebbal studios, I will have to perform a join of all 5 tables to show the studio data, images, reviews and facilities.
1) Is this ok? Do you see any potential problems with this approach? Are there any better approaches?
2) will the query execution time increase as the number of records increase?
Thanks

This looks like a solid approaches, databases are designed to use joins and those are appropriate. Denormalization will cause more problems especially concerning data integrity than your current design,
Will the queries get slower with more data? Of course they will but that is true of any design. Sending 10 gig of data back across the network or internet is going to be slower than sending 10 bytes.
You don't ask but are there things you can do to prevent the queries from getting unacceptably slower? Yes. First you need to make sure that Foreign key and fields that will appear frequently in the where clause are indexed as well as the PKs. This will be enough to prevent horrible performance for most simple queries.
Another thing you should do is read a good book on performance tuning for the particular database backend you are using. Most of the ones I have read have at least one chapter on designing queries for performance. Read it and re-read it and re-read it until this is sunk into your bones and you would not consider writing queries any other way.
Many databases are considered slow because their developers are incompetent at writing performant queries and the joins get blamed. Learn about sargabilty. Understand how indexes work and what the tradeoffs for having them are. Understand how wide you table can be before performance suffers (Hint sometimes two tables in a one-one relationship are faster than using one table with the table would be very wide.).
You cannot effectively design a database without a solid grounding in performance tuning.

You can use one master table (here it is Studio) and can use reference keys to all other tables.
It is always better to split up the tables for normalization.
Benefits of normalization
Eliminate data redundancy
Improve performance
Query optimization
Faster update due to less number of columns in one table
Index improvement
Research normalization and you will get good idea to build DB.

Related

Optimizing of database with multiple JOINs

First, some details about the website and the database structure -
With my website you can learn English words, and you can insert on each word a sentence, an association, an image, in addition - each word has a category, sub category, group...
My database includes about 20 tables. any user who registers to my website 'add' to users table something like 4000 rows - the number of the words on my website. I have a serious problem while the user is filtering words (somthing like 'search' word but according char/s & category/s & group/s etc.. I have 9 JOINs in my sql query, and it takes something like 1 MIN to display results..
The target of JOINs - inside the table users (where each user has 4000 rows / each row = word) there are joins on this style:
$this->db->join('users', 'sentences.id = users.sentence_id' ,'left');
The same thing with associations, groups, images, binds between words etc..
The users table includes id of sentences, associations, groups.. and with the JOIN there is a connection.
I don't know what to do.. it takes too much time. maybe the problem is the structure of the database? multiple joins? maybe using indexing? but how and where? because it's necessary sometimes retrieve all the words so indexing wouldn't help.
I'm using MySQL.
First of all, if you're using that many joins, indexes will not save you (as they will not be used in joins most of the time).
There are a few things you can do.
Schema Design
You probably would want to reconsider your schema design/query if you need 9 joins to achieve what you are doing!
From the looks of it, it seems your your tables are very normalized, perhaps in 3rd normal form? In that case consider denormalizing your tables into a larger one to avoid joins (joins are more expensive than full table scans!). There are many online documentations on this, however there's always costs to this, as it increases development complexity and data redundancy. Also by denormalizing your tables you avoid joins and can make better use of indexes.
Also I believe MyISAM is the only storage engine in MySQL that supports FULL TEXT indexes. However it does not have transactions and have table level-locking and no MVCC, so it depends on what you need.
Resources
I suggest you have a read at this book High Performance MySQL.
A truly awesome book on tuning MySQL databases
I also suggest having a read at the official documentation on your chosen storage engine. This is significant as each storage engine is VERY DIFFERENT! InnoDB is completely different from MyISAM which is also completely different from PBXT. Each engine has its benefits and you will have to consider which one fits your situation.
I would draw out the relational schema and work out the number of operations for the queries you are running, and go from there. Most DBMS's attempt to optimise queries implicitly, but not always optimally. You should look into re-ordering the joins so that the most restrictive are carried out first. Indexes could help, and again, would require some analysis to find which attributes you are searching on.
Building databases to deal with natural language is a very challenging subject and there is a lot of research on the subject. Have you looked into Markov chains? Have you taken a step back and thought about the computational complexity of what you are trying to do? If you arrive at the same conclusion of nine joins, then it may be fair to say that the problem is not scalable enough for a real-time application.
As an aside, I believe Google App Engine's data store attempts to index attributes for you, with implicit scalability. If you're running your database on a small web server, then you may see better results deploying it with a more comprehensive DBMS. I would only look into this as a last resort, however.

is better to create tables based on content or views?

i'm learning mysql and was working on a database for work. Everything's fine so far but I had a question. I am organizing financial statements for firms(balance sheet table, income statement table, cashflow table,etc.) and most companies have quarterly statements(they are unaudited) and annual statements(which are audited). Right now for each statement I have a column that flags it for annual or quarterly.
Its not likely that someone will be running a report on an audited and unaudited statement at the same time, so I was thinking if it was worth it to create a table for audited and one for unaudited. The reason I was thinking this was eventually the data will get fairly large and I thought the smaller the tables the faster performance.
So when I design a database should I be designing based on the content(i.e. group everything thats the same regardless) or should I be grouping based on how people will access it?
Another question this raises is should I be grouping financial statements by countries..since all analysis down at our firm in 90% within the same country
This is impossible to answer definitively without knowing the whole problem.
However, usually you want a single table to represent each logical entity in your system. From the sound of it, quarterly and annual statements represent the same logical entity, but differ by a single category column/field. The same holds true for the country question--if the only difference is the country (a categorization), then they likely should all be stored in the same table.
If you were to split your data into separate tables by category, your data would be scattered across multiple tables, and would be very hard to query. For example, if you wanted a count of all statements in the system, you would have to query ALL country tables and add the results together.
Edit: Joe Celko calls this anti-pattern "Attribute Splitting".
First of all I have to point out, I'm not a professional DB designer.
But if I ware you, in this case I would create one table as the entities are the same basically.
If you fear of mysql's performace on lager datasets, maybe it would be better to start building your app on Postgres. You can boost mysql's performace with stored functions/procedures or maybe views if you have to run complicated queries and of course you can use memcache or any nosql stuff to let the SQL rest a bit.
If you are sure in that users will search mainly only for this or that type of records, you can build three tables. One for all of the records, one-one for the audited and unaudited ones. You can keep them syncronized with the InnoDB's triggers (ON UPDATE/DELETE/INSERT). They could work like views, but I think (not tested) they would be faster then views. In this case you have to manage only the first "large" table. If you insert an audited record, the trigger fires and puts a record in to the audited table an so on...
Best wishes!
I agree with Phil and Damien - one table is better. What you want is one table per type of real business thing. If you design your tables to resemble real things, even abstract or conceptual things, then your data design is more likely to stand the test of time. Once you've sketched out a schema based on the real things you have data about, then you can go back and apply the rules of normalization to formalize your design.
As a rule, it is a bad idea to design for a performance problem you are worried about, but haven't actually seen. Your intuition about big tables being slower might actually be wrong. Most DBMS systems like bigger tables, at least to a point. When tables are big the query optimizers choose to use indexes. When tables are small they often end up getting full table scans, which can really slow down concurrent access. If your tables get so big that they are beyond the capabilities of your DBMS then it is time to consider either archiving out old data that you aren't using anymore or buying a more scalable DBMS.

Determining Best Table Structure for MySQL Performance

I'm working on a browser-based RPG for one of my websites, and right now I'm trying to determine the best way to organize my SQL tables for performance and maintenance.
Here's my question:
Does the number of columns in an SQL table affect the speed in which it can be queried?
I am not a newbie when it comes to PHP or MySQL. I used to develop things with the common goal of getting them to work, but I've recently advanced to the stage where a functional program is not good enough unless it's fast and reliable.
Anyways, right now I have a members table that has around 15 columns. It contains information such as the player's username, password, email, logins, page views, etcetera. It doesn't contain any information on the player's progress in the game, however. If I added columns for things such as army size, gold, turns, and whatnot, then it could easily rise to around 40 or 50 total columns.
Oh, and my database structure IS normalized.
Will a table with 50 columns that gets constantly queried be a bad idea? Should I split it into two tables; one for the user's general information and one for the user's game statistics?
I know I could check the query time myself, but I haven't actually created the tables yet and I think I'd be better off with some professional advice on this important decision for my game.
Thank you for your time! :)
The number of columns can have a measurable cost if you're relying on table-scans or on caching pages of table data. But the best way to get good performance is to create indexes to assist your queries. If you have indexes in place that benefit your queries, then the width of a row in the table is pretty much inconsequential. You're looking up specific rows through much faster means than scanning through the table.
Here are some resources for you:
EXPLAIN Demystified
More Mastering the Art of Indexing
Based on your caveat at the end of your question, you already know that you should be measuring performance and only fixing code that has problems. Don't try to make premature optimizations.
Unfortunately, there are no one-size-fits-all rules for defining indexes. The best set of indexes need to be designed custom for the queries that you need to be fastest. It's hard work, requiring a lot of analysis, testing, and taking comparative measurements for performance. It also requires a lot of reading to understand how your given RDBMS technology uses indexes.

Database design for very large amount of data

I am working on a project involving large amount of data from the delicious website. The data available is "Date, UserId, Url, Tags" (for each bookmark).
I normalized my database to a 3NF, and because of the nature of the queries that we wanted to use in combination, I came down to 6 tables... The design looks fine, however, now that a large amount of data is in the database, most of the queries need to join at least 2 tables together to get the answer, sometimes 3 or 4. At first, we didn't have any performance issues, because for testing matters we had not added too much data to the database. Now that we have a lot of data, simply joining extremely large tables takes a lot of time and for our project, which has to be real-time, this is a disaster.
I was wondering how big companies solve these issues. Looks like normalizing tables just adds complexity, but how does the big company handle large amounts of data in their databases, don't they use normalization?
Thanks.
Since you asked about how big companies (generally) approaches this:
They usually have a dba(database administrator) who lives and breathes the database the company uses.
This means they have people that know everything from how to design the tables optimally, profile and tune the queries/indexes/OS/server to knowing what firmware revision of the RAID controller that can cause problems for the database.
You don't talk much about what kind of tuning you've done, e.g.
Are you using MyISAM or InnoDB tables ? Their performance(and not the least their features) is radically different for different workloads.
Are the tables properly indexed according to the queries you run ?
run EXPLAIN on all your queries - which will help you identify keys that could be added/removed, wether the proper keys are selected, compare queries(SQL leaves you with lots of way to accomplish the same things)
Have you tuned the query-cache ? For some workloads the query cache(default on) can cause considerable slowdown.
How much memory do your box have , and is mysql tuned to take advantage of this ?
Do you use a file system and raid setup geared towards the database ?
Sometimes a little de-normalization is needed.
Different database products will have different charasteristics, MySQL might be blazingly fast for some worlkoads, and slow for others.

What techniques are most effective for dealing with millions of records?

I once had a MySQL database table containing 25 million records, which made even a simple COUNT(*) query takes minute to execute. I ended up making partitions, separating them into a couple tables. What i'm asking is, is there any pattern or design techniques to handle this kind of problem (huge number of records)? Is MSSQL or Oracle better in handling lots of records?
P.S
the COUNT(*) problem stated above is just an example case, in reality the app does crud functionality and some aggregate query (for reporting), but nothing really complicated. It's just that it takes quite a while (minutes) to execute some these queries because of the table volume
See Why MySQL could be slow with large tables and COUNT(*) vs COUNT(col)
Make sure you have an index on the column you're counting. If your server has plenty of RAM, consider increasing MySQL's buffer size. Make sure your disks are configured correctly -- DMA enabled, not sharing a drive or cable with the swap partition, etc.
What you're asking with "SELECT COUNT(*)" is not easy.
In MySQL, the MyISAM non-transactional engine optimises this by keeping a record count, so SELECT COUNT(*) will be very quick.
However, if you're using a transactional engine, SELECT COUNT(*) is basically saying:
Exactly how many records exist in this table in my transaction ?
To do this, the engine needs to scan the entire table; it probably knows roughly how many records exist in the table already, but to get an exact answer for a particular transaction, it needs a scan. This isn't going to be fast using MySQL innodb, it's not going to be fast in Oracle, or anything else. The whole table MUST be read (excluding things stored separately by the engine, such as BLOBs)
Having the whole table in ram will make it a bit faster, but it's still not going to be fast.
If your application relies on frequent, accurate counts, you may want to make a summary table which is updated by a trigger or some other means.
If your application relies on frequent, less accurate counts, you could maintain summary data with a scheduled task (which may impact performance of other operations less).
Many performance issues around large tables relate to indexing problems, or lack of indexing all together. I'd definitely make sure you are familiar with indexing techniques and the specifics of the database you plan to use.
With regards to your slow count(*) on the huge table, i would assume you were using the InnoDB table type in MySQL. I have some tables with over 100 million records using MyISAM under MySQL and the count(*) is very quick.
With regards to MySQL in particular, there are even slight indexing differences between InnoDB and MyISAM tables which are the two most commonly used table types. It's worth understanding the pros and cons of each and how to use them.
What kind of access to the data do you need? I've used HBase (based on Google's BigTable) loaded with a vast amount of data (~30 million rows) as the backend for an application which could return results within a matter of seconds. However, it's not really appropriate if you need "real time" access - i.e. to power a website. Its column-oriented nature is also a fairly radical change if you're used to row-oriented DBMS.
Is count(*) on the whole table actually something you do a lot?
InnoDB will have to do a full table scan to count the rows, which is obviously a major performance issue if counting all of them is something you actually want to do. But that doesn't mean that other operations on the table will be slow.
With the right indexes, MySQL will be very fast at retrieving data from tables much bigger than that. The problem with indexes is that they can hurt insert speeds, particularly for large tables as insert performance drops dramatically once the space required for the index reaches a certain threshold - presumably the size it will keep in memory. But if you only need modest insert speeds, MySQL should do everything you need.
Any other database will have similar tradeoffs between retrieve speed and insert speed; they may or may not be better for your application. But I would look first at getting the indexes right, and maybe rewriting your queries, before you try other databases. For what it's worth, we picked MySQL originally because we found it performed best.
Note that MyISAM tables in MySQL store the total size of the table. They maintain this because it's useful to the optimiser in some cases, but a side effect is that count(*) on the whole table is really fast. That doesn't necessarily mean they're faster than InnoDB at anything else.
I answered a similar question in This Stackoverflow Posting in some detail, describing the merits of the architectures of both systems. To some extent it was done from a data warehousing point of view but many of the differences also matter on transactional systems.
However, 25 million rows is not a VLDB and if you are having performance problems you should look to indexing and tuning. You don't need to go to Oracle to support a 25 million row database - you've got about 3 orders of magnitude to go before you're truly in VLDB territory.
You are asking for a books worth of answer and I therefore propose you get a good book on databases. There are many.
To get you started, here are some database basics:
First, you need a great data model based not just on what data you need to store but on usage patterns. Good database performance starts with good schema design.
Second, place indicies on columns based upon expected lookup AND update needs as update performance is often overlooked.
Third, don't put functions in where clauses if at all possible.
Fourth, use an -ahem- RDBMS engine that is of quality design. I would respectfully submit that while it has improved greatly in the recent past, mysql does not qualify. (Apologies to those who wish to argue it has finally made the grade in recent times.) There is no longer any need to choose between high-price and quality; Postgres (aka PostgreSql) is available open-source and is truly fantastic - and has all the plug-ins available to meet your needs.
Finally, learn what you are asking a database engine to do - gain some insight into internals - so you can better judge what kinds of things are expensive and why.
I'm going to second #Mark Baker, and say that you need to build indices on your tables.
For other queries than the one you selected, you should also be aware that using constructs such as IN() is faster than a series of OR statements in the query. There are lots of little steps you can take to speed-up individual queries.
Indexing is key to performance with this number of records, but how you write the queries can make a big difference as well. Specific performance tuning methods vary by database, but in general, avoid returning more records or fields than you actually need, make sure all join fields are indexed (as well as common where clause fields), avoid cursors (although I think this is less true in Oracle than SQL Server I don't know about mySQL).
Hardware can also be a bottleneck especially if you are running things besides the database server on the same machine.
Performance tuning is a very technical subject and can't really be answered well in a format like this. I suggest you get a performance tuning book and read it. Here is a link to one for mySQL
http://www.amazon.com/High-Performance-MySQL-Optimization-Replication/dp/0596101716