I'm writing a new feature using new tables on a MySQL database. Is there a performance hit I get by indexing columns (that I'll use for WHERE in SELECT queries) from the beginning or should I wait until my table reaches a considerable size before I start indexing?
If you are going to eventually need the indexes, you might as well create them with the tables. This does somewhat slow down inserts, but they might as well be there if you are going to need them. Why wait for slow queries, if you know the right answer?
One argument against putting them in right away is if the actual queries will inform the indexing strategy. You seem to have a pretty good idea of what the usage will be.
Do recognize that indexes make some operations fast (notably selects). However, they make other operations slower (notably, insert, update, and delete). For this reason you should be thoughtful about the indexes that go on.
From this related post:
One more index than you need is too many. One less is too little.
I've tried searching for a case where having too many indexes was a
problem and couldn't really find anything
You KNOW you have too many if your inserts are too slow, and the index used for reading are not speeding things up enough to make up for it.
When you insert, update, delete your table, the index need to updated too.
See this Article about indexing
Yes, inserts and updates could be slower because of an index. However, in my practice this has not been a common problem. You want to add only indexes that you know you are going to need. Wait to add other indexes when addressing new problems. One thing to consider is many times: “What if” indexes are forgotten, not needed and actually cause performance issues that can be hard to track down. After finding the problem index, the developer has to spend additional time determining if the index is actually needed in some other part of the application. As far as waiting to add indexes until a table reaches a certain size. I would seriously doubt that would buy you any performance and if it did I would question the system design.
Related
I am working on the efficiency of MySQL server, and have a very large database which utilizes several composite keys, would server performance be improved by de-normalizing data and not using composite keys? I am requesting an "in general" answer, due to non-disclosure I cannot post any code or database schema.
If its a large database, you shouldn't denormalize the data--that would be criminal to your client etc. The secret to speed is always more indexes. Figure out which queries they use the most and optimize those by looking at their query plans. Ideally, you want to see the word "seek" instead of "scan". If you still see the word "scan", just keeping adding more indexes to your database until you see the word "seek".
You'd be surprised how much you could gain by getting a better understanding of exactly how they're using the database and what's really slowing it down. De-normalization, unfortunately, is a last resort. They'll thank you later when they won't have to create obscenely complex queries because there's no other way to accomplish anything, thanks to you giving in to more short-sighted solutions. You should be thinking more long-term both for them and for your practice imho. Cheers
I realize this is a sort of meta-programming question, but I'm assuming there are enough experienced people here to give a decent answer.
I was just building a query again, to retrieve some data from a table.
SELECT pl.field1, pl.field2
FROM table pl
LEFT JOIN table2 dp on pl.field1 = dp.field1
WHERE dp.field1 IS NULL
Executing this query took ages (1800+ seconds).
After I got sick of waiting, and made the effort to EXPLAIN the query, it turned out that a full table scan was done.
I created an index on dp.field1 and the query was almost instant thereafter, creating that index took less than a second.
Judging from the EXPLAIN, this wasn't too difficult to determine. Why can't, or won't, MySQL do this automatically? Spending just a second to create that index will make the query instant, so MySQL could theoretically create a temporary index, use it to do the query and then remove it again, which would still be orders of magnitude faster than the alternative.
I'm expecting the usual answers of 'to make sure you design a good schema' or 'mysql just does what you tell it to do', but I'm wondering if there might be a technical reason why this is a bad idea.
For columns with low cardinality it is not a good idea to use a B-Tree Index. B-Trees become degenerated for low cardinalities and do in fact increase query time in comparison to a full table scan.
So always creating a B-Tree index is not a good idea. At least it have to consider cardinality, too. And maybe several other things, too.
Quite simply - because the idea doesn't really scale using the current design of RDBMS engines.
It's okay for a single user, but databases are designed to support many concurrent users, and having each user's query also run a speculative optimization step ("can I speed up this query by creating an index?"), and creating that index, which in some circumstances is a very expensive operation, would become slow at any degree of scale. Having the index be "single use" would be wasteful of both computation time and disk space, but having lots of permanent indices in turn would slow down the query optimizer by having to investigate many indices for a given query. It would also slow down data modification operations.
Admittedly, on modern hardware, these concerns are a lot less significant - basic design of RDBMS engines dates back to the days when disk space was expensive, CPUs were several orders of magnitude slower, and memory was an unimaginable luxury.
I'm only speaking for MySQL because there may be a database system out there that automatically modifies your database design.
The simple answer is, MySQL simply does what you tell it to do.
MySQL cannot predict the future. Only you can. You know much more about your data than MySQL does. MySQL keeps some statistics, but it's guessing the best way to execute your query on very sparse information (that is sometimes outdated) before it actually tries to do so. Once it starts executing, it doesn't change its plan, no matter how wrong the guess was.
The methods that it uses to guess are all very well documented. It's our job to provide the indexes that will provide the most benefit, and even, at times, hint that it should use those indexes.
If you tell MySQL to perform a query that requires a table scan, it assumes you know that it's going to do a table scan, because it told you in its documentation that it would. It simply obeys.
Database systems that don't allow the DBA to make decisions don't scale well. There are always tradeoffs to be made, and you're the one to make them. MySQL is a hammer, not a carpenter.
This is my first time building a database with a table containing 10 million records. The table is a members table that will contain all the details of a member.
What do I need to pay attention when I build the database?
Do I need a special version of MySQL? Should I use MyISAM or InnoDB?
For a start, you may need to step back and re-examine your schema. How did you end up with 10 million rows in the member table? Do you actually have 10 million members (it seems like a lot)?
I suspect (although I'm not sure) that you have less than 10 million members in which case your table will not be correctly structured. Please post the schema, that's the first step to us helping you out.
If you do have 10 million members, my advice is to make your application vendor-agnostic to begin with (i.e., standard SQL). Then, if you start running into problems, just toss out your current DBMS and replace it with a more powerful one.
Once you've established you have one that's suitable, then, and only then would I advise using vendor-specific stuff. Otherwise it will be a painful process to change.
BTW, 10 million rows is not really considered a big database table, at least not where I come from.
Beyond that, the following is important (not necessarily an exhaustive list but a good start).
Design your tables for 3NF always. Once you identify performance problems, you can violate that rule provided you understand the consequences.
Don't bother performance tuning during development, your queries are in a state of flux. Just accept the fact they may not run fast.
Once the majority of queries are locked down, then start tuning your tables. Add whatever indexes speed up the selects, de-normalize and so forth.
Tuning is not a set-and-forget operation (which is why we pay our DBAs so much). Continuously monitor performance and tune to suit.
I prefer to keep my SQL standard to retain the ability to switch vendors at any time. But I'm pragmatic. Use vendor-specific stuff if it really gives you a boost. Just be aware of what you're losing and try to isolate the vendor-specific stuff as much as possible.
People that use "select * from ..." when they don't need every column should be beaten into submission.
Likewise those that select every row to filter out on the client side. The people that write our DBMS' aren't sitting around all day playing Solitaire, they know how to make queries run fast. Let the database do what it's best at. Filtering and aggregation is best done on the server side - only send what is needed across the wire.
Generate your queries to be useful. Other than the DoD who require reports detailing every component of their aircraft carriers down to the nuts-and-bolts level, no-one's interested in reading your 1200-page report no matter how useful you think it may be. In fact, I don't think the DoD reads theirs either, but I wouldn't want some general chewing me out because I didn't deliver - those guys can be loud and they have a fair bit of sophisticated weaponry under their control.
At least use InnoDB. You will feel the pain when you realize MyISAM has just lost your data...
Apart from this, you should give more information about what you want to do.
You don't need to use InnoDB if you don't have data integrity and atomic action requirements. You want to use InnoDB if you have foreign keys between tables and you are required to keep the constraints, or if you need to update multiple tables in atomic operation. Otherwise, if you just need to use the table to do analysis, MyISAM is fine.
For queries, make sure you build smart indexes to suite your query. For example, if you want to sort by columns c and selecting based on columns a, and b, make sure you have an index that covers columns a, b, and c, in that order, and that index includes full length of each column, rather than a prefix. If you don't do your index right, sorting over a large amount of data will kill you. See http://dev.mysql.com/doc/refman/5.0/en/order-by-optimization.html
Just a note about InnoDB and setting up and testing a large table with it. If you start injecting your data, it will take hours. Make sure you issue commits periodically, otherwise if you want to stop and redo for whatever reason, you end up have to 1) wait hours for transaction recovery, or 2) kill mysqld, set InnoDB recover flag to no recover and restart. Also if you want to re-inject data from scratch, DROP the table and recreate it is almost instantaneous, but it will take hours to actually "DELETE FROM table".
I once had a MySQL database table containing 25 million records, which made even a simple COUNT(*) query takes minute to execute. I ended up making partitions, separating them into a couple tables. What i'm asking is, is there any pattern or design techniques to handle this kind of problem (huge number of records)? Is MSSQL or Oracle better in handling lots of records?
P.S
the COUNT(*) problem stated above is just an example case, in reality the app does crud functionality and some aggregate query (for reporting), but nothing really complicated. It's just that it takes quite a while (minutes) to execute some these queries because of the table volume
See Why MySQL could be slow with large tables and COUNT(*) vs COUNT(col)
Make sure you have an index on the column you're counting. If your server has plenty of RAM, consider increasing MySQL's buffer size. Make sure your disks are configured correctly -- DMA enabled, not sharing a drive or cable with the swap partition, etc.
What you're asking with "SELECT COUNT(*)" is not easy.
In MySQL, the MyISAM non-transactional engine optimises this by keeping a record count, so SELECT COUNT(*) will be very quick.
However, if you're using a transactional engine, SELECT COUNT(*) is basically saying:
Exactly how many records exist in this table in my transaction ?
To do this, the engine needs to scan the entire table; it probably knows roughly how many records exist in the table already, but to get an exact answer for a particular transaction, it needs a scan. This isn't going to be fast using MySQL innodb, it's not going to be fast in Oracle, or anything else. The whole table MUST be read (excluding things stored separately by the engine, such as BLOBs)
Having the whole table in ram will make it a bit faster, but it's still not going to be fast.
If your application relies on frequent, accurate counts, you may want to make a summary table which is updated by a trigger or some other means.
If your application relies on frequent, less accurate counts, you could maintain summary data with a scheduled task (which may impact performance of other operations less).
Many performance issues around large tables relate to indexing problems, or lack of indexing all together. I'd definitely make sure you are familiar with indexing techniques and the specifics of the database you plan to use.
With regards to your slow count(*) on the huge table, i would assume you were using the InnoDB table type in MySQL. I have some tables with over 100 million records using MyISAM under MySQL and the count(*) is very quick.
With regards to MySQL in particular, there are even slight indexing differences between InnoDB and MyISAM tables which are the two most commonly used table types. It's worth understanding the pros and cons of each and how to use them.
What kind of access to the data do you need? I've used HBase (based on Google's BigTable) loaded with a vast amount of data (~30 million rows) as the backend for an application which could return results within a matter of seconds. However, it's not really appropriate if you need "real time" access - i.e. to power a website. Its column-oriented nature is also a fairly radical change if you're used to row-oriented DBMS.
Is count(*) on the whole table actually something you do a lot?
InnoDB will have to do a full table scan to count the rows, which is obviously a major performance issue if counting all of them is something you actually want to do. But that doesn't mean that other operations on the table will be slow.
With the right indexes, MySQL will be very fast at retrieving data from tables much bigger than that. The problem with indexes is that they can hurt insert speeds, particularly for large tables as insert performance drops dramatically once the space required for the index reaches a certain threshold - presumably the size it will keep in memory. But if you only need modest insert speeds, MySQL should do everything you need.
Any other database will have similar tradeoffs between retrieve speed and insert speed; they may or may not be better for your application. But I would look first at getting the indexes right, and maybe rewriting your queries, before you try other databases. For what it's worth, we picked MySQL originally because we found it performed best.
Note that MyISAM tables in MySQL store the total size of the table. They maintain this because it's useful to the optimiser in some cases, but a side effect is that count(*) on the whole table is really fast. That doesn't necessarily mean they're faster than InnoDB at anything else.
I answered a similar question in This Stackoverflow Posting in some detail, describing the merits of the architectures of both systems. To some extent it was done from a data warehousing point of view but many of the differences also matter on transactional systems.
However, 25 million rows is not a VLDB and if you are having performance problems you should look to indexing and tuning. You don't need to go to Oracle to support a 25 million row database - you've got about 3 orders of magnitude to go before you're truly in VLDB territory.
You are asking for a books worth of answer and I therefore propose you get a good book on databases. There are many.
To get you started, here are some database basics:
First, you need a great data model based not just on what data you need to store but on usage patterns. Good database performance starts with good schema design.
Second, place indicies on columns based upon expected lookup AND update needs as update performance is often overlooked.
Third, don't put functions in where clauses if at all possible.
Fourth, use an -ahem- RDBMS engine that is of quality design. I would respectfully submit that while it has improved greatly in the recent past, mysql does not qualify. (Apologies to those who wish to argue it has finally made the grade in recent times.) There is no longer any need to choose between high-price and quality; Postgres (aka PostgreSql) is available open-source and is truly fantastic - and has all the plug-ins available to meet your needs.
Finally, learn what you are asking a database engine to do - gain some insight into internals - so you can better judge what kinds of things are expensive and why.
I'm going to second #Mark Baker, and say that you need to build indices on your tables.
For other queries than the one you selected, you should also be aware that using constructs such as IN() is faster than a series of OR statements in the query. There are lots of little steps you can take to speed-up individual queries.
Indexing is key to performance with this number of records, but how you write the queries can make a big difference as well. Specific performance tuning methods vary by database, but in general, avoid returning more records or fields than you actually need, make sure all join fields are indexed (as well as common where clause fields), avoid cursors (although I think this is less true in Oracle than SQL Server I don't know about mySQL).
Hardware can also be a bottleneck especially if you are running things besides the database server on the same machine.
Performance tuning is a very technical subject and can't really be answered well in a format like this. I suggest you get a performance tuning book and read it. Here is a link to one for mySQL
http://www.amazon.com/High-Performance-MySQL-Optimization-Replication/dp/0596101716
Consider an indexed MySQL table with 7 columns, being constantly queried and written to. What is the advisable number of rows that this table should be allowed to contain before the performance would be improved by splitting the data off into other tables?
Whether or not you would get a performance gain by partitioning the data depends on the data and the queries you will run on it. You can store many millions of rows in a table and with good indexes and well-designed queries it will still be super-fast. Only consider partitioning if you are already confident that your indexes and queries are as good as they can be, as it can be more trouble than its worth.
There's no magic number, but there's a few things that affect performance in particular:
Index Cardinality: don't bother indexing a row that has 2 or 3 values (like an ENUM). On a large table, the query optimizer will ignore these.
There's a trade off between writes and indexes. The more indexes you have, the longer writes take. Don't just index every column. Analyze your queries and see which columns need to be indexed for your app.
Disk IO and a memory play an important role. If you can fit your whole table into memory, you take disk IO out of the equation (once the table is cached, anyway). My guess is that you'll see a big performance change when your table is too big to buffer in memory.
Consider partitioning your servers based on use. If your transactional system is reading/writing single rows, you can probably buy yourself some time by replicating the data to a read only server for aggregate reporting.
As you probably know, table performance changes based on the data size. Keep an eye on your table/queries. You'll know when it's time for a change.
MySQL 5 has partitioning built in and is very nice. What's nice is you can define how your table should be split up. For instance, if you query mostly based on a userid you can partition your tables based on userid, or if you're querying by dates do it by date. What's nice about this is that MySQL will know exactly which partition table to search through to find your values. The downside is if you're search on a field that isn't defining your partition its going to scan through each table, which could possibly decrease performance.
While after the fact you could point to the table size at which performance became a problem, I don't think you can predict it, and certainly not from the information given on a web site such as this!
Some questions you might usefully ask yourself:
Is performance currently acceptable?
How is performance measured - is
there a metric?
How do we recognise
unacceptable performance?
Do we
measure performance in any way that
might allow us to forecast a
problem?
Are all our queries using
an efficient index?
Have we simulated extreme loads and volumes on the system?
Using the MyISAM engine, you'll run into a 2GB hard limit on table size unless you change the default.
Don't ever apply an optimisation if you don't think it's needed. Ideally this should be determined by testing (as others have alluded).
Horizontal or vertical partitioning can improve performance but also complicate you application. Don't do it unless you're sure that you need it AND it will definitely help.
The 2G data MyISAM file size is only a default and can be changed at table creation time (or later by an ALTER, but it needs to rebuild the table). It doesn't apply to other engines (e.g. InnoDB).
Actually this is a good question for performance. Have you read Jay Pipes? There isn't a specific number of rows but there is a specific page size for reads and there can be good reasons for vertical partitioning.
Check out his kung fu presentation and have a look through his posts. I'm sure you'll find that he's written some useful advice on this.
Are you using MyISAM? Are you planning to store more than a couple of gigabytes? Watch out for MAX_ROWS and AVG_ROW_LENGTH.
Jeremy Zawodny has an excellent write-up on how to solve this problem.