I'm struggling with MySQL index optimization for some queries that should be simple but are taking forever. Rather than post the specific problem, I wanted to ask if there is an automated way of dealing with these.
I searched around but couldn't find anything. Surely, if query/ index optimization is just following a set of steps, then someone must have written an app to automate it for a given query... or am I not appreciating the complexities involved?
Well, I can offer a SQL indexing tutorial. Let us know if you succeed with automation ;)
Not so sure about MySQL, but there are tools for Oracle and SQL Server. They cover the trivial cases, but they tend to give a false sense of safety regarding non-trivial cases. Nor do they consider the overall workload very will, they are usually limited to suggesting indexes for particular statements.
IF it was that simple, you'd have automated index builder within MySQL.
Actually there is a query optimizer build into MySQL and it transparently rewrites your queries into what it finds most optimal form before executing them. It doesn't always work all that well though, and has it's own quirks. Knowing these helps avoid some common pitfalls (like using dependent subqueries)
There are tools, that can, with the query log given, show you which indexes are not used, and you can, by enabling logging of queries, that do not use index, see which one need an index. The problem is that indexes are expensive and you cannot just index everything, and which indexes you need depends on your queries.
Related
We have some large indexes that we suspect are not being used in our Rails site, and would like to drop them to save space and computation. However, doing so could be catastrophic if it turns out they were being used. How can we confirm they are not being used?
One option is to log all queries for a time and run 'explain plan' on any of them that use the table in question. But I've heard 'explain plan' can occasionally be inaccurate. We would also have to collect queries for a few hours to be sure, which is quite a lot of log to store and process.
If there was a way to temporarily disable an index, we'd be willing to do that, as long as we could quickly enable it if problems arose. But I don't see a way to do that universally; you can only specify an 'ignore index' hint to individual sql statements.
Short Answer:
With MySQL 5.6 it is possible to do this by using the PERFORMANCE_SCHEMA and ps_helper.
ps_helper is a series of views and routines that present the data in PERFORMANCE_SCHEMA in more useful ways. The VIEW you are after is this one: http://www.markleith.co.uk/ps_helper/#schema_unused_indexes
More detailed:
The idea of disabling an index is called 'invisible indexes' in Oracle. MySQL doesn't support them, but I'd love to see this feature as well - I filed http://bugs.mysql.com/bug.php?id=70299 a couple of months ago on it.
Removing unused indexes is very important, as it can help optimizer performance. I have a war-story of using the ps_helper + unused_indexes view here: http://www.tocker.ca/2013/09/05/migrating-from-postgresql-to-mysql.html
There's only one procedure for this: Testing, testing, testing and benchmarking.
The primary function of indexes, apart form ensuring uniqueness, is to expedite data access. If all operations were O(1) there would be no need for indexes in the first place.
You need to have another instance of your application where you can experiment with adding, removing, and adjusting your indexes. It's impossible to replicate both real-world hardware and real-world loads, but you can come pretty close if you pay careful attention to how your hardware is configured, how your application is exercised, and can produce that's reasonably similar.
If you have a application logs that are sufficiently detailed, sometimes you can replay these operations. Read operations are easier to replay than writes, but both can be simulated if you've got enough time to invest in this.
For any application running at scale, you want to know where performance falls off a cliff. So long as your production load is well below this level you'll be okay. If you don't know where the cliff is, you might hit it without any warning.
Remember that indexes not only take up space, which is a minor issue, but the size of the index has an effect on how expensive it is to update, making writes more costly. It's ideal to have only the ones you need, but it's almost impossible to identify which are actually used. There are many that might be used in theory, but never are, and some that shouldn't be used, but which are because the query optimizer is a little dumb sometimes.
What mySQL server variables should we be looking at and what thresholds are significant for the following problem scenarios:
CPU bound
Disk read bound
Disk write bound
And for each scenario, what solutions are recommended to improve them, short of getting better hardware or scaling the database to multiple servers?
This is a complicated area. The "thresholds" that will affect each of your three categories overlap quite a bit.
If you are having problems with your operations being CPU bound, then you definitely need to look at:
(a) The structure of your database - is it fully normalized. Bad DB structure leads to complex queries which hit the processor.
(b) Your indexes - is everything needed for your queries sufficiently indexed. Lack of indexes can hit both the processor and the memory VERY hard. To check indexes, do "EXPLAIN ...your query". Anything row in the resulting explanation that says it isn't using an index, you need to look at closely and if possible, add an index.
(c) Use prepared statements wherever possible. These can save the CPU from doing quite a bit of crunching.
(d) Use a better compiler with optimizations appropriate for your CPU. This is one for the dedicated types, but it can glean you the odd extra percent here and there.
If you are having problems with your operations being read bound
(a) Ensure that you are caching where possible. Check the configuration variables for query_cache_limit and query_cache_size. This isn't a magic fix, but raising these can help.
(b) As with above, check your indexes. Good indexes reduce the amount of data that needs to be read.
If you having problems with your operations being write bound
(a) See if you need all the indexes you currently have. Indexes are good, but the trade-off for them improving query time, is that maintaining those indexes can impact the time spent writing the data and keeping them up to date. Normally you want indexes if in doubt, but sometimes you're more interested in rapidly writing to a table than you are in reading from it.
(b) Make possible use of INSERT DELAYED to "queue" writes to the database. Note, this is not a magic fix and often inappropriate, but in the right circumstances can be of help.
(c) Check for tables that are heavily read from and written to at the same time, e.g. an access list that update's visitor's session data constantly and is read from just as much. It's easy to optimize a table for reading from, and writing to, but not really possible to design a table to be good at both. If you have such a case and it's a bottleneck, consider whether it's possible to split its functions or move any complex operations using that table to a temporary table that you can update as a block periodically.
Note, the only stuff in the above that has a major effect, is good query design / indexing. Beyond that, you want to start considering at better hardware. In particular, you can get a lot of benefit out of a RAID-0 array which doesn't do a lot for writing bound problems, but can do wonders for read-bound problems. And it can be a pretty cheap solution for a big boost.
You also missed two items off your list.
Memory bound. If you are hitting memory problems then you must check everything that can be usefully indexed is indexed. You can also look at greater connection pooling if for some reason you're using a lot of discrete connections to your DB.
Network bound. If you are hitting network bound problems... well you probably aren't, but if you are, you need another network card or a better network.
Note, that a convenient way to analyze your DB performance is to turn on the log_slow_queries option and set long_query_time to either 0 to get everything, or 0.3 or similar to catch anything that might be holding your database up. You can also turn on log-queries-not-using-indexes to see if anything interesting shows up. Note, this sort of logging can kill a busy live server. Try it on a development box to start.
Hope that's of some help. I'd be interested in anyone's comments on the above.
The question says it all. Which one is faster? And when should we use view instead subquery and vice verse when it comes to speed optimisation?
I do not have a certain situation but was thinking about that while trying some stuff with views in mysql.
A smart optimizer will come up with the same execution plan either way. But if there were to be a difference, it would be because the optimizer was for some reason not able to correctly predict how the view would behave, meaning a subquery might, in some circumstances, have an edge.
But that's beside the point; this is a correctness issue. Views and subquerys serve different purposes. You use views to provide code re-use or security. Reaching for a subquery when you should use a view without understanding the security and maintenance implications is folly. Correctness trumps performance.
Neither are particularly efficient in MySQL. In any case, MySQL does no caching of data in views, so the view simply adds another step in query execution. This makes views slower than subqueries. Check out this blog post for some extra info http://www.mysqlperformanceblog.com/2007/08/12/mysql-view-as-performance-troublemaker/
One possible alternative (if you can deal with slightly outdated data) is materialized views. Check out Flexviews for more info and an implementation.
As I've been looking into the differences between Postgres and MySQL, it has struck me that, if what I read is to be believed, MySQL should be (disclaimer: by reading the rest of this sentence, you agree to read the next paragraph as well) the laughingstock of the RMDB world: it doesn't enforce ACID by default, the net is rife with stories of MySQL-related data loss and by all accounts and the query optimizer is a joke.
But none of this seems to matter. It's not hard to tell that MySQL has about a million times* as much hype as Postgres (it's LAMP, not LAPP), big installations of MySQL are not unheard of (LJ? Digg?) and I haven't noticed a drop in MySQL's popularity.
This makes me wonder: are these "problems" with MySQL really that bad?
So, if you have used MySQL for a reasonably large project**, what was your experience like? Did you use Postgres as well? How was it worse? How was it better?
*: [citation needed]
**: I'm well aware that, for "small things" (blogs, what have you), MySQL (along with practically every other RDB) is just fine.
Since it's tagged [subjective], I'll be subjective. For me it's about the little things. PostgreSQL is more developer friendly and makes it easy to do the right thing regarding data integrity by default.
If you give MySQL an incorrect type, it will implicitly convert it even if the conversion is incorrect. PostgreSQL will complain.
EXPLAIN in PostgeSQL is way more useful than in MySQL. It gives you the exact structured query plan. What kind of algorithm will it use, what cost does does each step have, etc. This means that if the query optimizer in MySQL doesn't do what you think it does, you will have hard time to debug it.
If you ever wrote anything more complex in the MySQL stored procedure language, you will know how painful it is. PL/pgSQL is actually a nice language + you can use many other languages.
MySQL doesn't have sequences, so if you need them you have to roll your own. Most people will do it wrong and have race conditions in their code.
PostgreSQL exposes most of it's internal lock types to the developer. If you need to lock your table in a special way, you can do that.
Everything is programmable in PostgreSQL. For example, if you need your own data type for some specific data, you can add it. You can add casts and operators for the data types. Probably not worth the effort for small projects, but it's better than storing things as strings.
PostgreSQL adds every action including DDL changes to a transaction, unlike MySQL. If you have a conversion script that creates/drops tables, BEGIN/END won't help you in MySQL to keep it in consistent state.
That doesn't mean it's impossible to write good database applications with MySQL, it just requires more effort.
MySQL can be used for reasonably large applications, provided you really know what you do and don't trust the defaults.
MySQL defaults are optimized to be easy-to-use and to get started quickly and to provide best performance (usually). Other databases choose defaults that are at the very least ACID and are scalable (i.e. choose defaults that are not necessarily the best/fastest for small data sets)
Another item is that MySQL only learned to be a "real database" relatively recent, while almost all competing products started life with full ACID in mind.
MySQL had problems with almost all aspects of ACID at one time or another. Most of them are gone or can be configured away, but you will have to check each one. The problem with troubles in atomicity for example is that you will not notice them until you place your system under heavy load (which often coincides with it being a production system, unfortunately).
So my summary would be: MySQL is capable of working in this environments, but it takes work. And the path it took to get to that point cost it quite a few points in the confidence area.
Provided you know what its capabilities are, then it may fit your use case.
If used correctly, then it is ACID compliant. If used incorrectly, it is not. The trouble is, that people seem to assume that it's a good thing to have ACID compliance.
In reality ACID is often the enemy of performance (Particularly the D for durability). By relaxing durability very slightly, we can typically get a very large performance boost.
Likewise, even using the MyISAM engine (which doesn't have much by way of durability, and not a lot of the others either) is still appropriate to some problem domains.
We are using MySQL in some applications - and it is doing a pretty good job.
In the newer projects we are using the InnoDB engine - and albeit it may be slower than the default engine it is working well.
Right now we are using an ORM mapper - and so most of the complexity is hidden behind the ORM mapper (and working nice).
I think the infrastructure (Tools and information) is one of MySQL's big plusses: we are using really nice tools: Toad for MySQL and MySQL Administrator.
Altough I have to admit that I had a shocking experience last week when helping a friend with a SQL statement and the correleated subquery nearly stopped his MySQL server - but with the trick of enclosing it in another query - it worked really well.
This is nothing which REALLY shocks me - because I've used other DB systems which cost big bucks (I'm looking at you - DB2) - and they had other things to work around. (maybe not as drastic - but still you had to optimize for them).
I haven't used both for a single large project, but having used both I have some idea of how they compare.
In general almost all MySQL's problems can be worked around with good discipline. The issue is more that developer has to know all the gotchas and work around them. After working with PostgreSQL or Oracle this feels a bit like death by a thousand papercuts. You get that used to stuff just working.
This is a pretty significant issue in the types of stuff that I have worked on. Complex schemas with complex queries and lots of data. tight schedules with little time for performance engineering meaning that getting consistently reasonable performance without having to manually optimize queries is important. A good cost based optimizer is almost a requirement. Combine that with quite a lot of outsourcing with development teams that don't have the experience to catch all the gotchas in time and the little issues escalate to large QA problems. Hitting any of MySQL silent data corruption gotchas in production is something that really scares me. I'll take any declarative constraints at the database level that I can get to have atleast some safety net, MySQL unfortunately falls short on that.
PostgreSQL has the added benefit that it can run significantly more algorithms using more advanced data-structures in the database. Most of our large projects have a few cases where MySQL will hit its limits. Moving the algorithms outside the database requires considerably more effort with pretty tricky code involving correct locking and synchronization. In particular I have at one time or another hit the need for partial indexes, indexes on expressions, custom aggregate functions, set returning stored procedures, array and hash datatypes, inverted indexes on array values, update/delete-returning, deferrable foreign key constraints.
On the other hand MySQL has at least for now a better story for scale out. If I had to support a huge number users on a reasonably simple application, and had the team to build a heavily partitioned and replicated database with eventual consistency, I'd pick MySQL over PostgreSQL for the low level data storage building block. On the other the competitors in that space are the key-value databases.
are these "problems" with MySQL really that bad?
Actually, the pain MySQL will inflict on you can range from moderate to insane, and much of it depends on MyISAM.
I find a good rule of thumb is this :
are you backing up some MyISAM tables ?
MyISAM is great for data you don't really care about, like traffic logs and the like, or for data that you can easily restore in case of a problem since it's read-only and hence never changed since the time you loaded that 10GB dump. In those cases the compact row format of MyISAM brings great space savings (that however do not translate into faster seq scan speed, for some reason).
If the data you put in MyISAM tables is worth backing up, you are going to enter in a world of hurt when you realize some day that it is all inconsistent because of the lack of FK and constraint checks, and incidentally all your backups will contain inconsistent data too.
If you make lots of concurrent updates to MyISAM tables, then you are gonna go way past the world of hurt stage : when the load reaches a certain threshold, you are doomed. Of course the readers block writers which block readers which block queued writers, etc, so the performance is bad, load avg goes to 200, and your box is nuked, but also I could consistently crasy MyISAM tables in a benchmark I wrote 2 years ago just by hitting them with too much load. Random data ensued, sometimes crashing the mysql on selects or spewing random errors.
So, if you avoid MyISAM like the plague it is, the problems with MySQL aren't really that bad. InnoDB is robust. However, generally I find it inferior to Postgres, which is faster and has so many less gotchas, and Gets The Job Done easier and faster.
No, the issues you mention are NOT a big deal. See Google and Facebook as two examples of companies that are using MySQL to accomplish Herculean tasks you'll only ever dream of encountering.
I use the following rules when running a MySQL to prevent headaches down the line:
Take daily, weekly, monthly snapshots of database. More often than not the problems you'll run in to have nothing to do with MySQL, instead it's a boneheaded developer running:
DELETE FROM mytable; # Where is the WHERE?
Use InnoDB by default, the only reason to use MyISAM is for full text search.
Get your database schema under source control.
Why don't databases automatically index tables based on query frequency? Do any tools exist to analyze a database and the queries it is receiving, and automatically create, or at least suggest which indexes to create?
I'm specifically interested in MySQL, but I'd be curious for other databases as well.
That is a best question I have seen on stackoverflow. Unfortunately I don't have an answer. Google's bigtable does automatially index the right columns, but BigTable doesn't allow arbitrary joins so the problem space is much smaller.
The only answer I can give is this:
One day someone asked, "Why can't the computer just analyze my code and and compile & statically type the pieces of code that run most often?"
People are solving this problem today (e.g. Tamarin in FF3.1), and I think "auto-indexing" relational databases is the same class of problem, but it isn't as much a priority. A decade from now, manually adding indexes to a database will be considered a waste of time. For now, we are stuck with monitoring slow queries and running optimizers.
There are database optimizers that can be enabled or attached to databases to suggest (and in some cases perform) indexes that might help things out.
However, it's not actually a trivial problem, and when these aids first came out users sometimes found it actually slowed their databases down due to inferior optimizations.
Lastly, there's a LOT of money in the industry for database architects, and they prefer the status quo.
Still, databases are becoming more intelligent. If you use SQL server profiler with Microsoft SQL server you'll find ways to speed your server up. Other databases have similar profilers, and there are third party utilities to do this work.
But if you're the one writing the queries, hopefully you know enough about what you're doing to index the right fields. If not then having the right indexes is likely the least of your problems...
-Adam
MS SQL 2005 also maintains an internal reference of suggested indexes to create based on usage data. It's not as complete or accurate as the Tuning Advisor, but it is automatic. Research dm_db_missing_index_groups for more information.
There is a script on I think an MS SQL blog with a script for suggesting indexes in SQL 2005 but I can't find the exact script right now! Its just the thing from the description as I recall. Here's a link to some more info http://blogs.msdn.com/bartd/archive/2007/07/19/are-you-using-sql-s-missing-index-dmvs.aspx
PS just for SQL Server 2005 +
There are tools out there for this.
For MS SQL, use the SQL Profiler (to record activity against the database), and the Database Engine Tuning Advisor (SQL 2005) or the Index Tuning Wizard (SQL 2000) to analyze the activities and recommend indexes or other improvements.
Yes, some engines DO support automatic indexing. One such example for mysql is Infobright, their engine does not support "conventional" indexes and instead implicitly indexes everything - this is a column-based storage engine.
The behaviour of such engines tends to be very different from what developers (And yes, you need ot be a DEVELOPER to even be thinking about using Infobright; it is not a plug-in replacement for a standard engine) expect.
I agree with what Adam Davis says in his comment. I'll add that if such a mechanism existed to create indexes automatically, the most common reaction to this feature would be, "That's nice... How do I turn it off?"
Part of the reason may be that indexes don't just give a small speedup. If you don't have a suitable index on a large table queries can run so slowly that the application is entirely unusable, and possibly if it is interacting with other software it simply won't work. So you really need the indexes to be right before you start trying to use the application.
Also, rather than building an index in the background, and slowing things down further while it's being built, it is better to have the index defined before you start adding significant amounts of data.
I'm sure we'll get more tools that take sample queries and work out what indexes are necessary; also probably we will eventually get databases that do as you suggest and monitor performance and add indexes they think are necessary, but I don't think they will be a replacement for starting off with the right indexes.
Seems that MySQL doesn't have a user-friendly profiler. Maybe you want to try something like this, a php class based in MySQL profiler.
Amazon's SimpleDB has automatic indexing on all columns based on your usage:
http://aws.amazon.com/simpledb/
It has other limitations though:
It's a key-value store, not an RDB. Obviously that means slow joins (and no built-in join support).
It has a 10gb limit on table size. There are libraries that will handle partitioning big data for you although this locks you into that library's way of doing things, which can have its own problems.
It stores all values as strings, even numbers, which makes sorting a column with a 1,9, and 10 come out like 1,10,9 unless you use a library which hacks this by 0 padding. This also impacts negative numbers.
The 10gb limit is bigger than many might assume, so you could proceed with this for a simple site that you plan on rewriting if it ever hits big.
It's unfortunate this kind of automatic indexing didn't make it into DynamoDb, which appears to have replaced it - they don't even mention SimpleDb in their Product list anymore, you have to find it through old links to it.
Google App Engine does that (see the index.yaml file).