Performance considerations when using MySQL VIEWs - mysql

I was considering using MySQL views to provide an abstraction when pulling data from the DB. As I was looking for material on this, I came across this article, which ends with:
MySQL has long way to go getting
queries with VIEWs properly optimized.
The article is from 2007. Is this still applicable? Eg: Has MySQL solved these issues?

MySQL views work fine functionally, but they perform badly in the majority of cases.
This is because introducing even quite a simple view tends to cause a much worse query plan to be used by the optimiser. This makes using views impractical in the general case.
Often the server will materialise the entire view as a temporary table, which is not helpful for good performance (unless it happens to have only a very small number of rows in it).
So if you thought the explain plan was ok without a view, with a view it can turn terrible. This is still unresolved in the latest dev version of MySQL as far as I know.
You can have views, but don't expect decent (i.e. acceptable) performance.

Related

Slow remote queries

I am working on a Rails application and was using SQLite during development and the speed had been very fast.
I have started to use a remote MySQL database hosted by Amazon and am getting very slow query times. Besides trying to optimize the remote database, is there anything on the Rails side of things I can do?
Local database access vs. remote will show a significant difference in speed. Since you've not provided any specifics I can't zero in on the issue but I can make a suggestion:
Try caching your queries and views as much as possible. This will reduce the amount of queries you need to do. This works well especially for static data like menus.
Optimization is the key. Make sure you eliminate as many unnecessary queries as you can, and those queries you make only request the fields you need using the select method.
Profile the various components involved. The database server itself is one of them. The network latency is another. While for the second one probably there is little you can do, probably you can tweak alot the first part. Starting from profiling the queries and going to tweaking the server itself.
Knowing where to look for will help you start with the best approach. As for caching, always keep that in mind, but that can prove to be quite problematic depending on the nature of your application.

Creating a view that contains a view - performance hit?

When composing a view - should I stick with the base tables or can I feel confident that including a view within a view will not hurt performance. I want to include the view because it would allow me to change one base view if i have a change to the table design opposed to updating every single view that is dependent on the table changed. It just seems like the smarter thing to do, but want to make sure I am not doing something considered bad practice or hurts performance.
In SQL Server views get resolved at compile time. There is a very small performance impact during the compilation. There is no impact on the actual query execution. That assumes however that the same plan will be selected. If you nest views that contain complex joins, you might run into a situation where you access a table more often than necessary. The optimizer wont be able to figure that out and the system will end up doing a lot more work than necessary. So be careful to put only views into a query that do not contain more tables than you would have accessed by writing the query without the view.
Historically, some platforms have had "trouble" optimizing queries that incorporated multiple levels of views. I say "trouble", because most of the time even the poorly optimized queries were fast enough for me. (Almost all the time. But I try not to live near the bleeding edge.)
Several years ago, I decided I'd use views whenever it made sense to. Thoughtful use of views can greatly simplify complex databases; we all know that. But I decided to trust the optimizer to do a good enough job, and to trust the developers to release upgrades that made the optimizer better before my queries buried the server.
So if I thought a view would reduce the mental load on me, I created a view. If I needed to query a view of a view of a view, I just did it.
So far, that decision has proven to be a good one for me. I've never killed a server with a query, and I still understand my tables and views. (I still look at execution plans and test performance before I move a query to production, though.)

Are the consistency/data loss/query optimization issues I read about "that bad"?

As I've been looking into the differences between Postgres and MySQL, it has struck me that, if what I read is to be believed, MySQL should be (disclaimer: by reading the rest of this sentence, you agree to read the next paragraph as well) the laughingstock of the RMDB world: it doesn't enforce ACID by default, the net is rife with stories of MySQL-related data loss and by all accounts and the query optimizer is a joke.
But none of this seems to matter. It's not hard to tell that MySQL has about a million times* as much hype as Postgres (it's LAMP, not LAPP), big installations of MySQL are not unheard of (LJ? Digg?) and I haven't noticed a drop in MySQL's popularity.
This makes me wonder: are these "problems" with MySQL really that bad?
So, if you have used MySQL for a reasonably large project**, what was your experience like? Did you use Postgres as well? How was it worse? How was it better?
*: [citation needed]
**: I'm well aware that, for "small things" (blogs, what have you), MySQL (along with practically every other RDB) is just fine.
Since it's tagged [subjective], I'll be subjective. For me it's about the little things. PostgreSQL is more developer friendly and makes it easy to do the right thing regarding data integrity by default.
If you give MySQL an incorrect type, it will implicitly convert it even if the conversion is incorrect. PostgreSQL will complain.
EXPLAIN in PostgeSQL is way more useful than in MySQL. It gives you the exact structured query plan. What kind of algorithm will it use, what cost does does each step have, etc. This means that if the query optimizer in MySQL doesn't do what you think it does, you will have hard time to debug it.
If you ever wrote anything more complex in the MySQL stored procedure language, you will know how painful it is. PL/pgSQL is actually a nice language + you can use many other languages.
MySQL doesn't have sequences, so if you need them you have to roll your own. Most people will do it wrong and have race conditions in their code.
PostgreSQL exposes most of it's internal lock types to the developer. If you need to lock your table in a special way, you can do that.
Everything is programmable in PostgreSQL. For example, if you need your own data type for some specific data, you can add it. You can add casts and operators for the data types. Probably not worth the effort for small projects, but it's better than storing things as strings.
PostgreSQL adds every action including DDL changes to a transaction, unlike MySQL. If you have a conversion script that creates/drops tables, BEGIN/END won't help you in MySQL to keep it in consistent state.
That doesn't mean it's impossible to write good database applications with MySQL, it just requires more effort.
MySQL can be used for reasonably large applications, provided you really know what you do and don't trust the defaults.
MySQL defaults are optimized to be easy-to-use and to get started quickly and to provide best performance (usually). Other databases choose defaults that are at the very least ACID and are scalable (i.e. choose defaults that are not necessarily the best/fastest for small data sets)
Another item is that MySQL only learned to be a "real database" relatively recent, while almost all competing products started life with full ACID in mind.
MySQL had problems with almost all aspects of ACID at one time or another. Most of them are gone or can be configured away, but you will have to check each one. The problem with troubles in atomicity for example is that you will not notice them until you place your system under heavy load (which often coincides with it being a production system, unfortunately).
So my summary would be: MySQL is capable of working in this environments, but it takes work. And the path it took to get to that point cost it quite a few points in the confidence area.
Provided you know what its capabilities are, then it may fit your use case.
If used correctly, then it is ACID compliant. If used incorrectly, it is not. The trouble is, that people seem to assume that it's a good thing to have ACID compliance.
In reality ACID is often the enemy of performance (Particularly the D for durability). By relaxing durability very slightly, we can typically get a very large performance boost.
Likewise, even using the MyISAM engine (which doesn't have much by way of durability, and not a lot of the others either) is still appropriate to some problem domains.
We are using MySQL in some applications - and it is doing a pretty good job.
In the newer projects we are using the InnoDB engine - and albeit it may be slower than the default engine it is working well.
Right now we are using an ORM mapper - and so most of the complexity is hidden behind the ORM mapper (and working nice).
I think the infrastructure (Tools and information) is one of MySQL's big plusses: we are using really nice tools: Toad for MySQL and MySQL Administrator.
Altough I have to admit that I had a shocking experience last week when helping a friend with a SQL statement and the correleated subquery nearly stopped his MySQL server - but with the trick of enclosing it in another query - it worked really well.
This is nothing which REALLY shocks me - because I've used other DB systems which cost big bucks (I'm looking at you - DB2) - and they had other things to work around. (maybe not as drastic - but still you had to optimize for them).
I haven't used both for a single large project, but having used both I have some idea of how they compare.
In general almost all MySQL's problems can be worked around with good discipline. The issue is more that developer has to know all the gotchas and work around them. After working with PostgreSQL or Oracle this feels a bit like death by a thousand papercuts. You get that used to stuff just working.
This is a pretty significant issue in the types of stuff that I have worked on. Complex schemas with complex queries and lots of data. tight schedules with little time for performance engineering meaning that getting consistently reasonable performance without having to manually optimize queries is important. A good cost based optimizer is almost a requirement. Combine that with quite a lot of outsourcing with development teams that don't have the experience to catch all the gotchas in time and the little issues escalate to large QA problems. Hitting any of MySQL silent data corruption gotchas in production is something that really scares me. I'll take any declarative constraints at the database level that I can get to have atleast some safety net, MySQL unfortunately falls short on that.
PostgreSQL has the added benefit that it can run significantly more algorithms using more advanced data-structures in the database. Most of our large projects have a few cases where MySQL will hit its limits. Moving the algorithms outside the database requires considerably more effort with pretty tricky code involving correct locking and synchronization. In particular I have at one time or another hit the need for partial indexes, indexes on expressions, custom aggregate functions, set returning stored procedures, array and hash datatypes, inverted indexes on array values, update/delete-returning, deferrable foreign key constraints.
On the other hand MySQL has at least for now a better story for scale out. If I had to support a huge number users on a reasonably simple application, and had the team to build a heavily partitioned and replicated database with eventual consistency, I'd pick MySQL over PostgreSQL for the low level data storage building block. On the other the competitors in that space are the key-value databases.
are these "problems" with MySQL really that bad?
Actually, the pain MySQL will inflict on you can range from moderate to insane, and much of it depends on MyISAM.
I find a good rule of thumb is this :
are you backing up some MyISAM tables ?
MyISAM is great for data you don't really care about, like traffic logs and the like, or for data that you can easily restore in case of a problem since it's read-only and hence never changed since the time you loaded that 10GB dump. In those cases the compact row format of MyISAM brings great space savings (that however do not translate into faster seq scan speed, for some reason).
If the data you put in MyISAM tables is worth backing up, you are going to enter in a world of hurt when you realize some day that it is all inconsistent because of the lack of FK and constraint checks, and incidentally all your backups will contain inconsistent data too.
If you make lots of concurrent updates to MyISAM tables, then you are gonna go way past the world of hurt stage : when the load reaches a certain threshold, you are doomed. Of course the readers block writers which block readers which block queued writers, etc, so the performance is bad, load avg goes to 200, and your box is nuked, but also I could consistently crasy MyISAM tables in a benchmark I wrote 2 years ago just by hitting them with too much load. Random data ensued, sometimes crashing the mysql on selects or spewing random errors.
So, if you avoid MyISAM like the plague it is, the problems with MySQL aren't really that bad. InnoDB is robust. However, generally I find it inferior to Postgres, which is faster and has so many less gotchas, and Gets The Job Done easier and faster.
No, the issues you mention are NOT a big deal. See Google and Facebook as two examples of companies that are using MySQL to accomplish Herculean tasks you'll only ever dream of encountering.
I use the following rules when running a MySQL to prevent headaches down the line:
Take daily, weekly, monthly snapshots of database. More often than not the problems you'll run in to have nothing to do with MySQL, instead it's a boneheaded developer running:
DELETE FROM mytable; # Where is the WHERE?
Use InnoDB by default, the only reason to use MyISAM is for full text search.
Get your database schema under source control.

Is MySQL appropriate for a read-heavy database with 3.5m+ rows? If so, which engine?

My experience with databases is with fairly small web applications, but now I'm working with a dataset of voter information for an entire state. There are approximately 3.5m voters and I will need to do quite a bit of reporting on them based on their address, voting history, age, etc. The web application itself will be written with Django, so I have a few choices of database including MySQL and PostgreSQL.
In the past I've almost exclusively used MySQL since it was so easily available. I realize that 3.5m rows in a table isn't really all that much, but it's the largest dataset I've personally worked with, so I'm out of my personal comfort zone. Also, this project isn't a quickie throw-away application though, so I want to make sure I choose the best database for the job and not just the one I'm most comfortable with.
If MySQL is an appropriate tool for the job I would also like to know if it makes sense to use InnoDB or MyISAM. I understand the basic differences between the two, but some sources say to use MyISAM for speed but InnoDB if you want a "real" database, while others say all modern uses of MySQL should use InnoDB.
Thanks!
I've run DB's far bigger than this on mysql- you should be fine. Just tune your indexes carefully.
InnoDB supports better locking semantics, so if there will be occasional or frequent writes (or if you want better data integrity), I'd suggest starting there, and then benchmarking myisam later if you can't hit your performance targets.
MyISAM only makes sense if you need speed so badly that you're willing to accept many data integrity issues downsides to achieve it. You can end up with database corruption on any unclean shutdown, there's no foreign keys, no transactions, it's really limited. And since 3.5 million rows on modern hardware is a trivial data set (unless your rows are huge), you're certainly not at the point where you're forced to optimize for performance instead of reliability because there's no other way to hit your performance goals--that's the only situation where you should have to put up with MyISAM.
As for whether to choose PostgreSQL instead, you won't really see a big performance difference between the two on an app this small. If you're familiar with MySQL already, you could certainly justify just using it again to keep your learning curve down.
I don't like MySQL because there are so many ways you can get bad data into the database where PostgreSQL is intolerant of that behavior (see Comparing Speed and Reliability), the bad MyISAM behavior is just a subset of the concerns there. Given how fractured the MySQL community is now and the uncertainties about what Oracle is going to do with it, you might want to consider taking a look at PostgreSQL just so you have some more options here in the future. There's a lot less drama around the always free BSD licensed PostgreSQL lately, and while smaller at least the whole development community for it is pushing in the same direction.
Since it's a read-heavy table, I will recommend using MyISAM table type.
If you do not use foreign keys, you can avoid the bugs like this and that.
Backing up or copying the table to another server is as simple as coping frm, MYI and MYD files.
If you need to compute reports and complex aggregates, be aware that postgres' query optimizer is rather smart and ingenious, wether the mysql "optimizer" is quite simple and dumb.
On a big join the difference can be huge.
The only advantage MySQL has is that it can hit the indexes without hitting the tables.
You should load your dataset in both databases and experiment the biger queries you intend to run. It is better to spend a few days of experimenting, rather than be stuck with the wrong choice.

Do any databases support automatic Index Creation?

Why don't databases automatically index tables based on query frequency? Do any tools exist to analyze a database and the queries it is receiving, and automatically create, or at least suggest which indexes to create?
I'm specifically interested in MySQL, but I'd be curious for other databases as well.
That is a best question I have seen on stackoverflow. Unfortunately I don't have an answer. Google's bigtable does automatially index the right columns, but BigTable doesn't allow arbitrary joins so the problem space is much smaller.
The only answer I can give is this:
One day someone asked, "Why can't the computer just analyze my code and and compile & statically type the pieces of code that run most often?"
People are solving this problem today (e.g. Tamarin in FF3.1), and I think "auto-indexing" relational databases is the same class of problem, but it isn't as much a priority. A decade from now, manually adding indexes to a database will be considered a waste of time. For now, we are stuck with monitoring slow queries and running optimizers.
There are database optimizers that can be enabled or attached to databases to suggest (and in some cases perform) indexes that might help things out.
However, it's not actually a trivial problem, and when these aids first came out users sometimes found it actually slowed their databases down due to inferior optimizations.
Lastly, there's a LOT of money in the industry for database architects, and they prefer the status quo.
Still, databases are becoming more intelligent. If you use SQL server profiler with Microsoft SQL server you'll find ways to speed your server up. Other databases have similar profilers, and there are third party utilities to do this work.
But if you're the one writing the queries, hopefully you know enough about what you're doing to index the right fields. If not then having the right indexes is likely the least of your problems...
-Adam
MS SQL 2005 also maintains an internal reference of suggested indexes to create based on usage data. It's not as complete or accurate as the Tuning Advisor, but it is automatic. Research dm_db_missing_index_groups for more information.
There is a script on I think an MS SQL blog with a script for suggesting indexes in SQL 2005 but I can't find the exact script right now! Its just the thing from the description as I recall. Here's a link to some more info http://blogs.msdn.com/bartd/archive/2007/07/19/are-you-using-sql-s-missing-index-dmvs.aspx
PS just for SQL Server 2005 +
There are tools out there for this.
For MS SQL, use the SQL Profiler (to record activity against the database), and the Database Engine Tuning Advisor (SQL 2005) or the Index Tuning Wizard (SQL 2000) to analyze the activities and recommend indexes or other improvements.
Yes, some engines DO support automatic indexing. One such example for mysql is Infobright, their engine does not support "conventional" indexes and instead implicitly indexes everything - this is a column-based storage engine.
The behaviour of such engines tends to be very different from what developers (And yes, you need ot be a DEVELOPER to even be thinking about using Infobright; it is not a plug-in replacement for a standard engine) expect.
I agree with what Adam Davis says in his comment. I'll add that if such a mechanism existed to create indexes automatically, the most common reaction to this feature would be, "That's nice... How do I turn it off?"
Part of the reason may be that indexes don't just give a small speedup. If you don't have a suitable index on a large table queries can run so slowly that the application is entirely unusable, and possibly if it is interacting with other software it simply won't work. So you really need the indexes to be right before you start trying to use the application.
Also, rather than building an index in the background, and slowing things down further while it's being built, it is better to have the index defined before you start adding significant amounts of data.
I'm sure we'll get more tools that take sample queries and work out what indexes are necessary; also probably we will eventually get databases that do as you suggest and monitor performance and add indexes they think are necessary, but I don't think they will be a replacement for starting off with the right indexes.
Seems that MySQL doesn't have a user-friendly profiler. Maybe you want to try something like this, a php class based in MySQL profiler.
Amazon's SimpleDB has automatic indexing on all columns based on your usage:
http://aws.amazon.com/simpledb/
It has other limitations though:
It's a key-value store, not an RDB. Obviously that means slow joins (and no built-in join support).
It has a 10gb limit on table size. There are libraries that will handle partitioning big data for you although this locks you into that library's way of doing things, which can have its own problems.
It stores all values as strings, even numbers, which makes sorting a column with a 1,9, and 10 come out like 1,10,9 unless you use a library which hacks this by 0 padding. This also impacts negative numbers.
The 10gb limit is bigger than many might assume, so you could proceed with this for a simple site that you plan on rewriting if it ever hits big.
It's unfortunate this kind of automatic indexing didn't make it into DynamoDb, which appears to have replaced it - they don't even mention SimpleDb in their Product list anymore, you have to find it through old links to it.
Google App Engine does that (see the index.yaml file).