mysql speed, table index and select/update/insert - mysql

We have got a MySQL table which has got more than 7.000.000 (yes seven million) rows.
We are always doing so much SELECT / INSERT / UPDATE queries per 5 seconds.
Is it a good thing that if we create MySQL INDEX for that table? Will there be some bad consequences like data corrupting or loosing MySQL services etc.?
Little info:
MySQL version 5.1.56
Server CentOS
Table engines are MyISAM
MySQL CPU load between 200% - 400% always

In general, indexes will improve the speed of SELECT operations and will slow down INSERT/UPDATE/DELETE operations, as both the base table and the indexes must be modified when a change occurs.

It is very difficult to say such a thing. I would expect that the indexing itself might take some time. But after that you should have some improvements. As said by #Joe and #Patrick, it might hurt your modification time, but the selecting will be faster.
Ofcourse, there are some other ways of improving performance on inserting and updating. You could ofcourse batch updates if it is not important to have change visible immediatly.

The indexes will help dramatically with selects. Especially if the match up well with the commonly filtered fields. And you have a good simple primary key. They will help with both the time of the queries and the processing cycles.
The drawbacks are if you are very often updating/altering/deleting these records, especially the indexed fields. Even in this case though, it is often worth it.
How much you're going to be reporting (select statement) vs updating (should!) hugely affects both your initial design as well as your later adjustments once your db is in the wild. Since you already have what you have, testing will give you the answers you need. If you really do a lot of select queries, and a lot of updating, your solution might be to copy out data now and then to a reporting table. Then you can index like crazy with no ill effects.
You have actually asked a large question, and you should study up on this more. The general things I've mentioned above hold for most all relational dbs, but there are also particular behaviors of the particular databases (MySQL in your case), mainly in how they decide when and where to use indexes.

If you are looking for performance, indexes are the way to go. Indexes speed up your queries. If you have 7 Million records, your queries are probably taking many seconds possibley a minute depending on your memory size.
Generally speaking, I would create indexes that match the most frequent SELECT statements. Everyone talks about the negative impact of indexes on table size and speed but I would neglect those impacts unless you have a table for which you are doing 95% of the time inserts and updates but even then, if those inserts happen at night and you query during the day, go and create those indexes, your users during daytime will appreciate it.
What is the actual time impact to an insert or update statement if there is an additional index, 0.001 secondes maybe? If the index saves you many seconds per each query, I guess the additional time required to update index is well worth it.
The only time I ever had an issue with creating an index (it actually broke the program logic) was when we added a primary key to a table that was previously created (by someone else) without a primary key and the program was expecting that the SELECT statement returns the records in the sequence they were created. Creating the primary key changed that, the records when selecting without any WHERE clause were returned in a different sequence.
This is obviously a wrong design in the first place, nevertheless, if you have an older program and you encounter tables without primary key, I suggest to look at the code that reads that table before adding a primary key, just in case.
One more last thought about creating indexes, the choice of fields and the sequence in which the fields appear in the index have an impact on the performance of the index.

I had the same kind of problem that you describe.
I did a few changes and 1 query passed from 11sec to a few milliseconds
1- Upgraded to MariaDB 10.1
2- Changed ALL my DB to ARIA engine
3- Changed my.cnf to the strict mininum
4- Upgraded php 7.1 (but this one had a little impact)
5- with CentOS : "Yum update" in the terminal or via ssh (by keeping everything up to date)
1- MariaDB is the new Open source version of MYSQL
2- ARIA engine is the evolution of MYISAM
3- my.cnf have usually too much change that affect performance
Here an example
[mysqld]
performance-schema=1
general_log=0
slow_query_log=0
max_allowed_packet=268435456
By removing all extra options from the my.cnf, it's telling mysql to use default values.
In MYSQL 5 (5.1, 5.5, 5.6...) When I did that ; I only noticed a small difference.
But in MariaDB -> the small my.cnf like this did a BIG difference.
******* ALL of those changes ; the server hardware remained the same.
Hope it can help you

Related

How to improve "select min(my_col)" query in MySQL without adding and index

The query below takes about a minute to run on my MySQL instance (running on a fairly beefy machine with 64G memory, 2T disc, 2.30Ghz CPU with 8 cores and 16 logical, and the query is running on localhost). This same query runs in less than a second on a SQL Server database I have access to. Unfortunately, I do not have access to the SQL Server host or the DBA, etc.
select min(visit_start_date)
from visit_occurrence;
The table has been set to ENGINE=MyISAM and default-storage-engine=INNODB and innodb_buffer_pool_size=16G are set in my.ini.
Is there some configuration I could be missing that would cause this query to run so slowly on MySQL? How can I fix it?
I have a large number of tables and queries I will need to support so I would really like to be able to fix this issue globally rather than having to create indexes everywhere I have slow queries.
The SQL Server database does not seem to have an index on the column being queried as shown below.
EDIT:
Untagged MS Sql Server, I had tagged it hoping for the help of our MS Sql Server colleagues with information that Sql Server had some way of structuring data and/or queries that would make this type of query run faster on that platform v other such as MySql
Removed image of code to more closely conform with community standards
You never know if there is a magic go-faster button if you don't ask (ENGINE=MyISAM is sometimes kind of like a magic go-faster button for some queries in MySql). I'm kind of fishing for a potential hardware or clustering solution here. Is Apache Ignite a potential solution here?
Thanks again to the community for all of your support and help. I hope this fixes most of the issues that have been raised for this post.
SECOND EDIT:
Is the partitioning/sharding described in the links below a potential solution here?
https://user3141592.medium.com/how-to-scale-mysql-42ebd2841fa6
https://dev.mysql.com/doc/refman/8.0/en/partitioning-overview.html
THIRD EDIT: A note on community standards.
Part of our community standards is explicitly to be welcoming, inclusive, and to be nice.
https://stackoverflow.blog/2018/04/26/stack-overflow-isnt-very-welcoming-its-time-for-that-to-change/?fbclid=IwAR1gr6r2qmXs506SAV3H_h6H8LoFy3mlXucfa-fqiiEXMHUR3aF_tdoZGsw
https://meta.stackexchange.com/questions/240839/the-new-new-be-nice-policy-code-of-conduct-updated-with-your-feedback).
The MS Sql Server tag was used here as one of the systems I'm comparing is MS Sql Server. We're really working with very limited information here. I have two systems: My MySql system, which is knowable as I'm running it, and the MS Sql Server running the same database in someone else's system that I have very little information about (all I have is a read only sql prompt). I am comparing apples and oranges: The same query runs well on the orange (MS Sql Server) and does not run well on the apple (My MySql instance). I'd like to know why so I can make an informed decision about how to get my queries to run in a reasonable amount of time. How do I get my apple to look like an orange? Do I switch to MS Sql Server? Do I need to deploy on different hardware? Is the other system running some kind of in memory caching system on top of their database instance? Most of these possibilities would require a non trivial amount of time to explore and validate. So yes, I would like help from MS Sql Server experts that might know if there are caching options, transactional v warehouse options, etc. that could be set that would make a world of difference, that would be magic go-fast buttons.
The magic go-fast button comment was perhaps a little bit condescending.
The picture showing the indexes was shown as I was just trying to make the point that the other system does not seem to have an index on the column being queried. I this case a picture was worth a thousand words.
If the table says ENGINE=MyISAM, then that is what counts. In almost all cases, this is a bad choice. innodb_buffer_pool_size=16G is not relevant except that it robs memory from MyISAM.
default-storage-engine=INNODB is relevant only when creating a table explicitly specifying the ENGINE=.
Are some of your tables MyISAM and some are InnoDB? How much RAM do you have?
Most performance solutions necessarily involve an INDEX. Please explain why you can't afford an index. It could turn that query into less than 10ms, regardless of the number of rows in the table.
Sorry, but I don't accept "rather than having to create indexes everywhere I have slow queries".
Changing tables from MyISAM to InnoDB will, in some cases help with performance. Suggest you change the engine as you add the indexes.
Show us some more queries, we can help you decide what indexes are needed. select min(visit_start_date) from visit_occurrence; needs INDEX(date); other queries may not be so trivial. Do not fall into the trap of "indexing every column".
More
In MySQL...
A single connection only uses one core, so more cores only helps when you have more connections. (Some tiny exceptions exist in MySQL 8.0.)
Partitioning rarely helps with performance; do use that without getting advice. (PS: BY RANGE is perhaps the only useful variant.)
Replication is for read-scaling (and backup and ...)
Sharding is for write-scaling. It requires a bunch of extra architectural things -- such as routing queries to the appropriate servers. (MariaDB has Spider and FederatedX as possible tools.) In any case, sharding is a non-trivial undertaking.
Clustering is for HA (High Availability, auto-failover, etc), while helping some with read and write scaling. Cf: Galera, InnoDB Cluster.
Hardware is rarely more than a temporary solution to performance issues.
Caching leads to potentially inconsistent results, so beware. Also, consider my mantra "don't bother putting a cache in front of a cache".
(I can advise further on any of these topics.)
Whether in MyISAM or InnoDB. or even SQL Server, your query
select min(visit_start_date) from visit_occurrence;
can be satisfied almost instantaneously by this index, because it uses a so-called loose index scan.
CREATE INDEX visit_start_date ON visit_occurrence (visit_start_date);
A query with an aggregate function like MIN() is always a GROUP BY query. But if the GROUP BY clause isn't present in the SQL statement, the server groups by the entire table.
You mentioned a query that can be satisfied immediately when using MyISAM. That's SELECT COUNT(*) FROM whatever_table. Behind the scenes MyISAM keeps table metadata showing the total number of rows in the table, so that query comes back right away. The transactional storage engine InnoDB doesn't do that. It supports so much concurrency that its designers didn't include the total row count in their metadata, because it would be wrong in so many circumstances that it wasn't worth the risk.
Index design isn't a black art. But it is an art informed by the kind of measurements we get from EXPLAIN (or ANALYZE or EXPLAIN ANALYZE). A basic truth of database-driven apps (in any make of database server) is that indexing needs to be revisited as the app grows. The good news: changing, adding, or dropping indexes doesn't change your data.

Is Postgres better than MySql when one needs to add a column to a table with millions of rows?

We're having problems with Mysql. When I search around, I see many people having the same problem.
I have joined up with a product where the database has some tables with as many as 150 million rows. One example of our problem is that one of these tables has over 30 columns and about half of them are no longer used. When trying to remove columns or renaming columns, mysql wants to copy the entire table and rename. With this amount of data, it would take many hours to do this and the site would be offline pretty much the whole time. This is just the first of several large migrations to improve the schema. These aren't intended as a regular thing. Just a lot of cleanup I inherited.
I tried searching to see if people have the same problem with Postgres and I find almost nothing in comparison talking about this issue. Is this because Postgres is a lot better at it, or just that less people are using postgres?
In PostgreSQL, adding a new column without default value to a table is instantaneous, because the new column is only registered in the system catalog, not actually added on disk.
When the only tool you know is a hammer, all your problems look like a nail. For this problem, PostgreSQL is much much better at handling these types of changes. And the fact is, it doesn't matter how well you designed your app, you WILL have to change the schema on a live database someday. While MySQL's various engines really are amazing for certain corner cases, here none of them help. PostgreSQL's very close integration between the various layers means that you can have things like transactional ddl that allow you to roll back anything that isn't an alter / create database / tablespace. Or very very fast alter tables. Or non-impeding create indexes. And so on. It limits PostgreSQL to the things it does well (traditional transactional db load handling is a strong point) and not so great at the things that MySQL often fills in the gaps on, like live networked clustered storage with the ndb engine.
In this case none of the different engines in MySQL allow you to easily solve this problem. The very versatility of multiple storage engines means that the lexer / parser / top layer of the DB cannot be as tightly integrated to the storage engines, and therefore a lot of the cool things pgsql can do here mysql can't.
I've got a 118Gigabyte table in my stats db. It has 1.1 billion rows in it. It really should be partitioned but it's not read a whole lot, and when it is we can wait on it. At 300MB/sec (the speed the array it's on can read) it takes approximately 118*~3seconds to read, or right around 5 minutes. This machine has 32Gigs of RAM, so it cannot hold the table in memory.
When I ran the simple statement on this table:
alter table mytable add test text;
it hung waiting for a vacuum. I killed the vacuum (select pg_cancel_backend(12345) (<-- pid in there) and it finished immediately. A vacuum on this table takes a long time to run btw. Normally it's not a big deal, but when making changes to table structure, you have to wait on vacuums, or kill them.
Dropping a column is just as simple and fast.
Now we come to the problem with postgresql, and that is the in-heap MVCC storage. If you add that column, then do an update table set test='abc' it updates each row, and exactly doubles the size of the table. Unless HOT can update the rows in place, but then you need a 50% fill factor table which is double sized to begin with. The only way to get the space back is to either wait and let vacuum reclaim it over time and reuse it one update at a time, or to run cluster or vacuum full to shrink it back down.
you can get around this by running updates on parts of the table at a time (update where pkid between 1 and 10000000; ...) and running vacuum between each run to reclaim the space.
So, both systems have warts and bumps to deal with.
maybe because this should not be a regualr occurrence.
perhaps, reading between the lines, you need to be adding a row to another table, instead of columns to a large existing table..?

MySQL ALTER TABLE on very large table - is it safe to run it?

I have a MySQL database with a MyISAM table with 4 million rows. I update this table about once a week with about 2000 new rows. After updating, I then alter the table like this:
ALTER TABLE x ORDER BY PK DESC
I order the table by the primary key field in descending order. This has not given me any problems on my development machine (Windows with 3GB memory). Three times I have tried it successfully on the production Linux server (with 512MB RAM - and achieving the resulted sorted table in about 6 minutes each time), the last time I tried it I had to stop the query after about 30 minutes and rebuild the database from a backup.
Can a 512MB server cope with that alter statement on such a large table? I have read that a temporary table is created to perform the ALTER TABLE command.
Question: Can this alter command be safely run? What should be the expected time for the alteration of the table?
As I have just read, the ALTER TABLE ... ORDER BY ... query is useful to improve performance in certain scenarios. I am surprised that the PK Index does not help with this. But, from the MySQL docs, it seems that InnoDB does use the index. However InnoDB tends to be slower as MyISAM. That said, with InnoDB you wouldn't need to re-order the table but you would lose the blazing speed of MyISAM. It still may be worth a shot.
The way you explain the problems, it seems that there is too much data loaded into memory (maybe there is even swapping going on?). You could easily check that with monitoring your memory usage. It's hard to say as I do not know MySQL all that well.
On the other hand, I think your problem lies at a very different place: You are using a machine with only 512 Megs of RAM as Database server with a table containing more than 4Mio rows... And you are performing a very memory-heavy operation on the whole table on that machine. It seems that 512Megs will not nearly be enough for that.
A much more fundamental issue I am seeing here: You are doing development (and quite likely testing as well) in an environment that is very different to the production environment. The kind of problem you are explaining is to be expected. Your development machine has six times as much memory as your production machine. I believe I can safely say, that the processor is much faster as well. In that case, I suggest you create a virtual machine mimicking your production site. That way you can easily test your project without disrupting the production site.
What you're asking it to do is rebuild the entire table and all its indexes; this is an expensive operation particularly if the data doesn't fit in ram. It will complete, but it will be vastly slower if the data doesn't fit in ram, particularly if you have lots of indexes.
I question your judgement when choosing to run a machine with such tiny memory in production. Anyway:
Is this ALTER TABLE really necessary; what specific query are you trying to speed up, and have you tried it without?
Have you considered making your development machine more like production? I mean, using a dev box with MORE memory is never a good idea, and using a different OS is definitely not either.
There is probably also some tuning you can do to try to help; it largely depends on your schema (indexes in particular). 4M rows is not very many (for a machine with normal amounts of ram).
is the primary key auto_increment? if so, then doing ALTER TABLE ... ORDER BY isn't going to improve anything since everything will be inserted in order anyway.
(unless you have lots of deletes)
I'd probably create a View instead which is ordered by the PK value, so that for one thing you don't need to lock up that huge table while the ALTER is being performed.
If you're using InnoDB, you shouldn't have to explicitly perform the ORDER BY either post-insert or at query time. According to the MySQL 5.0 manual, InnoDB already defaults to primary key ordering for query results:
http://dev.mysql.com/doc/refman/5.0/en/alter-table.html#id4052480
MyISAM tables return records in insertion order by default, instead, which may work as well if you only ever append to the table, rather than using an UPDATE query to modify any rows in-place.

database for analytics

I'm setting up a large database that will generate statistical reports from incoming data.
The system will for the most part operate as follows:
Approximately 400k-500k rows - about 30 columns, mostly varchar(5-30) and datetime - will be uploaded each morning. Its approximately 60MB while in flat file form, but grows steeply in the DB with the addition of suitable indexes.
Various statistics will be generated from the current day's data.
Reports from these statistics will be generated and stored.
Current data set will get copied into a partitioned history table.
Throughout the day, the current data set (which was copied, not moved) can be queried by end users for information that is not likely to include constants, but relationships between fields.
Users may request specialized searches from the history table, but the queries will be crafted by a DBA.
Before the next day's upload, the current data table is truncated.
This will essentially be version 2 of our existing system.
Right now, we're using MySQL 5.0 MyISAM tables (Innodb was killing on space usage alone) and suffering greatly on #6 and #4. #4 is currently not a partitioned tabled as 5.0 doesn't support it. In order to get around the tremendous amount of time (hours and hours) its taking to insert records into history, we're writing each day to an unindexed history_queue table, and then on the weekends during our slowest time, writing the queue to the history table. The problem is that any historical queries generated in the week are possibly several days behind then. We can't reduce the indexes on the historical table or its queries become unusable.
We're definitely moving to at least MySQL 5.1 (if we stay with MySQL) for the next release but strongly considering PostgreSQL. I know that debate has been done to death, but I was wondering if anybody had any advice relevant to this situation. Most of the research is revolving around web site usage. Indexing is really our main beef with MySQL and it seems like PostgreSQL may help us out through partial indexes and indexes based on functions.
I've read dozens of articles about the differences between the two, but most are old. PostgreSQL has long been labeled "more advanced, but slower" - is that still generally the case comparing MySQL 5.1 to PostgreSQL 8.3 or is it more balanced now?
Commercial databases (Oracle and MS SQL) are simply not an option - although I wish Oracle was.
NOTE on MyISAM vs Innodb for us:
We were running Innodb and for us, we found it MUCH slower, like 3-4 times slower. BUT, we were also much newer to MySQL and frankly I'm not sure we had db tuned appropriately for Innodb.
We're running in an environment with a very high degree of uptime - battery backup, fail-over network connections, backup generators, fully redundant systems, etc. So the integrity concerns with MyISAM were weighed and deemed acceptable.
In regards to 5.1:
I've heard the stability issues concern with 5.1. Generally I assume that any recently (within last 12 months) piece of software is not rock-solid stable. The updated feature set in 5.1 is just too much to pass up given the chance to re-engineer the project.
In regards to PostgreSQL gotchas:
COUNT(*) without any where clause is a pretty rare case for us. I don't anticipate this being an issue.
COPY FROM isn't nearly as flexible as LOAD DATA INFILE but an intermediate loading table will fix that.
My biggest concern is the lack of INSERT IGNORE. We've often used it when building some processing table so that we could avoid putting multiple records in twice and then having to do a giant GROUP BY at the end just to remove some dups. I think its used just infrequently enough for the lack of it to be tolerable.
My work tried a pilot project to migrate historical data from an ERP setup. The size of the data is on the small side, only 60Gbyte, covering over ~ 21 million rows, the largest table having 16 million rows. There's an additional ~15 million rows waiting to come into the pipe but the pilot has been shelved due to other priorities. The plan was to use PostgreSQL's "Job" facility to schedule queries that would regenerate data on a daily basis suitable for use in analytics.
Running simple aggregates over the large 16-million record table, the first thing I noticed is how sensitive it is to the amount of RAM available. An increase in RAM at one point allowed for a year's worth of aggregates without resorting to sequential table scans.
If you decide to use PostgreSQL, I would highly recommend re-tuning the config file, as it tends to ship with the most conservative settings possible (so that it will run on systems with little RAM). Tuning takes a little bit, maybe a few hours, but once you get it to a point where response is acceptable, just set it and forget it.
Once you have the server-side tuning done (and it's all about memory, surprise!) you'll turn your attention to your indexes. Indexing and query planning also requires a little effort but once set you'll find it to be effective. Partial indexes are a nice feature for isolating those records that have "edge-case" data in them, I highly recommend this feature if you are looking for exceptions in a sea of similar data.
Lastly, use the table space feature to relocate the data onto a fast drive array.
In my practical experience I have to say, that postgresql had quite a performance jump from 7.x/8.0 to 8.1 (for our use cases in some instances 2x-3x faster), from 8.1 to 8.2 the improvement was smaller but still noticeable. I don't know the improvements between 8.2 and 8.3, but I expect there is some performance improvement too, I havent tested it so far.
Regarding indices, I would recommend to drop those, and only create them again after filling the database with your data, it is much faster.
Further improve the crap out of your postgresql settings, there is so much gain from it. The default settings are at least sensible now, in pre 8.2 times pg was optimized for running on a pda.
In some cases, especially if you have complicated queries it can help to deactivate nested loops in your settings, which forces pg to use better performing approaches on your queries.
Ah, yes, did I say that you should go for postgresql?
(An alternative would be firebird, which is not so flexible, but in my experience it is in some cases performing much better than mysql and postgresql)
In my experience Inodb is slighly faster for really simple queries, pg for more complex queries. Myisam is probably even faster than Innodb for retrieval, but perhaps slower for indexing/index repair.
These mostly varchar fields, are you indexing them with char(n) indexes?
Can you normalize some of them? It'll cost you on the rewrite, but may save time on subsequent queries, as your row size will decrease, thus fitting more rows into memory at one time.
ON EDIT:
OK, so you have two problems, query time against the daily, and updating the history, yes?
As to the second: in my experience, mysql myism is bad at re-indexing. On tables the size of your daily (0.5 to 1M records, with rather wide (denormalized flat input) records), I found it was faster to re-write the table than to insert and wait for the re-indexing and attendant disk thrashing.
So this might or might not help:
create new_table select * from old_table ;
copies the tables but no indices.
Then insert the new records as normally. Then create the indexes on new table, wait a while. Drop old table, and rename new table to old table.
Edit: In response to the fourth comment: I don't know that MyIsam is always that bad. I know in my particular case, I was shocked at how much faster copying the table and then adding the index was. As it happened, I was doing something similar to what you were doing, copying large denormalized flat files into the database, and then renormalizing the data. But that's an anecdote, not data. ;)
(I also think I found that overall InnoDb was faster, given that I was doing as much inserting as querying. A very special case of database use.)
Note that copying with a select a.*, b.value as foo join ... was also faster than an update a.foo = b.value ... join, which follows, as the update was to an indexed column.
What is not clear to me is how complex the analytical processing is. In my oppinion, having 500K records to process should not be such a big problem, in terms of analytical processing, it is a small recordset.
Even if it is a complex job, if you can leave it over night to complete (since it is a daily process, as I understood from your post), it should still be enough.
Regarding the resulted table, I would not reduce the indexes of the table. Again, you can do the loading over night, including indexes refresh, and have the resulted, updated data set ready for use in the morning, with quicker access than in case of raw tables (non-indexed).
I saw PosgreSQL used in a datawarehouse like environment, working on the setup I've described (data transformation jobs over night) and with no performance complaints.
I'd go for PostgreSQL. You need for example partitioned tables, which are in stable Postgres releases since at least 2005 - in MySQL it is a novelty. I've heard about stability issues in new features of 5.1. With MyISAM you have no referential integrity, transactions and concurrent access suffers a lot - read this blog entry "Using MyISAM in production" for more.
And Postgres is much faster on complicated queries, which will be good for your #6.
There is also a very active and helpful mailing list, where you can get support even from core Postgres developers for free. It has some gotchas though.
The Infobright people appear to be doing some interesting things along these lines:
http://www.infobright.org/
-- psj
If Oracle is not considered an option because of cost issues, then Oracle Express Edition is available for free (as in beer). It has size limitations, but if you do not keep history around for too long anyway, it should not be a concern.
Check your hardware. Are you maxing the IO? Do you have buffers configured properly? Is your hardware sized correctly? Memory for buffering and fast disks are key.
If you have too many indexes, it'll slow inserts down substantially.
How are you doing your inserts? If you're doing one record per INSERT statement:
INSERT INTO TABLE blah VALUES (?, ?, ?, ?)
and calling it 500K times, your performance will suck. I'm surprised it's finishing in hours. With MySQL you can insert hundreds or thousands of rows at a time:
INSERT INTO TABLE blah VALUES
(?, ?, ?, ?),
(?, ?, ?, ?),
(?, ?, ?, ?)
If you're doing one insert per web requests, you should consider logging to the file system and doing bulk imports on a crontab. I've used that design in the past to speed up inserts. It also means your webpages don't depend on the database server.
It's also much faster to use LOAD DATA INFILE to import a CSV file. See http://dev.mysql.com/doc/refman/5.1/en/load-data.html
The other thing I can suggest is be wary of the SQL hammer -- you may not have SQL nails. Have you considered using a tool like Pig or Hive to generate optimized data sets for your reports?
EDIT
If you're having troubles batch importing 500K records, you need to compromise somewhere. I would drop some indexes on your master table, then create optimized views of the data for each report.
Have you tried playing with the myisam_key_buffer parameter ? It is very important in index update speed.
Also if you have indexes on date, id, etc which are correlated columns, you can do :
INSERT INTO archive SELECT .. FROM current ORDER BY id (or date)
The idea is to insert the rows in order, in this case the index update is much faster. Of course this only works for the indexes that agree with the ORDER BY... If you have some rather random columns, then those won't be helped.
but strongly considering PostgreSQL.
You should definitely test it.
it seems like PostgreSQL may help us out through partial indexes and indexes based on functions.
Yep.
I've read dozens of articles about the differences between the two, but most are old. PostgreSQL has long been labeled "more advanced, but slower" - is that still generally the case comparing MySQL 5.1 to PostgreSQL 8.3 or is it more balanced now?
Well that depends. As with any database,
IF YOU DONT KNOW HOW TO CONFIGURE AND TUNE IT IT WILL BE SLOW
If your hardware is not up to the task, it will be slow
Some people who know mysql well and want to try postgres don't factor in the fact that they need to re-learn some things and read the docs, as a result a really badly configured postgres is benchmarked, and that can be pretty slow.
For web usage, I've benchmarked a well configured postgres on a low-end server (Core 2 Duo, SATA disk) with a custom benchmark forum that I wrote and it spit out more than 4000 forum web pages per second, saturating the database server's gigabit ethernet link. So if you know how to use it, it can be screaming fast (InnoDB was much slower due to concurrency issues). "MyISAM is faster for small simple selects" is total bull, postgres will zap a "small simple select" in 50-100 microseconds.
Now, for your usage, you don't care about that ;)
You care about the ways your database can compute Big Aggregates and Big Joins, and a properly configured postgres with a good IO system will usually win against a MySQL system on those, because the optimizer is much smarter, and has many more join/aggregate types to choose from.
My biggest concern is the lack of INSERT IGNORE. We've often used it when building some processing table so that we could avoid putting multiple records in twice and then having to do a giant GROUP BY at the end just to remove some dups. I think its used just infrequently enough for the lack of it to be tolerable.
You can use a GROUP BY, but if you want to insert into a table only records that are not already there, you can do this :
INSERT INTO target SELECT .. FROM source LEFT JOIN target ON (...) WHERE target.id IS NULL
In your use case you have no concurrency problems, so that works well.

How to predict MySQL tipping points?

I work on a big web application that uses a MySQL 5.0 database with InnoDB tables. Twice over the last couple of months, we have experienced the following scenario:
The database server runs fine for weeks, with low load and few slow queries.
A frequently-executed query that previously ran quickly will suddenly start running very slowly.
Database load spikes and the site hangs.
The solution in both cases was to find the slow query in the slow query log and create a new index on the table to speed it up. After applying the index, database performance returned to normal.
What's most frustrating is that, in both cases, we had no warning about the impending doom; all of our monitoring systems (e.g., graphs of system load, CPU usage, query execution rates, slow queries) told us that the database server was in good health.
Question #1: How can we predict these kinds of tipping points or avoid them altogether?
One thing we are not doing with any regularity is running OPTIMIZE TABLE or ANALYZE TABLE. We've had a hard time finding a good rule of thumb about how often (if ever) to manually do these things. (Since these commands LOCK tables, we don't want to run them indiscriminately.) Do these scenarios sound like the result of unoptimized tables?
Question #2: Should we be manually running OPTIMIZE or ANALYZE? If so, how often?
More details about the app: database usage pattern is approximately 95% reads, 5% writes; database executes around 300 queries/second; the table used in the slow queries was the same in both cases, and has hundreds of thousands of records.
The MySQL Performance Blog is a fantastic resource. Namely, this post covers the basics of properly tuning InnoDB-specific parameters.
I've also found that the PDF version of the MySQL Reference Manual to be essential. Chapter 7 covers general optimization, and section 7.5 covers server-specific optimizations you can toy with.
From the sound of your server, the query cache may be of IMMENSE value to you.
The reference manual also gives you some great detail concerning slow queries, caches, query optimization, and even disk seek analysis with indexes.
It may be worth your time to look into multi-master replication, allowing you to lock one server entirely and run OPTIMIZE/ANALYZE, without taking a performance hit (as 95% of your queries are reads, the other server could manage the writes just fine).
Section 12.5.2.5 covers OPTIMIZE TABLE in detail, and 12.5.2.1 covers ANALYZE TABLE in detail.
Update for your edits/emphasis:
Question #2 is easy to answer. From the reference manual:
OPTIMIZE:
OPTIMIZE TABLE should be used if you have deleted a large part of a table or if you have made many changes to a table with variable-length rows. [...] You can use OPTIMIZE TABLE to reclaim the unused space and to defragment the data table.
And ANALYZE:
ANALYZE TABLE analyzes and stores the key distribution for a table. [...] MySQL uses the stored key distribution to decide the order in which tables should be joined when you perform a join on something other than a constant. In addition, key distributions can be used when deciding which indexes to use for a specific table within a query.
OPTIMIZE is good to run when you have the free time. MySQL optimizes well around deleted rows, but if you go and delete 20GB of data from a table, it may be a good idea to run this. It is definitely not required for good performance in most cases.
ANALYZE is much more critical. As noted, having the needed table data available to MySQL (provided with ANALYZE) is very important when it comes to pretty much any query. It is something that should be run on a common basis.
Question #1 is a bit more of a trick. I would watch the server very carefully when this happens, namely disk I/O. My bet would be that your server is thrashing either your swap or the (InnoDB) caches. In either case, it may be query, tuning, or load related. Unoptimized tables could cause this. As mentioned, running ANALYZE can immensely help performance, and will likely help out too.
I haven't found any good way of predicting MySQL "tipping points" -- and I've run into a few.
Having said that, I've found tipping points are related to table size. But not merely raw table size, rather how big the "area of interest" is to a query. For example, in a table of over 3 million rows and about 40 columns, about three-quarters integers, most queries that would easily select a portion of them based on indices are fast. However, when one value in a query on one indexed column means two-thirds of the rows are now "interesting", the query is now about 5-times slower than normal. Lesson: try to arrange your data so such a scan isn't necessary.
However, such behaviour now gives you a size to look for. This size will be heavily dependant on your server setup, the MySQL server variables and the table's schema and data.
Similarly, I've seen reporting queries run in reasonable time (~45 seconds) if the period is two weeks, but take half-an-hour if the period is extended to four weeks.
Use slow query log that will help you to narrow down the queries you want to optimize.
For time critical queries it sometimes better to keep stable plan by using hints.
It sounds like you have a frustrating situation and maybe not the best code review process and development environment.
Whenever you add a new query to your code you need to check that it has the appropriate indexes ready and add those with the code release.
If you don't do that your second option is to constantly monitor the slow query log and then go beat the developers; I mean go add the index.
There's an option to enable logging of queries that didn't use an index which would be useful to you.
If there are some queries that "works and stops working" (but are "using and index") then it's likely that the query wasn't very good in the first place (low cardinality in the index; inefficient join; ...) and the first rule of evaluating the query carefully when it's added would apply.
For question #2 - On InnoDB "analyze table" is basically free to run, so if you have bad join performance it doesn't hurt to run it. Unless the balance of the keys in the table are changing a lot it's unlikely to help though. It almost always comes down to bad queries. "optimize table" rebuilds the InnoDB table; in my experience it's relatively rare that it helps enough to be worth the hassle of having the table unavailable for the duration (or doing the master-master failover stuff while it's running).