mysql stored routine vs. mysql-alternative? - mysql

We are using a mysql database w/ about 150,000 records (names) total. Our searches on the 'names' field is done through an autocomplete function in php. We have the table indexed but still feel that the searching is a bit sluggish (a few full seconds vs. something like Google Finance w/ near-instant response). We came up w/ 2 possibilities, but wanted to get more insight:
Can we create a bunch (many thousands or more) of stored procedures to speed up searches, or will creating that many stored procedures bog-down the db?
Is there a faster alternative to mysql for "select" statements (speed on inserting & updating rows isn't too important so we can sacrifice that, if necessary). I've vaguely heard of BigTable & others that don't support JOIN statements....we need JOIN statements for some of our other queries we do.
thx

Forget about stored procedures. They wont do any good for you.
Mysql is good choice, it's often considered as fastest RDBMS. And there is no need to look for 'faster alternative to select statement'.
Abnormal query execution time you mentioned is a result of server misconfiguration or wrong database schema, or both. Please read this response on serverfault or update your question here: provide server configuration, part of database schema and problem query along with explain select ...

You need to cache the information in memory to avoid making repeated calls to the database.

Yes, you need to expire the cache if you change the data, but as you said, that's not common, so you can even do that on a semi-automated basis and not worry about it if necessary. You should check out this MySQL.com article, as well as perhaps explore the MEMORY storage engine (sorry, new and can't post more than one hyperlink per post?!) which takes a little bit of coding around to use but can be extremely efficient.
What's the actual query time (vs page time)? On a reasonably modern server that's not loaded to hell, MySQL should be able to do an autocomplete query on 150k rows much, much, faster than two seconds. Missing some indexes?

Related

How to improve "select min(my_col)" query in MySQL without adding and index

The query below takes about a minute to run on my MySQL instance (running on a fairly beefy machine with 64G memory, 2T disc, 2.30Ghz CPU with 8 cores and 16 logical, and the query is running on localhost). This same query runs in less than a second on a SQL Server database I have access to. Unfortunately, I do not have access to the SQL Server host or the DBA, etc.
select min(visit_start_date)
from visit_occurrence;
The table has been set to ENGINE=MyISAM and default-storage-engine=INNODB and innodb_buffer_pool_size=16G are set in my.ini.
Is there some configuration I could be missing that would cause this query to run so slowly on MySQL? How can I fix it?
I have a large number of tables and queries I will need to support so I would really like to be able to fix this issue globally rather than having to create indexes everywhere I have slow queries.
The SQL Server database does not seem to have an index on the column being queried as shown below.
EDIT:
Untagged MS Sql Server, I had tagged it hoping for the help of our MS Sql Server colleagues with information that Sql Server had some way of structuring data and/or queries that would make this type of query run faster on that platform v other such as MySql
Removed image of code to more closely conform with community standards
You never know if there is a magic go-faster button if you don't ask (ENGINE=MyISAM is sometimes kind of like a magic go-faster button for some queries in MySql). I'm kind of fishing for a potential hardware or clustering solution here. Is Apache Ignite a potential solution here?
Thanks again to the community for all of your support and help. I hope this fixes most of the issues that have been raised for this post.
SECOND EDIT:
Is the partitioning/sharding described in the links below a potential solution here?
https://user3141592.medium.com/how-to-scale-mysql-42ebd2841fa6
https://dev.mysql.com/doc/refman/8.0/en/partitioning-overview.html
THIRD EDIT: A note on community standards.
Part of our community standards is explicitly to be welcoming, inclusive, and to be nice.
https://stackoverflow.blog/2018/04/26/stack-overflow-isnt-very-welcoming-its-time-for-that-to-change/?fbclid=IwAR1gr6r2qmXs506SAV3H_h6H8LoFy3mlXucfa-fqiiEXMHUR3aF_tdoZGsw
https://meta.stackexchange.com/questions/240839/the-new-new-be-nice-policy-code-of-conduct-updated-with-your-feedback).
The MS Sql Server tag was used here as one of the systems I'm comparing is MS Sql Server. We're really working with very limited information here. I have two systems: My MySql system, which is knowable as I'm running it, and the MS Sql Server running the same database in someone else's system that I have very little information about (all I have is a read only sql prompt). I am comparing apples and oranges: The same query runs well on the orange (MS Sql Server) and does not run well on the apple (My MySql instance). I'd like to know why so I can make an informed decision about how to get my queries to run in a reasonable amount of time. How do I get my apple to look like an orange? Do I switch to MS Sql Server? Do I need to deploy on different hardware? Is the other system running some kind of in memory caching system on top of their database instance? Most of these possibilities would require a non trivial amount of time to explore and validate. So yes, I would like help from MS Sql Server experts that might know if there are caching options, transactional v warehouse options, etc. that could be set that would make a world of difference, that would be magic go-fast buttons.
The magic go-fast button comment was perhaps a little bit condescending.
The picture showing the indexes was shown as I was just trying to make the point that the other system does not seem to have an index on the column being queried. I this case a picture was worth a thousand words.
If the table says ENGINE=MyISAM, then that is what counts. In almost all cases, this is a bad choice. innodb_buffer_pool_size=16G is not relevant except that it robs memory from MyISAM.
default-storage-engine=INNODB is relevant only when creating a table explicitly specifying the ENGINE=.
Are some of your tables MyISAM and some are InnoDB? How much RAM do you have?
Most performance solutions necessarily involve an INDEX. Please explain why you can't afford an index. It could turn that query into less than 10ms, regardless of the number of rows in the table.
Sorry, but I don't accept "rather than having to create indexes everywhere I have slow queries".
Changing tables from MyISAM to InnoDB will, in some cases help with performance. Suggest you change the engine as you add the indexes.
Show us some more queries, we can help you decide what indexes are needed. select min(visit_start_date) from visit_occurrence; needs INDEX(date); other queries may not be so trivial. Do not fall into the trap of "indexing every column".
More
In MySQL...
A single connection only uses one core, so more cores only helps when you have more connections. (Some tiny exceptions exist in MySQL 8.0.)
Partitioning rarely helps with performance; do use that without getting advice. (PS: BY RANGE is perhaps the only useful variant.)
Replication is for read-scaling (and backup and ...)
Sharding is for write-scaling. It requires a bunch of extra architectural things -- such as routing queries to the appropriate servers. (MariaDB has Spider and FederatedX as possible tools.) In any case, sharding is a non-trivial undertaking.
Clustering is for HA (High Availability, auto-failover, etc), while helping some with read and write scaling. Cf: Galera, InnoDB Cluster.
Hardware is rarely more than a temporary solution to performance issues.
Caching leads to potentially inconsistent results, so beware. Also, consider my mantra "don't bother putting a cache in front of a cache".
(I can advise further on any of these topics.)
Whether in MyISAM or InnoDB. or even SQL Server, your query
select min(visit_start_date) from visit_occurrence;
can be satisfied almost instantaneously by this index, because it uses a so-called loose index scan.
CREATE INDEX visit_start_date ON visit_occurrence (visit_start_date);
A query with an aggregate function like MIN() is always a GROUP BY query. But if the GROUP BY clause isn't present in the SQL statement, the server groups by the entire table.
You mentioned a query that can be satisfied immediately when using MyISAM. That's SELECT COUNT(*) FROM whatever_table. Behind the scenes MyISAM keeps table metadata showing the total number of rows in the table, so that query comes back right away. The transactional storage engine InnoDB doesn't do that. It supports so much concurrency that its designers didn't include the total row count in their metadata, because it would be wrong in so many circumstances that it wasn't worth the risk.
Index design isn't a black art. But it is an art informed by the kind of measurements we get from EXPLAIN (or ANALYZE or EXPLAIN ANALYZE). A basic truth of database-driven apps (in any make of database server) is that indexing needs to be revisited as the app grows. The good news: changing, adding, or dropping indexes doesn't change your data.

MySQL vs SQL Server 2008 R2 simple select query performance

Can anyone explain to me why there is a dramatic difference in performance between MySQL and SQL Server for this simple select statement?
SELECT email from Users WHERE id=1
Currently the database has just one table with 3 users. MySQL time is on average 0.0003 while SQL Server is 0.05. Is this normal or the MSSQL server is not configured properly?
EDIT:
Both tables have the same structure, primary key is set to id, MySQL engine type is InnoDB.
I tried the query with WITH(NOLOCK) but the result is the same.
Are the servers of the same level of power? Hardware makes a difference, too. And are there roughly the same number of people accessing the db at the same time? Are any other applications using the same hardware (databases in general should not share servers with other applications).
Personally I wouldn't worry about this type of difference. If you want to see which is performing better, then add millions of records to the database and then test queries. Database in general all perform well with simple queries on tiny tables, even badly designed or incorrectly set up ones. To know if you will have a performance problem you need to test with large amounts of data and many simulataneous users on hardware similar to the one you will have in prod.
The issue with diagnosing low cost queries is that the fixed cost may swamp the variable costs. Not that I'm a MS-Fanboy, but I'm more familiar with MS-SQL, so I'll address that, primarily.
MS-SQL probably has more overhead for optimization and query parsing, which adds a fixed cost to the query when decising whether to use the index, looking at statistics, etc. MS-SQL also logs a lot of stuff about the query plan when it executes, and stores a lot of data for future optimization that adds overhead
This would all be helpful when the query takes a long time, but when benchmarking a single query, seems to show a slower result.
There are several factors that might affect that benchmark but the most significant is probably the way MySQL caches queries.
When you run a query, MySQL will cache the text of the query and the result. When the same query is issued again it will simply return the result from cache and not actually run the query.
Another important factor is the SQL Server metric is the total elapsed time, not just the time it takes to seek to that record, or pull it from cache. In SQL Server, turning on SET STATISTICS TIME ON will break it down a little bit more but you're still not really comparing like for like.
Finally, I'm not sure what the goal of this benchmarking is since that is an overly simplistic query. Are you comparing the platforms for a new project? What are your criteria for selection?

Run analytics on huge MySQL database

I have a MySQL database with a few (five to be precise) huge tables. It is essentially a star topology based data warehouse. The table sizes range from 700GB (fact table) to 1GB and whole database goes upto 1 terabyte. Now I have been given a task of running analytics on these tables which might even include joins.
A simple analytical query on this database can be "find number of smokers per state and display it in descending order" this requirement could be converted in a simple query like
select state, count(smokingStatus) as smokers
from abc
having smokingstatus='current smoker'
group by state....
This query (and many other of same nature) takes a lot of time to execute on this database, time taken is in order of tens of hours.
This database is also heavily used for insertion which means every few minutes there are thousands of rows getting added.
In such a scenario how can I tackle this querying problem?
I have looked in Cassandra which seemed easy to implement but I am not sure if it is going to be as easy for running analytical queries on the database especially when I have to use "where clause and group by construct"
Have Also looked into Hadoop but I am not sure how can I implement RDBMS type queries. I am not too sure if I want to right away invest in getting at least three machines for name-node, zookeeper and data-nodes!! Above all our company prefers windows based solutions.
I have also thought of pre-computing all the data in a simpler summary tables but that limits my ability to run different kinds of queries.
Are there any other ideas which I can implement?
EDIT
Following is the mysql environment setup
1) master-slave setup
2) master for inserts/updates
3) slave for reads and running stored procedures
4) all tables are innodb with files per table
5) indexes on string as well as int columns.
Pre-calculating values is an option but since requirements for this kind of ad-hoc aggregated values keeps changing.
Looking at this from the position of attempting to make MySQL work better rather than positing an entirely new architectural system:
Firstly, verify what's really happening. EXPLAIN the queries which are causing issues, rather than guessing what's going on.
Having said that, I'm going to guess as to what's going on since I don't have the query plans. I'm guessing that (a) your indexes aren't being used correctly and you're getting a bunch of avoidable table scans, (b) your DB servers are tuned for OLTP, not analytical queries, (c) writing data while reading is causing things to slow down greatly, (d) working with strings just sucks and (e) you've got some inefficient queries with horrible joins (everyone has some of these).
To improve things, I'd investigate the following (in roughly this order):
Check the query plans, make sure the existing indexes are being used correctly - look at the table scans, make sure the queries actually make sense.
Move the analytical queries off the OLTP system - the tunings required for fast inserts and short queries are very different to those for the sorts of queries which potentially read most of a large table. This might mean having another analytic-only slave, with a different config (and possibly table types - I'm not sure what the state of the art with MySQL is right now).
Move the strings out of the fact table - rather than having the smoking status column with string values of (say) 'current smoker', 'recently quit', 'quit 1+ years', 'never smoked', push these values out to another table, and have the integer keys in the fact table (this will help the sizes of the indexes too).
Stop the tables from being updated while the queries are running - if the indexes are moving while the query is running I can't see good things happening. It's (luckily) been a long time since I cared about MySQL replication, so I can't remember if you can batch up the writes to the analytical query slave without too much drama.
If you get to this point without solving the performance issues, then it's time to think about moving off MySQL. I'd look at Infobright first - it's open source/$$ & based on MySQL, so it's probably the easiest to put into your existing system (make sure the data is going to the InfoBright DB, then point your analytical queries to the Infobright server, keep the rest of the system as it is, job done), or if Vertica ever releases its Community Edition. Hadoop+Hive has a lot of moving parts - its pretty cool (and great on the resume), but if it's only going to be used for the analytic portion of you system it may take more care & feeding than other options.
1 TB is not that big. MySQL should be able to handle that. At least simple queries like that shouldn't take hours! Can't be very helpful without knowing the larger context, but I can suggest some questions that you might ask yourself, mostly related to how you use your data:
Is there a way you can separate the reads and writes? How many read so you do per day and how many writes? Can you live with some lag, e.g write to a new table each day and merge it to the existing table at the end of the day?
What are most of your queries like? Are they mostly aggregation queries? Can you do some partial aggregation beforehand? Can you pre-calculate number of new smokers every day?
Can you use hadoop for the aggregation process above? Hadoop is kinda good at that stuff. Basically use hadoop just for daily or batch processing and store the results into the DB.
On the DB side, are you using InnoDB or MyISAM? Are the indices on String columns? Can you make it ints etc.?
Hope that helps
MySQL is have a serious limitation what prevent him to be able to perform good on such scenarious. The problem is a lack of parralel query capability - it can not utilize multiple CPUs in the single query.
Hadoop has an RDMBS like addition called Hive. It is application capable of translate your queries in Hive QL (sql like engine) into the MapReduce jobs. Since it is actually small adition on top of Hadoop it inherits its linear scalability
I would suggest to deploy hive alongside MySQL, replicate daily data to there and run heavy aggregations agains it. It will offload serious part of the load fro MySQL. You still need it for the short interactive queries, usually backed by indexes. You need them since Hive is iherently not-interactive - each query will take at least a few dozens of seconds.
Cassandra is built for the Key-Value type of access and does not have scalable GroupBy capability build-in. There is DataStax's Brisk which integrate Cassandra with Hive/MapReduce but it might be not trivial to map your schema into Cassandra and you still not get flexibility and indexing capabiilties of the RDBMS.
As a bottom line - Hive alongside MySQL should be good solution.

de-normalization, weighted aggregates for updated tables in MySQL

this time I got a more general question. Should I use multiple views rather than stored procedures for weighted aggregation of data, if the original data is updated periodically?
Basically I have a local MySQL database that is updated periodically by importing the same kind of data (tables) from a bigger transaction database.
The local database is used for statistical analysis. Thus I de-normalize (basically aggregate) the data locally for use with statistical software packages. So far I used stored procedures because I felt it was easier to handle (and arranged more clearly) when weighting schemes (basically other tables containing weights that are multiplied with variables) came into play.
Though the disadvantage of stored procedures is that I have the run all of 'em again when the tables are populated with new data. Obviously I am not a DBA... So don´t shy away from stating the obvious :) What´s the best approach to handle this kind of scenario? SP or views ? Or something completely different?
thx for any suggestions in advance!
It depends (that's the generic answer to any "general" questions, isn't it? :) ). You need to evaluate the tradeoffs to see what the best solution is for your needs.
Views are basically just query re-writing (in MySQL), so using a view will be performing the aggregation/denormalization every time the query is run. That may make your queries slower that you would like. Also, if your procedures are really complicated, maybe it's not practical to try to put that logic into a view.
Stored procedures do the work once, so queries will be faster. But then your updates won't show up automatically. So I think the answer depends on how often the data changes, how often queries are run, and how important the performance of the queries is.
As for alternative suggestions, you could also run your stored procedures using events, if your data updates are regular, and you are just trying to save yourself from the manual task of running the procedures.
Another option is to have denormalization/aggregation tables that are updated with triggers. As you update your data in the source table, the triggers will automatically keep the aggregate tables current.
Here is a link to documentation for stored procedures, views, triggers, and events.

database for analytics

I'm setting up a large database that will generate statistical reports from incoming data.
The system will for the most part operate as follows:
Approximately 400k-500k rows - about 30 columns, mostly varchar(5-30) and datetime - will be uploaded each morning. Its approximately 60MB while in flat file form, but grows steeply in the DB with the addition of suitable indexes.
Various statistics will be generated from the current day's data.
Reports from these statistics will be generated and stored.
Current data set will get copied into a partitioned history table.
Throughout the day, the current data set (which was copied, not moved) can be queried by end users for information that is not likely to include constants, but relationships between fields.
Users may request specialized searches from the history table, but the queries will be crafted by a DBA.
Before the next day's upload, the current data table is truncated.
This will essentially be version 2 of our existing system.
Right now, we're using MySQL 5.0 MyISAM tables (Innodb was killing on space usage alone) and suffering greatly on #6 and #4. #4 is currently not a partitioned tabled as 5.0 doesn't support it. In order to get around the tremendous amount of time (hours and hours) its taking to insert records into history, we're writing each day to an unindexed history_queue table, and then on the weekends during our slowest time, writing the queue to the history table. The problem is that any historical queries generated in the week are possibly several days behind then. We can't reduce the indexes on the historical table or its queries become unusable.
We're definitely moving to at least MySQL 5.1 (if we stay with MySQL) for the next release but strongly considering PostgreSQL. I know that debate has been done to death, but I was wondering if anybody had any advice relevant to this situation. Most of the research is revolving around web site usage. Indexing is really our main beef with MySQL and it seems like PostgreSQL may help us out through partial indexes and indexes based on functions.
I've read dozens of articles about the differences between the two, but most are old. PostgreSQL has long been labeled "more advanced, but slower" - is that still generally the case comparing MySQL 5.1 to PostgreSQL 8.3 or is it more balanced now?
Commercial databases (Oracle and MS SQL) are simply not an option - although I wish Oracle was.
NOTE on MyISAM vs Innodb for us:
We were running Innodb and for us, we found it MUCH slower, like 3-4 times slower. BUT, we were also much newer to MySQL and frankly I'm not sure we had db tuned appropriately for Innodb.
We're running in an environment with a very high degree of uptime - battery backup, fail-over network connections, backup generators, fully redundant systems, etc. So the integrity concerns with MyISAM were weighed and deemed acceptable.
In regards to 5.1:
I've heard the stability issues concern with 5.1. Generally I assume that any recently (within last 12 months) piece of software is not rock-solid stable. The updated feature set in 5.1 is just too much to pass up given the chance to re-engineer the project.
In regards to PostgreSQL gotchas:
COUNT(*) without any where clause is a pretty rare case for us. I don't anticipate this being an issue.
COPY FROM isn't nearly as flexible as LOAD DATA INFILE but an intermediate loading table will fix that.
My biggest concern is the lack of INSERT IGNORE. We've often used it when building some processing table so that we could avoid putting multiple records in twice and then having to do a giant GROUP BY at the end just to remove some dups. I think its used just infrequently enough for the lack of it to be tolerable.
My work tried a pilot project to migrate historical data from an ERP setup. The size of the data is on the small side, only 60Gbyte, covering over ~ 21 million rows, the largest table having 16 million rows. There's an additional ~15 million rows waiting to come into the pipe but the pilot has been shelved due to other priorities. The plan was to use PostgreSQL's "Job" facility to schedule queries that would regenerate data on a daily basis suitable for use in analytics.
Running simple aggregates over the large 16-million record table, the first thing I noticed is how sensitive it is to the amount of RAM available. An increase in RAM at one point allowed for a year's worth of aggregates without resorting to sequential table scans.
If you decide to use PostgreSQL, I would highly recommend re-tuning the config file, as it tends to ship with the most conservative settings possible (so that it will run on systems with little RAM). Tuning takes a little bit, maybe a few hours, but once you get it to a point where response is acceptable, just set it and forget it.
Once you have the server-side tuning done (and it's all about memory, surprise!) you'll turn your attention to your indexes. Indexing and query planning also requires a little effort but once set you'll find it to be effective. Partial indexes are a nice feature for isolating those records that have "edge-case" data in them, I highly recommend this feature if you are looking for exceptions in a sea of similar data.
Lastly, use the table space feature to relocate the data onto a fast drive array.
In my practical experience I have to say, that postgresql had quite a performance jump from 7.x/8.0 to 8.1 (for our use cases in some instances 2x-3x faster), from 8.1 to 8.2 the improvement was smaller but still noticeable. I don't know the improvements between 8.2 and 8.3, but I expect there is some performance improvement too, I havent tested it so far.
Regarding indices, I would recommend to drop those, and only create them again after filling the database with your data, it is much faster.
Further improve the crap out of your postgresql settings, there is so much gain from it. The default settings are at least sensible now, in pre 8.2 times pg was optimized for running on a pda.
In some cases, especially if you have complicated queries it can help to deactivate nested loops in your settings, which forces pg to use better performing approaches on your queries.
Ah, yes, did I say that you should go for postgresql?
(An alternative would be firebird, which is not so flexible, but in my experience it is in some cases performing much better than mysql and postgresql)
In my experience Inodb is slighly faster for really simple queries, pg for more complex queries. Myisam is probably even faster than Innodb for retrieval, but perhaps slower for indexing/index repair.
These mostly varchar fields, are you indexing them with char(n) indexes?
Can you normalize some of them? It'll cost you on the rewrite, but may save time on subsequent queries, as your row size will decrease, thus fitting more rows into memory at one time.
ON EDIT:
OK, so you have two problems, query time against the daily, and updating the history, yes?
As to the second: in my experience, mysql myism is bad at re-indexing. On tables the size of your daily (0.5 to 1M records, with rather wide (denormalized flat input) records), I found it was faster to re-write the table than to insert and wait for the re-indexing and attendant disk thrashing.
So this might or might not help:
create new_table select * from old_table ;
copies the tables but no indices.
Then insert the new records as normally. Then create the indexes on new table, wait a while. Drop old table, and rename new table to old table.
Edit: In response to the fourth comment: I don't know that MyIsam is always that bad. I know in my particular case, I was shocked at how much faster copying the table and then adding the index was. As it happened, I was doing something similar to what you were doing, copying large denormalized flat files into the database, and then renormalizing the data. But that's an anecdote, not data. ;)
(I also think I found that overall InnoDb was faster, given that I was doing as much inserting as querying. A very special case of database use.)
Note that copying with a select a.*, b.value as foo join ... was also faster than an update a.foo = b.value ... join, which follows, as the update was to an indexed column.
What is not clear to me is how complex the analytical processing is. In my oppinion, having 500K records to process should not be such a big problem, in terms of analytical processing, it is a small recordset.
Even if it is a complex job, if you can leave it over night to complete (since it is a daily process, as I understood from your post), it should still be enough.
Regarding the resulted table, I would not reduce the indexes of the table. Again, you can do the loading over night, including indexes refresh, and have the resulted, updated data set ready for use in the morning, with quicker access than in case of raw tables (non-indexed).
I saw PosgreSQL used in a datawarehouse like environment, working on the setup I've described (data transformation jobs over night) and with no performance complaints.
I'd go for PostgreSQL. You need for example partitioned tables, which are in stable Postgres releases since at least 2005 - in MySQL it is a novelty. I've heard about stability issues in new features of 5.1. With MyISAM you have no referential integrity, transactions and concurrent access suffers a lot - read this blog entry "Using MyISAM in production" for more.
And Postgres is much faster on complicated queries, which will be good for your #6.
There is also a very active and helpful mailing list, where you can get support even from core Postgres developers for free. It has some gotchas though.
The Infobright people appear to be doing some interesting things along these lines:
http://www.infobright.org/
-- psj
If Oracle is not considered an option because of cost issues, then Oracle Express Edition is available for free (as in beer). It has size limitations, but if you do not keep history around for too long anyway, it should not be a concern.
Check your hardware. Are you maxing the IO? Do you have buffers configured properly? Is your hardware sized correctly? Memory for buffering and fast disks are key.
If you have too many indexes, it'll slow inserts down substantially.
How are you doing your inserts? If you're doing one record per INSERT statement:
INSERT INTO TABLE blah VALUES (?, ?, ?, ?)
and calling it 500K times, your performance will suck. I'm surprised it's finishing in hours. With MySQL you can insert hundreds or thousands of rows at a time:
INSERT INTO TABLE blah VALUES
(?, ?, ?, ?),
(?, ?, ?, ?),
(?, ?, ?, ?)
If you're doing one insert per web requests, you should consider logging to the file system and doing bulk imports on a crontab. I've used that design in the past to speed up inserts. It also means your webpages don't depend on the database server.
It's also much faster to use LOAD DATA INFILE to import a CSV file. See http://dev.mysql.com/doc/refman/5.1/en/load-data.html
The other thing I can suggest is be wary of the SQL hammer -- you may not have SQL nails. Have you considered using a tool like Pig or Hive to generate optimized data sets for your reports?
EDIT
If you're having troubles batch importing 500K records, you need to compromise somewhere. I would drop some indexes on your master table, then create optimized views of the data for each report.
Have you tried playing with the myisam_key_buffer parameter ? It is very important in index update speed.
Also if you have indexes on date, id, etc which are correlated columns, you can do :
INSERT INTO archive SELECT .. FROM current ORDER BY id (or date)
The idea is to insert the rows in order, in this case the index update is much faster. Of course this only works for the indexes that agree with the ORDER BY... If you have some rather random columns, then those won't be helped.
but strongly considering PostgreSQL.
You should definitely test it.
it seems like PostgreSQL may help us out through partial indexes and indexes based on functions.
Yep.
I've read dozens of articles about the differences between the two, but most are old. PostgreSQL has long been labeled "more advanced, but slower" - is that still generally the case comparing MySQL 5.1 to PostgreSQL 8.3 or is it more balanced now?
Well that depends. As with any database,
IF YOU DONT KNOW HOW TO CONFIGURE AND TUNE IT IT WILL BE SLOW
If your hardware is not up to the task, it will be slow
Some people who know mysql well and want to try postgres don't factor in the fact that they need to re-learn some things and read the docs, as a result a really badly configured postgres is benchmarked, and that can be pretty slow.
For web usage, I've benchmarked a well configured postgres on a low-end server (Core 2 Duo, SATA disk) with a custom benchmark forum that I wrote and it spit out more than 4000 forum web pages per second, saturating the database server's gigabit ethernet link. So if you know how to use it, it can be screaming fast (InnoDB was much slower due to concurrency issues). "MyISAM is faster for small simple selects" is total bull, postgres will zap a "small simple select" in 50-100 microseconds.
Now, for your usage, you don't care about that ;)
You care about the ways your database can compute Big Aggregates and Big Joins, and a properly configured postgres with a good IO system will usually win against a MySQL system on those, because the optimizer is much smarter, and has many more join/aggregate types to choose from.
My biggest concern is the lack of INSERT IGNORE. We've often used it when building some processing table so that we could avoid putting multiple records in twice and then having to do a giant GROUP BY at the end just to remove some dups. I think its used just infrequently enough for the lack of it to be tolerable.
You can use a GROUP BY, but if you want to insert into a table only records that are not already there, you can do this :
INSERT INTO target SELECT .. FROM source LEFT JOIN target ON (...) WHERE target.id IS NULL
In your use case you have no concurrency problems, so that works well.